8-bit hacking

I think we can replace a lot of legacy code with an SkRasterPipeline
backend that works in 8-bit and stays interlaced.  Think of this as a
"lowerp" replacement for lowp.

I'm having some trouble getting ARMv8 working.
ARMv7 should be fine, but I want to turn it on separately from x86.
I haven't looked at 32-bit x86 yet, but that's also on the todo list.

Open questions to follow up on:
  - is it better to fold every multiply back down to 8-bit
    (as seen here), or to allow intermediates to accumulate
    in 16-bit and divide by 255 when done/needed?
  - is it better pass tightly packed 8-bit vectors between stages (as
    seen here), or to keep the 8-bit values unpacked in 16-bit lanes?
  - should we make V wider than 1 register?

GMs look good.  All diffs invisible and plausibly due to the 15->8 bit
precision drop.  A quick bench run showed this running in about 0.75x
the time of the existing lowp backend.

Change-Id: I24aa46ff1d19c0b9b8dc192d5b1821cab0b8843c
Reviewed-on: https://skia-review.googlesource.com/29886
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Florin Malita <fmalita@chromium.org>
6 files changed