grand unifried lowp stages

I have text_16_AA_FF -> 8888 (forcing RP) faster than head now on my
laptop.  I'm feeling confident that we can make this perform well.

After looking at performance a bit more today, it looks like everything
is within what I'd consider comparable in performance, especially on
ARM.  On x86-64 it looks like big bulk blits get a little slower and
small mask blits get a little faster.

Quality looks good, and maybe improved for 565.

There are fewer platform-specific differences now in _lowp, and I think
they're few enough now that we could even consider completing the
unification by folding the 8-bit and float code together.  Rename
"div255()" to "rebias()", slap on a few coats of paint...

Guarded for Chrome with SK_JUMPER_LEGACY_LOWP.

Change-Id: I36309c07cf736f3cb31952cca66030ad56026318
Reviewed-on: https://skia-review.googlesource.com/45982
Reviewed-by: Herb Derby <herb@google.com>
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
9 files changed