grand unifried lowp stages

I have text_16_AA_FF -> 8888 (forcing RP) faster than head now on my
laptop.  I'm feeling confident that we can make this perform well.

After looking at performance a bit more today, it looks like everything
is within what I'd consider comparable in performance, especially on
ARM.  On x86-64 it looks like big bulk blits get a little slower and
small mask blits get a little faster.

Quality looks good, and maybe improved for 565.

There are fewer platform-specific differences now in _lowp, and I think
they're few enough now that we could even consider completing the
unification by folding the 8-bit and float code together.  Rename
"div255()" to "rebias()", slap on a few coats of paint...

Guarded for Chrome with SK_JUMPER_LEGACY_LOWP.

Change-Id: I36309c07cf736f3cb31952cca66030ad56026318
Reviewed-on: https://skia-review.googlesource.com/45982
Reviewed-by: Herb Derby <herb@google.com>
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
diff --git a/BUILD.gn b/BUILD.gn
index ef2ac07..9b46480 100644
--- a/BUILD.gn
+++ b/BUILD.gn
@@ -1804,6 +1804,7 @@
     inputs = [
       "src/jumper/SkJumper_stages.cpp",
       "src/jumper/SkJumper_stages_8bit.cpp",
+      "src/jumper/SkJumper_stages_lowp.cpp",
     ]
 
     # GN insists its outputs should go somewhere underneath target_out_dir, so we trick it.