3-15% speedup to HardLight / Overlay xfermodes.

While investigating my bug (skia:4052) I saw this TODO and figured
it'd make me feel better about an otherwise unsuccessful investigation.

This speeds up HardLight and Overlay (same code) by about 15% with SSE, mostly
by rewriting the logic from 1 cheap comparison and 2 expensive div255() calls
to 2 cheap comparisons and 1 expensive div255().

NEON speeds up by a more modest ~3%.

BUG=skia:

Review URL: https://codereview.chromium.org/1230663005
diff --git a/tests/SkNxTest.cpp b/tests/SkNxTest.cpp
index 5893214..4005d25 100644
--- a/tests/SkNxTest.cpp
+++ b/tests/SkNxTest.cpp
@@ -192,3 +192,19 @@
     }
     }
 }
+
+DEF_TEST(Sk4px_widening, r) {
+    SkPMColor colors[] = {
+        SkPreMultiplyColor(0xff00ff00),
+        SkPreMultiplyColor(0x40008000),
+        SkPreMultiplyColor(0x7f020406),
+        SkPreMultiplyColor(0x00000000),
+    };
+    auto packed = Sk4px::Load4(colors);
+
+    auto wideLo = packed.widenLo(),
+         wideHi = packed.widenHi(),
+         wideLoHi    = packed.widenLoHi(),
+         wideLoHiAlt = wideLo + wideHi;
+    REPORTER_ASSERT(r, 0 == memcmp(&wideLoHi, &wideLoHiAlt, sizeof(wideLoHi)));
+}