3-15% speedup to HardLight / Overlay xfermodes.

While investigating my bug (skia:4052) I saw this TODO and figured
it'd make me feel better about an otherwise unsuccessful investigation.

This speeds up HardLight and Overlay (same code) by about 15% with SSE, mostly
by rewriting the logic from 1 cheap comparison and 2 expensive div255() calls
to 2 cheap comparisons and 1 expensive div255().

NEON speeds up by a more modest ~3%.

BUG=skia:

Review URL: https://codereview.chromium.org/1230663005
8 files changed