Rework SSE and NEON Color32 algorithms to be more correct and faster.

This algorithm changes the blend math, guarded by SK_LEGACY_COLOR32_MATH.  The new math is more correct: it's never off by more than 1, and correct in all the interesting 0x00 and 0xFF edge cases, where the old math was never off by more than 2, and not always correct on the edges.

If you look at tests/BlendTest.cpp, the old code was using the `blend_256_plus1_trunc` algorithm, while the new code uses `blend_256_round_alt`.  Neither uses `blend_perfect`, which is about ~35% slower than `blend_256_round_alt`.

This will require an unfathomable number of rebaselines, first to Skia, then to Blink when I remove the guard.

I plan to follow up with some integer SIMD abstractions that can unify these two implementations into a single algorithm.  This was originally what I was working on here, but the correctness gains seem to be quite compelling.  The only places these two algorithms really differ greatly now is the kernel function, and even there they can really both be expressed abstractly as:
  - multiply 8-bits and 8-bits producing 16-bits
  - add 16-bits to 16-bits, returning the top 8 bits.
All the constants are the same, except SSE is a little faster to keep 8 16-bit inverse alphas, NEON's a little faster to keep 8 8-bit inverse alphas.  I may need to take this small speed win back to unify the two.

We should expect a ~25% speedup on Intel (mostly from unrolling to 8 pixels) and a ~20% speedup on ARM (mostly from using vaddhn to add `color`, round, and narrow back down to 8-bit all into one instruction.

(I am probably missing several more related bugs here.)
BUG=skia:3738,skia:420,chromium:111470

Review URL: https://codereview.chromium.org/1092433002
3 files changed