SkPx: new approach to fixed-point SIMD

SkPx is like Sk4px, except each platform implementation of SkPx can declare
a different sweet spot of N pixels, with extra loads and stores to handle the
ragged edge of 0<n<N pixels.

In this case, _sse's sweet spot remains 4 pixels.   _neon jumps up to 8 so
we can now use NEON's transposing loads and stores, and _none is just 1.
This makes operations involving alpha considerably more efficient on NEON,
as alpha is its own distinct 8x8 bit plane that's easy to toss around.

This incorporates a few other improvements I've been wanting:
  - no requirement that we're dealing with SkPMColor.  SkColor works too.
  - no anonymous namespace hack to differentiate implementations.

Codegen and perf look good on Clang/x86-64 and GCC/ARMv7.
The NEON code looks very similar to the old NEON code, as intended.
No .skp or GM diffs on my laptop.  Don't expect any.

I intend this to replace Sk4px.  Plan after landing:
  - port SkXfermode_opts.h
  - port Color32 in SkBlitRow_D32.cpp (and move to SkBlitRow_opts.h like other
    SkOpts code)
  - delete all Sk4px-related code
  - clean up evolutionary dead ends in SkNx (Sk16b, Sk16h, Sk4i, Sk4d, etc.)
    leaving Sk2f, Sk4f (and Sk2s, Sk4s).
  - find a machine with AVX2 to work on, write SkPx_avx2.h handling 8 pixels
    at a time.

In the end we'll have Sk4f for float pixels, SkPx for fixed-point pixels.

BUG=skia:4117

Committed: https://skia.googlesource.com/skia/+/82c93b45ed6ac0b628adb8375389c202d1f586f9

CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot

Review URL: https://codereview.chromium.org/1317233005
5 files changed