Make gather() look and work more like load().

They're really similar, so let's make them look that way.
Finally use mask load, mask store, and gather instructions for 8888.

We avoid mask load and store when tail == 0.  It's faster (one memory load instead of two) and a cheap test.
For gather, the intrinsics make it look like we could do the same, but it really all boils down to the same masked instruction in the end.

There's probably a better way to implement mask() with math instead of memory loads, but this works for now.

CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD

Change-Id: I578f47d4562ea19d983057bf2f4c3e21d0ab9a0e
Reviewed-on: https://skia-review.googlesource.com/5234
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
1 file changed