gather8/16 JIT support

The basic strategy is one at a time, inserting 8- or 16-bit values
into an Xmm register, then expanding to 32-bit in a Ymm at the end
using vpmovzx{b,w}d instructions.

Somewhat annoyingly we can only pull indices from an Xmm register,
so we grab the first four then shift down the top before the rest.

Added a unit test to get coverage where the indices are reused and
not consumed directly by the gather instruction.  It's an important
case, needing to find another register for accum that can't just be
dst(), but there's no natural coverage of that anywhere.

Change-Id: I8189ead2364060f10537a2f9364d63338a7e596f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/284311
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
3 files changed