improve scalar gather32

This loads 32 bits instead of gathering 256 in the tail part of loops.

To make it work, add a vmovd with SIB addressing.

I also remembered that the mysterious 0b100 is actually a signal that
the instruction uses SIB addressing, and is usually denoted by `rsp`.

(SIB addressing may be something we'd want to generalize over like we
did recently with YmmOrLabel, but I'll leave that for Future Me.)

Slight rewording where "scratch" is mentioned to keep it focused on
scratch GP registers, not "tmp" ymm registers.  Not a hugely important
distinction but helps when I'm grepping through code.

Change-Id: I39a6ab1a76ea0c103ae7d3ebc97a1b7d4b530e73
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/264376
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
3 files changed