SkJumper: use AVX2 mask loads and stores for U32

SkRasterPipeline_f16:  63 -> 58  (8888+f16 loads, f16 store)
SkRasterPipeline_srgb: 96 -> 84  (2x 8888 loads, 8888 store)

PS3 has a simpler way to build the mask, in a uint64_t.
Timing is still roughlt the same.

Change-Id: Ie278611dff02281e5a0f3a57185050bbe852bff0
Reviewed-on: https://skia-review.googlesource.com/9165
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
3 files changed