SkRasterPipeline: 8x pipelines, attempt 2

Original review here: https://skia-review.googlesource.com/c/2990/

Changes since:
  - simpler implementations of load_tail() / store_tail(): slower, but more obviously correct to all compilers
  - fleshed out math ops on Sk8i and Sk8u to make unit tests happy on -Fast bot (where we always have AVX2)
  - now storing stage functions as void(*)() to avoid undefined behavior and/or linker problems.  This restores 32-bit Windows.
  - all AVX2 Sk8x methods are marked always-inline, to avoid linking the "wrong" version on Debug builds.

CQ_INCLUDE_TRYBOTS=master.client.skia:Perf-Ubuntu-Clang-GCE-CPU-AVX2-x86_64-Debug-ASAN-Trybot,Perf-Ubuntu-Clang-GCE-CPU-AVX2-x86_64-Debug-GN,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Fast-Trybot;master.client.skia.compile:Build-Win-MSVC-x86_64-Debug-Trybot

GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=3064

Change-Id: Id0ba250037e271a9475fe2f0989d64f0aa909bae
Reviewed-on: https://skia-review.googlesource.com/3064
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
11 files changed