Rearrange SkRasterPipeline scanline tail handling.

We used to step at a 4-pixel stride as long as possible, then run up to 3 times, one pixel at a time.  Now replace those 1-at-a-time runs with a single tail stamp if there are 1-3 remaining pixels.

This style is simply more efficient: e.g. we'll blend and lerp once for 3 pixels instead of 3 times.  This should make short blits significantly more efficient.  It's also more future-oriented... AVX+ on Intel and SVE on ARM support masked loads and stores, so we can do the entire tail in one direct step.

This also makes it possible to re-arrange the code a bit to encapsulate each stage better.  I think generally this code reads more clearly than the old code, but YMMV.  I've arranged things so you write one function, but it's compiled into two specializations, one for tail=0 (Body) and one for tail>0 (Tail).  It's pretty tidy.

For now I've just burned a register to pass around tail.  It's 2 bits now, maybe soon 3 with AVX, and capped at 4 for even the craziest new toys, so there are plenty of places we can pack it if we want to get clever.

BUG=skia:

GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2717

Change-Id: I45852a3e5d4c5b5e9315302c46601aee0d32265f
Reviewed-on: https://skia-review.googlesource.com/2717
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
7 files changed