SkJumper: handle the <kStride tail in AVX+ mode.

We have plenty general purpose registers to spare on x86-64,
so the cheapest thing to do is use one to hold the usual 'tail'.

Speedups on HSW:
    SkRasterPipeline_srgb: 292 -> 170
    SkRasterPipeline_f16:  122 ->  90

There's plenty more room to improve here, e.g. using mask loads and
stores, but this seems to be enough to get things working reasonably.

BUG=skia:6289

Change-Id: I8c0ed325391822e9f36636500350205e93942111
Reviewed-on: https://skia-review.googlesource.com/9110
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
4 files changed