double pump 8-bit stages

This basically unrolls all loops, handling twice as many pixels in a
stride.  We now pass around 4 native registers instead of just 2.

I've temporarily disabled AVX2 mask loads and stores.  It shouldn't be
hard to turn them back on, but I'd want to test on AVX2 hardware first.

Change-Id: I0907070f086a0650167456c149a479c1d96b8a2d
Reviewed-on: https://skia-review.googlesource.com/33361
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
3 files changed