AVX 2 SrcOver blits: color32, blitmask.

As a follow up to the SSE 4.1 CL, this should look pretty familiar.

I've made some organizational changes around how we load, store, pack, and unpack data that I think makes things clearer and more orthogonal, and it'll make it easier to try out a pmaddubsw lerp.  I have backported these changes to the SSE 4.1 code, and I hope that I can actually get a lot of this code templated for sharing between the two later.

Perf changes (relative to SSE 4.1):
Xfermode_SrcOver:      1650 -> 1180  (0.71x)  // large opaque blit
Xfermode_SrcOver_aa:   1794 -> 1653  (0.92x)  // large opaque + small transparent
text_16_AA_{FF,BK,WT}: 1.72 -> 1.59  (0.92x)  // small opaque blit
text_16_AA_88:         1.83 -> 1.77  (0.97x)  // small transparent blit

This should be a big throughout win, and a small latency win.
This should all be pixel-exact to the previous SSE 4.1 code.

GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1532613002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.compile:Build-Ubuntu-GCC-x86_64-Release-CMake-Trybot,Build-Mac10.9-Clang-x86_64-Release-CMake-Trybot

Review URL: https://codereview.chromium.org/1532613002
5 files changed