SSE4 opaque blend using intrinsics instead of assembly.

Since we had such a hard time with the assembly versions of this blit (to the
point that we have them completely disabled everywhere), I thought I'd take
a shot at writing a version of the blit using intrinsics.

The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
to skip the blend when the 16 src pixels we consider each loop are all opaque
or all transparent.  _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
all those alphas.

It's worth looking to see if we can backport this type of logic to SSE2 using
_mm_movemask_epi8, or up to 32 pixels at a time using AVX.

My local performance testing doesn't show this to be an unambiguous win
(there are probably microbenchmarks and SKPs where we'd be better off just
powering through the blend rather than looking at alphas), but the potential
does seem tantalizing enough to let skiaperf vet it on the bots.  (< 1.0x is a win.)

DM says it draws pixel perfect compare to the old code.

Microbenchmarks:
               bitmap_RGBA_8888_A_source_stripes_two	  14us -> 14.4us	1.03x
             bitmap_RGBA_8888_A_source_stripes_three	14.3us -> 14.5us	1.01x
                       bitmap_RGBA_8888_scale_bilerp	61.9us -> 62.2us	1.01x
bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp	 102us ->  101us	0.99x
                bitmap_RGBA_8888_scale_rotate_bilerp	 103us ->  101us	0.99x
                              bitmap_RGBA_8888_scale	18.4us -> 18.2us	0.99x
             bitmap_RGBA_8888_A_scale_rotate_bicubic	  71us ->   70us	0.99x
         bitmap_RGBA_8888_update_scale_rotate_bilerp	 103us ->  101us	0.99x
              bitmap_RGBA_8888_A_scale_rotate_bilerp	 112us ->  109us	0.98x
                    bitmap_RGBA_8888_update_volatile	5.72us -> 5.58us	0.98x
                                    bitmap_RGBA_8888	5.73us -> 5.58us	0.97x
                             bitmap_RGBA_8888_update	5.78us ->  5.6us	0.97x
                     bitmap_RGBA_8888_A_scale_bilerp	70.7us ->   68us	0.96x
                    bitmap_RGBA_8888_A_scale_bicubic	23.7us -> 21.8us	0.92x
                                  bitmap_RGBA_8888_A	13.9us -> 10.9us	0.78x
                    bitmap_RGBA_8888_A_source_opaque	  14us -> 6.29us	0.45x
               bitmap_RGBA_8888_A_source_transparent	  14us -> 3.65us	0.26x

Running over our ~70 SKP web page captures, this looks like we spend 0.7x
the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
be a decent predictor of real-world impact.

BUG=chromium:399842

Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785

CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot

Review URL: https://codereview.chromium.org/874863002
7 files changed