Add SSE4 optimization of S32A_Opaque_Blitrow

Adds optimization of Skia S32A_Opaque_Blitrow blitter using SSE4.2 SIMD
instruction set. Special case for when alpha is zero or opaque.

Performance increase of 10%-400% compared to the existing SSE2
optimization (measured on Silvermont architecture).
Noticeable in ~25 different skia bench subtests, especially in
bitmap_8888_*, repeatTile_*, and morph_*.

bitmap_8888_A - 100% faster
bitmap_8888_A_source_transparent - 250% faster
bitmap_8888_A_source_opaque - 25% faster
bitmap_8888_A_scale_bicubic - 75% faster

Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>

Committed: https://skia.googlesource.com/skia/+/e2527b147679b0c43019fae7d59cc3777d2d097e

R=reed@google.com, mtklein@google.com, tomhudson@google.com, djsollen@google.com, joakim.landberg@intel.com

Author: henrik.smiding@intel.com

Review URL: https://codereview.chromium.org/289473009
6 files changed