Replace SSE optimization of Color32A_D565

Adds an SSE2 version of the Color32A_D565 function, to replace
the existing SSE4 version. Also does some minor cleanup.

Performance improvement in the following Skia benchmarks.
Measured on Atom Silvermont:
  Xfermode_SrcOver       - x3
  luma_colorfilter_large - x4.6
  luma_colorfilter_small - x2
  tablebench             - ~15%
  chart_bw               - ~10%

Measured on Corei7 Haswell:
luma_colorfilter_large running SSE2 - x2
luma_colorfilter_large running SSE4 - x2.3

Also improves performance in WPS Office application and 2D subtest of 0xbenchmark on Android.

Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>

Review URL: https://codereview.chromium.org/923523002
8 files changed