Add SSE optimization of Color32A_D565

Adds an SSE4.1 version of the Color32A_D565 function.

Performance improvement in the following benchmarks:
  Xfermode_SrcOver       - ~100%
  luma_colorfilter_large - ~150%
  luma_colorfilter_small - ~60%
  tablebench             - ~10%
  chart_bw               - ~10%
(Measured on a Atom Silvermont core)

Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>

Review URL: https://codereview.chromium.org/892623002
3 files changed