SSE2 implementation of S32A_D565_Opaque

microbenchmark of S32A_D565_Opaque() shows a 3x speedup after SSE optimization with various count on i7-3770.

BUG=
R=mtklein@google.com, reed@google.com

Author: qiankun.miao@intel.com

Review URL: https://codereview.chromium.org/138163013

git-svn-id: http://skia.googlecode.com/svn/trunk@13495 2bbb7eff-a529-9590-31e7-b0007b416f81
diff --git a/src/opts/SkBlitRow_opts_SSE2.h b/src/opts/SkBlitRow_opts_SSE2.h
index b443ec7..66bc95a 100644
--- a/src/opts/SkBlitRow_opts_SSE2.h
+++ b/src/opts/SkBlitRow_opts_SSE2.h
@@ -28,3 +28,7 @@
                          SkColor color, int width, SkPMColor);
 void SkBlitLCD16OpaqueRow_SSE2(SkPMColor dst[], const uint16_t src[],
                                SkColor color, int width, SkPMColor opaqueDst);
+
+void S32A_D565_Opaque_SSE2(uint16_t* SK_RESTRICT dst,
+                           const SkPMColor* SK_RESTRICT src,
+                           int count, U8CPU alpha, int /*x*/, int /*y*/);