SSE2 implementation of memcpy32

With SSE2 version memcpy32, S32_Opaque_BlitRow32() in SkBlitRow_D32.cpp
has about 30% performance improvement. Here are the data on desktop
i7-3770.
before:
       bitmap_scale_filter_90_90   8888:  cmsecs =      2.01
      bitmaprect_FF_filter_trans   8888:  cmsecs =      3.61
    bitmaprect_FF_nofilter_trans   8888:  cmsecs =      3.57
   bitmaprect_FF_filter_identity   8888:  cmsecs =      3.53
 bitmaprect_FF_nofilter_identity   8888:  cmsecs =      3.53
              bitmap_4444_update   8888:  cmsecs =      4.84
     bitmap_4444_update_volatile   8888:  cmsecs =      4.81
                     bitmap_4444   8888:  cmsecs =      4.81
after:
       bitmap_scale_filter_90_90   8888:  cmsecs =      1.83
      bitmaprect_FF_filter_trans   8888:  cmsecs =      2.36
    bitmaprect_FF_nofilter_trans   8888:  cmsecs =      2.36
   bitmaprect_FF_filter_identity   8888:  cmsecs =      2.60
 bitmaprect_FF_nofilter_identity   8888:  cmsecs =      2.63
              bitmap_4444_update   8888:  cmsecs =      3.30
     bitmap_4444_update_volatile   8888:  cmsecs =      3.30
                     bitmap_4444   8888:  cmsecs =      3.29

BUG=skia:
R=mtklein@google.com, reed@google.com, bsalomon@google.com

Author: qiankun.miao@intel.com

Review URL: https://codereview.chromium.org/285313002

git-svn-id: http://skia.googlecode.com/svn/trunk@14822 2bbb7eff-a529-9590-31e7-b0007b416f81
8 files changed