Xfermode: SSE2 implementation of overlay_modeproc

With SSE2 optimization, performance of Xfermode_Overlay will improve
about 35% on desktop i7-3770. Here are the data:
before:
Xfermode_Overlay   8888:  cmsecs =     44.17   565:  cmsecs =     59.27
after:
Xfermode_Overlay   8888:  cmsecs =     28.30   565:  cmsecs =     35.84

BUG=skia:
R=mtklein@google.com

Author: qiankun.miao@intel.com

Review URL: https://codereview.chromium.org/232783002

git-svn-id: http://skia.googlecode.com/svn/trunk@14370 2bbb7eff-a529-9590-31e7-b0007b416f81
2 files changed