Xfermode: SSE2 implementation of difference_modeproc

With SSE2 optimization, performance of Xfermode_Difference will improve
about 60% on desktop i7-3770. Here are the data:
before:
Xfermode_Difference   8888:  cmsecs =  51.10   565:  cmsecs =  66.39
after:
Xfermode_Difference   8888:  cmsecs =  21.10   565:  cmsecs =  29.33

BUG=skia:
R=mtklein@google.com

Author: qiankun.miao@intel.com

Review URL: https://codereview.chromium.org/234433003

git-svn-id: http://skia.googlecode.com/svn/trunk@14375 2bbb7eff-a529-9590-31e7-b0007b416f81
1 file changed