Really use SSE4 (and SSSE3) in SkBlurImage_SSE4

We don't seem to be making good use of the available instruction set.
SSE4.1 gives us an easy way to unpack a pixel into an __m128i, and
SSSE3 gave us an easy way to do the reverse.

This should be bit-perfect and about a 10% speedup.

BUG=skia:

Review URL: https://codereview.chromium.org/1123263003
1 file changed