Add SSSE3 acceleration for S32_D16_filter_DX

With this CL, related nanobench can be improved for 565 config.
         bitmap_BGRA_8888_update_scale_bilerp   76.1us -> 46.7us        0.61x
                bitmap_BGRA_8888_scale_bilerp   78.7us ->   47us        0.6x
bitmap_BGRA_8888_update_volatile_scale_bilerp   82.7us -> 46.9us        0.57x

BUG=skia:

Review URL: https://codereview.chromium.org/788853002
3 files changed