Optimize SSE2 opaque blend

Backport optimization from https://codereview.chromium.org/874863002/.
Microbenchmarks data compared to previous SSE2 implementation:
             bitmap_BGRA_8888_A_source_stripes_three    7.52us -> 8.67us        1.15x
               bitmap_BGRA_8888_A_source_stripes_two    7.48us -> 8.56us        1.15x
         bitmap_BGRA_8888_update_scale_rotate_bilerp    63.4us ->   64us        1.01x
                    bitmap_BGRA_8888_update_volatile    3.31us -> 3.33us        1.01x
                              bitmap_BGRA_8888_scale    11.1us -> 11.2us        1x
                       bitmap_BGRA_8888_scale_bilerp    35.8us -> 35.9us        1x
                                    bitmap_BGRA_8888    3.33us -> 3.33us        1x
             bitmap_BGRA_8888_A_scale_rotate_bicubic    66.7us -> 66.5us        1x
bitmap_BGRA_8888_update_volatile_scale_rotate_bilerp    65.1us ->   64us        0.98x
                bitmap_BGRA_8888_scale_rotate_bilerp    65.1us ->   64us        0.98x
                    bitmap_BGRA_8888_A_scale_bicubic    30.6us -> 29.9us        0.98x
                     bitmap_BGRA_8888_A_scale_bilerp    42.7us -> 41.4us        0.97x
              bitmap_BGRA_8888_A_scale_rotate_bilerp      71us -> 67.7us        0.95x
                                  bitmap_BGRA_8888_A    7.44us ->  5.7us        0.77x
                    bitmap_BGRA_8888_A_source_opaque    7.46us -> 3.72us        0.5x
               bitmap_BGRA_8888_A_source_transparent    7.46us -> 1.96us        0.26x

BUG=skia:

Review URL: https://codereview.chromium.org/886403002
1 file changed