Use vmulq_n_u32(..., 0x01010101) to distribute alphas.

This seems to make alphas() faster and Load[24]Alphas() no slower.
The change is particularly noticeable on xfermodes that call alphas()
twice (on src and dst), with a 10-12% speedup.

Xfermode_Difference_aa	  29ms -> 28.4ms	0.98x
   Xfermode_DstATop_aa	27.2ms -> 26.7ms	0.98x
       Xfermode_Xor_aa	27.2ms -> 26.5ms	0.98x
      Xfermode_DstOver	23.6ms -> 22.9ms	0.97x
   Xfermode_DstOver_aa	27.8ms -> 26.8ms	0.96x
       Xfermode_DstOut	22.6ms -> 21.7ms	0.96x
  Xfermode_Multiply_aa	  30ms -> 28.5ms	0.95x
    Xfermode_DstOut_aa	26.1ms -> 24.8ms	0.95x
     Xfermode_DstIn_aa	25.4ms -> 24.1ms	0.95x
      Xfermode_DstATop	28.7ms ->   26ms	0.9x
     Xfermode_Multiply	35.5ms -> 31.3ms	0.88x
   Xfermode_Difference	31.8ms -> 27.7ms	0.87x
          Xfermode_Xor	30.1ms -> 26.1ms	0.87x
BUG=skia:

Review URL: https://codereview.chromium.org/1203513002
1 file changed