sk4px the rest of the easy xfermodes.

Adds and uses fastMulDiv255Round() where possible,
which approximates x*y/255 as (x*y+x)/256.  Seems like a sizeable
speedup, as seen below on Exclusion, Screen, and Modulate.  The
existing NEON code uses this approximation for
{Src,Dst}x{In,Out,Over}, and without it we'd regress speed there.

This will require rebaselines whether or not we use this
approximation: the x86 bots change if we do, the ARM bots change
if we don't.  None of the diffs are significant.

Desktop:
   Xfermode_Screen_aa	5.82ms -> 5.54ms	0.95x
 Xfermode_Modulate_aa	5.67ms -> 5.36ms	0.95x
Xfermode_Exclusion_aa	6.18ms -> 5.81ms	0.94x
   Xfermode_Exclusion	5.03ms -> 4.24ms	0.84x
      Xfermode_Screen	4.51ms -> 3.59ms	0.8x
    Xfermode_Modulate	 4.2ms -> 3.19ms	0.76x
     Xfermode_DstOver	6.73ms -> 3.88ms	0.58x
      Xfermode_SrcOut	6.47ms -> 3.48ms	0.54x
       Xfermode_SrcIn	6.46ms -> 3.46ms	0.54x
      Xfermode_DstOut	6.49ms -> 3.41ms	0.52x
       Xfermode_DstIn	 6.5ms -> 3.32ms	0.51x
      Xfermode_Src_aa	9.53ms -> 4.75ms	0.5x
    Xfermode_Clear_aa	9.65ms ->  4.8ms	0.5x
    Xfermode_DstIn_aa	11.5ms -> 5.57ms	0.49x
  Xfermode_DstOver_aa	11.6ms -> 5.63ms	0.49x
   Xfermode_SrcOut_aa	11.6ms ->  5.5ms	0.47x
    Xfermode_SrcIn_aa	11.7ms -> 5.51ms	0.47x
   Xfermode_DstOut_aa	11.7ms ->  5.4ms	0.46x

N7 performance is close enough to 1x that I'm not sure whether
this is a net win, net loss, or truly neutral.  I figure the bots will
show that.

I experimented with another approximation,
(x*(255-y))/255 ≈ (x*(256-y))/256.  This was inconclusive, so I'm
leaving it out for now.

The remaining modes are the complicated conditional ones.

BUG=skia:

Review URL: https://codereview.chromium.org/1141953004
2 files changed