cut a multiply in SSE2 bilerp

The left two pixels and right two pixels are both multiplied
by allY weights today, then added together:

    (L * (16-wX) * allY) + (R * (wx) * allY)

We can trivially refactor that, delaying the allY multiply
until it only needs to be done once:

    allY * ( L*(16-wx) + R*(wx) )

This cuts a multiply off the per-pixel cost.

As I write this CL description, I think the obvious next thing to try is

    allY * ( (R-L)*wx + L*16 )

as that L*16 can become a super cheap shift.

Change-Id: Id683801105834468a04d05854d7d494867168ef2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/244236
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
1 file changed