tweak mul/mad_unorm8

Using the approximation (x*y+x)/256 is slightly
faster than (x*y+255)/256 while still maintaining
the property that it's never off by more than a bit.

In the interpreter this saves ~0.1 ns/px when used,
and is also nice for JITting because it doesn't need
any constant registers.

(x*y+y)/256 works just as well, of course.

Change-Id: Ic946e26f0d22c602dfa7e8fa0d64bf87db5505ac
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/219917
Commit-Queue: Mike Klein <mtklein@google.com>
Commit-Queue: Brian Osman <brianosman@google.com>
Auto-Submit: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
2 files changed