add mul_unorm8 instruction

Another way for an interpreter to go faster
is to provide better instructions.

mul_unorm8 is one we use all the time.

Drops _I32 bench from ~3.6ns/px to ~2.6ns/px.

Change-Id: I9d08914c114048b79075796af9ec802236b35706
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218236
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
5 files changed