Convert Color32 code to perfect blend.

Before we commit to blend_256_round_alt, let's make sure blend_perfect is
really slower in practice (i.e. regresses on perf.skia.org).

blend_perfect is really the most desirable algorithm if we can afford it.  Not
only is it correct, but it's easy to think about and break into correct pieces:
for instance, its div255() doesn't require any coordination with the multiply.

This looks like a 30% hit according to microbenches.  That said, microbenches
said my previous change would be a 20-25% perf improvement, but it didn't end
up showing a significant effect at a high level.

As for correctness, I see a bunch of off-by-1 compared to blend_256_round_alt
(exactly what we'd expect), and one off-by-3 in a GM that looks like it has a
bunch of overdraw.

BUG=skia:

Review URL: https://codereview.chromium.org/1098913002
3 files changed