sketch hooking into PNG_FILTER_OPTIMIZATIONS

Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9.

bpp=4 is mainly an easy demo.  This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel.  Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve.

Everything here can be trivially downgraded to MMX, supporting bpp up to 8.  It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on.

I've tried four approaches:
  - this way;
  - doing things naively in 16-bit;
  - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) );
  - a mostly 8-bit version of the same.

They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version.  The other two are way more complicated, and would require us to draw some serious ASCII diagrams to explain.

I have learned that the .skp serialization tests (serialize-8888) have a nice side effect of testing the correctness of these filters!

(Since writing the description above, I've bumped things up to {Paeth,Sub,Avg} x { 3 bpp, 4 bpp }.)

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1573943002

Review URL: https://codereview.chromium.org/1573943002
5 files changed