sketch hooking into PNG_FILTER_OPTIMIZATIONS
Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9.
bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve.
Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on.
I've tried four approaches:
- this way;
- doing things naively in 16-bit;
- a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) );
- a mostly 8-bit version of the same.
They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version. The other two are way more complicated, and would require us to draw some serious ASCII diagrams to explain.
I have learned that the .skp serialization tests (serialize-8888) have a nice side effect of testing the correctness of these filters!
(Since writing the description above, I've bumped things up to {Paeth,Sub,Avg} x { 3 bpp, 4 bpp }.)
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1573943002
Review URL: https://codereview.chromium.org/1573943002
5 files changed