Look beyond SSE2 for Paeth

You can break this CL down into three steps.  Steps 2 and 3 depend on 1.

    Step 1: go to a 16-bit impl.  Speed ~unaffected.
    Step 2: use SSSE3 16-bit abs.  ~20% speedup to Paeth.
    Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth.

Overall this can improve PNG decoding by around 8% end-to-end.

I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build).  We've got plenty of bots building with SSSE3 at compile time to test that path.

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false&issue=1657503002

Review URL: https://codereview.chromium.org/1657503002
1 file changed