Add SSSE3 Optimizations for premul and swap

Improves deocde performance for RGBA pngs.

Swizzler Time on z620 (clang):
SwapPremul 0.24x
Premul     0.24x
Swap       0.37x
Decode Time on z620 (clang):
Premul   ZeroInit Decodes 0.88x
Unpremul ZeroInit Decodes 0.94x
Premul   Regular  Decodes 0.91x
Unpremul Regular  Decodes 0.98x

Swizzler Time in Dell Venue 8 (gcc):
SwapPremul 0.14x
Premul     0.14x
Swap       0.08x
Decode Time on Dell Venus 8 (gcc):
Premul   ZeroInit Decodes 0.79x
Premul   Regular  Decodes 0.77x

Note:
ZeroInit means memory is zero initialized, and we do not write to
memory for large sections of zero pixels (memory use opt for Android).

BUG=skia:4767
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1601883002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Review URL: https://codereview.chromium.org/1601883002
2 files changed