Optimized premultiplying swizzles for NEON

Improves decode performance for RGBA encoded PNGs.

Swizzle Time on Nexus 9 (with clang):
SwapPremul 0.44x
Premul     0.44x
Decode Time On Nexus 9 (with clang):
ZeroInit Decodes 0.85x
Regular  Decodes 0.86x

Swizzle Time on Nexus 6P (with clang)
SwapPremul 0.14x
Premul     0.14x
Decode Time On Nexus 6P (with clang):
ZeroInit Decodes 0.93x
Regular  Decodes 0.95x

Notes:
ZeroInit means memory is zero initialized, and we do not write to
memory for large sections of zero pixels (memory use opt for Android).

A profile on Nexus 9 shows that the premultiplication step of PNG
decoding is now ~5% of decode time (down from ~20%).

BUG=skia:4767
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1577703006
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Review URL: https://codereview.chromium.org/1577703006
3 files changed