Faster 4f gradient premul path

Similar to https://codereview.chromium.org/2409583003/, perform the
premul in 4f.  It turns out it's even faster to avoid the 255 load
multiplication in this case.

Also includes some template plumbing because DstTraits<>::load now needs
to be premul-aware (previously it wasn't).

R=reed@google.com
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2416233002

Review-Url: https://codereview.chromium.org/2416233002
5 files changed