Add Sk4f::ToBytes(uint8_t[16], Sk4f, Sk4f, Sk4f, Sk4f)

This is a big speedup for float -> byte.  E.g. gradient_linear_clamp_3color:
 x86-64 147µs -> 103µs  (Broadwell MBP)
 arm64 2.03ms -> 648µs  (Galaxy S6)
 armv7 1.12ms -> 489µs  (Galaxy S6, same device!)

BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot

Review URL: https://codereview.chromium.org/1483953002
5 files changed