Unify some SkNx code

 - one base case and one N=1 case instead of two each (or three with doubles)
 - use SkNx_cast instead of FromBytes/toBytes
 - 4-at-a-time Sk4f::ToBytes becomes a special standalone Sk4f_ToBytes

If I did everything right, this'll be perf- and pixel- neutral.

https://gold.skia.org/search2?issue=1526523003&unt=true&query=source_type%3Dgm&master=false

BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Review URL: https://codereview.chromium.org/1526523003
diff --git a/src/effects/gradients/SkRadialGradient.cpp b/src/effects/gradients/SkRadialGradient.cpp
index 52d0639..d734aa0 100644
--- a/src/effects/gradients/SkRadialGradient.cpp
+++ b/src/effects/gradients/SkRadialGradient.cpp
@@ -307,7 +307,7 @@
             dR = dR + ddR;
 
             uint8_t fi[4];
-            dist.toBytes(fi);
+            SkNx_cast<uint8_t>(dist).store(fi);
 
             for (int i = 0; i < 4; i++) {
                 *dstC++ = cache[toggle + fi[i]];
@@ -319,7 +319,7 @@
             Sk4f dist = Sk4f::Min(fast_sqrt(R), max);
 
             uint8_t fi[4];
-            dist.toBytes(fi);
+            SkNx_cast<uint8_t>(dist).store(fi);
             for (int i = 0; i < count; i++) {
                 *dstC++ = cache[toggle + fi[i]];
                 toggle = next_dither_toggle(toggle);