NEON f32 <-> f16 and f32 <-> u16

Adds f32 <-> f16 ARMv7 and ARMv8 NEON code.
Also adds NEON f32 <-> u16 code to make the comparison fair.

The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly.

The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx.

Still TODO:
ARMv7 float -> half.  Naively translating the SSE version results in 0x0000 where we'd expect a denormal output.

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1700473003
CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Review URL: https://codereview.chromium.org/1700473003
diff --git a/src/opts/SkNx_neon.h b/src/opts/SkNx_neon.h
index 641e9d2..be37baf 100644
--- a/src/opts/SkNx_neon.h
+++ b/src/opts/SkNx_neon.h
@@ -365,6 +365,14 @@
 #undef SHIFT16
 #undef SHIFT8
 
+template<> inline Sk4h SkNx_cast<uint16_t, float>(const Sk4f& src) {
+    return vqmovn_u32(vcvtq_u32_f32(src.fVec));
+}
+
+template<> inline Sk4f SkNx_cast<float, uint16_t>(const Sk4h& src) {
+    return vcvtq_f32_u32(vmovl_u16(src.fVec));
+}
+
 template<> inline Sk4b SkNx_cast<uint8_t, float>(const Sk4f& src) {
     uint32x4_t _32 = vcvtq_u32_f32(src.fVec);
     uint16x4_t _16 = vqmovn_u32(_32);