NEON f32 <-> f16 and f32 <-> u16
Adds f32 <-> f16 ARMv7 and ARMv8 NEON code.
Also adds NEON f32 <-> u16 code to make the comparison fair.
The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly.
The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx.
Still TODO:
ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1700473003
CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1700473003
diff --git a/src/opts/SkNx_neon.h b/src/opts/SkNx_neon.h
index 641e9d2..be37baf 100644
--- a/src/opts/SkNx_neon.h
+++ b/src/opts/SkNx_neon.h
@@ -365,6 +365,14 @@
#undef SHIFT16
#undef SHIFT8
+template<> inline Sk4h SkNx_cast<uint16_t, float>(const Sk4f& src) {
+ return vqmovn_u32(vcvtq_u32_f32(src.fVec));
+}
+
+template<> inline Sk4f SkNx_cast<float, uint16_t>(const Sk4h& src) {
+ return vcvtq_f32_u32(vmovl_u16(src.fVec));
+}
+
template<> inline Sk4b SkNx_cast<uint8_t, float>(const Sk4f& src) {
uint32x4_t _32 = vcvtq_u32_f32(src.fVec);
uint16x4_t _16 = vqmovn_u32(_32);