4x8 GEMM and IGEMM microkernels for Cortex A55.  7.8% faster for e2e mobile net v2.

Was f32_gemm_4x8__aarch64_neonfma_cortex_a53/mobilenet_v2/real_time      132632 us
Now f32_gemm_4x8__aarch64_neonfma_cortex_a55/mobilenet_v2/real_time      123029 us

The rev 1 version of Cortex A55 can co-issue a 64 bit
vector load with each FMA, so re-arrange the Cortex-A53
microkernel with 3 FMA paired with 2 loads and INS.

PiperOrigin-RevId: 301202721
diff --git a/test/f32-gemminc.yaml b/test/f32-gemminc.yaml
index bc472f4..cb4670a 100644
--- a/test/f32-gemminc.yaml
+++ b/test/f32-gemminc.yaml
@@ -18,6 +18,10 @@
   k-block: 4
   pipelined: true
   assembly: true
+- name: xnn_f32_gemminc_ukernel_4x8__aarch64_neonfma_cortex_a55
+  k-block: 4
+  pipelined: true
+  assembly: true
 - name: xnn_f32_gemminc_ukernel_4x8__aarch64_neonfma_cortex_a57
   k-block: 8
   pipelined: true