Cortex A7 microkernel based on LD64 with PLD added. 3.2% faster in end to end mobilenet v2
PLD instructions moved to end of loop to improve VMLA performance.
pld_ld64 microkernel removed.
Was
MobileNetV2_F32/XNNPACK/T:1/real_time 511808 us 509497 us 14 FLOPS=1.17534G/s FPS=1.95386/s Freq=1.1904G
Now
MobileNetV2_F32/XNNPACK/T:1/real_time 496032 us 496007 us 14 FLOPS=1.21273G/s FPS=2.016/s Freq=1.1904G
PiperOrigin-RevId: 321691241
diff --git a/test/f32-igemm-minmax.yaml b/test/f32-igemm-minmax.yaml
index 8274224..acfb79d 100644
--- a/test/f32-igemm-minmax.yaml
+++ b/test/f32-igemm-minmax.yaml
@@ -34,7 +34,7 @@
k-block: 2
pipelined: false
assembly: true
-- name: xnn_f32_igemm_minmax_ukernel_4x8__aarch32_neon_pld_ld64
+- name: xnn_f32_igemm_minmax_ukernel_4x8__aarch32_neon_cortex_a7
k-block: 2
pipelined: false
assembly: true