Cortex A7 microkernel based on LD64 with PLD added.  3.2% faster in end to end mobilenet v2
PLD instructions moved to end of loop to improve VMLA performance.
pld_ld64 microkernel removed.

Was
MobileNetV2_F32/XNNPACK/T:1/real_time     511808 us       509497 us           14 FLOPS=1.17534G/s FPS=1.95386/s Freq=1.1904G

Now
MobileNetV2_F32/XNNPACK/T:1/real_time     496032 us       496007 us           14 FLOPS=1.21273G/s FPS=2.016/s Freq=1.1904G

PiperOrigin-RevId: 321691241
diff --git a/test/f32-igemm-minmax.yaml b/test/f32-igemm-minmax.yaml
index 8274224..acfb79d 100644
--- a/test/f32-igemm-minmax.yaml
+++ b/test/f32-igemm-minmax.yaml
@@ -34,7 +34,7 @@
   k-block: 2
   pipelined: false
   assembly: true
-- name: xnn_f32_igemm_minmax_ukernel_4x8__aarch32_neon_pld_ld64
+- name: xnn_f32_igemm_minmax_ukernel_4x8__aarch32_neon_cortex_a7
   k-block: 2
   pipelined: false
   assembly: true