QS8 Neon GEMM C8 microkernel with 8 bit multiply and vpadal to accumulate.

C8 partial sums kernel using mull on 8 bit to 16 bit a full 64 bits at a time.
Then padal to add pairs and lengthen to 32 bit, accumulating.
The 4 int accumulators will represent 1 byte in the final output, so there is
a vector for each element in the matrix.
The 4 ints are added together outside the loop.

PiperOrigin-RevId: 354631007
diff --git a/BUILD.bazel b/BUILD.bazel
index cb69953..78aa457 100644
--- a/BUILD.bazel
+++ b/BUILD.bazel
@@ -1722,6 +1722,14 @@
     "src/qs8-gemm/gen/4x16-minmax-neon-mlal-lane.c",
     "src/qs8-gemm/gen/4x16-minmax-neon-mull-addw-dup.c",
     "src/qs8-gemm/gen/4x16c2-minmax-neon-mull-padal-dup.c",
+    "src/qs8-gemm/gen/1x8c8-minmax-neon-mull-padal.c",
+    "src/qs8-gemm/gen/2x8c8-minmax-neon-mull-padal.c",
+    "src/qs8-gemm/gen/3x8c8-minmax-neon-mull-padal.c",
+    "src/qs8-gemm/gen/4x8c8-minmax-neon-mull-padal.c",
+    "src/qs8-gemm/gen/1x16c8-minmax-neon-mull-padal.c",
+    "src/qs8-gemm/gen/2x16c8-minmax-neon-mull-padal.c",
+    "src/qs8-gemm/gen/3x16c8-minmax-neon-mull-padal.c",
+    "src/qs8-gemm/gen/4x16c8-minmax-neon-mull-padal.c",
     "src/qs8-igemm/gen/1x8-minmax-neon-mlal-lane.c",
     "src/qs8-igemm/gen/1x16-minmax-neon-mlal-lane.c",
     "src/qs8-igemm/gen/2x8-minmax-neon-mlal-lane.c",