Support QC8 DWCONV microkernels

- Minimal set of 8-bit fixed-point microkernels with per-channel quantization
(QC8) optimized for AVX2.
- Extend packing functions to allow extra space after the kernel data.
Per-channel quantization parameters are later packed into that space.
- Extend DWConvMicrokernelTester to support unit testing of QC8 DWCONV.

PiperOrigin-RevId: 377620717
diff --git a/bench/f32-dwconv.cc b/bench/f32-dwconv.cc
index d4766c3..9dbcad7 100644
--- a/bench/f32-dwconv.cc
+++ b/bench/f32-dwconv.cc
@@ -83,7 +83,7 @@
   std::vector<float, AlignedAllocator<float, 32>> w(w_elements * num_buffers);
   std::fill(w.begin(), w.end(), 0.0f);
   xnn_pack_f32_dwconv_ghw_w(kernel_height, kernel_width, channels, cr,
-      k.data(), b.data(), w.data(), nullptr);
+      k.data(), b.data(), w.data(), 0 /* extra bytes */, nullptr);
   for (size_t n = 1; n < num_buffers; n++) {
     std::copy(w.cbegin(), w.cbegin() + w_elements, w.begin() + n * w_elements);
   }