Issue #13624: Write a specialized UTF-8 encoder to allow more optimization

The main bottleneck was the PyUnicode_READ() macro.
diff --git a/Doc/whatsnew/3.3.rst b/Doc/whatsnew/3.3.rst
index 4f54b69..8ca94c9 100644
--- a/Doc/whatsnew/3.3.rst
+++ b/Doc/whatsnew/3.3.rst
@@ -712,7 +712,9 @@
   * the memory footprint is divided by 2 to 4 depending on the text
   * encode an ASCII string to UTF-8 doesn't need to encode characters anymore,
     the UTF-8 representation is shared with the ASCII representation
-  * getting a substring of a latin1 strings is 4 times faster
+  * the UTF-8 encoder has been optimized
+  * repeating a single ASCII letter and getting a substring of a ASCII strings
+    is 4 times faster
 
 
 Build and C API Changes