Reland "Integrate SIMD optimisations for zlib"

This version uses a "pthread_once" implementation, using Windows
synchronisation primitives, imported from tcmalloc.

Previous CLs:
https://codereview.chromium.org/677713002/
https://codereview.chromium.org/552123005

This version of the CL also runs fine on Windows Server 2003.

These optimisations have been published on zlib mailing list and at
https://github.com/jtkukunas/zlib/

This change merges the following optimisation patches:
- "For x86, add CPUID check."
- "Adds SSE2 optimized hash shifting to fill_window."
- "add SSE4.2 optimized hash function"
- "add PCLMULQDQ optimized CRC folding"

From Jim Kukunas <james.t.kukunas@linux.intel.com>; and adapts them to the
current zlib version in Chromium.

The optimisations are enabled at runtime if all the necessary CPU features are
present. As the optimisations require extra cflags to enable the compiler to
use the instructions the optimisations are held in their own static library
with a stub implementation to allow linking on other platforms.

TEST=net_unittests(GZipUnitTest) passes, Chrome functions and performance
improvement seen on RoboHornet benchmark on Linux Desktop
BUG=401517

Review URL: https://codereview.chromium.org/678423002

Cr-Original-Commit-Position: refs/heads/master@{#302799}
Cr-Mirrored-From: https://chromium.googlesource.com/chromium/src
Cr-Mirrored-Commit: 02a95e3084f979084fa8586e1718a6e6dd4c22da
diff --git a/deflate.h b/deflate.h
index 2fe6fd6..d15f2b5 100644
--- a/deflate.h
+++ b/deflate.h
@@ -107,6 +107,8 @@
     Byte  method;        /* STORED (for zip only) or DEFLATED */
     int   last_flush;    /* value of flush param for previous deflate call */
 
+    unsigned zalign(16) crc0[4 * 5];
+
                 /* used by deflate.c: */
 
     uInt  w_size;        /* LZ77 window size (32K by default) */
@@ -344,4 +346,14 @@
               flush = _tr_tally(s, distance, length)
 #endif
 
+/* Functions that are SIMD optimised on x86 */
+void ZLIB_INTERNAL crc_fold_init(deflate_state* const s);
+void ZLIB_INTERNAL crc_fold_copy(deflate_state* const s,
+                                 unsigned char* dst,
+                                 const unsigned char* src,
+                                 long len);
+unsigned ZLIB_INTERNAL crc_fold_512to32(deflate_state* const s);
+
+void ZLIB_INTERNAL fill_window_sse(deflate_state* s);
+
 #endif /* DEFLATE_H */