Optimize ARM64 SIMD code for Cavium ThunderX Per @ssvb: ThunderX is an ARM64 chip that dedicates most of its transistor real estate to providing 48 cores, so each core is not as fast as a result. Each core is dual-issue & in-order for scalar instructions and has only a single-issue half-width NEON unit, so the peak throughput is one 128-bit instruction per 2 cycles. So careful instruction scheduling is important. Furthermore, ThunderX has an extremely slow implementation of ld2 and ld3, so this commit implements the equivalent of those instructions using ld1. Compression speedup relative to libjpeg-turbo 1.4.2: 48-core ThunderX (RunAbove ARM Cloud), Linux, 64-bit: 58-85% (avg. 74%) relative to jpeg-6b: 1.75-2.14x (avg. 1.95x) Refer to #49 and #51 for discussion. Closes #51. This commit also wordsmiths the ChangeLog entry (the ARMv8 SIMD implementation is "complete" only for compression-- it still lacks some decompression algorithms, as does the ARMv7 implementation.) Based on: https://github.com/mayeut/libjpeg-turbo/commit/9405b5fd031558113bdfeae193a2b14baa589a75 which is based on: https://github.com/libjpeg-turbo/libjpeg-turbo/commit/f561944ff70adef65bb36212913bd28e6a2926d6 https://github.com/libjpeg-turbo/libjpeg-turbo/commit/962c8ab21feb3d7fc2a7a1ec8d26f6b985bbb86f

commit: d38b4f21ec5baabb448cd9ffa078fa9150d54af2 [log] [tgz]
author: DRC <information@libjpeg-turbo.org> Sat Jan 16 01:53:32 2016 -0600
committer: DRC <information@libjpeg-turbo.org> Sat Jan 16 02:39:02 2016 -0600
tree: 013c35d9228fac97bde0d8c66a14d1d0d22d075d
parent: e8aa5fa9349016c2eb5e05d01e16a8c47f7b68c8 [diff] [blame]
diff --git a/ChangeLog.txt b/ChangeLog.txt
index 6f1660a..cb59c2e 100644
--- a/ChangeLog.txt
+++ b/ChangeLog.txt

@@ -73,12 +73,11 @@
 SIMD-accelerated Huffman encoding can be disabled by setting the
 JSIMD_NOHUFFENC environment variable to 1.
 
-[14] Completed the ARM 64-bit (ARMv8) NEON SIMD implementation.  64-bit ARM
-now has SIMD coverage for all of the algorithms that are covered in the 32-bit
-(ARMv7) implementation, except for h2v1 (4:2:2) fancy upsampling.
-Additionally, the ARM 64-bit SIMD implementation now accelerates the slow
-integer forward DCT and h2v2 & h2v1 downsampling algorithms, which are not
-accelerated in the 32-bit implementation.
+[14] Added ARM 64-bit (ARMv8) NEON SIMD implementations of the commonly-used
+compression algorithms (including the slow integer forward DCT and h2v2 & h2v1
+downsampling algorithms, which are not accelerated in the 32-bit NEON
+implementation.)  This speeds up the overall 64-bit compression performance by
+about 2x on ARMv8 processors.
 
 
 1.4.2
commit	d38b4f21ec5baabb448cd9ffa078fa9150d54af2	[log] [tgz]
author	DRC <information@libjpeg-turbo.org>	Sat Jan 16 01:53:32 2016 -0600
committer	DRC <information@libjpeg-turbo.org>	Sat Jan 16 02:39:02 2016 -0600
tree	013c35d9228fac97bde0d8c66a14d1d0d22d075d
parent	e8aa5fa9349016c2eb5e05d01e16a8c47f7b68c8 [diff] [blame]