arm_compute v19.05

commit: 4ba87dbdc3b22220eba4a792c1f5c87e7a88c7af [log] [tgz]
author: Jenkins <bsgcomp@arm.com> Thu May 23 17:11:51 2019 +0100
committer: Jenkins <bsgcomp@arm.com> Thu May 23 17:11:51 2019 +0100
tree: f0364d64c78ffa0b0a86e85457748fbdccf5eb07
parent: 29f6788cee8881c5523a042a0ac9b0131d993768 [diff]
diff --git a/docs/00_introduction.dox b/docs/00_introduction.dox
index 8f74a75..cbfd456 100644
--- a/docs/00_introduction.dox
+++ b/docs/00_introduction.dox

@@ -73,6 +73,7 @@
 
 	.
 	├── arm_compute --> All the arm_compute headers
+	│   ├── graph.h --> Includes all the Graph headers at once.
 	│   ├── core
 	│   │   ├── CL
 	│   │   │   ├── CLKernelLibrary.h --> Manages all the OpenCL kernels compilation and caching, provides accessors for the OpenCL Context.
@@ -163,7 +164,6 @@
 	│   ├── graph_*.cpp --> Graph examples
 	│   ├── neoncl_*.cpp --> NEON / OpenCL interoperability examples
 	│   └── neon_*.cpp --> NEON examples
-	├── graph.h --> Includes all the Graph headers at once.
 	├── include
 	│   ├── CL
 	│   │   └── Khronos OpenCL C headers and C++ wrapper
@@ -208,8 +208,6 @@
 	│   │   └── Datasets for all the validation / benchmark tests, layer configurations for various networks, etc.
 	│   ├── framework
 	│   │   └── Boiler plate code for both validation and benchmark test suites (Command line parsers, instruments, output loggers, etc.)
-	│   ├── networks
-	│   │   └── Examples of how to instantiate networks.
 	│   └── validation --> Sources for validation
 	│       ├── Validation specific files
 	│       ├── fixtures
@@ -238,6 +236,74 @@
 
 @subsection S2_2_changelog Changelog
 
+v19.05 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - New Neon kernels / functions:
+    - @ref NEBatchToSpaceLayerKernel / @ref NEBatchToSpaceLayer
+    - @ref NEComplexPixelWiseMultiplicationKernel / @ref NEComplexPixelWiseMultiplication
+    - @ref NECropKernel / @ref NECropResize
+    - @ref NEDepthwiseConvolutionAssemblyDispatch
+    - @ref NEFFTDigitReverseKernel
+    - @ref NEFFTRadixStageKernel
+    - @ref NEFFTScaleKernel
+    - @ref NEGEMMLowpOffsetContributionOutputStageKernel
+    - @ref NEHeightConcatenateLayerKernel
+    - @ref NESpaceToBatchLayerKernel / @ref NESpaceToBatchLayer
+    - @ref NEFFT1D
+    - @ref NEFFT2D
+    - @ref NEFFTConvolutionLayer
+ - New OpenCL kernels / functions:
+    - @ref CLComplexPixelWiseMultiplicationKernel / @ref CLComplexPixelWiseMultiplication
+    - @ref CLCropKernel / @ref CLCropResize
+    - @ref CLDeconvolutionReshapeOutputKernel
+    - @ref CLFFTDigitReverseKernel
+    - @ref CLFFTRadixStageKernel
+    - @ref CLFFTScaleKernel
+    - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
+    - @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel
+    - @ref CLHeightConcatenateLayerKernel
+    - @ref CLDirectDeconvolutionLayer
+    - @ref CLFFT1D
+    - @ref CLFFT2D
+    - @ref CLFFTConvolutionLayer
+    - @ref CLGEMMDeconvolutionLayer
+ - New OpenGLES kernels / functions:
+    - @ref GCConcatenateLayer
+ - Deprecated functions/interfaces
+    - @ref GCDepthConcatenateLayer
+    - @ref NEWidthConcatenateLayer
+    - @ref NEDepthConcatenateLayer
+    - @ref CLWidthConcatenateLayer
+    - @ref CLDepthConcatenateLayer
+    - @ref CLGEMMInterleave4x4
+    - @ref CLGEMMTranspose1xW
+ - Support different quantization info in CLConcatLayer.
+ - Add checks on different input/output quantization info were not supported.
+ - Tensors have different quantization information.
+ - Add FP16 support checks.
+ - Fix output quantization CLDeptwiseConv3x3 when activation is fused.
+ - New graph examples:
+     - graph_convolution
+     - graph_fully_connected
+     - graph_depthwise_convolution
+     - Deepspeech v0.4.1
+ - Add support for QASYMM8 in NEArithmeticSubtractionKernel.
+ - Add support for QASYMM8 in NEPixelWiseMultiplicationKernel.
+ - Add support for QASYMM8 NEDeconvolution.
+ - Add support for DequantizationLayer for NEON/CL.
+ - Add support for dilation in CLDepthwiseConvolution.
+ - Fuse offset contribution with the output stage when we use NEGEMMLowpMatrixMultiplyCore.
+ - Optimize CLDeconvolution.
+ - Add StackLayer to the graph API.
+ - Add support for "reflect" padding mode in NEPad.
+ - Winograd 7x7 NHWC on OpenCL.
+ - Rework CL ML layers to run exclusively on CL.
+ - Support different quantization info in PoolingLayer.
+ - Implement and test import memory interfaces.
+ - Added new tests and removed old ones.
+ - Various clang-tidy fixes.
+
 v19.02 Public major release
  - Various bug fixes.
  - Various optimisations.
@@ -1041,7 +1107,7 @@
 Here is a guide to <a href="https://developer.android.com/ndk/guides/standalone_toolchain.html">create your Android standalone toolchains from the NDK</a>
 
 - Download the NDK r17b from here: https://developer.android.com/ndk/downloads/index.html
-- Make sure you have Python 2 installed on your machine.
+- Make sure you have Python 2.7 installed on your machine.
 - Generate the 32 and/or 64 toolchains by running the following commands:
 
 
@@ -1253,7 +1319,7 @@
 
 The OpenCL tuner, a.k.a. CLTuner, is a module of Arm Compute Library that can improve the performance of the OpenCL kernels tuning the Local-Workgroup-Size (LWS).
 The optimal LWS for each unique OpenCL kernel configuration is stored in a table. This table can be either imported or exported from/to a file.
-The OpenCL tuner performs a brute-force approach: it runs the same OpenCL kernel for a range of local workgroup sizes and keep the local workgroup size of the fastest run to use in subsequent calls to the kernel.
+The OpenCL tuner runs the same OpenCL kernel for a range of local workgroup sizes and keeps the local workgroup size of the fastest run to use in subsequent calls to the kernel. It supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file.
 In order for the performance numbers to be meaningful you must disable the GPU power management and set it to a fixed frequency for the entire duration of the tuning phase.
 
 If you wish to know more about LWS and the important role on improving the GPU cache utilization, we suggest having a look at the presentation "Even Faster CNNs: Exploring the New Class of Winograd Algorithms available at the following link:

diff --git a/docs/01_library.dox b/docs/01_library.dox
index 67adf9c..359ca47 100644
--- a/docs/01_library.dox
+++ b/docs/01_library.dox

@@ -461,7 +461,7 @@
 
 When the @ref CLTuner is enabled ( Target = 2 for the graph examples), the first time an OpenCL kernel is executed the Compute Library will try to run it with a variety of LWS values and will remember which one performed best for subsequent runs. At the end of the run the @ref graph::Graph will try to save these tuning parameters to a file.
 
-However this process takes quite a lot of time, which is why it cannot be enabled all the time.
+However this process takes quite a lot of time, which is why it cannot be enabled all the time. @ref CLTuner supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file.
 
 But, when the @ref CLTuner is disabled ( Target = 1 for the graph examples), the @ref graph::Graph will try to reload the file containing the tuning parameters, then for each executed kernel the Compute Library will use the fine tuned LWS if it was present in the file or use a default LWS value if it's not.
 

diff --git a/docs/05_functions_list.dox b/docs/05_functions_list.dox
index e82e472..7d6728d 100644
--- a/docs/05_functions_list.dox
+++ b/docs/05_functions_list.dox

@@ -67,6 +67,7 @@
         - @ref NEAccumulateSquared
         - @ref NEAccumulateWeighted
         - @ref NEActivationLayer
+        - @ref NEBatchToSpaceLayer
         - @ref NEBitwiseAnd
         - @ref NEBitwiseNot
         - @ref NEBitwiseOr
@@ -103,13 +104,16 @@
     - @ref NEArgMinMaxLayer
     - @ref NEBatchNormalizationLayer
     - @ref NECannyEdge
+    - @ref NEComplexPixelWiseMultiplication
     - @ref NEConcatenateLayer
     - @ref NEConvertFullyConnectedWeights
     - @ref NEConvolutionLayer
     - @ref NEConvolutionLayerReshapeWeights
     - @ref NEConvolutionSquare &lt;matrix_size&gt;
+    - @ref NECropResize
     - @ref NEDeconvolutionLayer
     - @ref NEDepthConcatenateLayer
+    - @ref NEDepthwiseConvolutionAssemblyDispatch
     - @ref NEDepthwiseConvolutionLayer
     - @ref NEDepthwiseConvolutionLayer3x3
     - @ref NEDepthwiseSeparableConvolutionLayer
@@ -118,6 +122,9 @@
     - @ref NEDirectConvolutionLayer
     - @ref NEEqualizeHistogram
     - @ref NEFastCorners
+    - @ref NEFFT1D
+    - @ref NEFFT2D
+    - @ref NEFFTConvolutionLayer
     - @ref NEFillBorder
     - @ref NEFullyConnectedLayer
     - @ref NEFuseBatchNormalization
@@ -159,6 +166,7 @@
     - @ref NESobel5x5
     - @ref NESobel7x7
     - @ref NESoftmaxLayer
+    - @ref NESpaceToBatchLayer
     - @ref NESplit
     - @ref NEStackLayer
     - @ref NEUnstack
@@ -172,10 +180,12 @@
     - @ref CLBatchNormalizationLayer
     - @ref CLBatchToSpaceLayer
     - @ref CLCannyEdge
+    - @ref CLComplexPixelWiseMultiplication
     - @ref CLConcatenateLayer
     - @ref CLConvolutionLayer
     - @ref CLConvolutionLayerReshapeWeights
     - @ref CLConvolutionSquare &lt;matrix_size&gt;
+    - @ref CLCropResize
     - @ref CLDeconvolutionLayer
     - @ref CLDeconvolutionLayerUpsample
     - @ref CLDepthConcatenateLayer
@@ -184,8 +194,12 @@
     - @ref CLDepthwiseSeparableConvolutionLayer
     - @ref CLDequantizationLayer
     - @ref CLDirectConvolutionLayer
+    - @ref CLDirectDeconvolutionLayer
     - @ref CLEqualizeHistogram
     - @ref CLFastCorners
+    - @ref CLFFT1D
+    - @ref CLFFT2D
+    - @ref CLFFTConvolutionLayer
     - @ref CLFullyConnectedLayer
     - @ref CLFuseBatchNormalization
     - @ref CLGaussian5x5
@@ -194,6 +208,7 @@
         - @ref CLGaussianPyramidOrb
     - @ref CLGEMM
     - @ref CLGEMMConvolutionLayer
+    - @ref CLGEMMDeconvolutionLayer
     - @ref CLGEMMLowpMatrixMultiplyCore
     - @ref CLGenerateProposalsLayer
     - @ref CLHarrisCorners
@@ -311,6 +326,7 @@
 
 - @ref IFunction
     - @ref GCBatchNormalizationLayer
+    - @ref GCConcatenateLayer
     - @ref GCConvolutionLayer
     - @ref GCConvolutionLayerReshapeWeights
     - @ref GCDepthConcatenateLayer

diff --git a/docs/Doxyfile b/docs/Doxyfile
index 081d497..a349016 100644
--- a/docs/Doxyfile
+++ b/docs/Doxyfile

@@ -38,7 +38,7 @@
 # could be handy for archiving the generated documentation or if some version
 # control system is used.
 
-PROJECT_NUMBER         = 19.02
+PROJECT_NUMBER         = 19.05
 
 # Using the PROJECT_BRIEF tag one can provide an optional one line description
 # for a project that appears at the top of each page and should give viewer a
commit	4ba87dbdc3b22220eba4a792c1f5c87e7a88c7af	[log] [tgz]
author	Jenkins <bsgcomp@arm.com>	Thu May 23 17:11:51 2019 +0100
committer	Jenkins <bsgcomp@arm.com>	Thu May 23 17:11:51 2019 +0100
tree	f0364d64c78ffa0b0a86e85457748fbdccf5eb07
parent	29f6788cee8881c5523a042a0ac9b0131d993768 [diff]