arm_compute v17.04
diff --git a/documentation/index.xhtml b/documentation/index.xhtml
index 7eaa9ce..c9f1bd7 100644
--- a/documentation/index.xhtml
+++ b/documentation/index.xhtml
@@ -40,7 +40,7 @@
  <tr style="height: 56px;">
   <td style="padding-left: 0.5em;">
    <div id="projectname">ARM Compute Library
-   &#160;<span id="projectnumber">17.03.1</span>
+   &#160;<span id="projectnumber">17.04</span>
    </div>
   </td>
  </tr>
@@ -115,7 +115,10 @@
 </ul>
 </li>
 <li class="level1"><a href="#S1_file_organisation">File organisation</a></li>
-<li class="level1"><a href="#S2_versions_changelog">Versions changelog</a></li>
+<li class="level1"><a href="#S2_versions_changelog">Release versions and changelog</a><ul><li class="level2"><a href="#S2_1_versions">Release versions</a></li>
+<li class="level2"><a href="#S2_2_changelog">Changelog</a></li>
+</ul>
+</li>
 <li class="level1"><a href="#S3_how_to_build">How to build the library and the examples</a><ul><li class="level2"><a href="#S3_1_build_options">Build options</a></li>
 <li class="level2"><a href="#S3_2_linux">Linux</a><ul><li class="level3"><a href="#S3_2_1_library">How to build the library ?</a></li>
 <li class="level3"><a href="#S3_2_2_examples">How to manually build the examples ?</a></li>
@@ -145,7 +148,7 @@
 </li>
 <li class="level3"><a href="#S4_6_2_tensors">Tensors</a></li>
 <li class="level3"><a href="#S4_6_3_description_conventions">Images and Tensors description conventions</a></li>
-<li class="level3"><a href="#S4_6_4_working_with_objects">Working with Images and Tensors</a></li>
+<li class="level3"><a href="#S4_6_4_working_with_objects">Working with Images and Tensors using iterators</a></li>
 </ul>
 </li>
 </ul>
@@ -233,66 +236,36 @@
 └── test_helpers --&gt; Boiler plate code used by examples
     └── Utils.h
 </pre><h1><a class="anchor" id="S2_versions_changelog"></a>
-Versions changelog</h1>
-<dl class="section note"><dt>Note</dt><dd>There will be one major public release with new features per quarter. All releases in between will only contain bug fixes.</dd></dl>
-<p>v16.12 (Binary release)</p><ul>
-<li>Original release</li>
-</ul>
-<p>v17.02 (Sources)</p><ul>
-<li>New OpenCL kernels / functions:<ul>
-<li><a class="el" href="classarm__compute_1_1_c_l_activation_layer_kernel.xhtml">CLActivationLayerKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_activation_layer.xhtml">CLActivationLayer</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_channel_combine_kernel.xhtml">CLChannelCombineKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_channel_combine.xhtml">CLChannelCombine</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_derivative_kernel.xhtml">CLDerivativeKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_channel_extract.xhtml">CLChannelExtract</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_fast_corners_kernel.xhtml">CLFastCornersKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_fast_corners.xhtml">CLFastCorners</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_mean_std_dev_kernel.xhtml">CLMeanStdDevKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_mean_std_dev.xhtml">CLMeanStdDev</a></li>
-</ul>
-</li>
-<li>New NEON kernels / functions:<ul>
-<li><a class="el" href="classarm__compute_1_1_h_o_g.xhtml" title="CPU implementation of HOG data-object. ">HOG</a> / SVM: <a class="el" href="classarm__compute_1_1_n_e_h_o_g_orientation_binning_kernel.xhtml">NEHOGOrientationBinningKernel</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_block_normalization_kernel.xhtml">NEHOGBlockNormalizationKernel</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_detector_kernel.xhtml">NEHOGDetectorKernel</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_non_maxima_suppression_kernel.xhtml">NEHOGNonMaximaSuppressionKernel</a> / <a class="el" href="classarm__compute_1_1_n_e_h_o_g_descriptor.xhtml">NEHOGDescriptor</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_detector.xhtml">NEHOGDetector</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_gradient.xhtml">NEHOGGradient</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_multi_detection.xhtml">NEHOGMultiDetection</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_non_linear_filter_kernel.xhtml">NENonLinearFilterKernel</a> / <a class="el" href="classarm__compute_1_1_n_e_non_linear_filter.xhtml">NENonLinearFilter</a></li>
-</ul>
-</li>
-<li>Introduced a <a class="el" href="classarm__compute_1_1_c_l_scheduler.xhtml" title="Provides global access to a CL context and command queue. ">CLScheduler</a> to manage the default context and command queue used by the runtime library and create synchronisation events.</li>
-<li>Switched all the kernels / functions to use tensors instead of images.</li>
-<li>Updated documentation to include instructions to build the library from sources.</li>
-</ul>
-<p>v17.02.1 (Sources)</p><ul>
-<li>New OpenCL kernels / functions:<ul>
-<li><a class="el" href="classarm__compute_1_1_c_l_logits1_d_max_kernel.xhtml">CLLogits1DMaxKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_logits1_d_shift_exp_sum_kernel.xhtml">CLLogits1DShiftExpSumKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_logits1_d_norm_kernel.xhtml">CLLogits1DNormKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_softmax_layer.xhtml">CLSoftmaxLayer</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_pooling_layer_kernel.xhtml">CLPoolingLayerKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_pooling_layer.xhtml">CLPoolingLayer</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_im2_col_kernel.xhtml">CLIm2ColKernel</a> <a class="el" href="classarm__compute_1_1_c_l_col2_im_kernel.xhtml">CLCol2ImKernel</a> <a class="el" href="classarm__compute_1_1_c_l_convolution_layer_weights_reshape_kernel.xhtml">CLConvolutionLayerWeightsReshapeKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_convolution_layer.xhtml">CLConvolutionLayer</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_remap_kernel.xhtml">CLRemapKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_remap.xhtml">CLRemap</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_hor_kernel.xhtml">CLGaussianPyramidHorKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_vert_kernel.xhtml">CLGaussianPyramidVertKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid.xhtml">CLGaussianPyramid</a>, <a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_half.xhtml">CLGaussianPyramidHalf</a>, <a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_orb.xhtml">CLGaussianPyramidOrb</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_min_max_kernel.xhtml">CLMinMaxKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_min_max_location_kernel.xhtml">CLMinMaxLocationKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_min_max_location.xhtml">CLMinMaxLocation</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_non_linear_filter_kernel.xhtml">CLNonLinearFilterKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_non_linear_filter.xhtml">CLNonLinearFilter</a></li>
-</ul>
-</li>
-<li>New NEON FP16 kernels (Requires armv8.2 CPU)<ul>
-<li><a class="el" href="classarm__compute_1_1_n_e_accumulate_weighted_f_p16_kernel.xhtml">NEAccumulateWeightedFP16Kernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_box3x3_f_p16_kernel.xhtml">NEBox3x3FP16Kernel</a></li>
+Release versions and changelog</h1>
+<h2><a class="anchor" id="S2_1_versions"></a>
+Release versions</h2>
+<p>All releases are numbered vYY.MM Where YY are the last two digits of the year, and MM the month number. If there is more than one release in a month then an extra sequential number is appended at the end: </p><pre class="fragment">v17.03 (First release of March 2017)
+v17.03.1 (Second release of March 2017)
+v17.04 (First release of April 2017)
+</pre><dl class="section note"><dt>Note</dt><dd>We're aiming at releasing one major public release with new features per quarter. All releases in between will only contain bug fixes.</dd></dl>
+<h2><a class="anchor" id="S2_2_changelog"></a>
+Changelog</h2>
+<p>v17.04 Public bug fixes release The following functions have been ported to use the new accurate padding:</p><ul>
+<li><a class="el" href="classarm__compute_1_1_c_l_color_convert_kernel.xhtml">CLColorConvertKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_edge_non_max_suppression_kernel.xhtml">CLEdgeNonMaxSuppressionKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_edge_trace_kernel.xhtml">CLEdgeTraceKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_hor_kernel.xhtml">CLGaussianPyramidHorKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_vert_kernel.xhtml">CLGaussianPyramidVertKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_gradient_kernel.xhtml">CLGradientKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_channel_combine_kernel.xhtml">NEChannelCombineKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_fill_array_kernel.xhtml">NEFillArrayKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_gaussian_pyramid_hor_kernel.xhtml">NEGaussianPyramidHorKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_gaussian_pyramid_vert_kernel.xhtml">NEGaussianPyramidVertKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_harris_score_f_p16_kernel.xhtml">NEHarrisScoreFP16Kernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_harris_score_kernel.xhtml">NEHarrisScoreKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_h_o_g_detector_kernel.xhtml">NEHOGDetectorKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_logits1_d_max_kernel.xhtml">NELogits1DMaxKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_logits1_d_shift_exp_sum_kernel.xhtml">NELogits1DShiftExpSumKernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_logits1_d_norm_kernel.xhtml">NELogits1DNormKernel</a></li>
 <li><a class="el" href="classarm__compute_1_1_n_e_non_maxima_suppression3x3_f_p16_kernel.xhtml">NENonMaximaSuppression3x3FP16Kernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_non_maxima_suppression3x3_kernel.xhtml">NENonMaximaSuppression3x3Kernel</a></li>
 </ul>
-</li>
-</ul>
-<p>v17.03 (Sources)</p><ul>
-<li>New OpenCL kernels / functions:<ul>
-<li><a class="el" href="classarm__compute_1_1_c_l_gradient_kernel.xhtml">CLGradientKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_edge_non_max_suppression_kernel.xhtml">CLEdgeNonMaxSuppressionKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_edge_trace_kernel.xhtml">CLEdgeTraceKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_canny_edge.xhtml">CLCannyEdge</a></li>
-<li>GEMM refactoring + FP16 support: <a class="el" href="classarm__compute_1_1_c_l_g_e_m_m_interleave4x4_kernel.xhtml">CLGEMMInterleave4x4Kernel</a>, <a class="el" href="classarm__compute_1_1_c_l_g_e_m_m_transpose1x_w_kernel.xhtml">CLGEMMTranspose1xWKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_g_e_m_m_matrix_multiply_kernel.xhtml">CLGEMMMatrixMultiplyKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_g_e_m_m_matrix_addition_kernel.xhtml">CLGEMMMatrixAdditionKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_g_e_m_m.xhtml">CLGEMM</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_g_e_m_m_matrix_accumulate_biases_kernel.xhtml">CLGEMMMatrixAccumulateBiasesKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_fully_connected_layer.xhtml">CLFullyConnectedLayer</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_transpose_kernel.xhtml">CLTransposeKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_transpose.xhtml">CLTranspose</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_l_k_tracker_init_kernel.xhtml">CLLKTrackerInitKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_l_k_tracker_stage0_kernel.xhtml">CLLKTrackerStage0Kernel</a>, <a class="el" href="classarm__compute_1_1_c_l_l_k_tracker_stage1_kernel.xhtml">CLLKTrackerStage1Kernel</a>, <a class="el" href="classarm__compute_1_1_c_l_l_k_tracker_finalize_kernel.xhtml">CLLKTrackerFinalizeKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_optical_flow.xhtml">CLOpticalFlow</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_normalization_layer_kernel.xhtml">CLNormalizationLayerKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_normalization_layer.xhtml">CLNormalizationLayer</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_laplacian_pyramid.xhtml">CLLaplacianPyramid</a>, <a class="el" href="classarm__compute_1_1_c_l_laplacian_reconstruct.xhtml">CLLaplacianReconstruct</a></li>
-</ul>
-</li>
-<li>New NEON kernels / functions:<ul>
-<li><a class="el" href="classarm__compute_1_1_n_e_activation_layer_kernel.xhtml">NEActivationLayerKernel</a> / <a class="el" href="classarm__compute_1_1_n_e_activation_layer.xhtml">NEActivationLayer</a></li>
-<li>GEMM refactoring + FP16 support (Requires armv8.2 CPU): <a class="el" href="classarm__compute_1_1_n_e_g_e_m_m_interleave4x4_kernel.xhtml">NEGEMMInterleave4x4Kernel</a>, <a class="el" href="classarm__compute_1_1_n_e_g_e_m_m_transpose1x_w_kernel.xhtml">NEGEMMTranspose1xWKernel</a>, <a class="el" href="classarm__compute_1_1_n_e_g_e_m_m_matrix_multiply_kernel.xhtml">NEGEMMMatrixMultiplyKernel</a>, <a class="el" href="classarm__compute_1_1_n_e_g_e_m_m_matrix_addition_kernel.xhtml">NEGEMMMatrixAdditionKernel</a> / <a class="el" href="classarm__compute_1_1_n_e_g_e_m_m.xhtml">NEGEMM</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_pooling_layer_kernel.xhtml">NEPoolingLayerKernel</a> / <a class="el" href="classarm__compute_1_1_n_e_pooling_layer.xhtml">NEPoolingLayer</a></li>
-</ul>
-</li>
-</ul>
-<p>v17.03.1 (Sources)</p><ul>
+<p>v17.03.1 First Major public release of the sources</p><ul>
 <li>Renamed the library to <a class="el" href="namespacearm__compute.xhtml">arm_compute</a></li>
 <li>New CPP target introduced for C++ kernels shared between NEON and CL functions.</li>
 <li>New padding calculation interface introduced and ported most kernels / functions to use it.</li>
@@ -310,6 +283,63 @@
 </ul>
 </li>
 </ul>
+<p>v17.03 Sources preview</p><ul>
+<li>New OpenCL kernels / functions:<ul>
+<li><a class="el" href="classarm__compute_1_1_c_l_gradient_kernel.xhtml">CLGradientKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_edge_non_max_suppression_kernel.xhtml">CLEdgeNonMaxSuppressionKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_edge_trace_kernel.xhtml">CLEdgeTraceKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_canny_edge.xhtml">CLCannyEdge</a></li>
+<li>GEMM refactoring + FP16 support: <a class="el" href="classarm__compute_1_1_c_l_g_e_m_m_interleave4x4_kernel.xhtml">CLGEMMInterleave4x4Kernel</a>, <a class="el" href="classarm__compute_1_1_c_l_g_e_m_m_transpose1x_w_kernel.xhtml">CLGEMMTranspose1xWKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_g_e_m_m_matrix_multiply_kernel.xhtml">CLGEMMMatrixMultiplyKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_g_e_m_m_matrix_addition_kernel.xhtml">CLGEMMMatrixAdditionKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_g_e_m_m.xhtml">CLGEMM</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_g_e_m_m_matrix_accumulate_biases_kernel.xhtml">CLGEMMMatrixAccumulateBiasesKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_fully_connected_layer.xhtml">CLFullyConnectedLayer</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_transpose_kernel.xhtml">CLTransposeKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_transpose.xhtml">CLTranspose</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_l_k_tracker_init_kernel.xhtml">CLLKTrackerInitKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_l_k_tracker_stage0_kernel.xhtml">CLLKTrackerStage0Kernel</a>, <a class="el" href="classarm__compute_1_1_c_l_l_k_tracker_stage1_kernel.xhtml">CLLKTrackerStage1Kernel</a>, <a class="el" href="classarm__compute_1_1_c_l_l_k_tracker_finalize_kernel.xhtml">CLLKTrackerFinalizeKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_optical_flow.xhtml">CLOpticalFlow</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_normalization_layer_kernel.xhtml">CLNormalizationLayerKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_normalization_layer.xhtml">CLNormalizationLayer</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_laplacian_pyramid.xhtml">CLLaplacianPyramid</a>, <a class="el" href="classarm__compute_1_1_c_l_laplacian_reconstruct.xhtml">CLLaplacianReconstruct</a></li>
+</ul>
+</li>
+<li>New NEON kernels / functions:<ul>
+<li><a class="el" href="classarm__compute_1_1_n_e_activation_layer_kernel.xhtml">NEActivationLayerKernel</a> / <a class="el" href="classarm__compute_1_1_n_e_activation_layer.xhtml">NEActivationLayer</a></li>
+<li>GEMM refactoring + FP16 support (Requires armv8.2 CPU): <a class="el" href="classarm__compute_1_1_n_e_g_e_m_m_interleave4x4_kernel.xhtml">NEGEMMInterleave4x4Kernel</a>, <a class="el" href="classarm__compute_1_1_n_e_g_e_m_m_transpose1x_w_kernel.xhtml">NEGEMMTranspose1xWKernel</a>, <a class="el" href="classarm__compute_1_1_n_e_g_e_m_m_matrix_multiply_kernel.xhtml">NEGEMMMatrixMultiplyKernel</a>, <a class="el" href="classarm__compute_1_1_n_e_g_e_m_m_matrix_addition_kernel.xhtml">NEGEMMMatrixAdditionKernel</a> / <a class="el" href="classarm__compute_1_1_n_e_g_e_m_m.xhtml">NEGEMM</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_pooling_layer_kernel.xhtml">NEPoolingLayerKernel</a> / <a class="el" href="classarm__compute_1_1_n_e_pooling_layer.xhtml">NEPoolingLayer</a></li>
+</ul>
+</li>
+</ul>
+<p>v17.02.1 Sources preview</p><ul>
+<li>New OpenCL kernels / functions:<ul>
+<li><a class="el" href="classarm__compute_1_1_c_l_logits1_d_max_kernel.xhtml">CLLogits1DMaxKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_logits1_d_shift_exp_sum_kernel.xhtml">CLLogits1DShiftExpSumKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_logits1_d_norm_kernel.xhtml">CLLogits1DNormKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_softmax_layer.xhtml">CLSoftmaxLayer</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_pooling_layer_kernel.xhtml">CLPoolingLayerKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_pooling_layer.xhtml">CLPoolingLayer</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_im2_col_kernel.xhtml">CLIm2ColKernel</a> <a class="el" href="classarm__compute_1_1_c_l_col2_im_kernel.xhtml">CLCol2ImKernel</a> <a class="el" href="classarm__compute_1_1_c_l_convolution_layer_weights_reshape_kernel.xhtml">CLConvolutionLayerWeightsReshapeKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_convolution_layer.xhtml">CLConvolutionLayer</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_remap_kernel.xhtml">CLRemapKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_remap.xhtml">CLRemap</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_hor_kernel.xhtml">CLGaussianPyramidHorKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_vert_kernel.xhtml">CLGaussianPyramidVertKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid.xhtml">CLGaussianPyramid</a>, <a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_half.xhtml">CLGaussianPyramidHalf</a>, <a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_orb.xhtml">CLGaussianPyramidOrb</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_min_max_kernel.xhtml">CLMinMaxKernel</a>, <a class="el" href="classarm__compute_1_1_c_l_min_max_location_kernel.xhtml">CLMinMaxLocationKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_min_max_location.xhtml">CLMinMaxLocation</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_non_linear_filter_kernel.xhtml">CLNonLinearFilterKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_non_linear_filter.xhtml">CLNonLinearFilter</a></li>
+</ul>
+</li>
+<li>New NEON FP16 kernels (Requires armv8.2 CPU)<ul>
+<li><a class="el" href="classarm__compute_1_1_n_e_accumulate_weighted_f_p16_kernel.xhtml">NEAccumulateWeightedFP16Kernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_box3x3_f_p16_kernel.xhtml">NEBox3x3FP16Kernel</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_non_maxima_suppression3x3_f_p16_kernel.xhtml">NENonMaximaSuppression3x3FP16Kernel</a></li>
+</ul>
+</li>
+</ul>
+<p>v17.02 Sources preview</p><ul>
+<li>New OpenCL kernels / functions:<ul>
+<li><a class="el" href="classarm__compute_1_1_c_l_activation_layer_kernel.xhtml">CLActivationLayerKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_activation_layer.xhtml">CLActivationLayer</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_channel_combine_kernel.xhtml">CLChannelCombineKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_channel_combine.xhtml">CLChannelCombine</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_derivative_kernel.xhtml">CLDerivativeKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_channel_extract.xhtml">CLChannelExtract</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_fast_corners_kernel.xhtml">CLFastCornersKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_fast_corners.xhtml">CLFastCorners</a></li>
+<li><a class="el" href="classarm__compute_1_1_c_l_mean_std_dev_kernel.xhtml">CLMeanStdDevKernel</a> / <a class="el" href="classarm__compute_1_1_c_l_mean_std_dev.xhtml">CLMeanStdDev</a></li>
+</ul>
+</li>
+<li>New NEON kernels / functions:<ul>
+<li><a class="el" href="classarm__compute_1_1_h_o_g.xhtml" title="CPU implementation of HOG data-object. ">HOG</a> / SVM: <a class="el" href="classarm__compute_1_1_n_e_h_o_g_orientation_binning_kernel.xhtml">NEHOGOrientationBinningKernel</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_block_normalization_kernel.xhtml">NEHOGBlockNormalizationKernel</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_detector_kernel.xhtml">NEHOGDetectorKernel</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_non_maxima_suppression_kernel.xhtml">NEHOGNonMaximaSuppressionKernel</a> / <a class="el" href="classarm__compute_1_1_n_e_h_o_g_descriptor.xhtml">NEHOGDescriptor</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_detector.xhtml">NEHOGDetector</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_gradient.xhtml">NEHOGGradient</a>, <a class="el" href="classarm__compute_1_1_n_e_h_o_g_multi_detection.xhtml">NEHOGMultiDetection</a></li>
+<li><a class="el" href="classarm__compute_1_1_n_e_non_linear_filter_kernel.xhtml">NENonLinearFilterKernel</a> / <a class="el" href="classarm__compute_1_1_n_e_non_linear_filter.xhtml">NENonLinearFilter</a></li>
+</ul>
+</li>
+<li>Introduced a <a class="el" href="classarm__compute_1_1_c_l_scheduler.xhtml" title="Provides global access to a CL context and command queue. ">CLScheduler</a> to manage the default context and command queue used by the runtime library and create synchronisation events.</li>
+<li>Switched all the kernels / functions to use tensors instead of images.</li>
+<li>Updated documentation to include instructions to build the library from sources.</li>
+</ul>
+<p>v16.12 Binary preview release</p><ul>
+<li>Original release</li>
+</ul>
 <h1><a class="anchor" id="S3_how_to_build"></a>
 How to build the library and the examples</h1>
 <h2><a class="anchor" id="S3_1_build_options"></a>
@@ -322,7 +352,7 @@
     default: 0
     actual: 0
 
-arch: Target Architecture (default=armv7a) (armv7a|arm64-v8a|arm64-v8.2-a|x86)
+arch: Target Architecture (default=armv7a) (armv7a|arm64-v8a|arm64-v8.2-a|x86_32|x86_64)
     default: armv7a
     actual: armv7a
 
@@ -349,17 +379,32 @@
 embed_kernels: Embed OpenCL kernels in library binary(Default=0) (0|1)
     default: 0
     actual: 0
+
+scheduler: Scheduler backend(Default=cpp) (cpp|pthread|openmp)
+    default: cpp
+    actual: cpp
+
+set_soname: Set the library's soname and shlibversion (Requires SCons 2.4 or above) (yes|no)
+    default: 0
+    actual: False
+
+extra_cxx_flags: Extra CXX flags to be appended to the build command
+    default:
+    actual:
 </pre><p>Debug / asserts:</p><ul>
 <li>With debug=1 asserts are enabled, and the library is built with symbols and no optimisations enabled.</li>
 <li>With debug=0 and asserts=1: Optimisations are enabled and symbols are removed, however all the asserts are still present (This is about 20% slower than the release build)</li>
 <li>With debug=0 and asserts=0: All optimisations are enable and no validation is performed, if the application misuses the library it is likely to result in a crash. (Only use this mode once you are sure your application is working as expected).</li>
 </ul>
-<p>Architecture: The x86 target can only be used with neon=0 and opencl=1.</p>
+<p>Architecture: The x86_32 and x86_64 targets can only be used with neon=0 and opencl=1.</p>
 <p>OS: Choose the operating system you are targeting: Linux, Android or bare metal. </p><dl class="section note"><dt>Note</dt><dd>bare metal can only be used for NEON (not OpenCL), only static libraries get built and NEON's multi-threading support is disabled.</dd></dl>
 <p>Build type: you can either build directly on your device (native) or cross compile from your desktop machine (cross-compile). In both cases make sure the compiler is available in your path.</p>
 <p>Werror: If you are compiling using the same toolchains as the ones used in this guide then there shouldn't be any warning and therefore you should be able to keep Werror=1. If with a different compiler version the library fails to build because of warnings interpreted as errors then, if you are sure the warnings are not important, you might want to try to build with Werror=0 (But please do report the issue either on Github or by an email to <a href="#" onclick="location.href='mai'+'lto:'+'dev'+'el'+'ope'+'r@'+'arm'+'.c'+'om'; return false;">devel<span style="display: none;">.nosp@m.</span>oper<span style="display: none;">.nosp@m.</span>@arm.<span style="display: none;">.nosp@m.</span>com</a> so that the issue can be addressed).</p>
-<p>OpenCL / NEON: Choose which SIMD technology you are interested targeting. (NEON for ARM Cortex-A CPUs or OpenCL for ARM Mali GPUs)</p>
+<p>OpenCL / NEON: Choose which SIMD technology you want to target. (NEON for ARM Cortex-A CPUs or OpenCL for ARM Mali GPUs)</p>
 <p>embed_kernels: For OpenCL only: set embed_kernels=1 if you want the OpenCL kernels to be built in the library's binaries instead of being read from separate ".cl" files. If embed_kernels is set to 0 then the application can set the path to the folder containing the OpenCL kernel files by calling <a class="el" href="classarm__compute_1_1_c_l_kernel_library.xhtml#af353532ea782387df6bcb6d01894f4ae" title="Initialises the kernel library. ">CLKernelLibrary::init()</a>. By default the path is set to "./cl_kernels".</p>
+<p>set_soname: Do you want to build the versioned version of the library ? If enabled the library will contain a SONAME and SHLIBVERSION and some symlinks will automatically be created between the objects. Example: libarm_compute_core.so -&gt; libarm_compute_core.so.1.0.0 libarm_compute_core.so.1 -&gt; libarm_compute_core.so.1.0.0 libarm_compute_core.so.1.0.0</p>
+<dl class="section note"><dt>Note</dt><dd>This options is disabled by default as it requires SCons version 2.4 or above.</dd></dl>
+<p>extra_cxx_flags: Custom CXX flags which will be appended to the end of the build command.</p>
 <h2><a class="anchor" id="S3_2_linux"></a>
 Linux</h2>
 <h3><a class="anchor" id="S3_2_1_library"></a>
@@ -368,7 +413,14 @@
 <dl class="section note"><dt>Note</dt><dd>If you are building with opencl=1 then scons will expect to find libOpenCL.so either in the current directory or in "build" (See the section below if you need a stub OpenCL library to link against)</dd></dl>
 <p>To cross-compile the library in debug mode, with NEON only support, for Linux 32bit: </p><pre class="fragment">scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=linux arch=armv7a
 </pre><p>To cross-compile the library in asserts mode, with OpenCL only support, for Linux 64bit: </p><pre class="fragment">scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a
-</pre><h3><a class="anchor" id="S3_2_2_examples"></a>
+</pre><p>You can also compile the library natively on an ARM device by using <b>build=native</b>: </p><pre class="fragment">scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=arm64-v8a build=native
+scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=native
+</pre><dl class="section note"><dt>Note</dt><dd>G++ for ARM is mono-arch, therefore if you want to compile for Linux 32bit on a Linux 64bit platform you will have to use a cross compiler.</dd></dl>
+<p>For example on a 64bit Debian based system you would have to install <b>g++-arm-linux-gnueabihf</b> </p><pre class="fragment">apt-get install g++-arm-linux-gnueabihf
+</pre><p>Then run </p><pre class="fragment">scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=cross_compile
+</pre><p>or simply remove the build parameter as build=cross_compile is the default value: </p><pre class="fragment">scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a
+</pre><dl class="section attention"><dt>Attention</dt><dd>To cross compile with opencl=1 you need to make sure to have a version of libOpenCL matching your target architecture.</dd></dl>
+<h3><a class="anchor" id="S3_2_2_examples"></a>
 How to manually build the examples ?</h3>
 <p>The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.</p>
 <dl class="section note"><dt>Note</dt><dd>The following command lines assume the <a class="el" href="namespacearm__compute.xhtml">arm_compute</a> binaries are present in the current directory or in the system library path.</dd></dl>
@@ -478,7 +530,8 @@
 <div class="fragment"><div class="line"><span class="comment">//Create a kernel object:</span></div><div class="line">MyKernel kernel;</div><div class="line"><span class="comment">// Initialize the kernel with the input/output and options you want to use:</span></div><div class="line">kernel.configure( input, output, option0, option1);</div><div class="line"><span class="comment">// Retrieve the execution window of the kernel:</span></div><div class="line"><span class="keyword">const</span> Window&amp; max_window = kernel.window();</div><div class="line"><span class="comment">// Run the whole kernel in the current thread:</span></div><div class="line">kernel.run( max_window ); <span class="comment">// Run the kernel on the full window</span></div></div><!-- fragment --><h3><a class="anchor" id="S4_2_3"></a>
 Multi-threading</h3>
 <p>The previous section shows how to run a NEON / CPP kernel in the current thread, however if your system has several CPU cores, you will probably want the kernel to use several cores. Here is how this can be done:</p>
-<div class="fragment"><div class="line">    <span class="keyword">const</span> Window &amp;max_window     = kernel-&gt;window();</div><div class="line">    <span class="keyword">const</span> <span class="keywordtype">int</span>     num_iterations = max_window.num_iterations(split_dimension);</div><div class="line">    <span class="keywordtype">int</span>           num_threads    = std::min(num_iterations, _num_threads);</div><div class="line"></div><div class="line">    <span class="keywordflow">if</span>(!kernel-&gt;is_parallelisable() || 1 == num_threads)</div><div class="line">    {</div><div class="line">        kernel-&gt;run(max_window);</div><div class="line">    }</div><div class="line"><span class="preprocessor">#ifndef NO_MULTI_THREADING</span></div><div class="line">    <span class="keywordflow">else</span></div><div class="line">    {</div><div class="line">        <span class="keywordflow">for</span>(<span class="keywordtype">int</span> t = 0; t &lt; num_threads; ++t)</div><div class="line">        {</div><div class="line">            Window win = max_window.split_window(split_dimension, t, num_threads);</div><div class="line">            win.set_thread_id(t);</div><div class="line">            win.set_num_threads(num_threads);</div><div class="line"></div><div class="line">            <span class="keywordflow">if</span>(t != num_threads - 1)</div><div class="line">            {</div><div class="line">                _threads[t].start(kernel, win);</div><div class="line">            }</div><div class="line">            <span class="keywordflow">else</span></div><div class="line">            {</div><div class="line">                kernel-&gt;run(win);</div><div class="line">            }</div><div class="line">        }</div><div class="line"></div><div class="line">        <span class="keywordflow">try</span></div><div class="line">        {</div><div class="line">            <span class="keywordflow">for</span>(<span class="keywordtype">int</span> t = 1; t &lt; num_threads; ++t)</div><div class="line">            {</div><div class="line">                _threads[t - 1].wait();</div><div class="line">            }</div><div class="line">        }</div><div class="line">        <span class="keywordflow">catch</span>(<span class="keyword">const</span> std::system_error &amp;e)</div><div class="line">        {</div><div class="line">            std::cout &lt;&lt; <span class="stringliteral">&quot;Caught system_error with code &quot;</span> &lt;&lt; e.code() &lt;&lt; <span class="stringliteral">&quot; meaning &quot;</span> &lt;&lt; e.what() &lt;&lt; <span class="charliteral">&#39;\n&#39;</span>;</div><div class="line">        }</div><div class="line">    }</div><div class="line"><span class="preprocessor">#endif </span><span class="comment">/* NO_MULTI_THREADING */</span><span class="preprocessor"></span></div></div><!-- fragment --><p> This is the very basic implementation used in the NEON runtime library by all the NEON functions, </p><dl class="section see"><dt>See also</dt><dd><a class="el" href="classarm__compute_1_1_c_p_p_scheduler.xhtml" title="Pool of threads to automatically split a kernel&#39;s execution among several threads. ">CPPScheduler</a>.</dd></dl>
+<div class="fragment"><div class="line">    <span class="keyword">const</span> Window &amp;max_window     = kernel-&gt;window();</div><div class="line">    <span class="keyword">const</span> <span class="keywordtype">int</span>     num_iterations = max_window.num_iterations(split_dimension);</div><div class="line">    <span class="keywordtype">int</span>           num_threads    = std::min(num_iterations, _num_threads);</div><div class="line"></div><div class="line">    <span class="keywordflow">if</span>(!kernel-&gt;is_parallelisable() || 1 == num_threads)</div><div class="line">    {</div><div class="line">        kernel-&gt;run(max_window);</div><div class="line">    }</div><div class="line"><span class="preprocessor">#ifndef NO_MULTI_THREADING</span></div><div class="line">    <span class="keywordflow">else</span></div><div class="line">    {</div><div class="line">        <span class="keywordflow">for</span>(<span class="keywordtype">int</span> t = 0; t &lt; num_threads; ++t)</div><div class="line">        {</div><div class="line">            Window win = max_window.split_window(split_dimension, t, num_threads);</div><div class="line">            win.set_thread_id(t);</div><div class="line">            win.set_num_threads(num_threads);</div><div class="line"></div><div class="line">            <span class="keywordflow">if</span>(t != num_threads - 1)</div><div class="line">            {</div><div class="line">                _threads[t].start(kernel, win);</div><div class="line">            }</div><div class="line">            <span class="keywordflow">else</span></div><div class="line">            {</div><div class="line">                kernel-&gt;run(win);</div><div class="line">            }</div><div class="line">        }</div><div class="line"></div><div class="line">        <span class="keywordflow">try</span></div><div class="line">        {</div><div class="line">            <span class="keywordflow">for</span>(<span class="keywordtype">int</span> t = 1; t &lt; num_threads; ++t)</div><div class="line">            {</div><div class="line">                _threads[t - 1].wait();</div><div class="line">            }</div><div class="line">        }</div><div class="line">        <span class="keywordflow">catch</span>(<span class="keyword">const</span> std::system_error &amp;e)</div><div class="line">        {</div><div class="line">            std::cout &lt;&lt; <span class="stringliteral">&quot;Caught system_error with code &quot;</span> &lt;&lt; e.code() &lt;&lt; <span class="stringliteral">&quot; meaning &quot;</span> &lt;&lt; e.what() &lt;&lt; <span class="charliteral">&#39;\n&#39;</span>;</div><div class="line">        }</div><div class="line">    }</div><div class="line"><span class="preprocessor">#endif </span><span class="comment">/* NO_MULTI_THREADING */</span><span class="preprocessor"></span></div></div><!-- fragment --><p> This is the very basic implementation used in the NEON runtime library by all the NEON functions.</p>
+<dl class="section see"><dt>See also</dt><dd><a class="el" href="classarm__compute_1_1_c_p_p_scheduler.xhtml" title="Pool of threads to automatically split a kernel&#39;s execution among several threads. ">CPPScheduler</a>.</dd></dl>
 <dl class="section note"><dt>Note</dt><dd>Some kernels like for example <a class="el" href="classarm__compute_1_1_n_e_histogram_kernel.xhtml">NEHistogramKernel</a> need some local temporary buffer to perform their calculations. In order to avoid memory corruption between threads, the local buffer must be of size: <code>memory_needed_per_thread * num_threads</code> and each subwindow must be initialised by calling <a class="el" href="classarm__compute_1_1_window.xhtml#a50ee380d076dd9ce06a35a76903f8b7b">Window::set_thread_id()</a> with a unique thread_id between 0 and num_threads.</dd></dl>
 <h3><a class="anchor" id="S4_2_4"></a>
 Functions</h3>
@@ -524,42 +577,23 @@
 </ul>
 <div class="fragment"><div class="line">    PPMLoader ppm;</div><div class="line">    <a class="code" href="struct_image.xhtml">Image</a>     src, tmp, dst;</div><div class="line"></div><div class="line">    <span class="keywordflow">if</span>(argc &lt; 2)</div><div class="line">    {</div><div class="line">        <span class="comment">// Print help</span></div><div class="line">        std::cout &lt;&lt; <span class="stringliteral">&quot;Usage: ./build/neon_convolution [input_image.ppm]\n\n&quot;</span>;</div><div class="line">        std::cout &lt;&lt; <span class="stringliteral">&quot;No input_image provided, creating a dummy 640x480 image\n&quot;</span>;</div><div class="line">        <span class="comment">// Initialize just the dimensions and format of your buffers:</span></div><div class="line">        src.allocator()-&gt;init(TensorInfo(640, 480, <a class="code" href="namespacearm__compute.xhtml#ab4e88c89b3b7ea1735996cc4def22d58a6669348b484e3008dca2bfa8e85e40b5">Format::U8</a>));</div><div class="line">    }</div><div class="line">    <span class="keywordflow">else</span></div><div class="line">    {</div><div class="line">        ppm.open(argv[1]);</div><div class="line">        <span class="comment">// Initialize just the dimensions and format of your buffers:</span></div><div class="line">        ppm.init_image(src, <a class="code" href="namespacearm__compute.xhtml#ab4e88c89b3b7ea1735996cc4def22d58a6669348b484e3008dca2bfa8e85e40b5">Format::U8</a>);</div><div class="line">    }</div><div class="line"></div><div class="line">    <span class="comment">// Initialize just the dimensions and format of the temporary and destination images:</span></div><div class="line">    tmp.allocator()-&gt;init(*src.info());</div><div class="line">    dst.allocator()-&gt;init(*src.info());</div><div class="line"></div><div class="line">    NEConvolution3x3 conv3x3;</div><div class="line">    NEConvolution5x5 conv5x5;</div><div class="line"></div><div class="line">    <span class="comment">// Apply a Gaussian 3x3 filter to the source image followed by a Gaussian 5x5:</span></div><div class="line">    <span class="comment">// The function will automatically update the padding information inside input and output to match its requirements</span></div><div class="line">    conv3x3.configure(&amp;src, &amp;tmp, <a class="code" href="cl__convolution_8cpp.xhtml#a741ba5321da40184f8653e0a50ace070">gaussian3x3</a>, 0 <span class="comment">/* Let arm_compute calculate the scale */</span>, <a class="code" href="namespacearm__compute.xhtml#a15a05537a472ee742404821851529327a0db45d2a4141101bdfe48e3314cfbca3">BorderMode::UNDEFINED</a>);</div><div class="line">    conv5x5.configure(&amp;tmp, &amp;dst, <a class="code" href="cl__convolution_8cpp.xhtml#a565013cf7e49a591bacd548571951f94">gaussian5x5</a>, 0 <span class="comment">/* Let arm_compute calculate the scale */</span>, <a class="code" href="namespacearm__compute.xhtml#a15a05537a472ee742404821851529327a0db45d2a4141101bdfe48e3314cfbca3">BorderMode::UNDEFINED</a>);</div><div class="line"></div><div class="line">    <span class="comment">// Now that the padding requirements are known we can allocate the images:</span></div><div class="line">    src.allocator()-&gt;allocate();</div><div class="line">    tmp.allocator()-&gt;allocate();</div><div class="line">    dst.allocator()-&gt;allocate();</div><div class="line"></div><div class="line">    <span class="comment">// Fill the input image with the content of the PPM image if a filename was provided:</span></div><div class="line">    <span class="keywordflow">if</span>(ppm.is_open())</div><div class="line">    {</div><div class="line">        ppm.fill_image(src);</div><div class="line">    }</div><div class="line"></div><div class="line">    <span class="comment">//Execute the functions:</span></div><div class="line">    conv3x3.run();</div><div class="line">    conv5x5.run();</div><div class="line"></div><div class="line">    <span class="comment">// Save the result to file:</span></div><div class="line">    <span class="keywordflow">if</span>(ppm.is_open())</div><div class="line">    {</div><div class="line">        <span class="keyword">const</span> std::string output_filename = std::string(argv[1]) + <span class="stringliteral">&quot;_out.ppm&quot;</span>;</div><div class="line">        <a class="code" href="namespacetest__helpers.xhtml#a5036a1b77bd7223a68954b5078c6545a">save_to_ppm</a>(dst, output_filename);</div><div class="line">    }</div></div><!-- fragment --> <dl class="section note"><dt>Note</dt><dd>It's important to call allocate <b>after</b> the function is configured: if the image / tensor is already allocated then the function will shrink its execution window instead of increasing the padding. (See below for more details).</dd></dl>
 <ul>
-<li>Manual padding / no padding / auto padding: You can allocate your images / tensors up front (before configuring your functions), in that case the function will use whatever padding is available and will shrink its execution window if there isn't enough padding available (Which will translates into a smaller valid region for the output <dl class="section see"><dt>See also</dt><dd>valid_region). If you don't want to manually set the padding but still want to allocate your objects upfront then you can use auto_padding.</dd></dl>
-<div class="fragment"><div class="line"><a class="code" href="struct_image.xhtml">Image</a>     src, dst;</div><div class="line"></div><div class="line"><span class="comment">// Use auto padding for the input:</span></div><div class="line">src.info()-&gt;init_auto_padding(TensorShape(640u,480u), <a class="code" href="namespacearm__compute.xhtml#ab4e88c89b3b7ea1735996cc4def22d58a6669348b484e3008dca2bfa8e85e40b5">Format::U8</a>);</div><div class="line"></div><div class="line"><span class="comment">// Use manual padding for the destination image</span></div><div class="line">dst.info()-&gt;init(src.info()-&gt;tensor_shape(), <a class="code" href="namespacearm__compute.xhtml#ab4e88c89b3b7ea1735996cc4def22d58a6669348b484e3008dca2bfa8e85e40b5">Format::U8</a>, strides_in_bytes, offset_first_element_in_bytes, total_size_in_bytes);</div><div class="line"></div><div class="line"><span class="comment">// Allocate all the images</span></div><div class="line">src.allocator()-&gt;allocate();</div><div class="line">dst.allocator()-&gt;allocate();</div><div class="line"><span class="comment">// Fill the input image with the content of the PPM image if a filename was provided:</span></div><div class="line">fill_image(src);</div><div class="line"></div><div class="line">NEGaussian3x3 gauss;</div><div class="line"></div><div class="line"><span class="comment">// Apply a Gaussian 3x3 filter to the source image (Note: if the padding provided is not enough then the execution window and valid region of the output will be shrunk)</span></div><div class="line">gauss.configure(&amp;src, &amp;dst, <a class="code" href="namespacearm__compute.xhtml#a15a05537a472ee742404821851529327a0db45d2a4141101bdfe48e3314cfbca3">BorderMode::UNDEFINED</a>);</div><div class="line"></div><div class="line"><span class="comment">//Execute the functions:</span></div><div class="line">gauss.run();</div></div><!-- fragment --></li>
+<li>Manual padding / no padding / auto padding: You can allocate your images / tensors up front (before configuring your functions), in that case the function will use whatever padding is available and will shrink its execution window if there isn't enough padding available (Which will translates into a smaller valid region for the output. See also <a class="el" href="index.xhtml#valid_region">Valid regions</a>). If you don't want to manually set the padding but still want to allocate your objects upfront then you can use auto_padding.</li>
 </ul>
-<dl class="section warning"><dt>Warning</dt><dd>Some kernels need up to 3 neighbour values to calculate the value of a given pixel, therefore to be safe we use a 4 pixels padding all around the image and some kernels read and write up to 32 pixels at the time, therefore we add an extra 32 pixels of padding at the end of each row to be safe. As a result auto padded buffers waste a lot of memory and are less cache friendly. It is therefore recommended to use accurate padding or manual padding wherever possible.</dd></dl>
+<div class="fragment"><div class="line"><a class="code" href="struct_image.xhtml">Image</a>     src, dst;</div><div class="line"></div><div class="line"><span class="comment">// Use auto padding for the input:</span></div><div class="line">src.info()-&gt;init_auto_padding(TensorShape(640u,480u), <a class="code" href="namespacearm__compute.xhtml#ab4e88c89b3b7ea1735996cc4def22d58a6669348b484e3008dca2bfa8e85e40b5">Format::U8</a>);</div><div class="line"></div><div class="line"><span class="comment">// Use manual padding for the destination image</span></div><div class="line">dst.info()-&gt;init(src.info()-&gt;tensor_shape(), <a class="code" href="namespacearm__compute.xhtml#ab4e88c89b3b7ea1735996cc4def22d58a6669348b484e3008dca2bfa8e85e40b5">Format::U8</a>, strides_in_bytes, offset_first_element_in_bytes, total_size_in_bytes);</div><div class="line"></div><div class="line"><span class="comment">// Allocate all the images</span></div><div class="line">src.allocator()-&gt;allocate();</div><div class="line">dst.allocator()-&gt;allocate();</div><div class="line"><span class="comment">// Fill the input image with the content of the PPM image if a filename was provided:</span></div><div class="line">fill_image(src);</div><div class="line"></div><div class="line">NEGaussian3x3 gauss;</div><div class="line"></div><div class="line"><span class="comment">// Apply a Gaussian 3x3 filter to the source image (Note: if the padding provided is not enough then the execution window and valid region of the output will be shrunk)</span></div><div class="line">gauss.configure(&amp;src, &amp;dst, <a class="code" href="namespacearm__compute.xhtml#a15a05537a472ee742404821851529327a0db45d2a4141101bdfe48e3314cfbca3">BorderMode::UNDEFINED</a>);</div><div class="line"></div><div class="line"><span class="comment">//Execute the functions:</span></div><div class="line">gauss.run();</div></div><!-- fragment --><dl class="section warning"><dt>Warning</dt><dd>Some kernels need up to 3 neighbour values to calculate the value of a given pixel, therefore to be safe we use a 4 pixels padding all around the image and some kernels read and write up to 32 pixels at the time, therefore we add an extra 32 pixels of padding at the end of each row to be safe. As a result auto padded buffers waste a lot of memory and are less cache friendly. It is therefore recommended to use accurate padding or manual padding wherever possible.</dd></dl>
 <h4><a class="anchor" id="valid_region"></a>
 Valid regions</h4>
 <p>Some kernels (like edge detectors for example) need to read values of neighbouring pixels to calculate the value of a given pixel, it is therefore not possible to calculate the values of the pixels on the edges.</p>
 <p>Another case is: if a kernel processes 8 pixels per iteration then if the image's dimensions is not a multiple of 8 and not enough padding is available then the kernel will not be able to process the pixels near the right edge as a result these pixels will be left undefined.</p>
-<p>In order to know which pixels have been calculated, each kernel sets a valid region for each output image or tensor </p><dl class="section see"><dt>See also</dt><dd><a class="el" href="classarm__compute_1_1_tensor_info.xhtml#ac437ef0718add962a4059fb3b3084c34" title="Valid region of the tensor. ">TensorInfo::valid_region()</a>, <a class="el" href="structarm__compute_1_1_valid_region.xhtml">ValidRegion</a></dd></dl>
+<p>In order to know which pixels have been calculated, each kernel sets a valid region for each output image or tensor. See also <a class="el" href="classarm__compute_1_1_tensor_info.xhtml#ac437ef0718add962a4059fb3b3084c34">TensorInfo::valid_region()</a>, <a class="el" href="structarm__compute_1_1_valid_region.xhtml">ValidRegion</a></p>
 <dl class="section attention"><dt>Attention</dt><dd>Valid regions and accurate padding have only been introduced in the library recently therefore not all the kernels and functions have been ported to use them yet. All the non ported kernels will set the <a class="el" href="structarm__compute_1_1_valid_region.xhtml">ValidRegion</a> equal to the <a class="el" href="classarm__compute_1_1_tensor_shape.xhtml">TensorShape</a>.</dd></dl>
 <p>List of kernels which haven't been ported yet:</p>
 <ul>
-<li><a class="el" href="classarm__compute_1_1_c_l_color_convert_kernel.xhtml">CLColorConvertKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_edge_non_max_suppression_kernel.xhtml">CLEdgeNonMaxSuppressionKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_edge_trace_kernel.xhtml">CLEdgeTraceKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_hor_kernel.xhtml">CLGaussianPyramidHorKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_gaussian_pyramid_vert_kernel.xhtml">CLGaussianPyramidVertKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_c_l_gradient_kernel.xhtml">CLGradientKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_channel_combine_kernel.xhtml">NEChannelCombineKernel</a></li>
 <li><a class="el" href="classarm__compute_1_1_n_e_color_convert_kernel.xhtml">NEColorConvertKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_fill_array_kernel.xhtml">NEFillArrayKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_gaussian_pyramid_hor_kernel.xhtml">NEGaussianPyramidHorKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_gaussian_pyramid_vert_kernel.xhtml">NEGaussianPyramidVertKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_harris_score_f_p16_kernel.xhtml">NEHarrisScoreFP16Kernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_harris_score_kernel.xhtml">NEHarrisScoreKernel</a></li>
 <li><a class="el" href="classarm__compute_1_1_n_e_histogram_kernel.xhtml">NEHistogramKernel</a></li>
 <li><a class="el" href="classarm__compute_1_1_n_e_histogram_border_kernel.xhtml">NEHistogramBorderKernel</a></li>
 <li><a class="el" href="classarm__compute_1_1_n_e_h_o_g_block_normalization_kernel.xhtml">NEHOGBlockNormalizationKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_h_o_g_detector_kernel.xhtml">NEHOGDetectorKernel</a></li>
 <li><a class="el" href="classarm__compute_1_1_n_e_h_o_g_orientation_binning_kernel.xhtml">NEHOGOrientationBinningKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_logits1_d_max_kernel.xhtml">NELogits1DMaxKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_logits1_d_shift_exp_sum_kernel.xhtml">NELogits1DShiftExpSumKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_logits1_d_norm_kernel.xhtml">NELogits1DNormKernel</a></li>
 <li><a class="el" href="classarm__compute_1_1_n_e_l_k_tracker_kernel.xhtml">NELKTrackerKernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_non_maxima_suppression3x3_f_p16_kernel.xhtml">NENonMaximaSuppression3x3FP16Kernel</a></li>
-<li><a class="el" href="classarm__compute_1_1_n_e_non_maxima_suppression3x3_kernel.xhtml">NENonMaximaSuppression3x3Kernel</a></li>
 </ul>
 <h3><a class="anchor" id="S4_6_2_tensors"></a>
 Tensors</h3>
@@ -574,14 +608,20 @@
 <dl class="section note"><dt>Note</dt><dd>Unless specified otherwise in the kernel's or function's documentation all tensors and images parameters passed must have identical dimensions.</dd>
 <dd>
 Unless specified otherwise in the kernel's or function's documentation the number of channels for tensors is expected to be 1 (For images, the number of channels is inferred from the <a class="el" href="namespacearm__compute.xhtml#ab4e88c89b3b7ea1735996cc4def22d58">Format</a>).</dd></dl>
-<h3><a class="anchor" id="S4_6_4_working_with_objects"></a>
-Working with Images and Tensors</h3>
-<p>In the case that no padding exists in the Image/Tensor object you can linearize the object memory and directly copy to/from it. </p><div class="fragment"><div class="line"><span class="comment">// Create a tensor object</span></div><div class="line">Tensor tensor;</div><div class="line"><span class="comment">// Operate on tensor</span></div><div class="line">...</div><div class="line"><span class="comment">// Copy results</span></div><div class="line">unsigned <span class="keywordtype">char</span> *dst = ... <span class="comment">// Your unpadded destination buffer</span></div><div class="line"><span class="comment">// Copy tensor as a linear bulk of memory if no padding exists</span></div><div class="line"><span class="keywordflow">if</span>(!tensor.info()-&gt;has_padding())</div><div class="line">{</div><div class="line">    std::copy_n(tensor.buffer(), tensor.info()-&gt;total_size(), dst);</div><div class="line">}</div></div><!-- fragment --><p>On the other hand, in case of padding, each row should be carefully copied separately. </p><div class="fragment"><div class="line"><span class="comment">// Create an image object</span></div><div class="line"><a class="code" href="struct_image.xhtml">Image</a> img;</div><div class="line"><span class="comment">// Initialize image</span></div><div class="line"><span class="keyword">const</span> <span class="keywordtype">unsigned</span> <span class="keywordtype">char</span> *src = ... <span class="comment">// Your unpadded input buffer</span></div><div class="line"><span class="comment">// Initialize the Image object using an RGB source image</span></div><div class="line"><span class="keywordflow">for</span>(<span class="keywordtype">unsigned</span> <span class="keywordtype">int</span> y = 0; y &lt; height; ++y)</div><div class="line">{</div><div class="line">    <span class="comment">// Copy one RGB row at a time</span></div><div class="line">    std::copy_n(img.buffer() + img.info()-&gt;offset_element_in_bytes(Coordinates(0, y)), width * 3, src + (y * width) * 3);</div><div class="line">}</div></div><!-- fragment --> </div></div><!-- contents -->
+<dl class="section attention"><dt>Attention</dt><dd>Regardless of the <a class="el" href="namespacearm__compute.xhtml#ad8ed01ff3ff33333d8e19db4d2818bb6">DataType</a> used by a tensor the <a class="el" href="classarm__compute_1_1_i_tensor.xhtml#ab988210662dbd3bf32fd563c7dd1bdbf">ITensor::buffer()</a> method will always return a uint8_t pointer, and all the metadata in <a class="el" href="classarm__compute_1_1_tensor_info.xhtml">TensorInfo</a> will be expressed in bytes. It is the user's responsibility to cast the pointer to the correct type.</dd></dl>
+<p>For example, to read the element located at the coordinates (x,y) of a float tensor:</p>
+<div class="fragment"><div class="line"><span class="keywordtype">float</span> value = *<span class="keyword">reinterpret_cast&lt;</span><span class="keywordtype">float</span>*<span class="keyword">&gt;</span>(input.buffer() + input.info()-&gt;offset_element_in_bytes(Coordinates(x,y)));</div></div><!-- fragment --><h3><a class="anchor" id="S4_6_4_working_with_objects"></a>
+Working with Images and Tensors using iterators</h3>
+<p>The library provides some iterators to access objects' data. Iterators are created by associating a data object (An image or a tensor for example) with an iteration window.</p>
+<p>Iteration windows are defined by an array of dimension, each of which is made of a start, end and step.</p>
+<p>The <a class="el" href="namespacearm__compute.xhtml#a78fd1c0056e9add7ab01b8e118c0038d">execute_window_loop</a> function takes an execution window, a lambda function and one or more iterators. It will iterate through every element of the execution window and for each element it will update the iterators accordingly and call the lambda function.</p>
+<p>Here is a couple of examples of how to use the iterators to fill / read tensors:</p>
+<div class="fragment"><div class="line">    constexpr <span class="keywordtype">unsigned</span> <span class="keywordtype">int</span> width  = 4;</div><div class="line">    constexpr <span class="keywordtype">unsigned</span> <span class="keywordtype">int</span> height = 3;</div><div class="line">    constexpr <span class="keywordtype">unsigned</span> <span class="keywordtype">int</span> batch  = 2;</div><div class="line"></div><div class="line">    <span class="keyword">auto</span> *src_data = <span class="keyword">new</span> <span class="keywordtype">float</span>[width * height * batch];</div><div class="line">    <span class="keyword">auto</span> *dst_data = <span class="keyword">new</span> <span class="keywordtype">float</span>[width * height * batch];</div><div class="line"></div><div class="line">    <span class="comment">// Fill src_data with dummy values:</span></div><div class="line">    <span class="keywordflow">for</span>(<span class="keywordtype">unsigned</span> <span class="keywordtype">int</span> b = 0; b &lt; batch; b++)</div><div class="line">    {</div><div class="line">        <span class="keywordflow">for</span>(<span class="keywordtype">unsigned</span> <span class="keywordtype">int</span> h = 0; h &lt; height; h++)</div><div class="line">        {</div><div class="line">            <span class="keywordflow">for</span>(<span class="keywordtype">unsigned</span> <span class="keywordtype">int</span> w = 0; w &lt; width; w++)</div><div class="line">            {</div><div class="line">                src_data[b * (width * height) + h * width + w] = static_cast&lt;float&gt;(100 * b + 10 * h + w);</div><div class="line">            }</div><div class="line">        }</div><div class="line">    }</div><div class="line"></div><div class="line">    Tensor         input, output;</div><div class="line">    NESoftmaxLayer softmax;</div><div class="line"></div><div class="line">    <span class="comment">// Initialize the tensors dimensions and type:</span></div><div class="line">    <span class="keyword">const</span> TensorShape shape(width, height, batch);</div><div class="line">    input.allocator()-&gt;init(TensorInfo(shape, 1, <a class="code" href="namespacearm__compute.xhtml#ab4e88c89b3b7ea1735996cc4def22d58a44ad4ef5a76e6aa6fb3e3fa079a54fda">DataType::F32</a>));</div><div class="line">    output.allocator()-&gt;init(TensorInfo(shape, 1, <a class="code" href="namespacearm__compute.xhtml#ab4e88c89b3b7ea1735996cc4def22d58a44ad4ef5a76e6aa6fb3e3fa079a54fda">DataType::F32</a>));</div><div class="line"></div><div class="line">    <span class="comment">// Configure softmax:</span></div><div class="line">    softmax.configure(&amp;input, &amp;output);</div><div class="line"></div><div class="line">    <span class="comment">// Allocate the input / output tensors:</span></div><div class="line">    input.allocator()-&gt;allocate();</div><div class="line">    output.allocator()-&gt;allocate();</div><div class="line"></div><div class="line">    <span class="comment">// Fill the input tensor:</span></div><div class="line">    <span class="comment">// Simplest way: create an iterator to iterate through each element of the input tensor:</span></div><div class="line">    Window input_window;</div><div class="line">    input_window.use_tensor_dimensions(input.info());</div><div class="line">    std::cout &lt;&lt; <span class="stringliteral">&quot; Dimensions of the input&#39;s iterator:\n&quot;</span>;</div><div class="line">    std::cout &lt;&lt; <span class="stringliteral">&quot; X = [start=&quot;</span> &lt;&lt; input_window.x().start() &lt;&lt; <span class="stringliteral">&quot;, end=&quot;</span> &lt;&lt; input_window.x().end() &lt;&lt; <span class="stringliteral">&quot;, step=&quot;</span> &lt;&lt; input_window.x().step() &lt;&lt; <span class="stringliteral">&quot;]\n&quot;</span>;</div><div class="line">    std::cout &lt;&lt; <span class="stringliteral">&quot; Y = [start=&quot;</span> &lt;&lt; input_window.y().start() &lt;&lt; <span class="stringliteral">&quot;, end=&quot;</span> &lt;&lt; input_window.y().end() &lt;&lt; <span class="stringliteral">&quot;, step=&quot;</span> &lt;&lt; input_window.y().step() &lt;&lt; <span class="stringliteral">&quot;]\n&quot;</span>;</div><div class="line">    std::cout &lt;&lt; <span class="stringliteral">&quot; Z = [start=&quot;</span> &lt;&lt; input_window.z().start() &lt;&lt; <span class="stringliteral">&quot;, end=&quot;</span> &lt;&lt; input_window.z().end() &lt;&lt; <span class="stringliteral">&quot;, step=&quot;</span> &lt;&lt; input_window.z().step() &lt;&lt; <span class="stringliteral">&quot;]\n&quot;</span>;</div><div class="line"></div><div class="line">    <span class="comment">// Create an iterator:</span></div><div class="line">    Iterator input_it(&amp;input, input_window);</div><div class="line"></div><div class="line">    <span class="comment">// Iterate through the elements of src_data and copy them one by one to the input tensor:</span></div><div class="line">    <span class="comment">// This is equivalent to:</span></div><div class="line">    <span class="comment">// for( unsigned int z = 0; z &lt; batch; ++z)</span></div><div class="line">    <span class="comment">// {</span></div><div class="line">    <span class="comment">//   for( unsigned int y = 0; y &lt; height; ++y)</span></div><div class="line">    <span class="comment">//   {</span></div><div class="line">    <span class="comment">//     for( unsigned int x = 0; x &lt; width; ++x)</span></div><div class="line">    <span class="comment">//     {</span></div><div class="line">    <span class="comment">//       *reinterpret_cast&lt;float*&gt;( input.buffer() + input.info()-&gt;offset_element_in_bytes(Coordinates(x,y,z))) = src_data[ z * (width*height) + y * width + x];</span></div><div class="line">    <span class="comment">//     }</span></div><div class="line">    <span class="comment">//   }</span></div><div class="line">    <span class="comment">// }</span></div><div class="line">    <span class="comment">// Except it works for an arbitrary number of dimensions</span></div><div class="line">    <a class="code" href="namespacearm__compute.xhtml#a78fd1c0056e9add7ab01b8e118c0038d">execute_window_loop</a>(input_window, [&amp;](<span class="keyword">const</span> Coordinates &amp; <span class="keywordtype">id</span>)</div><div class="line">    {</div><div class="line">        std::cout &lt;&lt; <span class="stringliteral">&quot;Setting item [&quot;</span> &lt;&lt; <span class="keywordtype">id</span>.x() &lt;&lt; <span class="stringliteral">&quot;,&quot;</span> &lt;&lt; <span class="keywordtype">id</span>.y() &lt;&lt; <span class="stringliteral">&quot;,&quot;</span> &lt;&lt; <span class="keywordtype">id</span>.z() &lt;&lt; <span class="stringliteral">&quot;]\n&quot;</span>;</div><div class="line">        *<span class="keyword">reinterpret_cast&lt;</span><span class="keywordtype">float</span> *<span class="keyword">&gt;</span>(input_it.ptr()) = src_data[<span class="keywordtype">id</span>.z() * (width * height) + <span class="keywordtype">id</span>.y() * width + <span class="keywordtype">id</span>.x()];</div><div class="line">    },</div><div class="line">    input_it);</div><div class="line"></div><div class="line">    <span class="comment">// Run NEON softmax:</span></div><div class="line">    softmax.run();</div><div class="line"></div><div class="line">    <span class="comment">// More efficient way: create an iterator to iterate through each row (instead of each element) of the output tensor:</span></div><div class="line">    Window output_window;</div><div class="line">    output_window.use_tensor_dimensions(output.info(), <span class="comment">/* first_dimension =*/</span><a class="code" href="classarm__compute_1_1_window.xhtml#ad2d402364fa822b0b7775081291eeca9">Window::DimY</a>); <span class="comment">// Iterate through the rows (not each element)</span></div><div class="line">    std::cout &lt;&lt; <span class="stringliteral">&quot; Dimensions of the output&#39;s iterator:\n&quot;</span>;</div><div class="line">    std::cout &lt;&lt; <span class="stringliteral">&quot; X = [start=&quot;</span> &lt;&lt; output_window.x().start() &lt;&lt; <span class="stringliteral">&quot;, end=&quot;</span> &lt;&lt; output_window.x().end() &lt;&lt; <span class="stringliteral">&quot;, step=&quot;</span> &lt;&lt; output_window.x().step() &lt;&lt; <span class="stringliteral">&quot;]\n&quot;</span>;</div><div class="line">    std::cout &lt;&lt; <span class="stringliteral">&quot; Y = [start=&quot;</span> &lt;&lt; output_window.y().start() &lt;&lt; <span class="stringliteral">&quot;, end=&quot;</span> &lt;&lt; output_window.y().end() &lt;&lt; <span class="stringliteral">&quot;, step=&quot;</span> &lt;&lt; output_window.y().step() &lt;&lt; <span class="stringliteral">&quot;]\n&quot;</span>;</div><div class="line">    std::cout &lt;&lt; <span class="stringliteral">&quot; Z = [start=&quot;</span> &lt;&lt; output_window.z().start() &lt;&lt; <span class="stringliteral">&quot;, end=&quot;</span> &lt;&lt; output_window.z().end() &lt;&lt; <span class="stringliteral">&quot;, step=&quot;</span> &lt;&lt; output_window.z().step() &lt;&lt; <span class="stringliteral">&quot;]\n&quot;</span>;</div><div class="line"></div><div class="line">    <span class="comment">// Create an iterator:</span></div><div class="line">    Iterator output_it(&amp;output, output_window);</div><div class="line"></div><div class="line">    <span class="comment">// Iterate through the rows of the output tensor and copy them to dst_data:</span></div><div class="line">    <span class="comment">// This is equivalent to:</span></div><div class="line">    <span class="comment">// for( unsigned int z = 0; z &lt; batch; ++z)</span></div><div class="line">    <span class="comment">// {</span></div><div class="line">    <span class="comment">//   for( unsigned int y = 0; y &lt; height; ++y)</span></div><div class="line">    <span class="comment">//   {</span></div><div class="line">    <span class="comment">//     memcpy( dst_data + z * (width*height) + y * width, input.buffer() + input.info()-&gt;offset_element_in_bytes(Coordinates(0,y,z)), width * sizeof(float));</span></div><div class="line">    <span class="comment">//   }</span></div><div class="line">    <span class="comment">// }</span></div><div class="line">    <span class="comment">// Except it works for an arbitrary number of dimensions</span></div><div class="line">    <a class="code" href="namespacearm__compute.xhtml#a78fd1c0056e9add7ab01b8e118c0038d">execute_window_loop</a>(output_window, [&amp;](<span class="keyword">const</span> Coordinates &amp; <span class="keywordtype">id</span>)</div><div class="line">    {</div><div class="line">        std::cout &lt;&lt; <span class="stringliteral">&quot;Copying one row starting from [&quot;</span> &lt;&lt; <span class="keywordtype">id</span>.x() &lt;&lt; <span class="stringliteral">&quot;,&quot;</span> &lt;&lt; <span class="keywordtype">id</span>.y() &lt;&lt; <span class="stringliteral">&quot;,&quot;</span> &lt;&lt; <span class="keywordtype">id</span>.z() &lt;&lt; <span class="stringliteral">&quot;]\n&quot;</span>;</div><div class="line">        <span class="comment">// Copy one whole row:</span></div><div class="line">        memcpy(dst_data + <span class="keywordtype">id</span>.z() * (width * height) + <span class="keywordtype">id</span>.y() * width, output_it.ptr(), width * <span class="keyword">sizeof</span>(float));</div><div class="line">    },</div><div class="line">    output_it);</div><div class="line"></div><div class="line">    <span class="keyword">delete</span>[] src_data;</div><div class="line">    <span class="keyword">delete</span>[] dst_data;</div></div><!-- fragment --></div></div><!-- contents -->
 </div><!-- doc-content -->
 <!-- start footer part -->
 <div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
   <ul>
-    <li class="footer">Generated on Fri Mar 24 2017 17:23:51 for ARM Compute Library by
+    <li class="footer">Generated on Wed Apr 12 2017 14:26:06 for ARM Compute Library by
     <a href="http://www.doxygen.org/index.html">
     <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.11 </li>
   </ul>