Blame - llvm/docs/Vectorizers.rst - toolchain/llvm-project

blob: 5ec3da43ac7a660950f710810fca7cfb877ece8c [file] [log] [blame]

Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	1	==========================
				2	Auto-Vectorization in LLVM
				3	==========================
				4
Sean Silva	99e12f9	2012-12-20 22:42:20 +0000	[diff] [blame]	5	.. contents::
				6	:local:
				7
				8	LLVM has two vectorizers: The :ref:`Loop Vectorizer <loop-vectorizer>`,
				9	which operates on Loops, and the :ref:`Basic Block Vectorizer
				10	<bb-vectorizer>`, which optimizes straight-line code. These vectorizers
				11	focus on different optimization opportunities and use different techniques.
				12	The BB vectorizer merges multiple scalars that are found in the code into
				13	vectors while the Loop Vectorizer widens instructions in the original loop
				14	to operate on multiple consecutive loop iterations.
				15
				16	.. _loop-vectorizer:
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	17
				18	The Loop Vectorizer
				19	===================
				20
Nadav Rotem	0328f5e	2012-12-19 18:04:44 +0000	[diff] [blame]	21	Usage
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	22	-----
Nadav Rotem	0328f5e	2012-12-19 18:04:44 +0000	[diff] [blame]	23
Sean Silva	13ed79c	2012-12-20 02:23:25 +0000	[diff] [blame]	24	LLVM's Loop Vectorizer is now available and will be useful for many people.
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	25	It is not enabled by default, but can be enabled through clang using the
				26	command line flag:
				27
				28	.. code-block:: console
				29
Nadav Rotem	8f4a6cc	2012-12-19 18:02:36 +0000	[diff] [blame]	30	$ clang -fvectorize -O3 file.c
				31
				32	If the ``-fvectorize`` flag is used then the loop vectorizer will be enabled
				33	when running with ``-O3``, ``-O2``. When ``-Os`` is used, the loop vectorizer
				34	will only vectorize loops that do not require a major increase in code size.
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	35
				36	We plan to enable the Loop Vectorizer by default as part of the LLVM 3.3 release.
				37
				38	Features
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	39	--------
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	40
				41	The LLVM Loop Vectorizer has a number of features that allow it to vectorize
				42	complex loops.
				43
				44	Loops with unknown trip count
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	45	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	46
				47	The Loop Vectorizer supports loops with an unknown trip count.
				48	In the loop below, the iteration ``start`` and ``finish`` points are unknown,
				49	and the Loop Vectorizer has a mechanism to vectorize loops that do not start
Sean Silva	13ed79c	2012-12-20 02:23:25 +0000	[diff] [blame]	50	at zero. In this example, 'n' may not be a multiple of the vector width, and
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	51	the vectorizer has to execute the last few iterations as scalar code. Keeping
				52	a scalar copy of the loop increases the code size.
				53
				54	.. code-block:: c++
				55
				56	void bar(float A, float B, float K, int start, int end) {
Sean Silva	8c44a47	2012-12-20 22:47:41 +0000	[diff] [blame]	57	for (int i = start; i < end; ++i)
				58	A[i] *= B[i] + K;
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	59	}
				60
				61	Runtime Checks of Pointers
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	62	^^^^^^^^^^^^^^^^^^^^^^^^^^
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	63
				64	In the example below, if the pointers A and B point to consecutive addresses,
				65	then it is illegal to vectorize the code because some elements of A will be
				66	written before they are read from array B.
				67
				68	Some programmers use the 'restrict' keyword to notify the compiler that the
				69	pointers are disjointed, but in our example, the Loop Vectorizer has no way of
				70	knowing that the pointers A and B are unique. The Loop Vectorizer handles this
				71	loop by placing code that checks, at runtime, if the arrays A and B point to
				72	disjointed memory locations. If arrays A and B overlap, then the scalar version
Sean Silva	287e7d2	2012-12-20 22:59:36 +0000	[diff] [blame^]	73	of the loop is executed.
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	74
				75	.. code-block:: c++
				76
				77	void bar(float A, float B, float K, int n) {
Sean Silva	8c44a47	2012-12-20 22:47:41 +0000	[diff] [blame]	78	for (int i = 0; i < n; ++i)
				79	A[i] *= B[i] + K;
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	80	}
				81
				82
				83	Reductions
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	84	^^^^^^^^^^
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	85
Sean Silva	287e7d2	2012-12-20 22:59:36 +0000	[diff] [blame^]	86	In this example the ``sum`` variable is used by consecutive iterations of
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	87	the loop. Normally, this would prevent vectorization, but the vectorizer can
Sean Silva	13ed79c	2012-12-20 02:23:25 +0000	[diff] [blame]	88	detect that 'sum' is a reduction variable. The variable 'sum' becomes a vector
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	89	of integers, and at the end of the loop the elements of the array are added
Sean Silva	287e7d2	2012-12-20 22:59:36 +0000	[diff] [blame^]	90	together to create the correct result. We support a number of different
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	91	reduction operations, such as addition, multiplication, XOR, AND and OR.
				92
				93	.. code-block:: c++
				94
				95	int foo(int A, int B, int n) {
				96	unsigned sum = 0;
				97	for (int i = 0; i < n; ++i)
Sean Silva	287e7d2	2012-12-20 22:59:36 +0000	[diff] [blame^]	98	sum += A[i] + 5;
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	99	return sum;
				100	}
				101
				102	Inductions
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	103	^^^^^^^^^^
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	104
				105	In this example the value of the induction variable ``i`` is saved into an
				106	array. The Loop Vectorizer knows to vectorize induction variables.
				107
				108	.. code-block:: c++
				109
				110	void bar(float A, float B, float K, int n) {
Sean Silva	8c44a47	2012-12-20 22:47:41 +0000	[diff] [blame]	111	for (int i = 0; i < n; ++i)
				112	A[i] = i;
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	113	}
				114
				115	If Conversion
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	116	^^^^^^^^^^^^^
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	117
				118	The Loop Vectorizer is able to "flatten" the IF statement in the code and
				119	generate a single stream of instructions. The Loop Vectorizer supports any
				120	control flow in the innermost loop. The innermost loop may contain complex
				121	nesting of IFs, ELSEs and even GOTOs.
				122
				123	.. code-block:: c++
				124
				125	int foo(int A, int B, int n) {
				126	unsigned sum = 0;
				127	for (int i = 0; i < n; ++i)
				128	if (A[i] > B[i])
				129	sum += A[i] + 5;
				130	return sum;
				131	}
				132
				133	Pointer Induction Variables
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	134	^^^^^^^^^^^^^^^^^^^^^^^^^^^
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	135
				136	This example uses the "accumulate" function of the standard c++ library. This
				137	loop uses C++ iterators, which are pointers, and not integer indices.
				138	The Loop Vectorizer detects pointer induction variables and can vectorize
				139	this loop. This feature is important because many C++ programs use iterators.
				140
				141	.. code-block:: c++
				142
				143	int baz(int *A, int n) {
				144	return std::accumulate(A, A + n, 0);
				145	}
				146
				147	Reverse Iterators
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	148	^^^^^^^^^^^^^^^^^
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	149
				150	The Loop Vectorizer can vectorize loops that count backwards.
				151
				152	.. code-block:: c++
				153
				154	int foo(int A, int B, int n) {
				155	for (int i = n; i > 0; --i)
				156	A[i] +=1;
				157	}
				158
				159	Scatter / Gather
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	160	^^^^^^^^^^^^^^^^
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	161
Sean Silva	287e7d2	2012-12-20 22:59:36 +0000	[diff] [blame^]	162	The Loop Vectorizer can vectorize code that becomes scatter/gather
				163	memory accesses.
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	164
				165	.. code-block:: c++
				166
				167	int foo(int A, int B, int n, int k) {
Sean Silva	8c44a47	2012-12-20 22:47:41 +0000	[diff] [blame]	168	for (int i = 0; i < n; ++i)
Sean Silva	e140b2e	2012-12-20 22:49:13 +0000	[diff] [blame]	169	A[i7] += B[ik];
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	170	}
				171
Nadav Rotem	af14a3f	2012-12-19 07:36:35 +0000	[diff] [blame]	172	Vectorization of Mixed Types
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	173	^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	174
				175	The Loop Vectorizer can vectorize programs with mixed types. The Vectorizer
				176	cost model can estimate the cost of the type conversion and decide if
				177	vectorization is profitable.
				178
				179	.. code-block:: c++
				180
				181	int foo(int A, char B, int n, int k) {
Sean Silva	8c44a47	2012-12-20 22:47:41 +0000	[diff] [blame]	182	for (int i = 0; i < n; ++i)
Sean Silva	e140b2e	2012-12-20 22:49:13 +0000	[diff] [blame]	183	A[i] += 4 * B[i];
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	184	}
				185
				186	Vectorization of function calls
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	187	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	188
				189	The Loop Vectorize can vectorize intrinsic math functions.
				190	See the table below for a list of these functions.
				191
				192	+-----+-----+---------+
				193	\| pow \| exp \| exp2 \|
				194	+-----+-----+---------+
				195	\| sin \| cos \| sqrt \|
				196	+-----+-----+---------+
				197	\| log \|log2 \| log10 \|
				198	+-----+-----+---------+
				199	\|fabs \|floor\| ceil \|
				200	+-----+-----+---------+
				201	\|fma \|trunc\|nearbyint\|
				202	+-----+-----+---------+
				203
Nadav Rotem	15bdbbe	2012-12-19 08:28:24 +0000	[diff] [blame]	204	Performance
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	205	-----------
Nadav Rotem	15bdbbe	2012-12-19 08:28:24 +0000	[diff] [blame]	206
Sean Silva	287e7d2	2012-12-20 22:59:36 +0000	[diff] [blame^]	207	This section shows the the execution time of Clang on a simple benchmark:
Nadav Rotem	90c8b4b	2012-12-19 08:43:05 +0000	[diff] [blame]	208	`gcc-loops <http://llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/UnitTests/Vectorizer/>`_.
Sean Silva	287e7d2	2012-12-20 22:59:36 +0000	[diff] [blame^]	209	This benchmarks is a collection of loops from the GCC autovectorization
Nadav Rotem	8f4a6cc	2012-12-19 18:02:36 +0000	[diff] [blame]	210	`page <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>`_ by Dorit Nuzman.
Nadav Rotem	15bdbbe	2012-12-19 08:28:24 +0000	[diff] [blame]	211
Nadav Rotem	12da396	2012-12-20 00:03:36 +0000	[diff] [blame]	212	The chart below compares GCC-4.7, ICC-13, and Clang-SVN with and without loop vectorization at -O3, tuned for "corei7-avx", running on a Sandybridge iMac.
Sean Silva	287e7d2	2012-12-20 22:59:36 +0000	[diff] [blame^]	213	The Y-axis shows the time in msec. Lower is better. The last column shows the geomean of all the kernels.
Nadav Rotem	15bdbbe	2012-12-19 08:28:24 +0000	[diff] [blame]	214
				215	.. image:: gcc-loops.png
Sean Silva	fd706f7	2012-12-20 22:24:37 +0000	[diff] [blame]	216	:width: 100%
Nadav Rotem	15bdbbe	2012-12-19 08:28:24 +0000	[diff] [blame]	217
Sean Silva	99e12f9	2012-12-20 22:42:20 +0000	[diff] [blame]	218	.. _bb-vectorizer:
				219
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	220	The Basic Block Vectorizer
				221	==========================
				222
Nadav Rotem	0328f5e	2012-12-19 18:04:44 +0000	[diff] [blame]	223	Usage
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	224	------
Nadav Rotem	0328f5e	2012-12-19 18:04:44 +0000	[diff] [blame]	225
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	226	The Basic Block Vectorizer is not enabled by default, but it can be enabled
				227	through clang using the command line flag:
				228
				229	.. code-block:: console
				230
Sean Silva	287e7d2	2012-12-20 22:59:36 +0000	[diff] [blame^]	231	$ clang -fslp-vectorize file.c
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	232
Nadav Rotem	0328f5e	2012-12-19 18:04:44 +0000	[diff] [blame]	233	Details
Sean Silva	08fd088	2012-12-20 02:40:45 +0000	[diff] [blame]	234	-------
Nadav Rotem	0328f5e	2012-12-19 18:04:44 +0000	[diff] [blame]	235
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	236	The goal of basic-block vectorization (a.k.a. superword-level parallelism) is
				237	to combine similar independent instructions within simple control-flow regions
				238	into vector instructions. Memory accesses, arithemetic operations, comparison
				239	operations and some math functions can all be vectorized using this technique
Sean Silva	287e7d2	2012-12-20 22:59:36 +0000	[diff] [blame^]	240	(subject to the capabilities of the target architecture).
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	241
				242	For example, the following function performs very similar operations on its
				243	inputs (a1, b1) and (a2, b2). The basic-block vectorizer may combine these
				244	into vector operations.
				245
				246	.. code-block:: c++
				247
				248	int foo(int a1, int a2, int b1, int b2) {
				249	int r1 = a1(a1 + b1)/b1 + 50b1/a1;
				250	int r2 = a2(a2 + b2)/b2 + 50b2/a2;
				251	return r1 + r2;
				252	}
				253
				254