Blame - llvm/docs/Vectorizers.rst - toolchain/llvm-project

blob: f585c630f94ef5bcde4747850ecbf4ab9bb24585 [file] [log] [blame]

Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	1	==========================
				2	Auto-Vectorization in LLVM
				3	==========================
				4
				5	LLVM has two vectorizers: The Loop Vectorizer, which operates on Loops,
				6	and the Basic Block Vectorizer, which optimizes straight-line code. These
				7	vectorizers focus on different optimization opportunities and use different
				8	techniques. The BB vectorizer merges multiple scalars that are found in the
				9	code into vectors while the Loop Vectorizer widens instructions in the
				10	original loop to operate on multiple consecutive loop iterations.
				11
				12	The Loop Vectorizer
				13	===================
				14
Nadav Rotem	0328f5e	2012-12-19 18:04:44 +0000	[diff] [blame^]	15	Usage
				16	^^^^^^
				17
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	18	LLVM’s Loop Vectorizer is now available and will be useful for many people.
				19	It is not enabled by default, but can be enabled through clang using the
				20	command line flag:
				21
				22	.. code-block:: console
				23
Nadav Rotem	8f4a6cc	2012-12-19 18:02:36 +0000	[diff] [blame]	24	$ clang -fvectorize -O3 file.c
				25
				26	If the ``-fvectorize`` flag is used then the loop vectorizer will be enabled
				27	when running with ``-O3``, ``-O2``. When ``-Os`` is used, the loop vectorizer
				28	will only vectorize loops that do not require a major increase in code size.
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	29
				30	We plan to enable the Loop Vectorizer by default as part of the LLVM 3.3 release.
				31
				32	Features
				33	^^^^^^^^^
				34
				35	The LLVM Loop Vectorizer has a number of features that allow it to vectorize
				36	complex loops.
				37
				38	Loops with unknown trip count
				39	------------------------------
				40
				41	The Loop Vectorizer supports loops with an unknown trip count.
				42	In the loop below, the iteration ``start`` and ``finish`` points are unknown,
				43	and the Loop Vectorizer has a mechanism to vectorize loops that do not start
				44	at zero. In this example, ‘n’ may not be a multiple of the vector width, and
				45	the vectorizer has to execute the last few iterations as scalar code. Keeping
				46	a scalar copy of the loop increases the code size.
				47
				48	.. code-block:: c++
				49
				50	void bar(float A, float B, float K, int start, int end) {
				51	for (int i = start; i < end; ++i)
				52	A[i] *= B[i] + K;
				53	}
				54
				55	Runtime Checks of Pointers
				56	--------------------------
				57
				58	In the example below, if the pointers A and B point to consecutive addresses,
				59	then it is illegal to vectorize the code because some elements of A will be
				60	written before they are read from array B.
				61
				62	Some programmers use the 'restrict' keyword to notify the compiler that the
				63	pointers are disjointed, but in our example, the Loop Vectorizer has no way of
				64	knowing that the pointers A and B are unique. The Loop Vectorizer handles this
				65	loop by placing code that checks, at runtime, if the arrays A and B point to
				66	disjointed memory locations. If arrays A and B overlap, then the scalar version
				67	of the loop is executed.
				68
				69	.. code-block:: c++
				70
				71	void bar(float A, float B, float K, int n) {
				72	for (int i = 0; i < n; ++i)
				73	A[i] *= B[i] + K;
				74	}
				75
				76
				77	Reductions
				78	--------------------------
				79
				80	In this example the ``sum`` variable is used by consecutive iterations of
				81	the loop. Normally, this would prevent vectorization, but the vectorizer can
				82	detect that ‘sum’ is a reduction variable. The variable ‘sum’ becomes a vector
				83	of integers, and at the end of the loop the elements of the array are added
				84	together to create the correct result. We support a number of different
				85	reduction operations, such as addition, multiplication, XOR, AND and OR.
				86
				87	.. code-block:: c++
				88
				89	int foo(int A, int B, int n) {
				90	unsigned sum = 0;
				91	for (int i = 0; i < n; ++i)
				92	sum += A[i] + 5;
				93	return sum;
				94	}
				95
				96	Inductions
				97	--------------------------
				98
				99	In this example the value of the induction variable ``i`` is saved into an
				100	array. The Loop Vectorizer knows to vectorize induction variables.
				101
				102	.. code-block:: c++
				103
				104	void bar(float A, float B, float K, int n) {
				105	for (int i = 0; i < n; ++i)
				106	A[i] = i;
				107	}
				108
				109	If Conversion
				110	--------------------------
				111
				112	The Loop Vectorizer is able to "flatten" the IF statement in the code and
				113	generate a single stream of instructions. The Loop Vectorizer supports any
				114	control flow in the innermost loop. The innermost loop may contain complex
				115	nesting of IFs, ELSEs and even GOTOs.
				116
				117	.. code-block:: c++
				118
				119	int foo(int A, int B, int n) {
				120	unsigned sum = 0;
				121	for (int i = 0; i < n; ++i)
				122	if (A[i] > B[i])
				123	sum += A[i] + 5;
				124	return sum;
				125	}
				126
				127	Pointer Induction Variables
Nadav Rotem	af14a3f	2012-12-19 07:36:35 +0000	[diff] [blame]	128	---------------------------
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	129
				130	This example uses the "accumulate" function of the standard c++ library. This
				131	loop uses C++ iterators, which are pointers, and not integer indices.
				132	The Loop Vectorizer detects pointer induction variables and can vectorize
				133	this loop. This feature is important because many C++ programs use iterators.
				134
				135	.. code-block:: c++
				136
				137	int baz(int *A, int n) {
				138	return std::accumulate(A, A + n, 0);
				139	}
				140
				141	Reverse Iterators
				142	--------------------------
				143
				144	The Loop Vectorizer can vectorize loops that count backwards.
				145
				146	.. code-block:: c++
				147
				148	int foo(int A, int B, int n) {
				149	for (int i = n; i > 0; --i)
				150	A[i] +=1;
				151	}
				152
				153	Scatter / Gather
Nadav Rotem	af14a3f	2012-12-19 07:36:35 +0000	[diff] [blame]	154	----------------
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	155
Nadav Rotem	af14a3f	2012-12-19 07:36:35 +0000	[diff] [blame]	156	The Loop Vectorizer can vectorize code that becomes scatter/gather
				157	memory accesses.
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	158
				159	.. code-block:: c++
				160
				161	int foo(int A, int B, int n, int k) {
				162	for (int i = 0; i < n; ++i)
				163	A[i7] += B[ik];
				164	}
				165
Nadav Rotem	af14a3f	2012-12-19 07:36:35 +0000	[diff] [blame]	166	Vectorization of Mixed Types
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	167	--------------------------
				168
				169	The Loop Vectorizer can vectorize programs with mixed types. The Vectorizer
				170	cost model can estimate the cost of the type conversion and decide if
				171	vectorization is profitable.
				172
				173	.. code-block:: c++
				174
				175	int foo(int A, char B, int n, int k) {
				176	for (int i = 0; i < n; ++i)
				177	A[i] += 4 * B[i];
				178	}
				179
				180	Vectorization of function calls
				181	--------------------------
				182
				183	The Loop Vectorize can vectorize intrinsic math functions.
				184	See the table below for a list of these functions.
				185
				186	+-----+-----+---------+
				187	\| pow \| exp \| exp2 \|
				188	+-----+-----+---------+
				189	\| sin \| cos \| sqrt \|
				190	+-----+-----+---------+
				191	\| log \|log2 \| log10 \|
				192	+-----+-----+---------+
				193	\|fabs \|floor\| ceil \|
				194	+-----+-----+---------+
				195	\|fma \|trunc\|nearbyint\|
				196	+-----+-----+---------+
				197
Nadav Rotem	15bdbbe	2012-12-19 08:28:24 +0000	[diff] [blame]	198	Performance
				199	^^^^^^^^^^^
				200
				201	This section shows the the execution time of Clang on a simple benchmark:
Nadav Rotem	90c8b4b	2012-12-19 08:43:05 +0000	[diff] [blame]	202	`gcc-loops <http://llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/UnitTests/Vectorizer/>`_.
Nadav Rotem	15bdbbe	2012-12-19 08:28:24 +0000	[diff] [blame]	203	This benchmarks is a collection of loops from the GCC autovectorization
Nadav Rotem	8f4a6cc	2012-12-19 18:02:36 +0000	[diff] [blame]	204	`page <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>`_ by Dorit Nuzman.
Nadav Rotem	15bdbbe	2012-12-19 08:28:24 +0000	[diff] [blame]	205
				206	The chart below compares GCC-4.7, ICC-13, and Clang-SVN at -O3, running on a Sandybridge.
				207	The Y-axis shows time in msec. Lower is better.
				208
				209	.. image:: gcc-loops.png
				210
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	211	The Basic Block Vectorizer
				212	==========================
				213
Nadav Rotem	0328f5e	2012-12-19 18:04:44 +0000	[diff] [blame^]	214	Usage
				215	^^^^^^
				216
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	217	The Basic Block Vectorizer is not enabled by default, but it can be enabled
				218	through clang using the command line flag:
				219
				220	.. code-block:: console
				221
				222	$ clang -fslp-vectorize file.c
				223
Nadav Rotem	0328f5e	2012-12-19 18:04:44 +0000	[diff] [blame^]	224	Details
				225	^^^^^^^
				226
Nadav Rotem	c4efbb8	2012-12-19 07:22:24 +0000	[diff] [blame]	227	The goal of basic-block vectorization (a.k.a. superword-level parallelism) is
				228	to combine similar independent instructions within simple control-flow regions
				229	into vector instructions. Memory accesses, arithemetic operations, comparison
				230	operations and some math functions can all be vectorized using this technique
				231	(subject to the capabilities of the target architecture).
				232
				233	For example, the following function performs very similar operations on its
				234	inputs (a1, b1) and (a2, b2). The basic-block vectorizer may combine these
				235	into vector operations.
				236
				237	.. code-block:: c++
				238
				239	int foo(int a1, int a2, int b1, int b2) {
				240	int r1 = a1(a1 + b1)/b1 + 50b1/a1;
				241	int r2 = a2(a2 + b2)/b2 + 50b2/a2;
				242	return r1 + r2;
				243	}
				244
				245