blob: 9507fd6953bff43212987522957b9371e65314e7 [file] [log] [blame]
Nadav Rotemc4efbb82012-12-19 07:22:24 +00001==========================
2Auto-Vectorization in LLVM
3==========================
4
5LLVM has two vectorizers: The *Loop Vectorizer*, which operates on Loops,
6and the *Basic Block Vectorizer*, which optimizes straight-line code. These
7vectorizers focus on different optimization opportunities and use different
8techniques. The BB vectorizer merges multiple scalars that are found in the
9code into vectors while the Loop Vectorizer widens instructions in the
10original loop to operate on multiple consecutive loop iterations.
11
12The Loop Vectorizer
13===================
14
15LLVMs Loop Vectorizer is now available and will be useful for many people.
16It is not enabled by default, but can be enabled through clang using the
17command line flag:
18
19.. code-block:: console
20
Nadav Rotem8f4a6cc2012-12-19 18:02:36 +000021 $ clang -fvectorize -O3 file.c
22
23If the ``-fvectorize`` flag is used then the loop vectorizer will be enabled
24when running with ``-O3``, ``-O2``. When ``-Os`` is used, the loop vectorizer
25will only vectorize loops that do not require a major increase in code size.
Nadav Rotemc4efbb82012-12-19 07:22:24 +000026
27We plan to enable the Loop Vectorizer by default as part of the LLVM 3.3 release.
28
29Features
30^^^^^^^^^
31
32The LLVM Loop Vectorizer has a number of features that allow it to vectorize
33complex loops.
34
35Loops with unknown trip count
36------------------------------
37
38The Loop Vectorizer supports loops with an unknown trip count.
39In the loop below, the iteration ``start`` and ``finish`` points are unknown,
40and the Loop Vectorizer has a mechanism to vectorize loops that do not start
41at zero. In this example, n may not be a multiple of the vector width, and
42the vectorizer has to execute the last few iterations as scalar code. Keeping
43a scalar copy of the loop increases the code size.
44
45.. code-block:: c++
46
47 void bar(float *A, float* B, float K, int start, int end) {
48 for (int i = start; i < end; ++i)
49 A[i] *= B[i] + K;
50 }
51
52Runtime Checks of Pointers
53--------------------------
54
55In the example below, if the pointers A and B point to consecutive addresses,
56then it is illegal to vectorize the code because some elements of A will be
57written before they are read from array B.
58
59Some programmers use the 'restrict' keyword to notify the compiler that the
60pointers are disjointed, but in our example, the Loop Vectorizer has no way of
61knowing that the pointers A and B are unique. The Loop Vectorizer handles this
62loop by placing code that checks, at runtime, if the arrays A and B point to
63disjointed memory locations. If arrays A and B overlap, then the scalar version
64of the loop is executed.
65
66.. code-block:: c++
67
68 void bar(float *A, float* B, float K, int n) {
69 for (int i = 0; i < n; ++i)
70 A[i] *= B[i] + K;
71 }
72
73
74Reductions
75--------------------------
76
77In this example the ``sum`` variable is used by consecutive iterations of
78the loop. Normally, this would prevent vectorization, but the vectorizer can
79detect that sum is a reduction variable. The variable sum becomes a vector
80of integers, and at the end of the loop the elements of the array are added
81together to create the correct result. We support a number of different
82reduction operations, such as addition, multiplication, XOR, AND and OR.
83
84.. code-block:: c++
85
86 int foo(int *A, int *B, int n) {
87 unsigned sum = 0;
88 for (int i = 0; i < n; ++i)
89 sum += A[i] + 5;
90 return sum;
91 }
92
93Inductions
94--------------------------
95
96In this example the value of the induction variable ``i`` is saved into an
97array. The Loop Vectorizer knows to vectorize induction variables.
98
99.. code-block:: c++
100
101 void bar(float *A, float* B, float K, int n) {
102 for (int i = 0; i < n; ++i)
103 A[i] = i;
104 }
105
106If Conversion
107--------------------------
108
109The Loop Vectorizer is able to "flatten" the IF statement in the code and
110generate a single stream of instructions. The Loop Vectorizer supports any
111control flow in the innermost loop. The innermost loop may contain complex
112nesting of IFs, ELSEs and even GOTOs.
113
114.. code-block:: c++
115
116 int foo(int *A, int *B, int n) {
117 unsigned sum = 0;
118 for (int i = 0; i < n; ++i)
119 if (A[i] > B[i])
120 sum += A[i] + 5;
121 return sum;
122 }
123
124Pointer Induction Variables
Nadav Rotemaf14a3f2012-12-19 07:36:35 +0000125---------------------------
Nadav Rotemc4efbb82012-12-19 07:22:24 +0000126
127This example uses the "accumulate" function of the standard c++ library. This
128loop uses C++ iterators, which are pointers, and not integer indices.
129The Loop Vectorizer detects pointer induction variables and can vectorize
130this loop. This feature is important because many C++ programs use iterators.
131
132.. code-block:: c++
133
134 int baz(int *A, int n) {
135 return std::accumulate(A, A + n, 0);
136 }
137
138Reverse Iterators
139--------------------------
140
141The Loop Vectorizer can vectorize loops that count backwards.
142
143.. code-block:: c++
144
145 int foo(int *A, int *B, int n) {
146 for (int i = n; i > 0; --i)
147 A[i] +=1;
148 }
149
150Scatter / Gather
Nadav Rotemaf14a3f2012-12-19 07:36:35 +0000151----------------
Nadav Rotemc4efbb82012-12-19 07:22:24 +0000152
Nadav Rotemaf14a3f2012-12-19 07:36:35 +0000153The Loop Vectorizer can vectorize code that becomes scatter/gather
154memory accesses.
Nadav Rotemc4efbb82012-12-19 07:22:24 +0000155
156.. code-block:: c++
157
158 int foo(int *A, int *B, int n, int k) {
159 for (int i = 0; i < n; ++i)
160 A[i*7] += B[i*k];
161 }
162
Nadav Rotemaf14a3f2012-12-19 07:36:35 +0000163Vectorization of Mixed Types
Nadav Rotemc4efbb82012-12-19 07:22:24 +0000164--------------------------
165
166The Loop Vectorizer can vectorize programs with mixed types. The Vectorizer
167cost model can estimate the cost of the type conversion and decide if
168vectorization is profitable.
169
170.. code-block:: c++
171
172 int foo(int *A, char *B, int n, int k) {
173 for (int i = 0; i < n; ++i)
174 A[i] += 4 * B[i];
175 }
176
177Vectorization of function calls
178--------------------------
179
180The Loop Vectorize can vectorize intrinsic math functions.
181See the table below for a list of these functions.
182
183+-----+-----+---------+
184| pow | exp | exp2 |
185+-----+-----+---------+
186| sin | cos | sqrt |
187+-----+-----+---------+
188| log |log2 | log10 |
189+-----+-----+---------+
190|fabs |floor| ceil |
191+-----+-----+---------+
192|fma |trunc|nearbyint|
193+-----+-----+---------+
194
Nadav Rotem15bdbbe2012-12-19 08:28:24 +0000195Performance
196^^^^^^^^^^^
197
198This section shows the the execution time of Clang on a simple benchmark:
Nadav Rotem90c8b4b2012-12-19 08:43:05 +0000199`gcc-loops <http://llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/UnitTests/Vectorizer/>`_.
Nadav Rotem15bdbbe2012-12-19 08:28:24 +0000200This benchmarks is a collection of loops from the GCC autovectorization
Nadav Rotem8f4a6cc2012-12-19 18:02:36 +0000201`page <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>`_ by Dorit Nuzman.
Nadav Rotem15bdbbe2012-12-19 08:28:24 +0000202
203The chart below compares GCC-4.7, ICC-13, and Clang-SVN at -O3, running on a Sandybridge.
204The Y-axis shows time in msec. Lower is better.
205
206.. image:: gcc-loops.png
207
Nadav Rotemc4efbb82012-12-19 07:22:24 +0000208The Basic Block Vectorizer
209==========================
210
211The Basic Block Vectorizer is not enabled by default, but it can be enabled
212through clang using the command line flag:
213
214.. code-block:: console
215
216 $ clang -fslp-vectorize file.c
217
218The goal of basic-block vectorization (a.k.a. superword-level parallelism) is
219to combine similar independent instructions within simple control-flow regions
220into vector instructions. Memory accesses, arithemetic operations, comparison
221operations and some math functions can all be vectorized using this technique
222(subject to the capabilities of the target architecture).
223
224For example, the following function performs very similar operations on its
225inputs (a1, b1) and (a2, b2). The basic-block vectorizer may combine these
226into vector operations.
227
228.. code-block:: c++
229
230 int foo(int a1, int a2, int b1, int b2) {
231 int r1 = a1*(a1 + b1)/b1 + 50*b1/a1;
232 int r2 = a2*(a2 + b2)/b2 + 50*b2/a2;
233 return r1 + r2;
234 }
235
236