blob: 482bc49ad7f2e59990b2f6244face5a332069bda [file] [log] [blame]
Chris Lattner086c0142006-02-03 06:21:43 +00001Target Independent Opportunities:
2
Chris Lattnerf308ea02006-09-28 06:01:17 +00003//===---------------------------------------------------------------------===//
4
Chris Lattner9b62b452006-11-14 01:57:53 +00005With the recent changes to make the implicit def/use set explicit in
6machineinstrs, we should change the target descriptions for 'call' instructions
7so that the .td files don't list all the call-clobbered registers as implicit
8defs. Instead, these should be added by the code generator (e.g. on the dag).
9
10This has a number of uses:
11
121. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
13 for their different impdef sets.
142. Targets with multiple calling convs (e.g. x86) which have different clobber
15 sets don't need copies of call instructions.
163. 'Interprocedural register allocation' can be done to reduce the clobber sets
17 of calls.
18
19//===---------------------------------------------------------------------===//
20
Chris Lattner086c0142006-02-03 06:21:43 +000021FreeBench/mason contains code like this:
22
Chris Lattner11a3a9d2007-03-18 22:41:33 +000023typedef struct { int a; int b; int c; } p_type;
24extern int m[];
25p_type m0u(p_type *p) {
Chris Lattner086c0142006-02-03 06:21:43 +000026 int m[]={0, 8, 1, 2, 16, 5, 13, 7, 14, 9, 3, 4, 11, 12, 15, 10, 17, 6};
27 p_type pu;
Chris Lattner11a3a9d2007-03-18 22:41:33 +000028 pu.a = m[p->a];
29 pu.b = m[p->b];
30 pu.c = m[p->c];
Chris Lattner086c0142006-02-03 06:21:43 +000031 return pu;
32}
33
34We currently compile this into a memcpy from a static array into 'm', then
35a bunch of loads from m. It would be better to avoid the memcpy and just do
36loads from the static array.
37
Nate Begeman81e80972006-03-17 01:40:33 +000038//===---------------------------------------------------------------------===//
39
40Make the PPC branch selector target independant
41
42//===---------------------------------------------------------------------===//
Chris Lattner086c0142006-02-03 06:21:43 +000043
44Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
45precision don't matter (ffastmath). Misc/mandel will like this. :)
46
Chris Lattner086c0142006-02-03 06:21:43 +000047//===---------------------------------------------------------------------===//
48
49Solve this DAG isel folding deficiency:
50
51int X, Y;
52
53void fn1(void)
54{
55 X = X | (Y << 3);
56}
57
58compiles to
59
60fn1:
61 movl Y, %eax
62 shll $3, %eax
63 orl X, %eax
64 movl %eax, X
65 ret
66
67The problem is the store's chain operand is not the load X but rather
68a TokenFactor of the load X and load Y, which prevents the folding.
69
70There are two ways to fix this:
71
721. The dag combiner can start using alias analysis to realize that y/x
73 don't alias, making the store to X not dependent on the load from Y.
742. The generated isel could be made smarter in the case it can't
75 disambiguate the pointers.
76
77Number 1 is the preferred solution.
78
Evan Chenge617b082006-03-13 23:19:10 +000079This has been "fixed" by a TableGen hack. But that is a short term workaround
80which will be removed once the proper fix is made.
81
Chris Lattner086c0142006-02-03 06:21:43 +000082//===---------------------------------------------------------------------===//
83
Chris Lattnerb27b69f2006-03-04 01:19:34 +000084On targets with expensive 64-bit multiply, we could LSR this:
85
86for (i = ...; ++i) {
87 x = 1ULL << i;
88
89into:
90 long long tmp = 1;
91 for (i = ...; ++i, tmp+=tmp)
92 x = tmp;
93
94This would be a win on ppc32, but not x86 or ppc64.
95
Chris Lattnerad019932006-03-04 08:44:51 +000096//===---------------------------------------------------------------------===//
Chris Lattner5b0fe7d2006-03-05 20:00:08 +000097
98Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
99
100//===---------------------------------------------------------------------===//
Chris Lattner549f27d22006-03-07 02:46:26 +0000101
Chris Lattnerc20995e2006-03-11 20:17:08 +0000102Reassociate should turn: X*X*X*X -> t=(X*X) (t*t) to eliminate a multiply.
103
104//===---------------------------------------------------------------------===//
105
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000106Interesting? testcase for add/shift/mul reassoc:
107
108int bar(int x, int y) {
109 return x*x*x+y+x*x*x*x*x*y*y*y*y;
110}
111int foo(int z, int n) {
112 return bar(z, n) + bar(2*z, 2*n);
113}
114
115//===---------------------------------------------------------------------===//
116
Chris Lattner82c78b22006-03-09 20:13:21 +0000117These two functions should generate the same code on big-endian systems:
118
119int g(int *j,int *l) { return memcmp(j,l,4); }
120int h(int *j, int *l) { return *j - *l; }
121
122this could be done in SelectionDAGISel.cpp, along with other special cases,
123for 1,2,4,8 bytes.
124
125//===---------------------------------------------------------------------===//
126
Chris Lattnerc04b4232006-03-22 07:33:46 +0000127It would be nice to revert this patch:
128http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
129
130And teach the dag combiner enough to simplify the code expanded before
131legalize. It seems plausible that this knowledge would let it simplify other
132stuff too.
133
Chris Lattnere6cd96d2006-03-24 19:59:17 +0000134//===---------------------------------------------------------------------===//
135
Reid Spencerac9dcb92007-02-15 03:39:18 +0000136For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng67d3d4c2006-03-31 22:35:14 +0000137to the type size. It works but can be overly conservative as the alignment of
Reid Spencerac9dcb92007-02-15 03:39:18 +0000138specific vector types are target dependent.
Chris Lattnereaa7c062006-04-01 04:08:29 +0000139
140//===---------------------------------------------------------------------===//
141
142We should add 'unaligned load/store' nodes, and produce them from code like
143this:
144
145v4sf example(float *P) {
146 return (v4sf){P[0], P[1], P[2], P[3] };
147}
148
149//===---------------------------------------------------------------------===//
150
Reid Spencerac9dcb92007-02-15 03:39:18 +0000151We should constant fold vector type casts at the LLVM level, regardless of the
Chris Lattner52951222006-04-02 01:47:20 +0000152cast. Currently we cannot fold some casts because we don't have TargetData
153information in the constant folder, so we don't know the endianness of the
154target!
155
156//===---------------------------------------------------------------------===//
Chris Lattner879acef2006-04-20 18:49:28 +0000157
Chris Lattner16abfdf2006-05-18 18:26:13 +0000158Add support for conditional increments, and other related patterns. Instead
159of:
160
161 movl 136(%esp), %eax
162 cmpl $0, %eax
163 je LBB16_2 #cond_next
164LBB16_1: #cond_true
165 incl _foo
166LBB16_2: #cond_next
167
168emit:
169 movl _foo, %eax
170 cmpl $1, %edi
171 sbbl $-1, %eax
172 movl %eax, _foo
173
174//===---------------------------------------------------------------------===//
Chris Lattner870cf1b2006-05-19 20:45:08 +0000175
176Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
177
178Expand these to calls of sin/cos and stores:
179 double sincos(double x, double *sin, double *cos);
180 float sincosf(float x, float *sin, float *cos);
181 long double sincosl(long double x, long double *sin, long double *cos);
182
183Doing so could allow SROA of the destination pointers. See also:
184http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
185
186//===---------------------------------------------------------------------===//
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000187
188Scalar Repl cannot currently promote this testcase to 'ret long cst':
189
190 %struct.X = type { int, int }
191 %struct.Y = type { %struct.X }
192ulong %bar() {
Chris Lattnera5546fb2006-12-11 00:44:03 +0000193 %retval = alloca %struct.Y, align 8
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000194 %tmp12 = getelementptr %struct.Y* %retval, int 0, uint 0, uint 0
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000195 store int 0, int* %tmp12
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000196 %tmp15 = getelementptr %struct.Y* %retval, int 0, uint 0, uint 1
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000197 store int 1, int* %tmp15
Chris Lattnera5546fb2006-12-11 00:44:03 +0000198 %retval = bitcast %struct.Y* %retval to ulong*
199 %retval = load ulong* %retval
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000200 ret ulong %retval
201}
202
203it should be extended to do so.
204
205//===---------------------------------------------------------------------===//
Chris Lattnere8263e62006-05-21 03:57:07 +0000206
Chris Lattnera5546fb2006-12-11 00:44:03 +0000207-scalarrepl should promote this to be a vector scalar.
208
209 %struct..0anon = type { <4 x float> }
210
211implementation ; Functions:
212
213void %test1(<4 x float> %V, float* %P) {
214 %u = alloca %struct..0anon, align 16
215 %tmp = getelementptr %struct..0anon* %u, int 0, uint 0
216 store <4 x float> %V, <4 x float>* %tmp
217 %tmp1 = bitcast %struct..0anon* %u to [4 x float]*
218 %tmp = getelementptr [4 x float]* %tmp1, int 0, int 1
219 %tmp = load float* %tmp
220 %tmp3 = mul float %tmp, 2.000000e+00
221 store float %tmp3, float* %P
222 ret void
223}
224
225//===---------------------------------------------------------------------===//
226
Chris Lattnere8263e62006-05-21 03:57:07 +0000227Turn this into a single byte store with no load (the other 3 bytes are
228unmodified):
229
230void %test(uint* %P) {
231 %tmp = load uint* %P
232 %tmp14 = or uint %tmp, 3305111552
233 %tmp15 = and uint %tmp14, 3321888767
234 store uint %tmp15, uint* %P
235 ret void
236}
237
Chris Lattner9e18ef52006-05-30 21:29:15 +0000238//===---------------------------------------------------------------------===//
239
240dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
241
242Compile:
243
244int bar(int x)
245{
246 int t = __builtin_clz(x);
247 return -(t>>5);
248}
249
250to:
251
252_bar: addic r3,r3,-1
253 subfe r3,r3,r3
254 blr
255
Chris Lattnercbce2f62006-09-15 20:31:36 +0000256//===---------------------------------------------------------------------===//
257
258Legalize should lower ctlz like this:
259 ctlz(x) = popcnt((x-1) & ~x)
260
261on targets that have popcnt but not ctlz. itanium, what else?
Chris Lattner9e18ef52006-05-30 21:29:15 +0000262
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000263//===---------------------------------------------------------------------===//
264
265quantum_sigma_x in 462.libquantum contains the following loop:
266
267 for(i=0; i<reg->size; i++)
268 {
269 /* Flip the target bit of each basis state */
270 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
271 }
272
273Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
274so cool to turn it into something like:
275
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000276 long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000277 if (target < 32) {
278 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000279 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000280 } else {
281 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000282 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000283 }
284
285... which would only do one 32-bit XOR per loop iteration instead of two.
286
287It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
288alas...
289
290//===---------------------------------------------------------------------===//
Chris Lattnerfb981f32006-09-25 17:12:14 +0000291
292This isn't recognized as bswap by instcombine:
293
294unsigned int swap_32(unsigned int v) {
295 v = ((v & 0x00ff00ffU) << 8) | ((v & 0xff00ff00U) >> 8);
296 v = ((v & 0x0000ffffU) << 16) | ((v & 0xffff0000U) >> 16);
297 return v;
298}
299
Chris Lattnerf9bae432006-12-08 02:01:32 +0000300Nor is this (yes, it really is bswap):
301
302unsigned long reverse(unsigned v) {
303 unsigned t;
304 t = v ^ ((v << 16) | (v >> 16));
305 t &= ~0xff0000;
306 v = (v << 24) | (v >> 8);
307 return v ^ (t >> 8);
308}
309
Chris Lattnerfb981f32006-09-25 17:12:14 +0000310//===---------------------------------------------------------------------===//
311
312These should turn into single 16-bit (unaligned?) loads on little/big endian
313processors.
314
315unsigned short read_16_le(const unsigned char *adr) {
316 return adr[0] | (adr[1] << 8);
317}
318unsigned short read_16_be(const unsigned char *adr) {
319 return (adr[0] << 8) | adr[1];
320}
321
322//===---------------------------------------------------------------------===//
Chris Lattnercf103912006-10-24 16:12:47 +0000323
Reid Spencer1628cec2006-10-26 06:15:43 +0000324-instcombine should handle this transform:
Reid Spencere4d87aa2006-12-23 06:05:41 +0000325 icmp pred (sdiv X / C1 ), C2
Reid Spencer1628cec2006-10-26 06:15:43 +0000326when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
327
328Currently InstCombine avoids this transform but will do it when the signs of
329the operands and the sign of the divide match. See the FIXME in
330InstructionCombining.cpp in the visitSetCondInst method after the switch case
331for Instruction::UDiv (around line 4447) for more details.
332
333The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
334this construct.
Chris Lattnerd7c628d2006-11-03 22:27:39 +0000335
336//===---------------------------------------------------------------------===//
337
338Instcombine misses several of these cases (see the testcase in the patch):
339http://gcc.gnu.org/ml/gcc-patches/2006-10/msg01519.html
340
Reid Spencer1628cec2006-10-26 06:15:43 +0000341//===---------------------------------------------------------------------===//
Chris Lattner578d2df2006-11-10 00:23:26 +0000342
343viterbi speeds up *significantly* if the various "history" related copy loops
344are turned into memcpy calls at the source level. We need a "loops to memcpy"
345pass.
346
347//===---------------------------------------------------------------------===//
Nick Lewyckybf637342006-11-13 00:23:28 +0000348
Chris Lattner03a6d962007-01-16 06:39:48 +0000349Consider:
350
351typedef unsigned U32;
352typedef unsigned long long U64;
353int test (U32 *inst, U64 *regs) {
354 U64 effective_addr2;
355 U32 temp = *inst;
356 int r1 = (temp >> 20) & 0xf;
357 int b2 = (temp >> 16) & 0xf;
358 effective_addr2 = temp & 0xfff;
359 if (b2) effective_addr2 += regs[b2];
360 b2 = (temp >> 12) & 0xf;
361 if (b2) effective_addr2 += regs[b2];
362 effective_addr2 &= regs[4];
363 if ((effective_addr2 & 3) == 0)
364 return 1;
365 return 0;
366}
367
368Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
369we don't eliminate the computation of the top half of effective_addr2 because
370we don't have whole-function selection dags. On x86, this means we use one
371extra register for the function when effective_addr2 is declared as U64 than
372when it is declared U32.
373
374//===---------------------------------------------------------------------===//
375
Chris Lattner36e37d22007-02-13 21:44:43 +0000376Promote for i32 bswap can use i64 bswap + shr. Useful on targets with 64-bit
377regs and bswap, like itanium.
378
379//===---------------------------------------------------------------------===//