blob: 284be24b679a61296eb429c38985746b92e33065 [file] [log] [blame]
Dan Gohmanf17a25c2007-07-18 16:29:46 +00001Target Independent Opportunities:
2
3//===---------------------------------------------------------------------===//
4
5With the recent changes to make the implicit def/use set explicit in
6machineinstrs, we should change the target descriptions for 'call' instructions
7so that the .td files don't list all the call-clobbered registers as implicit
8defs. Instead, these should be added by the code generator (e.g. on the dag).
9
10This has a number of uses:
11
121. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
13 for their different impdef sets.
142. Targets with multiple calling convs (e.g. x86) which have different clobber
15 sets don't need copies of call instructions.
163. 'Interprocedural register allocation' can be done to reduce the clobber sets
17 of calls.
18
19//===---------------------------------------------------------------------===//
20
21Make the PPC branch selector target independant
22
23//===---------------------------------------------------------------------===//
24
25Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
26precision don't matter (ffastmath). Misc/mandel will like this. :)
27
28//===---------------------------------------------------------------------===//
29
30Solve this DAG isel folding deficiency:
31
32int X, Y;
33
34void fn1(void)
35{
36 X = X | (Y << 3);
37}
38
39compiles to
40
41fn1:
42 movl Y, %eax
43 shll $3, %eax
44 orl X, %eax
45 movl %eax, X
46 ret
47
48The problem is the store's chain operand is not the load X but rather
49a TokenFactor of the load X and load Y, which prevents the folding.
50
51There are two ways to fix this:
52
531. The dag combiner can start using alias analysis to realize that y/x
54 don't alias, making the store to X not dependent on the load from Y.
552. The generated isel could be made smarter in the case it can't
56 disambiguate the pointers.
57
58Number 1 is the preferred solution.
59
60This has been "fixed" by a TableGen hack. But that is a short term workaround
61which will be removed once the proper fix is made.
62
63//===---------------------------------------------------------------------===//
64
65On targets with expensive 64-bit multiply, we could LSR this:
66
67for (i = ...; ++i) {
68 x = 1ULL << i;
69
70into:
71 long long tmp = 1;
72 for (i = ...; ++i, tmp+=tmp)
73 x = tmp;
74
75This would be a win on ppc32, but not x86 or ppc64.
76
77//===---------------------------------------------------------------------===//
78
79Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
80
81//===---------------------------------------------------------------------===//
82
83Reassociate should turn: X*X*X*X -> t=(X*X) (t*t) to eliminate a multiply.
84
85//===---------------------------------------------------------------------===//
86
87Interesting? testcase for add/shift/mul reassoc:
88
89int bar(int x, int y) {
90 return x*x*x+y+x*x*x*x*x*y*y*y*y;
91}
92int foo(int z, int n) {
93 return bar(z, n) + bar(2*z, 2*n);
94}
95
96Reassociate should handle the example in GCC PR16157.
97
98//===---------------------------------------------------------------------===//
99
100These two functions should generate the same code on big-endian systems:
101
102int g(int *j,int *l) { return memcmp(j,l,4); }
103int h(int *j, int *l) { return *j - *l; }
104
105this could be done in SelectionDAGISel.cpp, along with other special cases,
106for 1,2,4,8 bytes.
107
108//===---------------------------------------------------------------------===//
109
110It would be nice to revert this patch:
111http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
112
113And teach the dag combiner enough to simplify the code expanded before
114legalize. It seems plausible that this knowledge would let it simplify other
115stuff too.
116
117//===---------------------------------------------------------------------===//
118
119For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
120to the type size. It works but can be overly conservative as the alignment of
121specific vector types are target dependent.
122
123//===---------------------------------------------------------------------===//
124
125We should add 'unaligned load/store' nodes, and produce them from code like
126this:
127
128v4sf example(float *P) {
129 return (v4sf){P[0], P[1], P[2], P[3] };
130}
131
132//===---------------------------------------------------------------------===//
133
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000134Add support for conditional increments, and other related patterns. Instead
135of:
136
137 movl 136(%esp), %eax
138 cmpl $0, %eax
139 je LBB16_2 #cond_next
140LBB16_1: #cond_true
141 incl _foo
142LBB16_2: #cond_next
143
144emit:
145 movl _foo, %eax
146 cmpl $1, %edi
147 sbbl $-1, %eax
148 movl %eax, _foo
149
150//===---------------------------------------------------------------------===//
151
152Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
153
154Expand these to calls of sin/cos and stores:
155 double sincos(double x, double *sin, double *cos);
156 float sincosf(float x, float *sin, float *cos);
157 long double sincosl(long double x, long double *sin, long double *cos);
158
159Doing so could allow SROA of the destination pointers. See also:
160http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
161
162//===---------------------------------------------------------------------===//
163
164Scalar Repl cannot currently promote this testcase to 'ret long cst':
165
166 %struct.X = type { i32, i32 }
167 %struct.Y = type { %struct.X }
168
169define i64 @bar() {
170 %retval = alloca %struct.Y, align 8
171 %tmp12 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 0
172 store i32 0, i32* %tmp12
173 %tmp15 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 1
174 store i32 1, i32* %tmp15
175 %retval.upgrd.1 = bitcast %struct.Y* %retval to i64*
176 %retval.upgrd.2 = load i64* %retval.upgrd.1
177 ret i64 %retval.upgrd.2
178}
179
180it should be extended to do so.
181
182//===---------------------------------------------------------------------===//
183
184-scalarrepl should promote this to be a vector scalar.
185
186 %struct..0anon = type { <4 x float> }
187
188define void @test1(<4 x float> %V, float* %P) {
189 %u = alloca %struct..0anon, align 16
190 %tmp = getelementptr %struct..0anon* %u, i32 0, i32 0
191 store <4 x float> %V, <4 x float>* %tmp
192 %tmp1 = bitcast %struct..0anon* %u to [4 x float]*
193 %tmp.upgrd.1 = getelementptr [4 x float]* %tmp1, i32 0, i32 1
194 %tmp.upgrd.2 = load float* %tmp.upgrd.1
195 %tmp3 = mul float %tmp.upgrd.2, 2.000000e+00
196 store float %tmp3, float* %P
197 ret void
198}
199
200//===---------------------------------------------------------------------===//
201
202Turn this into a single byte store with no load (the other 3 bytes are
203unmodified):
204
205void %test(uint* %P) {
206 %tmp = load uint* %P
207 %tmp14 = or uint %tmp, 3305111552
208 %tmp15 = and uint %tmp14, 3321888767
209 store uint %tmp15, uint* %P
210 ret void
211}
212
213//===---------------------------------------------------------------------===//
214
215dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
216
217Compile:
218
219int bar(int x)
220{
221 int t = __builtin_clz(x);
222 return -(t>>5);
223}
224
225to:
226
227_bar: addic r3,r3,-1
228 subfe r3,r3,r3
229 blr
230
231//===---------------------------------------------------------------------===//
232
233Legalize should lower ctlz like this:
234 ctlz(x) = popcnt((x-1) & ~x)
235
236on targets that have popcnt but not ctlz. itanium, what else?
237
238//===---------------------------------------------------------------------===//
239
240quantum_sigma_x in 462.libquantum contains the following loop:
241
242 for(i=0; i<reg->size; i++)
243 {
244 /* Flip the target bit of each basis state */
245 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
246 }
247
248Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
249so cool to turn it into something like:
250
251 long long Res = ((MAX_UNSIGNED) 1 << target);
252 if (target < 32) {
253 for(i=0; i<reg->size; i++)
254 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
255 } else {
256 for(i=0; i<reg->size; i++)
257 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
258 }
259
260... which would only do one 32-bit XOR per loop iteration instead of two.
261
262It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
263alas...
264
265//===---------------------------------------------------------------------===//
266
267This isn't recognized as bswap by instcombine:
268
269unsigned int swap_32(unsigned int v) {
270 v = ((v & 0x00ff00ffU) << 8) | ((v & 0xff00ff00U) >> 8);
271 v = ((v & 0x0000ffffU) << 16) | ((v & 0xffff0000U) >> 16);
272 return v;
273}
274
275Nor is this (yes, it really is bswap):
276
277unsigned long reverse(unsigned v) {
278 unsigned t;
279 t = v ^ ((v << 16) | (v >> 16));
280 t &= ~0xff0000;
281 v = (v << 24) | (v >> 8);
282 return v ^ (t >> 8);
283}
284
285//===---------------------------------------------------------------------===//
286
287These should turn into single 16-bit (unaligned?) loads on little/big endian
288processors.
289
290unsigned short read_16_le(const unsigned char *adr) {
291 return adr[0] | (adr[1] << 8);
292}
293unsigned short read_16_be(const unsigned char *adr) {
294 return (adr[0] << 8) | adr[1];
295}
296
297//===---------------------------------------------------------------------===//
298
299-instcombine should handle this transform:
300 icmp pred (sdiv X / C1 ), C2
301when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
302
303Currently InstCombine avoids this transform but will do it when the signs of
304the operands and the sign of the divide match. See the FIXME in
305InstructionCombining.cpp in the visitSetCondInst method after the switch case
306for Instruction::UDiv (around line 4447) for more details.
307
308The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
309this construct.
310
311//===---------------------------------------------------------------------===//
312
313Instcombine misses several of these cases (see the testcase in the patch):
314http://gcc.gnu.org/ml/gcc-patches/2006-10/msg01519.html
315
316//===---------------------------------------------------------------------===//
317
318viterbi speeds up *significantly* if the various "history" related copy loops
319are turned into memcpy calls at the source level. We need a "loops to memcpy"
320pass.
321
322//===---------------------------------------------------------------------===//
323
324Consider:
325
326typedef unsigned U32;
327typedef unsigned long long U64;
328int test (U32 *inst, U64 *regs) {
329 U64 effective_addr2;
330 U32 temp = *inst;
331 int r1 = (temp >> 20) & 0xf;
332 int b2 = (temp >> 16) & 0xf;
333 effective_addr2 = temp & 0xfff;
334 if (b2) effective_addr2 += regs[b2];
335 b2 = (temp >> 12) & 0xf;
336 if (b2) effective_addr2 += regs[b2];
337 effective_addr2 &= regs[4];
338 if ((effective_addr2 & 3) == 0)
339 return 1;
340 return 0;
341}
342
343Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
344we don't eliminate the computation of the top half of effective_addr2 because
345we don't have whole-function selection dags. On x86, this means we use one
346extra register for the function when effective_addr2 is declared as U64 than
347when it is declared U32.
348
349//===---------------------------------------------------------------------===//
350
351Promote for i32 bswap can use i64 bswap + shr. Useful on targets with 64-bit
352regs and bswap, like itanium.
353
354//===---------------------------------------------------------------------===//
355
356LSR should know what GPR types a target has. This code:
357
358volatile short X, Y; // globals
359
360void foo(int N) {
361 int i;
362 for (i = 0; i < N; i++) { X = i; Y = i*4; }
363}
364
365produces two identical IV's (after promotion) on PPC/ARM:
366
367LBB1_1: @bb.preheader
368 mov r3, #0
369 mov r2, r3
370 mov r1, r3
371LBB1_2: @bb
372 ldr r12, LCPI1_0
373 ldr r12, [r12]
374 strh r2, [r12]
375 ldr r12, LCPI1_1
376 ldr r12, [r12]
377 strh r3, [r12]
378 add r1, r1, #1 <- [0,+,1]
379 add r3, r3, #4
380 add r2, r2, #1 <- [0,+,1]
381 cmp r1, r0
382 bne LBB1_2 @bb
383
384
385//===---------------------------------------------------------------------===//
386
387Tail call elim should be more aggressive, checking to see if the call is
388followed by an uncond branch to an exit block.
389
390; This testcase is due to tail-duplication not wanting to copy the return
391; instruction into the terminating blocks because there was other code
392; optimized out of the function after the taildup happened.
393;RUN: llvm-upgrade < %s | llvm-as | opt -tailcallelim | llvm-dis | not grep call
394
395int %t4(int %a) {
396entry:
397 %tmp.1 = and int %a, 1
398 %tmp.2 = cast int %tmp.1 to bool
399 br bool %tmp.2, label %then.0, label %else.0
400
401then.0:
402 %tmp.5 = add int %a, -1
403 %tmp.3 = call int %t4( int %tmp.5 )
404 br label %return
405
406else.0:
407 %tmp.7 = setne int %a, 0
408 br bool %tmp.7, label %then.1, label %return
409
410then.1:
411 %tmp.11 = add int %a, -2
412 %tmp.9 = call int %t4( int %tmp.11 )
413 br label %return
414
415return:
416 %result.0 = phi int [ 0, %else.0 ], [ %tmp.3, %then.0 ],
417 [ %tmp.9, %then.1 ]
418 ret int %result.0
419}
420
421//===---------------------------------------------------------------------===//
422
Chris Lattner00159fc2007-10-03 06:10:59 +0000423Tail recursion elimination is not transforming this function, because it is
424returning n, which fails the isDynamicConstant check in the accumulator
425recursion checks.
426
427long long fib(const long long n) {
428 switch(n) {
429 case 0:
430 case 1:
431 return n;
432 default:
433 return fib(n-1) + fib(n-2);
434 }
435}
436
437//===---------------------------------------------------------------------===//
438
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000439Argument promotion should promote arguments for recursive functions, like
440this:
441
442; RUN: llvm-upgrade < %s | llvm-as | opt -argpromotion | llvm-dis | grep x.val
443
444implementation ; Functions:
445
446internal int %foo(int* %x) {
447entry:
448 %tmp = load int* %x
449 %tmp.foo = call int %foo(int *%x)
450 ret int %tmp.foo
451}
452
453int %bar(int* %x) {
454entry:
455 %tmp3 = call int %foo( int* %x) ; <int>[#uses=1]
456 ret int %tmp3
457}
458
Chris Lattner421a7332007-12-05 23:05:06 +0000459//===---------------------------------------------------------------------===//
Chris Lattner072ab752007-12-28 04:42:05 +0000460
461"basicaa" should know how to look through "or" instructions that act like add
462instructions. For example in this code, the x*4+1 is turned into x*4 | 1, and
463basicaa can't analyze the array subscript, leading to duplicated loads in the
464generated code:
465
466void test(int X, int Y, int a[]) {
467int i;
468 for (i=2; i<1000; i+=4) {
469 a[i+0] = a[i-1+0]*a[i-2+0];
470 a[i+1] = a[i-1+1]*a[i-2+1];
471 a[i+2] = a[i-1+2]*a[i-2+2];
472 a[i+3] = a[i-1+3]*a[i-2+3];
473 }
474}
475
Chris Lattnerfe7fe912007-12-28 22:30:05 +0000476//===---------------------------------------------------------------------===//
Chris Lattner072ab752007-12-28 04:42:05 +0000477
Chris Lattnerfe7fe912007-12-28 22:30:05 +0000478We should investigate an instruction sinking pass. Consider this silly
479example in pic mode:
480
481#include <assert.h>
482void foo(int x) {
483 assert(x);
484 //...
485}
486
487we compile this to:
488_foo:
489 subl $28, %esp
490 call "L1$pb"
491"L1$pb":
492 popl %eax
493 cmpl $0, 32(%esp)
494 je LBB1_2 # cond_true
495LBB1_1: # return
496 # ...
497 addl $28, %esp
498 ret
499LBB1_2: # cond_true
500...
501
502The PIC base computation (call+popl) is only used on one path through the
503code, but is currently always computed in the entry block. It would be
504better to sink the picbase computation down into the block for the
505assertion, as it is the only one that uses it. This happens for a lot of
506code with early outs.
507
Chris Lattnerbe9fe9d2007-12-29 01:05:01 +0000508Another example is loads of arguments, which are usually emitted into the
509entry block on targets like x86. If not used in all paths through a
510function, they should be sunk into the ones that do.
511
Chris Lattnerfe7fe912007-12-28 22:30:05 +0000512In this case, whole-function-isel would also handle this.
Chris Lattner072ab752007-12-28 04:42:05 +0000513
514//===---------------------------------------------------------------------===//