blob: 89ea9d0afc42c49460c280bda16f04d6340bd587 [file] [log] [blame]
Chris Lattner086c0142006-02-03 06:21:43 +00001Target Independent Opportunities:
2
Chris Lattnerf308ea02006-09-28 06:01:17 +00003//===---------------------------------------------------------------------===//
4
Chris Lattner9b62b452006-11-14 01:57:53 +00005With the recent changes to make the implicit def/use set explicit in
6machineinstrs, we should change the target descriptions for 'call' instructions
7so that the .td files don't list all the call-clobbered registers as implicit
8defs. Instead, these should be added by the code generator (e.g. on the dag).
9
10This has a number of uses:
11
121. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
13 for their different impdef sets.
142. Targets with multiple calling convs (e.g. x86) which have different clobber
15 sets don't need copies of call instructions.
163. 'Interprocedural register allocation' can be done to reduce the clobber sets
17 of calls.
18
19//===---------------------------------------------------------------------===//
20
Nate Begeman81e80972006-03-17 01:40:33 +000021Make the PPC branch selector target independant
22
23//===---------------------------------------------------------------------===//
Chris Lattner086c0142006-02-03 06:21:43 +000024
25Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
Chris Lattner2dae65d2008-12-10 01:30:48 +000026precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
27safe in general, even on darwin. See the libm implementation of hypot for
28examples (which special case when x/y are exactly zero to get signed zeros etc
29right).
Chris Lattner086c0142006-02-03 06:21:43 +000030
Chris Lattner086c0142006-02-03 06:21:43 +000031//===---------------------------------------------------------------------===//
32
33Solve this DAG isel folding deficiency:
34
35int X, Y;
36
37void fn1(void)
38{
39 X = X | (Y << 3);
40}
41
42compiles to
43
44fn1:
45 movl Y, %eax
46 shll $3, %eax
47 orl X, %eax
48 movl %eax, X
49 ret
50
51The problem is the store's chain operand is not the load X but rather
52a TokenFactor of the load X and load Y, which prevents the folding.
53
54There are two ways to fix this:
55
561. The dag combiner can start using alias analysis to realize that y/x
57 don't alias, making the store to X not dependent on the load from Y.
582. The generated isel could be made smarter in the case it can't
59 disambiguate the pointers.
60
61Number 1 is the preferred solution.
62
Evan Chenge617b082006-03-13 23:19:10 +000063This has been "fixed" by a TableGen hack. But that is a short term workaround
64which will be removed once the proper fix is made.
65
Chris Lattner086c0142006-02-03 06:21:43 +000066//===---------------------------------------------------------------------===//
67
Chris Lattnerb27b69f2006-03-04 01:19:34 +000068On targets with expensive 64-bit multiply, we could LSR this:
69
70for (i = ...; ++i) {
71 x = 1ULL << i;
72
73into:
74 long long tmp = 1;
75 for (i = ...; ++i, tmp+=tmp)
76 x = tmp;
77
78This would be a win on ppc32, but not x86 or ppc64.
79
Chris Lattnerad019932006-03-04 08:44:51 +000080//===---------------------------------------------------------------------===//
Chris Lattner5b0fe7d2006-03-05 20:00:08 +000081
82Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
83
84//===---------------------------------------------------------------------===//
Chris Lattner549f27d22006-03-07 02:46:26 +000085
Chris Lattnerc20995e2006-03-11 20:17:08 +000086Reassociate should turn: X*X*X*X -> t=(X*X) (t*t) to eliminate a multiply.
87
88//===---------------------------------------------------------------------===//
89
Chris Lattner74cfb7d2006-03-11 20:20:40 +000090Interesting? testcase for add/shift/mul reassoc:
91
92int bar(int x, int y) {
93 return x*x*x+y+x*x*x*x*x*y*y*y*y;
94}
95int foo(int z, int n) {
96 return bar(z, n) + bar(2*z, 2*n);
97}
98
Chris Lattner5e14b0d2007-05-05 22:29:06 +000099Reassociate should handle the example in GCC PR16157.
100
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000101//===---------------------------------------------------------------------===//
102
Chris Lattner82c78b22006-03-09 20:13:21 +0000103These two functions should generate the same code on big-endian systems:
104
105int g(int *j,int *l) { return memcmp(j,l,4); }
106int h(int *j, int *l) { return *j - *l; }
107
108this could be done in SelectionDAGISel.cpp, along with other special cases,
109for 1,2,4,8 bytes.
110
111//===---------------------------------------------------------------------===//
112
Chris Lattnerc04b4232006-03-22 07:33:46 +0000113It would be nice to revert this patch:
114http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
115
116And teach the dag combiner enough to simplify the code expanded before
117legalize. It seems plausible that this knowledge would let it simplify other
118stuff too.
119
Chris Lattnere6cd96d2006-03-24 19:59:17 +0000120//===---------------------------------------------------------------------===//
121
Reid Spencerac9dcb92007-02-15 03:39:18 +0000122For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng67d3d4c2006-03-31 22:35:14 +0000123to the type size. It works but can be overly conservative as the alignment of
Reid Spencerac9dcb92007-02-15 03:39:18 +0000124specific vector types are target dependent.
Chris Lattnereaa7c062006-04-01 04:08:29 +0000125
126//===---------------------------------------------------------------------===//
127
Dan Gohman1f3be1a2009-05-11 18:51:16 +0000128We should produce an unaligned load from code like this:
Chris Lattnereaa7c062006-04-01 04:08:29 +0000129
130v4sf example(float *P) {
131 return (v4sf){P[0], P[1], P[2], P[3] };
132}
133
134//===---------------------------------------------------------------------===//
135
Chris Lattner16abfdf2006-05-18 18:26:13 +0000136Add support for conditional increments, and other related patterns. Instead
137of:
138
139 movl 136(%esp), %eax
140 cmpl $0, %eax
141 je LBB16_2 #cond_next
142LBB16_1: #cond_true
143 incl _foo
144LBB16_2: #cond_next
145
146emit:
147 movl _foo, %eax
148 cmpl $1, %edi
149 sbbl $-1, %eax
150 movl %eax, _foo
151
152//===---------------------------------------------------------------------===//
Chris Lattner870cf1b2006-05-19 20:45:08 +0000153
154Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
155
156Expand these to calls of sin/cos and stores:
157 double sincos(double x, double *sin, double *cos);
158 float sincosf(float x, float *sin, float *cos);
159 long double sincosl(long double x, long double *sin, long double *cos);
160
161Doing so could allow SROA of the destination pointers. See also:
162http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
163
Chris Lattner2dae65d2008-12-10 01:30:48 +0000164This is now easily doable with MRVs. We could even make an intrinsic for this
165if anyone cared enough about sincos.
166
Chris Lattner870cf1b2006-05-19 20:45:08 +0000167//===---------------------------------------------------------------------===//
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000168
Chris Lattnere8263e62006-05-21 03:57:07 +0000169Turn this into a single byte store with no load (the other 3 bytes are
170unmodified):
171
Dan Gohman5c8274b2009-05-11 18:04:52 +0000172define void @test(i32* %P) {
173 %tmp = load i32* %P
174 %tmp14 = or i32 %tmp, 3305111552
175 %tmp15 = and i32 %tmp14, 3321888767
176 store i32 %tmp15, i32* %P
Chris Lattnere8263e62006-05-21 03:57:07 +0000177 ret void
178}
179
Chris Lattner9e18ef52006-05-30 21:29:15 +0000180//===---------------------------------------------------------------------===//
181
182dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
183
184Compile:
185
186int bar(int x)
187{
188 int t = __builtin_clz(x);
189 return -(t>>5);
190}
191
192to:
193
194_bar: addic r3,r3,-1
195 subfe r3,r3,r3
196 blr
197
Chris Lattnercbce2f62006-09-15 20:31:36 +0000198//===---------------------------------------------------------------------===//
199
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000200quantum_sigma_x in 462.libquantum contains the following loop:
201
202 for(i=0; i<reg->size; i++)
203 {
204 /* Flip the target bit of each basis state */
205 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
206 }
207
208Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
209so cool to turn it into something like:
210
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000211 long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000212 if (target < 32) {
213 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000214 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000215 } else {
216 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000217 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000218 }
219
220... which would only do one 32-bit XOR per loop iteration instead of two.
221
222It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattnerfaa6adf2009-09-21 06:04:07 +0000223alas.
224
225//===---------------------------------------------------------------------===//
226
227This should be optimized to one 'and' and one 'or', from PR4216:
228
229define i32 @test_bitfield(i32 %bf.prev.low) nounwind ssp {
230entry:
231 %bf.prev.lo.cleared10 = or i32 %bf.prev.low, 32962 ; <i32> [#uses=1]
232 %0 = and i32 %bf.prev.low, -65536 ; <i32> [#uses=1]
233 %1 = and i32 %bf.prev.lo.cleared10, 40186 ; <i32> [#uses=1]
234 %2 = or i32 %1, %0 ; <i32> [#uses=1]
235 ret i32 %2
236}
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000237
238//===---------------------------------------------------------------------===//
Chris Lattnerfb981f32006-09-25 17:12:14 +0000239
Chris Lattnerb1ac7692008-10-05 02:16:12 +0000240This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattnerf9bae432006-12-08 02:01:32 +0000241
242unsigned long reverse(unsigned v) {
243 unsigned t;
244 t = v ^ ((v << 16) | (v >> 16));
245 t &= ~0xff0000;
246 v = (v << 24) | (v >> 8);
247 return v ^ (t >> 8);
248}
249
Chris Lattnerfb981f32006-09-25 17:12:14 +0000250//===---------------------------------------------------------------------===//
251
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000252These idioms should be recognized as popcount (see PR1488):
253
254unsigned countbits_slow(unsigned v) {
255 unsigned c;
256 for (c = 0; v; v >>= 1)
257 c += v & 1;
258 return c;
259}
260unsigned countbits_fast(unsigned v){
261 unsigned c;
262 for (c = 0; v; c++)
263 v &= v - 1; // clear the least significant bit set
264 return c;
265}
266
267BITBOARD = unsigned long long
268int PopCnt(register BITBOARD a) {
269 register int c=0;
270 while(a) {
271 c++;
272 a &= a - 1;
273 }
274 return c;
275}
276unsigned int popcount(unsigned int input) {
277 unsigned int count = 0;
278 for (unsigned int i = 0; i < 4 * 8; i++)
279 count += (input >> i) & i;
280 return count;
281}
282
283//===---------------------------------------------------------------------===//
284
Chris Lattnerfb981f32006-09-25 17:12:14 +0000285These should turn into single 16-bit (unaligned?) loads on little/big endian
286processors.
287
288unsigned short read_16_le(const unsigned char *adr) {
289 return adr[0] | (adr[1] << 8);
290}
291unsigned short read_16_be(const unsigned char *adr) {
292 return (adr[0] << 8) | adr[1];
293}
294
295//===---------------------------------------------------------------------===//
Chris Lattnercf103912006-10-24 16:12:47 +0000296
Reid Spencer1628cec2006-10-26 06:15:43 +0000297-instcombine should handle this transform:
Reid Spencere4d87aa2006-12-23 06:05:41 +0000298 icmp pred (sdiv X / C1 ), C2
Reid Spencer1628cec2006-10-26 06:15:43 +0000299when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
300
301Currently InstCombine avoids this transform but will do it when the signs of
302the operands and the sign of the divide match. See the FIXME in
303InstructionCombining.cpp in the visitSetCondInst method after the switch case
304for Instruction::UDiv (around line 4447) for more details.
305
306The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
307this construct.
Chris Lattnerd7c628d2006-11-03 22:27:39 +0000308
309//===---------------------------------------------------------------------===//
310
Chris Lattner578d2df2006-11-10 00:23:26 +0000311viterbi speeds up *significantly* if the various "history" related copy loops
312are turned into memcpy calls at the source level. We need a "loops to memcpy"
313pass.
314
315//===---------------------------------------------------------------------===//
Nick Lewyckybf637342006-11-13 00:23:28 +0000316
Chris Lattner03a6d962007-01-16 06:39:48 +0000317Consider:
318
319typedef unsigned U32;
320typedef unsigned long long U64;
321int test (U32 *inst, U64 *regs) {
322 U64 effective_addr2;
323 U32 temp = *inst;
324 int r1 = (temp >> 20) & 0xf;
325 int b2 = (temp >> 16) & 0xf;
326 effective_addr2 = temp & 0xfff;
327 if (b2) effective_addr2 += regs[b2];
328 b2 = (temp >> 12) & 0xf;
329 if (b2) effective_addr2 += regs[b2];
330 effective_addr2 &= regs[4];
331 if ((effective_addr2 & 3) == 0)
332 return 1;
333 return 0;
334}
335
336Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
337we don't eliminate the computation of the top half of effective_addr2 because
338we don't have whole-function selection dags. On x86, this means we use one
339extra register for the function when effective_addr2 is declared as U64 than
340when it is declared U32.
341
342//===---------------------------------------------------------------------===//
343
Chris Lattner1a77a552007-03-24 06:01:32 +0000344LSR should know what GPR types a target has. This code:
345
346volatile short X, Y; // globals
347
348void foo(int N) {
349 int i;
350 for (i = 0; i < N; i++) { X = i; Y = i*4; }
351}
352
Chris Lattnerc1491f32009-09-20 17:37:38 +0000353produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner1a77a552007-03-24 06:01:32 +0000354
Chris Lattnerc1491f32009-09-20 17:37:38 +0000355LBB1_2:
356 ldr r3, LCPI1_0
357 ldr r3, [r3]
358 strh r2, [r3]
359 ldr r3, LCPI1_1
360 ldr r3, [r3]
361 strh r1, [r3]
362 add r1, r1, #4
363 add r2, r2, #1 <- [0,+,1]
364 sub r0, r0, #1 <- [0,-,1]
365 cmp r0, #0
366 bne LBB1_2
367
368LSR should reuse the "+" IV for the exit test.
Chris Lattner1a77a552007-03-24 06:01:32 +0000369
370
371//===---------------------------------------------------------------------===//
372
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000373Tail call elim should be more aggressive, checking to see if the call is
374followed by an uncond branch to an exit block.
375
376; This testcase is due to tail-duplication not wanting to copy the return
377; instruction into the terminating blocks because there was other code
378; optimized out of the function after the taildup happened.
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000379; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000380
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000381define i32 @t4(i32 %a) {
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000382entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000383 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
384 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
385 br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000386
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000387then.0: ; preds = %entry
388 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
389 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
390 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000391
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000392else.0: ; preds = %entry
393 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
394 br i1 %tmp.7, label %then.1, label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000395
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000396then.1: ; preds = %else.0
397 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
398 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
399 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000400
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000401return: ; preds = %then.1, %else.0, %then.0
402 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000403 [ %tmp.9, %then.1 ]
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000404 ret i32 %result.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000405}
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000406
407//===---------------------------------------------------------------------===//
408
Chris Lattnere1bb6ab2007-10-03 06:10:59 +0000409Tail recursion elimination is not transforming this function, because it is
410returning n, which fails the isDynamicConstant check in the accumulator
411recursion checks.
412
413long long fib(const long long n) {
414 switch(n) {
415 case 0:
416 case 1:
417 return n;
418 default:
419 return fib(n-1) + fib(n-2);
420 }
421}
422
423//===---------------------------------------------------------------------===//
424
Chris Lattnerc90b8662008-08-10 00:47:21 +0000425Tail recursion elimination should handle:
426
427int pow2m1(int n) {
428 if (n == 0)
429 return 0;
430 return 2 * pow2m1 (n - 1) + 1;
431}
432
433Also, multiplies can be turned into SHL's, so they should be handled as if
434they were associative. "return foo() << 1" can be tail recursion eliminated.
435
436//===---------------------------------------------------------------------===//
437
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000438Argument promotion should promote arguments for recursive functions, like
439this:
440
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000441; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000442
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000443define internal i32 @foo(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000444entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000445 %tmp = load i32* %x ; <i32> [#uses=0]
446 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
447 ret i32 %tmp.foo
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000448}
449
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000450define i32 @bar(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000451entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000452 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
453 ret i32 %tmp3
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000454}
455
Chris Lattner81f2d712007-12-05 23:05:06 +0000456//===---------------------------------------------------------------------===//
Chris Lattner166a2682007-12-28 04:42:05 +0000457
458"basicaa" should know how to look through "or" instructions that act like add
459instructions. For example in this code, the x*4+1 is turned into x*4 | 1, and
460basicaa can't analyze the array subscript, leading to duplicated loads in the
461generated code:
462
463void test(int X, int Y, int a[]) {
464int i;
465 for (i=2; i<1000; i+=4) {
466 a[i+0] = a[i-1+0]*a[i-2+0];
467 a[i+1] = a[i-1+1]*a[i-2+1];
468 a[i+2] = a[i-1+2]*a[i-2+2];
469 a[i+3] = a[i-1+3]*a[i-2+3];
470 }
471}
472
Chris Lattner2dae65d2008-12-10 01:30:48 +0000473BasicAA also doesn't do this for add. It needs to know that &A[i+1] != &A[i].
474
Chris Lattnera1643ba2007-12-28 22:30:05 +0000475//===---------------------------------------------------------------------===//
Chris Lattner166a2682007-12-28 04:42:05 +0000476
Chris Lattnera1643ba2007-12-28 22:30:05 +0000477We should investigate an instruction sinking pass. Consider this silly
478example in pic mode:
479
480#include <assert.h>
481void foo(int x) {
482 assert(x);
483 //...
484}
485
486we compile this to:
487_foo:
488 subl $28, %esp
489 call "L1$pb"
490"L1$pb":
491 popl %eax
492 cmpl $0, 32(%esp)
493 je LBB1_2 # cond_true
494LBB1_1: # return
495 # ...
496 addl $28, %esp
497 ret
498LBB1_2: # cond_true
499...
500
501The PIC base computation (call+popl) is only used on one path through the
502code, but is currently always computed in the entry block. It would be
503better to sink the picbase computation down into the block for the
504assertion, as it is the only one that uses it. This happens for a lot of
505code with early outs.
506
Chris Lattner92c06a02007-12-29 01:05:01 +0000507Another example is loads of arguments, which are usually emitted into the
508entry block on targets like x86. If not used in all paths through a
509function, they should be sunk into the ones that do.
510
Chris Lattnera1643ba2007-12-28 22:30:05 +0000511In this case, whole-function-isel would also handle this.
Chris Lattner166a2682007-12-28 04:42:05 +0000512
513//===---------------------------------------------------------------------===//
Chris Lattnerb3041942008-01-07 21:38:14 +0000514
515Investigate lowering of sparse switch statements into perfect hash tables:
516http://burtleburtle.net/bob/hash/perfect.html
517
518//===---------------------------------------------------------------------===//
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000519
520We should turn things like "load+fabs+store" and "load+fneg+store" into the
521corresponding integer operations. On a yonah, this loop:
522
523double a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000524void foo() {
525 int i, b;
526 for (b = 0; b < 10000000; b++)
527 for (i = 0; i < 256; i++)
528 a[i] = -a[i];
529}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000530
531is twice as slow as this loop:
532
533long long a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000534void foo() {
535 int i, b;
536 for (b = 0; b < 10000000; b++)
537 for (i = 0; i < 256; i++)
538 a[i] ^= (1ULL << 63);
539}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000540
541and I suspect other processors are similar. On X86 in particular this is a
542big win because doing this with integers allows the use of read/modify/write
543instructions.
544
545//===---------------------------------------------------------------------===//
Chris Lattner83726012008-01-10 18:25:41 +0000546
547DAG Combiner should try to combine small loads into larger loads when
548profitable. For example, we compile this C++ example:
549
550struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
551extern THotKey m_HotKey;
552THotKey GetHotKey () { return m_HotKey; }
553
554into (-O3 -fno-exceptions -static -fomit-frame-pointer):
555
556__Z9GetHotKeyv:
557 pushl %esi
558 movl 8(%esp), %eax
559 movb _m_HotKey+3, %cl
560 movb _m_HotKey+4, %dl
561 movb _m_HotKey+2, %ch
562 movw _m_HotKey, %si
563 movw %si, (%eax)
564 movb %ch, 2(%eax)
565 movb %cl, 3(%eax)
566 movb %dl, 4(%eax)
567 popl %esi
568 ret $4
569
570GCC produces:
571
572__Z9GetHotKeyv:
573 movl _m_HotKey, %edx
574 movl 4(%esp), %eax
575 movl %edx, (%eax)
576 movzwl _m_HotKey+4, %edx
577 movw %dx, 4(%eax)
578 ret $4
579
580The LLVM IR contains the needed alignment info, so we should be able to
581merge the loads and stores into 4-byte loads:
582
583 %struct.THotKey = type { i16, i8, i8, i8 }
584define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
585...
586 %tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
587 %tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
588 %tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
589 %tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
590
591Alternatively, we should use a small amount of base-offset alias analysis
592to make it so the scheduler doesn't need to hold all the loads in regs at
593once.
594
595//===---------------------------------------------------------------------===//
Chris Lattner497b7e92008-01-11 06:17:47 +0000596
Nate Begemane9fe65c2008-02-18 18:39:23 +0000597We should add an FRINT node to the DAG to model targets that have legal
598implementations of ceil/floor/rint.
Chris Lattner48840f82008-02-28 05:34:27 +0000599
600//===---------------------------------------------------------------------===//
601
602Consider:
603
604int test() {
605 long long input[8] = {1,1,1,1,1,1,1,1};
606 foo(input);
607}
608
609We currently compile this into a memcpy from a global array since the
610initializer is fairly large and not memset'able. This is good, but the memcpy
611gets lowered to load/stores in the code generator. This is also ok, except
612that the codegen lowering for memcpy doesn't handle the case when the source
613is a constant global. This gives us atrocious code like this:
614
615 call "L1$pb"
616"L1$pb":
617 popl %eax
618 movl _C.0.1444-"L1$pb"+32(%eax), %ecx
619 movl %ecx, 40(%esp)
620 movl _C.0.1444-"L1$pb"+20(%eax), %ecx
621 movl %ecx, 28(%esp)
622 movl _C.0.1444-"L1$pb"+36(%eax), %ecx
623 movl %ecx, 44(%esp)
624 movl _C.0.1444-"L1$pb"+44(%eax), %ecx
625 movl %ecx, 52(%esp)
626 movl _C.0.1444-"L1$pb"+40(%eax), %ecx
627 movl %ecx, 48(%esp)
628 movl _C.0.1444-"L1$pb"+12(%eax), %ecx
629 movl %ecx, 20(%esp)
630 movl _C.0.1444-"L1$pb"+4(%eax), %ecx
631...
632
633instead of:
634 movl $1, 16(%esp)
635 movl $0, 20(%esp)
636 movl $1, 24(%esp)
637 movl $0, 28(%esp)
638 movl $1, 32(%esp)
639 movl $0, 36(%esp)
640 ...
641
642//===---------------------------------------------------------------------===//
Chris Lattnera11deb02008-03-02 02:51:40 +0000643
644http://llvm.org/PR717:
645
646The following code should compile into "ret int undef". Instead, LLVM
647produces "ret int 0":
648
649int f() {
650 int x = 4;
651 int y;
652 if (x == 3) y = 0;
653 return y;
654}
655
656//===---------------------------------------------------------------------===//
Chris Lattner53b72772008-03-02 19:29:42 +0000657
658The loop unroller should partially unroll loops (instead of peeling them)
659when code growth isn't too bad and when an unroll count allows simplification
660of some code within the loop. One trivial example is:
661
662#include <stdio.h>
663int main() {
664 int nRet = 17;
665 int nLoop;
666 for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
667 if ( nLoop & 1 )
668 nRet += 2;
669 else
670 nRet -= 1;
671 }
672 return nRet;
673}
674
675Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
676reduction in code size. The resultant code would then also be suitable for
677exit value computation.
678
679//===---------------------------------------------------------------------===//
Chris Lattner349155b2008-03-17 01:47:51 +0000680
681We miss a bunch of rotate opportunities on various targets, including ppc, x86,
682etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
683matching code in dag combine doesn't look through truncates aggressively
684enough. Here are some testcases reduces from GCC PR17886:
685
686unsigned long long f(unsigned long long x, int y) {
687 return (x << y) | (x >> 64-y);
688}
689unsigned f2(unsigned x, int y){
690 return (x << y) | (x >> 32-y);
691}
692unsigned long long f3(unsigned long long x){
693 int y = 9;
694 return (x << y) | (x >> 64-y);
695}
696unsigned f4(unsigned x){
697 int y = 10;
698 return (x << y) | (x >> 32-y);
699}
700unsigned long long f5(unsigned long long x, unsigned long long y) {
701 return (x << 8) | ((y >> 48) & 0xffull);
702}
703unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
704 switch(z) {
705 case 1:
706 return (x << 8) | ((y >> 48) & 0xffull);
707 case 2:
708 return (x << 16) | ((y >> 40) & 0xffffull);
709 case 3:
710 return (x << 24) | ((y >> 32) & 0xffffffull);
711 case 4:
712 return (x << 32) | ((y >> 24) & 0xffffffffull);
713 default:
714 return (x << 40) | ((y >> 16) & 0xffffffffffull);
715 }
716}
717
Dan Gohmancb747c52008-10-17 21:39:27 +0000718On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner349155b2008-03-17 01:47:51 +0000719generate truly horrible code, instead of using shld and friends. On
720ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
721badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
722
723//===---------------------------------------------------------------------===//
Chris Lattnerf70107f2008-03-20 04:46:13 +0000724
725We do a number of simplifications in simplify libcalls to strength reduce
726standard library functions, but we don't currently merge them together. For
727example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
728be done safely if "b" isn't modified between the strlen and memcpy of course.
729
730//===---------------------------------------------------------------------===//
731
Chris Lattner10c5d362008-07-14 00:19:59 +0000732Reassociate should turn things like:
733
734int factorial(int X) {
735 return X*X*X*X*X*X*X*X;
736}
737
738into llvm.powi calls, allowing the code generator to produce balanced
739multiplication trees.
740
741//===---------------------------------------------------------------------===//
742
Chris Lattner26e150f2008-08-10 01:14:08 +0000743We generate a horrible libcall for llvm.powi. For example, we compile:
744
745#include <cmath>
746double f(double a) { return std::pow(a, 4); }
747
748into:
749
750__Z1fd:
751 subl $12, %esp
752 movsd 16(%esp), %xmm0
753 movsd %xmm0, (%esp)
754 movl $4, 8(%esp)
755 call L___powidf2$stub
756 addl $12, %esp
757 ret
758
759GCC produces:
760
761__Z1fd:
762 subl $12, %esp
763 movsd 16(%esp), %xmm0
764 mulsd %xmm0, %xmm0
765 mulsd %xmm0, %xmm0
766 movsd %xmm0, (%esp)
767 fldl (%esp)
768 addl $12, %esp
769 ret
770
771//===---------------------------------------------------------------------===//
772
773We compile this program: (from GCC PR11680)
774http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
775
776Into code that runs the same speed in fast/slow modes, but both modes run 2x
777slower than when compile with GCC (either 4.0 or 4.2):
778
779$ llvm-g++ perf.cpp -O3 -fno-exceptions
780$ time ./a.out fast
7811.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
782
783$ g++ perf.cpp -O3 -fno-exceptions
784$ time ./a.out fast
7850.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
786
787It looks like we are making the same inlining decisions, so this may be raw
788codegen badness or something else (haven't investigated).
789
790//===---------------------------------------------------------------------===//
791
792We miss some instcombines for stuff like this:
793void bar (void);
794void foo (unsigned int a) {
795 /* This one is equivalent to a >= (3 << 2). */
796 if ((a >> 2) >= 3)
797 bar ();
798}
799
800A few other related ones are in GCC PR14753.
801
802//===---------------------------------------------------------------------===//
803
804Divisibility by constant can be simplified (according to GCC PR12849) from
805being a mulhi to being a mul lo (cheaper). Testcase:
806
807void bar(unsigned n) {
808 if (n % 3 == 0)
809 true();
810}
811
812I think this basically amounts to a dag combine to simplify comparisons against
813multiply hi's into a comparison against the mullo.
814
815//===---------------------------------------------------------------------===//
Chris Lattner23f35bc2008-08-19 06:22:16 +0000816
Chris Lattnerdb039832008-10-15 16:06:03 +0000817Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
818bunch of other stuff from this example (see PR1604):
819
820#include <cstdio>
821struct test {
822 int val;
823 virtual ~test() {}
824};
825
826int main() {
827 test t;
828 std::scanf("%d", &t.val);
829 std::printf("%d\n", t.val);
830}
831
832//===---------------------------------------------------------------------===//
833
Chris Lattner3b364cb2008-10-15 16:33:52 +0000834Instcombine will merge comparisons like (x >= 10) && (x < 20) by producing (x -
83510) u< 10, but only when the comparisons have matching sign.
836
837This could be converted with a similiar technique. (PR1941)
838
839define i1 @test(i8 %x) {
840 %A = icmp uge i8 %x, 5
841 %B = icmp slt i8 %x, 20
842 %C = and i1 %A, %B
843 ret i1 %C
844}
845
846//===---------------------------------------------------------------------===//
Nick Lewyckydf563ca2008-11-27 22:12:22 +0000847
Nick Lewyckyd2f0db12008-11-27 22:41:45 +0000848These functions perform the same computation, but produce different assembly.
Nick Lewyckydf563ca2008-11-27 22:12:22 +0000849
850define i8 @select(i8 %x) readnone nounwind {
851 %A = icmp ult i8 %x, 250
852 %B = select i1 %A, i8 0, i8 1
853 ret i8 %B
854}
855
856define i8 @addshr(i8 %x) readnone nounwind {
857 %A = zext i8 %x to i9
858 %B = add i9 %A, 6 ;; 256 - 250 == 6
859 %C = lshr i9 %B, 8
860 %D = trunc i9 %C to i8
861 ret i8 %D
862}
863
864//===---------------------------------------------------------------------===//
Eli Friedman4e16b292008-11-30 07:36:04 +0000865
866From gcc bug 24696:
867int
868f (unsigned long a, unsigned long b, unsigned long c)
869{
870 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0);
871}
872int
873f (unsigned long a, unsigned long b, unsigned long c)
874{
875 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0);
876}
877Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with
878"clang -emit-llvm-bc | opt -std-compile-opts".
879
880//===---------------------------------------------------------------------===//
881
882From GCC Bug 20192:
883#define PMD_MASK (~((1UL << 23) - 1))
884void clear_pmd_range(unsigned long start, unsigned long end)
885{
886 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
887 f();
888}
889The expression should optimize to something like
890"!((start|end)&~PMD_MASK). Currently not optimized with "clang
891-emit-llvm-bc | opt -std-compile-opts".
892
893//===---------------------------------------------------------------------===//
894
895From GCC Bug 15241:
896unsigned int
897foo (unsigned int a, unsigned int b)
898{
899 if (a <= 7 && b <= 7)
900 baz ();
901}
902Should combine to "(a|b) <= 7". Currently not optimized with "clang
903-emit-llvm-bc | opt -std-compile-opts".
904
905//===---------------------------------------------------------------------===//
906
907From GCC Bug 3756:
908int
909pn (int n)
910{
911 return (n >= 0 ? 1 : -1);
912}
913Should combine to (n >> 31) | 1. Currently not optimized with "clang
914-emit-llvm-bc | opt -std-compile-opts | llc".
915
916//===---------------------------------------------------------------------===//
917
918From GCC Bug 28685:
919int test(int a, int b)
920{
921 int lt = a < b;
922 int eq = a == b;
923
924 return (lt || eq);
925}
926Should combine to "a <= b". Currently not optimized with "clang
927-emit-llvm-bc | opt -std-compile-opts | llc".
928
929//===---------------------------------------------------------------------===//
930
931void a(int variable)
932{
933 if (variable == 4 || variable == 6)
934 bar();
935}
936This should optimize to "if ((variable | 2) == 6)". Currently not
937optimized with "clang -emit-llvm-bc | opt -std-compile-opts | llc".
938
939//===---------------------------------------------------------------------===//
940
941unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
942i;}
943unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
944These should combine to the same thing. Currently, the first function
945produces better code on X86.
946
947//===---------------------------------------------------------------------===//
948
Eli Friedman4e16b292008-11-30 07:36:04 +0000949From GCC Bug 15784:
950#define abs(x) x>0?x:-x
951int f(int x, int y)
952{
953 return (abs(x)) >= 0;
954}
955This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
956optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
957
958//===---------------------------------------------------------------------===//
959
960From GCC Bug 14753:
961void
962rotate_cst (unsigned int a)
963{
964 a = (a << 10) | (a >> 22);
965 if (a == 123)
966 bar ();
967}
968void
969minus_cst (unsigned int a)
970{
971 unsigned int tem;
972
973 tem = 20 - a;
974 if (tem == 5)
975 bar ();
976}
977void
978mask_gt (unsigned int a)
979{
980 /* This is equivalent to a > 15. */
981 if ((a & ~7) > 8)
982 bar ();
983}
984void
985rshift_gt (unsigned int a)
986{
987 /* This is equivalent to a > 23. */
988 if ((a >> 2) > 5)
989 bar ();
990}
991All should simplify to a single comparison. All of these are
992currently not optimized with "clang -emit-llvm-bc | opt
993-std-compile-opts".
994
995//===---------------------------------------------------------------------===//
996
997From GCC Bug 32605:
998int c(int* x) {return (char*)x+2 == (char*)x;}
999Should combine to 0. Currently not optimized with "clang
1000-emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it).
1001
1002//===---------------------------------------------------------------------===//
1003
1004int a(unsigned char* b) {return *b > 99;}
1005There's an unnecessary zext in the generated code with "clang
1006-emit-llvm-bc | opt -std-compile-opts".
1007
1008//===---------------------------------------------------------------------===//
1009
Eli Friedman4e16b292008-11-30 07:36:04 +00001010int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;}
1011Should be combined to "((b >> 1) | b) & 1". Currently not optimized
1012with "clang -emit-llvm-bc | opt -std-compile-opts".
1013
1014//===---------------------------------------------------------------------===//
1015
1016unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);}
1017Should combine to "x | (y & 3)". Currently not optimized with "clang
1018-emit-llvm-bc | opt -std-compile-opts".
1019
1020//===---------------------------------------------------------------------===//
1021
1022unsigned a(unsigned a) {return ((a | 1) & 3) | (a & -4);}
1023Should combine to "a | 1". Currently not optimized with "clang
1024-emit-llvm-bc | opt -std-compile-opts".
1025
1026//===---------------------------------------------------------------------===//
1027
Eli Friedman4e16b292008-11-30 07:36:04 +00001028int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);}
1029Should fold to "(~a & c) | (a & b)". Currently not optimized with
1030"clang -emit-llvm-bc | opt -std-compile-opts".
1031
1032//===---------------------------------------------------------------------===//
1033
1034int a(int a,int b) {return (~(a|b))|a;}
1035Should fold to "a|~b". Currently not optimized with "clang
1036-emit-llvm-bc | opt -std-compile-opts".
1037
1038//===---------------------------------------------------------------------===//
1039
1040int a(int a, int b) {return (a&&b) || (a&&!b);}
1041Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
1042| opt -std-compile-opts".
1043
1044//===---------------------------------------------------------------------===//
1045
1046int a(int a, int b, int c) {return (a&&b) || (!a&&c);}
1047Should fold to "a ? b : c", or at least something sane. Currently not
1048optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1049
1050//===---------------------------------------------------------------------===//
1051
1052int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);}
1053Should fold to a && (b || c). Currently not optimized with "clang
1054-emit-llvm-bc | opt -std-compile-opts".
1055
1056//===---------------------------------------------------------------------===//
1057
1058int a(int x) {return x | ((x & 8) ^ 8);}
1059Should combine to x | 8. Currently not optimized with "clang
1060-emit-llvm-bc | opt -std-compile-opts".
1061
1062//===---------------------------------------------------------------------===//
1063
1064int a(int x) {return x ^ ((x & 8) ^ 8);}
1065Should also combine to x | 8. Currently not optimized with "clang
1066-emit-llvm-bc | opt -std-compile-opts".
1067
1068//===---------------------------------------------------------------------===//
1069
1070int a(int x) {return (x & 8) == 0 ? -1 : -9;}
1071Should combine to (x | -9) ^ 8. Currently not optimized with "clang
1072-emit-llvm-bc | opt -std-compile-opts".
1073
1074//===---------------------------------------------------------------------===//
1075
1076int a(int x) {return (x & 8) == 0 ? -9 : -1;}
1077Should combine to x | -9. Currently not optimized with "clang
1078-emit-llvm-bc | opt -std-compile-opts".
1079
1080//===---------------------------------------------------------------------===//
1081
1082int a(int x) {return ((x | -9) ^ 8) & x;}
1083Should combine to x & -9. Currently not optimized with "clang
1084-emit-llvm-bc | opt -std-compile-opts".
1085
1086//===---------------------------------------------------------------------===//
1087
1088unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
1089Should combine to "a * 0x88888888 >> 31". Currently not optimized
1090with "clang -emit-llvm-bc | opt -std-compile-opts".
1091
1092//===---------------------------------------------------------------------===//
1093
1094unsigned a(char* x) {if ((*x & 32) == 0) return b();}
1095There's an unnecessary zext in the generated code with "clang
1096-emit-llvm-bc | opt -std-compile-opts".
1097
1098//===---------------------------------------------------------------------===//
1099
1100unsigned a(unsigned long long x) {return 40 * (x >> 1);}
1101Should combine to "20 * (((unsigned)x) & -2)". Currently not
1102optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1103
1104//===---------------------------------------------------------------------===//
Bill Wendling3bdcda82008-12-02 05:12:47 +00001105
Chris Lattner88d84b22008-12-02 06:32:34 +00001106This was noticed in the entryblock for grokdeclarator in 403.gcc:
1107
1108 %tmp = icmp eq i32 %decl_context, 4
1109 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
1110 %tmp1 = icmp eq i32 %decl_context_addr.0, 1
1111 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
1112
1113tmp1 should be simplified to something like:
1114 (!tmp || decl_context == 1)
1115
1116This allows recursive simplifications, tmp1 is used all over the place in
1117the function, e.g. by:
1118
1119 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
1120 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
1121 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
1122
1123later.
1124
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001125//===---------------------------------------------------------------------===//
1126
1127Store sinking: This code:
1128
1129void f (int n, int *cond, int *res) {
1130 int i;
1131 *res = 0;
1132 for (i = 0; i < n; i++)
1133 if (*cond)
1134 *res ^= 234; /* (*) */
1135}
1136
1137On this function GVN hoists the fully redundant value of *res, but nothing
1138moves the store out. This gives us this code:
1139
1140bb: ; preds = %bb2, %entry
1141 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
1142 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
1143 %1 = load i32* %cond, align 4
1144 %2 = icmp eq i32 %1, 0
1145 br i1 %2, label %bb2, label %bb1
1146
1147bb1: ; preds = %bb
1148 %3 = xor i32 %.rle, 234
1149 store i32 %3, i32* %res, align 4
1150 br label %bb2
1151
1152bb2: ; preds = %bb, %bb1
1153 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
1154 %indvar.next = add i32 %i.05, 1
1155 %exitcond = icmp eq i32 %indvar.next, %n
1156 br i1 %exitcond, label %return, label %bb
1157
1158DSE should sink partially dead stores to get the store out of the loop.
1159
Chris Lattner6a09a742008-12-06 22:52:12 +00001160Here's another partial dead case:
1161http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
1162
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001163//===---------------------------------------------------------------------===//
1164
1165Scalar PRE hoists the mul in the common block up to the else:
1166
1167int test (int a, int b, int c, int g) {
1168 int d, e;
1169 if (a)
1170 d = b * c;
1171 else
1172 d = b - c;
1173 e = b * c + g;
1174 return d + e;
1175}
1176
1177It would be better to do the mul once to reduce codesize above the if.
1178This is GCC PR38204.
1179
1180//===---------------------------------------------------------------------===//
1181
1182GCC PR37810 is an interesting case where we should sink load/store reload
1183into the if block and outside the loop, so we don't reload/store it on the
1184non-call path.
1185
1186for () {
1187 *P += 1;
1188 if ()
1189 call();
1190 else
1191 ...
1192->
1193tmp = *P
1194for () {
1195 tmp += 1;
1196 if () {
1197 *P = tmp;
1198 call();
1199 tmp = *P;
1200 } else ...
1201}
1202*P = tmp;
1203
Chris Lattner8f416f32008-12-15 07:49:24 +00001204We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
1205we don't sink the store. We need partially dead store sinking.
1206
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001207//===---------------------------------------------------------------------===//
1208
Chris Lattner8f416f32008-12-15 07:49:24 +00001209[PHI TRANSLATE GEPs]
1210
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001211GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
1212leading to excess stack traffic. This could be handled by GVN with some crazy
1213symbolic phi translation. The code we get looks like (g is on the stack):
1214
1215bb2: ; preds = %bb1
1216..
1217 %9 = getelementptr %struct.f* %g, i32 0, i32 0
1218 store i32 %8, i32* %9, align bel %bb3
1219
1220bb3: ; preds = %bb1, %bb2, %bb
1221 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
1222 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
1223 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
1224 %11 = load i32* %10, align 4
1225
1226%11 is fully redundant, an in BB2 it should have the value %8.
1227
Chris Lattner6a09a742008-12-06 22:52:12 +00001228GCC PR33344 is a similar case.
1229
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001230//===---------------------------------------------------------------------===//
1231
Chris Lattner6a09a742008-12-06 22:52:12 +00001232There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
1233GCC testsuite. There are many pre testcases as ssa-pre-*.c
1234
Chris Lattner582048d2008-12-15 08:32:28 +00001235//===---------------------------------------------------------------------===//
1236
1237There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
1238GCC testsuite. For example, predcom-1.c is:
1239
1240 for (i = 2; i < 1000; i++)
1241 fib[i] = (fib[i-1] + fib[i - 2]) & 0xffff;
1242
1243which compiles into:
1244
1245bb1: ; preds = %bb1, %bb1.thread
1246 %indvar = phi i32 [ 0, %bb1.thread ], [ %0, %bb1 ]
1247 %i.0.reg2mem.0 = add i32 %indvar, 2
1248 %0 = add i32 %indvar, 1 ; <i32> [#uses=3]
1249 %1 = getelementptr [1000 x i32]* @fib, i32 0, i32 %0
1250 %2 = load i32* %1, align 4 ; <i32> [#uses=1]
1251 %3 = getelementptr [1000 x i32]* @fib, i32 0, i32 %indvar
1252 %4 = load i32* %3, align 4 ; <i32> [#uses=1]
1253 %5 = add i32 %4, %2 ; <i32> [#uses=1]
1254 %6 = and i32 %5, 65535 ; <i32> [#uses=1]
1255 %7 = getelementptr [1000 x i32]* @fib, i32 0, i32 %i.0.reg2mem.0
1256 store i32 %6, i32* %7, align 4
1257 %exitcond = icmp eq i32 %0, 998 ; <i1> [#uses=1]
1258 br i1 %exitcond, label %return, label %bb1
1259
1260This is basically:
1261 LOAD fib[i+1]
1262 LOAD fib[i]
1263 STORE fib[i+2]
1264
1265instead of handling this as a loop or other xform, all we'd need to do is teach
1266load PRE to phi translate the %0 add (i+1) into the predecessor as (i'+1+1) =
1267(i'+2) (where i' is the previous iteration of i). This would find the store
1268which feeds it.
1269
1270predcom-2.c is apparently the same as predcom-1.c
1271predcom-3.c is very similar but needs loads feeding each other instead of
1272store->load.
1273predcom-4.c seems the same as the rest.
1274
1275
1276//===---------------------------------------------------------------------===//
1277
Chris Lattner6a09a742008-12-06 22:52:12 +00001278Other simple load PRE cases:
Chris Lattner8f416f32008-12-15 07:49:24 +00001279http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35287 [LPRE crit edge splitting]
1280
1281http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34677 (licm does this, LPRE crit edge)
1282 llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as | opt -mem2reg -simplifycfg -gvn | llvm-dis
1283
Chris Lattner93c6c772009-09-21 02:53:57 +00001284http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16799 [BITCAST PHI TRANS]
1285
Chris Lattner582048d2008-12-15 08:32:28 +00001286//===---------------------------------------------------------------------===//
1287
1288Type based alias analysis:
Chris Lattner6a09a742008-12-06 22:52:12 +00001289http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
1290
1291//===---------------------------------------------------------------------===//
1292
Chris Lattner6a09a742008-12-06 22:52:12 +00001293A/B get pinned to the stack because we turn an if/then into a select instead
1294of PRE'ing the load/store. This may be fixable in instcombine:
1295http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
1296
Chris Lattner93c6c772009-09-21 02:53:57 +00001297struct X { int i; };
1298int foo (int x) {
1299 struct X a;
1300 struct X b;
1301 struct X *p;
1302 a.i = 1;
1303 b.i = 2;
1304 if (x)
1305 p = &a;
1306 else
1307 p = &b;
1308 return p->i;
1309}
Chris Lattner582048d2008-12-15 08:32:28 +00001310
Chris Lattner93c6c772009-09-21 02:53:57 +00001311//===---------------------------------------------------------------------===//
Chris Lattner582048d2008-12-15 08:32:28 +00001312
Chris Lattner6a09a742008-12-06 22:52:12 +00001313Interesting missed case because of control flow flattening (should be 2 loads):
1314http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner582048d2008-12-15 08:32:28 +00001315With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |
1316 opt -mem2reg -gvn -instcombine | llvm-dis
1317we miss it because we need 1) GEP PHI TRAN, 2) CRIT EDGE 3) MULTIPLE DIFFERENT
1318VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner6a09a742008-12-06 22:52:12 +00001319
1320//===---------------------------------------------------------------------===//
1321
1322http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
1323We could eliminate the branch condition here, loading from null is undefined:
1324
1325struct S { int w, x, y, z; };
1326struct T { int r; struct S s; };
1327void bar (struct S, int);
1328void foo (int a, struct T b)
1329{
1330 struct S *c = 0;
1331 if (a)
1332 c = &b.s;
1333 bar (*c, a);
1334}
1335
1336//===---------------------------------------------------------------------===//
Chris Lattner88d84b22008-12-02 06:32:34 +00001337
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001338simplifylibcalls should do several optimizations for strspn/strcspn:
1339
1340strcspn(x, "") -> strlen(x)
1341strcspn("", x) -> 0
1342strspn("", x) -> 0
1343strspn(x, "") -> strlen(x)
1344strspn(x, "a") -> strchr(x, 'a')-x
1345
1346strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
1347
1348size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
1349 int __reject3) {
1350 register size_t __result = 0;
1351 while (__s[__result] != '\0' && __s[__result] != __reject1 &&
1352 __s[__result] != __reject2 && __s[__result] != __reject3)
1353 ++__result;
1354 return __result;
1355}
1356
1357This should turn into a switch on the character. See PR3253 for some notes on
1358codegen.
1359
1360456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
1361
1362//===---------------------------------------------------------------------===//
Chris Lattnerd23b7992008-12-31 00:54:13 +00001363
1364"gas" uses this idiom:
1365 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string))
1366..
1367 else if (strchr ("<>", *intel_parser.op_string)
1368
1369Those should be turned into a switch.
1370
1371//===---------------------------------------------------------------------===//
Chris Lattnerffb08f52009-01-08 06:52:57 +00001372
1373252.eon contains this interesting code:
1374
1375 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
1376 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1377 %strlen = call i32 @strlen(i8* %3072) ; uses = 1
1378 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
1379 call void @llvm.memcpy.i32(i8* %endptr,
1380 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
1381 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly
1382
1383This is interesting for a couple reasons. First, in this:
1384
1385 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1386 %strlen = call i32 @strlen(i8* %3072)
1387
1388The strlen could be replaced with: %strlen = sub %3072, %3073, because the
1389strcpy call returns a pointer to the end of the string. Based on that, the
1390endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
1391
1392Second, the memcpy+strlen strlen can be replaced with:
1393
1394 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
1395
1396Because the destination was just copied into the specified memory buffer. This,
1397in turn, can be constant folded to "4".
1398
1399In other code, it contains:
1400
1401 %endptr6978 = bitcast i8* %endptr69 to i32*
1402 store i32 7107374, i32* %endptr6978, align 1
1403 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
1404
1405Which could also be constant folded. Whatever is producing this should probably
1406be fixed to leave this as a memcpy from a string.
1407
1408Further, eon also has an interesting partially redundant strlen call:
1409
1410bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
1411 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2]
1412 %683 = load i8** %682, align 4 ; <i8*> [#uses=4]
1413 %684 = load i8* %683, align 1 ; <i8> [#uses=1]
1414 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
1415 br i1 %685, label %bb10, label %bb9
1416
1417bb9: ; preds = %bb8
1418 %686 = call i32 @strlen(i8* %683) nounwind readonly
1419 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
1420 br i1 %687, label %bb10, label %bb11
1421
1422bb10: ; preds = %bb9, %bb8
1423 %688 = call i32 @strlen(i8* %683) nounwind readonly
1424
1425This could be eliminated by doing the strlen once in bb8, saving code size and
1426improving perf on the bb8->9->10 path.
1427
1428//===---------------------------------------------------------------------===//
Chris Lattner9fee08f2009-01-08 07:34:55 +00001429
1430I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
1431which looks like:
1432 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
1433
1434
1435bb62: ; preds = %bb55, %bb53
1436 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
1437 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1438 %172 = add i32 %171, -1 ; <i32> [#uses=1]
1439 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
1440
1441... no stores ...
1442 br i1 %or.cond, label %bb65, label %bb72
1443
1444bb65: ; preds = %bb62
1445 store i8 0, i8* %173, align 1
1446 br label %bb72
1447
1448bb72: ; preds = %bb65, %bb62
1449 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
1450 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1451
1452Note that on the bb62->bb72 path, that the %177 strlen call is partially
1453redundant with the %171 call. At worst, we could shove the %177 strlen call
1454up into the bb65 block moving it out of the bb62->bb72 path. However, note
1455that bb65 stores to the string, zeroing out the last byte. This means that on
1456that path the value of %177 is actually just %171-1. A sub is cheaper than a
1457strlen!
1458
1459This pattern repeats several times, basically doing:
1460
1461 A = strlen(P);
1462 P[A-1] = 0;
1463 B = strlen(P);
1464 where it is "obvious" that B = A-1.
1465
1466//===---------------------------------------------------------------------===//
1467
1468186.crafty contains this interesting pattern:
1469
1470%77 = call i8* @strstr(i8* getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0),
1471 i8* %30)
1472%phitmp648 = icmp eq i8* %77, getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0)
1473br i1 %phitmp648, label %bb70, label %bb76
1474
1475bb70: ; preds = %OptionMatch.exit91, %bb69
1476 %78 = call i32 @strlen(i8* %30) nounwind readonly align 1 ; <i32> [#uses=1]
1477
1478This is basically:
1479 cststr = "abcdef";
1480 if (strstr(cststr, P) == cststr) {
1481 x = strlen(P);
1482 ...
1483
1484The strstr call would be significantly cheaper written as:
1485
1486cststr = "abcdef";
1487if (memcmp(P, str, strlen(P)))
1488 x = strlen(P);
1489
1490This is memcmp+strlen instead of strstr. This also makes the strlen fully
1491redundant.
1492
1493//===---------------------------------------------------------------------===//
1494
1495186.crafty also contains this code:
1496
1497%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1498%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
1499%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
1500%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1501%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
1502
1503The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
1504
1505//===---------------------------------------------------------------------===//
1506
1507186.crafty has this interesting pattern with the "out.4543" variable:
1508
1509call void @llvm.memcpy.i32(
1510 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
1511 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
1512%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
1513
1514It is basically doing:
1515
1516 memcpy(globalarray, "string");
1517 printf(..., globalarray);
1518
1519Anyway, by knowing that printf just reads the memory and forward substituting
1520the string directly into the printf, this eliminates reads from globalarray.
1521Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
1522other similar functions) there are many stores to "out". Once all the printfs
1523stop using "out", all that is left is the memcpy's into it. This should allow
1524globalopt to remove the "stored only" global.
1525
1526//===---------------------------------------------------------------------===//
1527
Dan Gohman8289b052009-01-20 01:07:33 +00001528This code:
1529
1530define inreg i32 @foo(i8* inreg %p) nounwind {
1531 %tmp0 = load i8* %p
1532 %tmp1 = ashr i8 %tmp0, 5
1533 %tmp2 = sext i8 %tmp1 to i32
1534 ret i32 %tmp2
1535}
1536
1537could be dagcombine'd to a sign-extending load with a shift.
1538For example, on x86 this currently gets this:
1539
1540 movb (%eax), %al
1541 sarb $5, %al
1542 movsbl %al, %eax
1543
1544while it could get this:
1545
1546 movsbl (%eax), %eax
1547 sarl $5, %eax
1548
1549//===---------------------------------------------------------------------===//
Chris Lattner256baa42009-01-22 07:16:03 +00001550
1551GCC PR31029:
1552
1553int test(int x) { return 1-x == x; } // --> return false
1554int test2(int x) { return 2-x == x; } // --> return x == 1 ?
1555
1556Always foldable for odd constants, what is the rule for even?
1557
1558//===---------------------------------------------------------------------===//
1559
Torok Edwine46a6862009-01-24 19:30:25 +00001560PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
1561for next field in struct (which is at same address).
1562
1563For example: store of float into { {{}}, float } could be turned into a store to
1564the float directly.
1565
Torok Edwin474479f2009-02-20 18:42:06 +00001566//===---------------------------------------------------------------------===//
Nick Lewycky20babb12009-02-25 06:52:48 +00001567
Torok Edwin474479f2009-02-20 18:42:06 +00001568#include <math.h>
1569double foo(double a) { return sin(a); }
1570
1571This compiles into this on x86-64 Linux:
1572foo:
1573 subq $8, %rsp
1574 call sin
1575 addq $8, %rsp
1576 ret
1577vs:
1578
1579foo:
1580 jmp sin
1581
Nick Lewycky20babb12009-02-25 06:52:48 +00001582//===---------------------------------------------------------------------===//
1583
Chris Lattner32c5f172009-05-11 17:41:40 +00001584The arg promotion pass should make use of nocapture to make its alias analysis
1585stuff much more precise.
1586
1587//===---------------------------------------------------------------------===//
1588
1589The following functions should be optimized to use a select instead of a
1590branch (from gcc PR40072):
1591
1592char char_int(int m) {if(m>7) return 0; return m;}
1593int int_char(char m) {if(m>7) return 0; return m;}
1594
1595//===---------------------------------------------------------------------===//
1596
Nick Lewycky20babb12009-02-25 06:52:48 +00001597Instcombine should replace the load with a constant in:
1598
1599 static const char x[4] = {'a', 'b', 'c', 'd'};
1600
1601 unsigned int y(void) {
1602 return *(unsigned int *)x;
1603 }
1604
1605It currently only does this transformation when the size of the constant
1606is the same as the size of the integer (so, try x[5]) and the last byte
1607is a null (making it a C string). There's no need for these restrictions.
1608
1609//===---------------------------------------------------------------------===//
1610
Chris Lattnerd919a8b2009-05-11 17:36:33 +00001611InstCombine's "turn load from constant into constant" optimization should be
1612more aggressive in the presence of bitcasts. For example, because of unions,
1613this code:
1614
1615union vec2d {
1616 double e[2];
1617 double v __attribute__((vector_size(16)));
1618};
1619typedef union vec2d vec2d;
1620
1621static vec2d a={{1,2}}, b={{3,4}};
1622
1623vec2d foo () {
1624 return (vec2d){ .v = a.v + b.v * (vec2d){{5,5}}.v };
1625}
1626
1627Compiles into:
1628
1629@a = internal constant %0 { [2 x double]
1630 [double 1.000000e+00, double 2.000000e+00] }, align 16
1631@b = internal constant %0 { [2 x double]
1632 [double 3.000000e+00, double 4.000000e+00] }, align 16
1633...
1634define void @foo(%struct.vec2d* noalias nocapture sret %agg.result) nounwind {
1635entry:
1636 %0 = load <2 x double>* getelementptr (%struct.vec2d*
1637 bitcast (%0* @a to %struct.vec2d*), i32 0, i32 0), align 16
1638 %1 = load <2 x double>* getelementptr (%struct.vec2d*
1639 bitcast (%0* @b to %struct.vec2d*), i32 0, i32 0), align 16
1640
1641
1642Instcombine should be able to optimize away the loads (and thus the globals).
1643
Chris Lattnerc2b0d482009-09-14 16:49:26 +00001644See also PR4973
Chris Lattnerd919a8b2009-05-11 17:36:33 +00001645
1646//===---------------------------------------------------------------------===//