blob: bcc55b41e1a55fd10f25c4724c1e418b0a514164 [file] [log] [blame]
Chris Lattner086c0142006-02-03 06:21:43 +00001Target Independent Opportunities:
2
Chris Lattnerf308ea02006-09-28 06:01:17 +00003//===---------------------------------------------------------------------===//
4
Chris Lattner9b62b452006-11-14 01:57:53 +00005With the recent changes to make the implicit def/use set explicit in
6machineinstrs, we should change the target descriptions for 'call' instructions
7so that the .td files don't list all the call-clobbered registers as implicit
8defs. Instead, these should be added by the code generator (e.g. on the dag).
9
10This has a number of uses:
11
121. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
13 for their different impdef sets.
142. Targets with multiple calling convs (e.g. x86) which have different clobber
15 sets don't need copies of call instructions.
163. 'Interprocedural register allocation' can be done to reduce the clobber sets
17 of calls.
18
19//===---------------------------------------------------------------------===//
20
Nate Begeman81e80972006-03-17 01:40:33 +000021Make the PPC branch selector target independant
22
23//===---------------------------------------------------------------------===//
Chris Lattner086c0142006-02-03 06:21:43 +000024
25Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
Chris Lattner2dae65d2008-12-10 01:30:48 +000026precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
27safe in general, even on darwin. See the libm implementation of hypot for
28examples (which special case when x/y are exactly zero to get signed zeros etc
29right).
Chris Lattner086c0142006-02-03 06:21:43 +000030
Chris Lattner086c0142006-02-03 06:21:43 +000031//===---------------------------------------------------------------------===//
32
33Solve this DAG isel folding deficiency:
34
35int X, Y;
36
37void fn1(void)
38{
39 X = X | (Y << 3);
40}
41
42compiles to
43
44fn1:
45 movl Y, %eax
46 shll $3, %eax
47 orl X, %eax
48 movl %eax, X
49 ret
50
51The problem is the store's chain operand is not the load X but rather
52a TokenFactor of the load X and load Y, which prevents the folding.
53
54There are two ways to fix this:
55
561. The dag combiner can start using alias analysis to realize that y/x
57 don't alias, making the store to X not dependent on the load from Y.
582. The generated isel could be made smarter in the case it can't
59 disambiguate the pointers.
60
61Number 1 is the preferred solution.
62
Evan Chenge617b082006-03-13 23:19:10 +000063This has been "fixed" by a TableGen hack. But that is a short term workaround
64which will be removed once the proper fix is made.
65
Chris Lattner086c0142006-02-03 06:21:43 +000066//===---------------------------------------------------------------------===//
67
Chris Lattnerb27b69f2006-03-04 01:19:34 +000068On targets with expensive 64-bit multiply, we could LSR this:
69
70for (i = ...; ++i) {
71 x = 1ULL << i;
72
73into:
74 long long tmp = 1;
75 for (i = ...; ++i, tmp+=tmp)
76 x = tmp;
77
78This would be a win on ppc32, but not x86 or ppc64.
79
Chris Lattnerad019932006-03-04 08:44:51 +000080//===---------------------------------------------------------------------===//
Chris Lattner5b0fe7d2006-03-05 20:00:08 +000081
82Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
83
84//===---------------------------------------------------------------------===//
Chris Lattner549f27d22006-03-07 02:46:26 +000085
Chris Lattnerc20995e2006-03-11 20:17:08 +000086Reassociate should turn: X*X*X*X -> t=(X*X) (t*t) to eliminate a multiply.
87
88//===---------------------------------------------------------------------===//
89
Chris Lattner74cfb7d2006-03-11 20:20:40 +000090Interesting? testcase for add/shift/mul reassoc:
91
92int bar(int x, int y) {
93 return x*x*x+y+x*x*x*x*x*y*y*y*y;
94}
95int foo(int z, int n) {
96 return bar(z, n) + bar(2*z, 2*n);
97}
98
Chris Lattner5e14b0d2007-05-05 22:29:06 +000099Reassociate should handle the example in GCC PR16157.
100
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000101//===---------------------------------------------------------------------===//
102
Chris Lattner82c78b22006-03-09 20:13:21 +0000103These two functions should generate the same code on big-endian systems:
104
105int g(int *j,int *l) { return memcmp(j,l,4); }
106int h(int *j, int *l) { return *j - *l; }
107
108this could be done in SelectionDAGISel.cpp, along with other special cases,
109for 1,2,4,8 bytes.
110
111//===---------------------------------------------------------------------===//
112
Chris Lattnerc04b4232006-03-22 07:33:46 +0000113It would be nice to revert this patch:
114http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
115
116And teach the dag combiner enough to simplify the code expanded before
117legalize. It seems plausible that this knowledge would let it simplify other
118stuff too.
119
Chris Lattnere6cd96d2006-03-24 19:59:17 +0000120//===---------------------------------------------------------------------===//
121
Reid Spencerac9dcb92007-02-15 03:39:18 +0000122For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng67d3d4c2006-03-31 22:35:14 +0000123to the type size. It works but can be overly conservative as the alignment of
Reid Spencerac9dcb92007-02-15 03:39:18 +0000124specific vector types are target dependent.
Chris Lattnereaa7c062006-04-01 04:08:29 +0000125
126//===---------------------------------------------------------------------===//
127
Dan Gohman1f3be1a2009-05-11 18:51:16 +0000128We should produce an unaligned load from code like this:
Chris Lattnereaa7c062006-04-01 04:08:29 +0000129
130v4sf example(float *P) {
131 return (v4sf){P[0], P[1], P[2], P[3] };
132}
133
134//===---------------------------------------------------------------------===//
135
Chris Lattner16abfdf2006-05-18 18:26:13 +0000136Add support for conditional increments, and other related patterns. Instead
137of:
138
139 movl 136(%esp), %eax
140 cmpl $0, %eax
141 je LBB16_2 #cond_next
142LBB16_1: #cond_true
143 incl _foo
144LBB16_2: #cond_next
145
146emit:
147 movl _foo, %eax
148 cmpl $1, %edi
149 sbbl $-1, %eax
150 movl %eax, _foo
151
152//===---------------------------------------------------------------------===//
Chris Lattner870cf1b2006-05-19 20:45:08 +0000153
154Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
155
156Expand these to calls of sin/cos and stores:
157 double sincos(double x, double *sin, double *cos);
158 float sincosf(float x, float *sin, float *cos);
159 long double sincosl(long double x, long double *sin, long double *cos);
160
161Doing so could allow SROA of the destination pointers. See also:
162http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
163
Chris Lattner2dae65d2008-12-10 01:30:48 +0000164This is now easily doable with MRVs. We could even make an intrinsic for this
165if anyone cared enough about sincos.
166
Chris Lattner870cf1b2006-05-19 20:45:08 +0000167//===---------------------------------------------------------------------===//
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000168
Chris Lattnere8263e62006-05-21 03:57:07 +0000169Turn this into a single byte store with no load (the other 3 bytes are
170unmodified):
171
Dan Gohman5c8274b2009-05-11 18:04:52 +0000172define void @test(i32* %P) {
173 %tmp = load i32* %P
174 %tmp14 = or i32 %tmp, 3305111552
175 %tmp15 = and i32 %tmp14, 3321888767
176 store i32 %tmp15, i32* %P
Chris Lattnere8263e62006-05-21 03:57:07 +0000177 ret void
178}
179
Chris Lattner9e18ef52006-05-30 21:29:15 +0000180//===---------------------------------------------------------------------===//
181
182dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
183
184Compile:
185
186int bar(int x)
187{
188 int t = __builtin_clz(x);
189 return -(t>>5);
190}
191
192to:
193
194_bar: addic r3,r3,-1
195 subfe r3,r3,r3
196 blr
197
Chris Lattnercbce2f62006-09-15 20:31:36 +0000198//===---------------------------------------------------------------------===//
199
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000200quantum_sigma_x in 462.libquantum contains the following loop:
201
202 for(i=0; i<reg->size; i++)
203 {
204 /* Flip the target bit of each basis state */
205 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
206 }
207
208Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
209so cool to turn it into something like:
210
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000211 long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000212 if (target < 32) {
213 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000214 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000215 } else {
216 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000217 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000218 }
219
220... which would only do one 32-bit XOR per loop iteration instead of two.
221
222It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattnerfaa6adf2009-09-21 06:04:07 +0000223alas.
224
225//===---------------------------------------------------------------------===//
226
227This should be optimized to one 'and' and one 'or', from PR4216:
228
229define i32 @test_bitfield(i32 %bf.prev.low) nounwind ssp {
230entry:
231 %bf.prev.lo.cleared10 = or i32 %bf.prev.low, 32962 ; <i32> [#uses=1]
232 %0 = and i32 %bf.prev.low, -65536 ; <i32> [#uses=1]
233 %1 = and i32 %bf.prev.lo.cleared10, 40186 ; <i32> [#uses=1]
234 %2 = or i32 %1, %0 ; <i32> [#uses=1]
235 ret i32 %2
236}
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000237
238//===---------------------------------------------------------------------===//
Chris Lattnerfb981f32006-09-25 17:12:14 +0000239
Chris Lattnerb1ac7692008-10-05 02:16:12 +0000240This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattnerf9bae432006-12-08 02:01:32 +0000241
242unsigned long reverse(unsigned v) {
243 unsigned t;
244 t = v ^ ((v << 16) | (v >> 16));
245 t &= ~0xff0000;
246 v = (v << 24) | (v >> 8);
247 return v ^ (t >> 8);
248}
249
Chris Lattnerfb981f32006-09-25 17:12:14 +0000250//===---------------------------------------------------------------------===//
251
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000252These idioms should be recognized as popcount (see PR1488):
253
254unsigned countbits_slow(unsigned v) {
255 unsigned c;
256 for (c = 0; v; v >>= 1)
257 c += v & 1;
258 return c;
259}
260unsigned countbits_fast(unsigned v){
261 unsigned c;
262 for (c = 0; v; c++)
263 v &= v - 1; // clear the least significant bit set
264 return c;
265}
266
267BITBOARD = unsigned long long
268int PopCnt(register BITBOARD a) {
269 register int c=0;
270 while(a) {
271 c++;
272 a &= a - 1;
273 }
274 return c;
275}
276unsigned int popcount(unsigned int input) {
277 unsigned int count = 0;
278 for (unsigned int i = 0; i < 4 * 8; i++)
279 count += (input >> i) & i;
280 return count;
281}
282
283//===---------------------------------------------------------------------===//
284
Chris Lattnerfb981f32006-09-25 17:12:14 +0000285These should turn into single 16-bit (unaligned?) loads on little/big endian
286processors.
287
288unsigned short read_16_le(const unsigned char *adr) {
289 return adr[0] | (adr[1] << 8);
290}
291unsigned short read_16_be(const unsigned char *adr) {
292 return (adr[0] << 8) | adr[1];
293}
294
295//===---------------------------------------------------------------------===//
Chris Lattnercf103912006-10-24 16:12:47 +0000296
Reid Spencer1628cec2006-10-26 06:15:43 +0000297-instcombine should handle this transform:
Reid Spencere4d87aa2006-12-23 06:05:41 +0000298 icmp pred (sdiv X / C1 ), C2
Reid Spencer1628cec2006-10-26 06:15:43 +0000299when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
300
301Currently InstCombine avoids this transform but will do it when the signs of
302the operands and the sign of the divide match. See the FIXME in
303InstructionCombining.cpp in the visitSetCondInst method after the switch case
304for Instruction::UDiv (around line 4447) for more details.
305
306The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
307this construct.
Chris Lattnerd7c628d2006-11-03 22:27:39 +0000308
309//===---------------------------------------------------------------------===//
310
Chris Lattner578d2df2006-11-10 00:23:26 +0000311viterbi speeds up *significantly* if the various "history" related copy loops
312are turned into memcpy calls at the source level. We need a "loops to memcpy"
313pass.
314
315//===---------------------------------------------------------------------===//
Nick Lewyckybf637342006-11-13 00:23:28 +0000316
Chris Lattner03a6d962007-01-16 06:39:48 +0000317Consider:
318
319typedef unsigned U32;
320typedef unsigned long long U64;
321int test (U32 *inst, U64 *regs) {
322 U64 effective_addr2;
323 U32 temp = *inst;
324 int r1 = (temp >> 20) & 0xf;
325 int b2 = (temp >> 16) & 0xf;
326 effective_addr2 = temp & 0xfff;
327 if (b2) effective_addr2 += regs[b2];
328 b2 = (temp >> 12) & 0xf;
329 if (b2) effective_addr2 += regs[b2];
330 effective_addr2 &= regs[4];
331 if ((effective_addr2 & 3) == 0)
332 return 1;
333 return 0;
334}
335
336Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
337we don't eliminate the computation of the top half of effective_addr2 because
338we don't have whole-function selection dags. On x86, this means we use one
339extra register for the function when effective_addr2 is declared as U64 than
340when it is declared U32.
341
Chris Lattner17424982009-11-10 23:47:45 +0000342PHI Slicing could be extended to do this.
343
Chris Lattner03a6d962007-01-16 06:39:48 +0000344//===---------------------------------------------------------------------===//
345
Chris Lattner1a77a552007-03-24 06:01:32 +0000346LSR should know what GPR types a target has. This code:
347
348volatile short X, Y; // globals
349
350void foo(int N) {
351 int i;
352 for (i = 0; i < N; i++) { X = i; Y = i*4; }
353}
354
Chris Lattnerc1491f32009-09-20 17:37:38 +0000355produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner1a77a552007-03-24 06:01:32 +0000356
Chris Lattnerc1491f32009-09-20 17:37:38 +0000357LBB1_2:
358 ldr r3, LCPI1_0
359 ldr r3, [r3]
360 strh r2, [r3]
361 ldr r3, LCPI1_1
362 ldr r3, [r3]
363 strh r1, [r3]
364 add r1, r1, #4
365 add r2, r2, #1 <- [0,+,1]
366 sub r0, r0, #1 <- [0,-,1]
367 cmp r0, #0
368 bne LBB1_2
369
370LSR should reuse the "+" IV for the exit test.
Chris Lattner1a77a552007-03-24 06:01:32 +0000371
372
373//===---------------------------------------------------------------------===//
374
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000375Tail call elim should be more aggressive, checking to see if the call is
376followed by an uncond branch to an exit block.
377
378; This testcase is due to tail-duplication not wanting to copy the return
379; instruction into the terminating blocks because there was other code
380; optimized out of the function after the taildup happened.
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000381; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000382
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000383define i32 @t4(i32 %a) {
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000384entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000385 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
386 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
387 br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000388
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000389then.0: ; preds = %entry
390 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
391 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
392 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000393
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000394else.0: ; preds = %entry
395 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
396 br i1 %tmp.7, label %then.1, label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000397
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000398then.1: ; preds = %else.0
399 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
400 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
401 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000402
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000403return: ; preds = %then.1, %else.0, %then.0
404 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000405 [ %tmp.9, %then.1 ]
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000406 ret i32 %result.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000407}
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000408
409//===---------------------------------------------------------------------===//
410
Chris Lattnerc90b8662008-08-10 00:47:21 +0000411Tail recursion elimination should handle:
412
413int pow2m1(int n) {
414 if (n == 0)
415 return 0;
416 return 2 * pow2m1 (n - 1) + 1;
417}
418
419Also, multiplies can be turned into SHL's, so they should be handled as if
420they were associative. "return foo() << 1" can be tail recursion eliminated.
421
422//===---------------------------------------------------------------------===//
423
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000424Argument promotion should promote arguments for recursive functions, like
425this:
426
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000427; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000428
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000429define internal i32 @foo(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000430entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000431 %tmp = load i32* %x ; <i32> [#uses=0]
432 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
433 ret i32 %tmp.foo
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000434}
435
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000436define i32 @bar(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000437entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000438 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
439 ret i32 %tmp3
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000440}
441
Chris Lattner81f2d712007-12-05 23:05:06 +0000442//===---------------------------------------------------------------------===//
Chris Lattner166a2682007-12-28 04:42:05 +0000443
444"basicaa" should know how to look through "or" instructions that act like add
445instructions. For example in this code, the x*4+1 is turned into x*4 | 1, and
446basicaa can't analyze the array subscript, leading to duplicated loads in the
447generated code:
448
449void test(int X, int Y, int a[]) {
450int i;
451 for (i=2; i<1000; i+=4) {
452 a[i+0] = a[i-1+0]*a[i-2+0];
453 a[i+1] = a[i-1+1]*a[i-2+1];
454 a[i+2] = a[i-1+2]*a[i-2+2];
455 a[i+3] = a[i-1+3]*a[i-2+3];
456 }
457}
458
Chris Lattner2dae65d2008-12-10 01:30:48 +0000459BasicAA also doesn't do this for add. It needs to know that &A[i+1] != &A[i].
460
Chris Lattnera1643ba2007-12-28 22:30:05 +0000461//===---------------------------------------------------------------------===//
Chris Lattner166a2682007-12-28 04:42:05 +0000462
Chris Lattnera1643ba2007-12-28 22:30:05 +0000463We should investigate an instruction sinking pass. Consider this silly
464example in pic mode:
465
466#include <assert.h>
467void foo(int x) {
468 assert(x);
469 //...
470}
471
472we compile this to:
473_foo:
474 subl $28, %esp
475 call "L1$pb"
476"L1$pb":
477 popl %eax
478 cmpl $0, 32(%esp)
479 je LBB1_2 # cond_true
480LBB1_1: # return
481 # ...
482 addl $28, %esp
483 ret
484LBB1_2: # cond_true
485...
486
487The PIC base computation (call+popl) is only used on one path through the
488code, but is currently always computed in the entry block. It would be
489better to sink the picbase computation down into the block for the
490assertion, as it is the only one that uses it. This happens for a lot of
491code with early outs.
492
Chris Lattner92c06a02007-12-29 01:05:01 +0000493Another example is loads of arguments, which are usually emitted into the
494entry block on targets like x86. If not used in all paths through a
495function, they should be sunk into the ones that do.
496
Chris Lattnera1643ba2007-12-28 22:30:05 +0000497In this case, whole-function-isel would also handle this.
Chris Lattner166a2682007-12-28 04:42:05 +0000498
499//===---------------------------------------------------------------------===//
Chris Lattnerb3041942008-01-07 21:38:14 +0000500
501Investigate lowering of sparse switch statements into perfect hash tables:
502http://burtleburtle.net/bob/hash/perfect.html
503
504//===---------------------------------------------------------------------===//
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000505
506We should turn things like "load+fabs+store" and "load+fneg+store" into the
507corresponding integer operations. On a yonah, this loop:
508
509double a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000510void foo() {
511 int i, b;
512 for (b = 0; b < 10000000; b++)
513 for (i = 0; i < 256; i++)
514 a[i] = -a[i];
515}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000516
517is twice as slow as this loop:
518
519long long a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000520void foo() {
521 int i, b;
522 for (b = 0; b < 10000000; b++)
523 for (i = 0; i < 256; i++)
524 a[i] ^= (1ULL << 63);
525}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000526
527and I suspect other processors are similar. On X86 in particular this is a
528big win because doing this with integers allows the use of read/modify/write
529instructions.
530
531//===---------------------------------------------------------------------===//
Chris Lattner83726012008-01-10 18:25:41 +0000532
533DAG Combiner should try to combine small loads into larger loads when
534profitable. For example, we compile this C++ example:
535
536struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
537extern THotKey m_HotKey;
538THotKey GetHotKey () { return m_HotKey; }
539
540into (-O3 -fno-exceptions -static -fomit-frame-pointer):
541
542__Z9GetHotKeyv:
543 pushl %esi
544 movl 8(%esp), %eax
545 movb _m_HotKey+3, %cl
546 movb _m_HotKey+4, %dl
547 movb _m_HotKey+2, %ch
548 movw _m_HotKey, %si
549 movw %si, (%eax)
550 movb %ch, 2(%eax)
551 movb %cl, 3(%eax)
552 movb %dl, 4(%eax)
553 popl %esi
554 ret $4
555
556GCC produces:
557
558__Z9GetHotKeyv:
559 movl _m_HotKey, %edx
560 movl 4(%esp), %eax
561 movl %edx, (%eax)
562 movzwl _m_HotKey+4, %edx
563 movw %dx, 4(%eax)
564 ret $4
565
566The LLVM IR contains the needed alignment info, so we should be able to
567merge the loads and stores into 4-byte loads:
568
569 %struct.THotKey = type { i16, i8, i8, i8 }
570define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
571...
572 %tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
573 %tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
574 %tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
575 %tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
576
577Alternatively, we should use a small amount of base-offset alias analysis
578to make it so the scheduler doesn't need to hold all the loads in regs at
579once.
580
581//===---------------------------------------------------------------------===//
Chris Lattner497b7e92008-01-11 06:17:47 +0000582
Nate Begemane9fe65c2008-02-18 18:39:23 +0000583We should add an FRINT node to the DAG to model targets that have legal
584implementations of ceil/floor/rint.
Chris Lattner48840f82008-02-28 05:34:27 +0000585
586//===---------------------------------------------------------------------===//
587
588Consider:
589
590int test() {
591 long long input[8] = {1,1,1,1,1,1,1,1};
592 foo(input);
593}
594
595We currently compile this into a memcpy from a global array since the
596initializer is fairly large and not memset'able. This is good, but the memcpy
597gets lowered to load/stores in the code generator. This is also ok, except
598that the codegen lowering for memcpy doesn't handle the case when the source
599is a constant global. This gives us atrocious code like this:
600
601 call "L1$pb"
602"L1$pb":
603 popl %eax
604 movl _C.0.1444-"L1$pb"+32(%eax), %ecx
605 movl %ecx, 40(%esp)
606 movl _C.0.1444-"L1$pb"+20(%eax), %ecx
607 movl %ecx, 28(%esp)
608 movl _C.0.1444-"L1$pb"+36(%eax), %ecx
609 movl %ecx, 44(%esp)
610 movl _C.0.1444-"L1$pb"+44(%eax), %ecx
611 movl %ecx, 52(%esp)
612 movl _C.0.1444-"L1$pb"+40(%eax), %ecx
613 movl %ecx, 48(%esp)
614 movl _C.0.1444-"L1$pb"+12(%eax), %ecx
615 movl %ecx, 20(%esp)
616 movl _C.0.1444-"L1$pb"+4(%eax), %ecx
617...
618
619instead of:
620 movl $1, 16(%esp)
621 movl $0, 20(%esp)
622 movl $1, 24(%esp)
623 movl $0, 28(%esp)
624 movl $1, 32(%esp)
625 movl $0, 36(%esp)
626 ...
627
628//===---------------------------------------------------------------------===//
Chris Lattnera11deb02008-03-02 02:51:40 +0000629
630http://llvm.org/PR717:
631
632The following code should compile into "ret int undef". Instead, LLVM
633produces "ret int 0":
634
635int f() {
636 int x = 4;
637 int y;
638 if (x == 3) y = 0;
639 return y;
640}
641
642//===---------------------------------------------------------------------===//
Chris Lattner53b72772008-03-02 19:29:42 +0000643
644The loop unroller should partially unroll loops (instead of peeling them)
645when code growth isn't too bad and when an unroll count allows simplification
646of some code within the loop. One trivial example is:
647
648#include <stdio.h>
649int main() {
650 int nRet = 17;
651 int nLoop;
652 for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
653 if ( nLoop & 1 )
654 nRet += 2;
655 else
656 nRet -= 1;
657 }
658 return nRet;
659}
660
661Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
662reduction in code size. The resultant code would then also be suitable for
663exit value computation.
664
665//===---------------------------------------------------------------------===//
Chris Lattner349155b2008-03-17 01:47:51 +0000666
667We miss a bunch of rotate opportunities on various targets, including ppc, x86,
668etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
669matching code in dag combine doesn't look through truncates aggressively
670enough. Here are some testcases reduces from GCC PR17886:
671
672unsigned long long f(unsigned long long x, int y) {
673 return (x << y) | (x >> 64-y);
674}
675unsigned f2(unsigned x, int y){
676 return (x << y) | (x >> 32-y);
677}
678unsigned long long f3(unsigned long long x){
679 int y = 9;
680 return (x << y) | (x >> 64-y);
681}
682unsigned f4(unsigned x){
683 int y = 10;
684 return (x << y) | (x >> 32-y);
685}
686unsigned long long f5(unsigned long long x, unsigned long long y) {
687 return (x << 8) | ((y >> 48) & 0xffull);
688}
689unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
690 switch(z) {
691 case 1:
692 return (x << 8) | ((y >> 48) & 0xffull);
693 case 2:
694 return (x << 16) | ((y >> 40) & 0xffffull);
695 case 3:
696 return (x << 24) | ((y >> 32) & 0xffffffull);
697 case 4:
698 return (x << 32) | ((y >> 24) & 0xffffffffull);
699 default:
700 return (x << 40) | ((y >> 16) & 0xffffffffffull);
701 }
702}
703
Dan Gohmancb747c52008-10-17 21:39:27 +0000704On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner349155b2008-03-17 01:47:51 +0000705generate truly horrible code, instead of using shld and friends. On
706ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
707badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
708
709//===---------------------------------------------------------------------===//
Chris Lattnerf70107f2008-03-20 04:46:13 +0000710
711We do a number of simplifications in simplify libcalls to strength reduce
712standard library functions, but we don't currently merge them together. For
713example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
714be done safely if "b" isn't modified between the strlen and memcpy of course.
715
716//===---------------------------------------------------------------------===//
717
Chris Lattner10c5d362008-07-14 00:19:59 +0000718Reassociate should turn things like:
719
720int factorial(int X) {
721 return X*X*X*X*X*X*X*X;
722}
723
724into llvm.powi calls, allowing the code generator to produce balanced
725multiplication trees.
726
727//===---------------------------------------------------------------------===//
728
Chris Lattner26e150f2008-08-10 01:14:08 +0000729We generate a horrible libcall for llvm.powi. For example, we compile:
730
731#include <cmath>
732double f(double a) { return std::pow(a, 4); }
733
734into:
735
736__Z1fd:
737 subl $12, %esp
738 movsd 16(%esp), %xmm0
739 movsd %xmm0, (%esp)
740 movl $4, 8(%esp)
741 call L___powidf2$stub
742 addl $12, %esp
743 ret
744
745GCC produces:
746
747__Z1fd:
748 subl $12, %esp
749 movsd 16(%esp), %xmm0
750 mulsd %xmm0, %xmm0
751 mulsd %xmm0, %xmm0
752 movsd %xmm0, (%esp)
753 fldl (%esp)
754 addl $12, %esp
755 ret
756
757//===---------------------------------------------------------------------===//
758
759We compile this program: (from GCC PR11680)
760http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
761
762Into code that runs the same speed in fast/slow modes, but both modes run 2x
763slower than when compile with GCC (either 4.0 or 4.2):
764
765$ llvm-g++ perf.cpp -O3 -fno-exceptions
766$ time ./a.out fast
7671.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
768
769$ g++ perf.cpp -O3 -fno-exceptions
770$ time ./a.out fast
7710.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
772
773It looks like we are making the same inlining decisions, so this may be raw
774codegen badness or something else (haven't investigated).
775
776//===---------------------------------------------------------------------===//
777
778We miss some instcombines for stuff like this:
779void bar (void);
780void foo (unsigned int a) {
781 /* This one is equivalent to a >= (3 << 2). */
782 if ((a >> 2) >= 3)
783 bar ();
784}
785
786A few other related ones are in GCC PR14753.
787
788//===---------------------------------------------------------------------===//
789
790Divisibility by constant can be simplified (according to GCC PR12849) from
791being a mulhi to being a mul lo (cheaper). Testcase:
792
793void bar(unsigned n) {
794 if (n % 3 == 0)
795 true();
796}
797
798I think this basically amounts to a dag combine to simplify comparisons against
799multiply hi's into a comparison against the mullo.
800
801//===---------------------------------------------------------------------===//
Chris Lattner23f35bc2008-08-19 06:22:16 +0000802
Chris Lattnerdb039832008-10-15 16:06:03 +0000803Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
804bunch of other stuff from this example (see PR1604):
805
806#include <cstdio>
807struct test {
808 int val;
809 virtual ~test() {}
810};
811
812int main() {
813 test t;
814 std::scanf("%d", &t.val);
815 std::printf("%d\n", t.val);
816}
817
818//===---------------------------------------------------------------------===//
819
Chris Lattner3b364cb2008-10-15 16:33:52 +0000820Instcombine will merge comparisons like (x >= 10) && (x < 20) by producing (x -
82110) u< 10, but only when the comparisons have matching sign.
822
823This could be converted with a similiar technique. (PR1941)
824
825define i1 @test(i8 %x) {
826 %A = icmp uge i8 %x, 5
827 %B = icmp slt i8 %x, 20
828 %C = and i1 %A, %B
829 ret i1 %C
830}
831
832//===---------------------------------------------------------------------===//
Nick Lewyckydf563ca2008-11-27 22:12:22 +0000833
Nick Lewyckyd2f0db12008-11-27 22:41:45 +0000834These functions perform the same computation, but produce different assembly.
Nick Lewyckydf563ca2008-11-27 22:12:22 +0000835
836define i8 @select(i8 %x) readnone nounwind {
837 %A = icmp ult i8 %x, 250
838 %B = select i1 %A, i8 0, i8 1
839 ret i8 %B
840}
841
842define i8 @addshr(i8 %x) readnone nounwind {
843 %A = zext i8 %x to i9
844 %B = add i9 %A, 6 ;; 256 - 250 == 6
845 %C = lshr i9 %B, 8
846 %D = trunc i9 %C to i8
847 ret i8 %D
848}
849
850//===---------------------------------------------------------------------===//
Eli Friedman4e16b292008-11-30 07:36:04 +0000851
852From gcc bug 24696:
853int
854f (unsigned long a, unsigned long b, unsigned long c)
855{
856 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0);
857}
858int
859f (unsigned long a, unsigned long b, unsigned long c)
860{
861 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0);
862}
863Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with
864"clang -emit-llvm-bc | opt -std-compile-opts".
865
866//===---------------------------------------------------------------------===//
867
868From GCC Bug 20192:
869#define PMD_MASK (~((1UL << 23) - 1))
870void clear_pmd_range(unsigned long start, unsigned long end)
871{
872 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
873 f();
874}
875The expression should optimize to something like
876"!((start|end)&~PMD_MASK). Currently not optimized with "clang
877-emit-llvm-bc | opt -std-compile-opts".
878
879//===---------------------------------------------------------------------===//
880
881From GCC Bug 15241:
882unsigned int
883foo (unsigned int a, unsigned int b)
884{
885 if (a <= 7 && b <= 7)
886 baz ();
887}
888Should combine to "(a|b) <= 7". Currently not optimized with "clang
889-emit-llvm-bc | opt -std-compile-opts".
890
891//===---------------------------------------------------------------------===//
892
893From GCC Bug 3756:
894int
895pn (int n)
896{
897 return (n >= 0 ? 1 : -1);
898}
899Should combine to (n >> 31) | 1. Currently not optimized with "clang
900-emit-llvm-bc | opt -std-compile-opts | llc".
901
902//===---------------------------------------------------------------------===//
903
904From GCC Bug 28685:
905int test(int a, int b)
906{
907 int lt = a < b;
908 int eq = a == b;
909
910 return (lt || eq);
911}
912Should combine to "a <= b". Currently not optimized with "clang
913-emit-llvm-bc | opt -std-compile-opts | llc".
914
915//===---------------------------------------------------------------------===//
916
917void a(int variable)
918{
919 if (variable == 4 || variable == 6)
920 bar();
921}
922This should optimize to "if ((variable | 2) == 6)". Currently not
923optimized with "clang -emit-llvm-bc | opt -std-compile-opts | llc".
924
925//===---------------------------------------------------------------------===//
926
927unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
928i;}
929unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
930These should combine to the same thing. Currently, the first function
931produces better code on X86.
932
933//===---------------------------------------------------------------------===//
934
Eli Friedman4e16b292008-11-30 07:36:04 +0000935From GCC Bug 15784:
936#define abs(x) x>0?x:-x
937int f(int x, int y)
938{
939 return (abs(x)) >= 0;
940}
941This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
942optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
943
944//===---------------------------------------------------------------------===//
945
946From GCC Bug 14753:
947void
948rotate_cst (unsigned int a)
949{
950 a = (a << 10) | (a >> 22);
951 if (a == 123)
952 bar ();
953}
954void
955minus_cst (unsigned int a)
956{
957 unsigned int tem;
958
959 tem = 20 - a;
960 if (tem == 5)
961 bar ();
962}
963void
964mask_gt (unsigned int a)
965{
966 /* This is equivalent to a > 15. */
967 if ((a & ~7) > 8)
968 bar ();
969}
970void
971rshift_gt (unsigned int a)
972{
973 /* This is equivalent to a > 23. */
974 if ((a >> 2) > 5)
975 bar ();
976}
977All should simplify to a single comparison. All of these are
978currently not optimized with "clang -emit-llvm-bc | opt
979-std-compile-opts".
980
981//===---------------------------------------------------------------------===//
982
983From GCC Bug 32605:
984int c(int* x) {return (char*)x+2 == (char*)x;}
985Should combine to 0. Currently not optimized with "clang
986-emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it).
987
988//===---------------------------------------------------------------------===//
989
990int a(unsigned char* b) {return *b > 99;}
991There's an unnecessary zext in the generated code with "clang
992-emit-llvm-bc | opt -std-compile-opts".
993
994//===---------------------------------------------------------------------===//
995
Eli Friedman4e16b292008-11-30 07:36:04 +0000996int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;}
997Should be combined to "((b >> 1) | b) & 1". Currently not optimized
998with "clang -emit-llvm-bc | opt -std-compile-opts".
999
1000//===---------------------------------------------------------------------===//
1001
1002unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);}
1003Should combine to "x | (y & 3)". Currently not optimized with "clang
1004-emit-llvm-bc | opt -std-compile-opts".
1005
1006//===---------------------------------------------------------------------===//
1007
1008unsigned a(unsigned a) {return ((a | 1) & 3) | (a & -4);}
1009Should combine to "a | 1". Currently not optimized with "clang
1010-emit-llvm-bc | opt -std-compile-opts".
1011
1012//===---------------------------------------------------------------------===//
1013
Eli Friedman4e16b292008-11-30 07:36:04 +00001014int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);}
1015Should fold to "(~a & c) | (a & b)". Currently not optimized with
1016"clang -emit-llvm-bc | opt -std-compile-opts".
1017
1018//===---------------------------------------------------------------------===//
1019
1020int a(int a,int b) {return (~(a|b))|a;}
1021Should fold to "a|~b". Currently not optimized with "clang
1022-emit-llvm-bc | opt -std-compile-opts".
1023
1024//===---------------------------------------------------------------------===//
1025
1026int a(int a, int b) {return (a&&b) || (a&&!b);}
1027Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
1028| opt -std-compile-opts".
1029
1030//===---------------------------------------------------------------------===//
1031
1032int a(int a, int b, int c) {return (a&&b) || (!a&&c);}
1033Should fold to "a ? b : c", or at least something sane. Currently not
1034optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1035
1036//===---------------------------------------------------------------------===//
1037
1038int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);}
1039Should fold to a && (b || c). Currently not optimized with "clang
1040-emit-llvm-bc | opt -std-compile-opts".
1041
1042//===---------------------------------------------------------------------===//
1043
1044int a(int x) {return x | ((x & 8) ^ 8);}
1045Should combine to x | 8. Currently not optimized with "clang
1046-emit-llvm-bc | opt -std-compile-opts".
1047
1048//===---------------------------------------------------------------------===//
1049
1050int a(int x) {return x ^ ((x & 8) ^ 8);}
1051Should also combine to x | 8. Currently not optimized with "clang
1052-emit-llvm-bc | opt -std-compile-opts".
1053
1054//===---------------------------------------------------------------------===//
1055
1056int a(int x) {return (x & 8) == 0 ? -1 : -9;}
1057Should combine to (x | -9) ^ 8. Currently not optimized with "clang
1058-emit-llvm-bc | opt -std-compile-opts".
1059
1060//===---------------------------------------------------------------------===//
1061
1062int a(int x) {return (x & 8) == 0 ? -9 : -1;}
1063Should combine to x | -9. Currently not optimized with "clang
1064-emit-llvm-bc | opt -std-compile-opts".
1065
1066//===---------------------------------------------------------------------===//
1067
1068int a(int x) {return ((x | -9) ^ 8) & x;}
1069Should combine to x & -9. Currently not optimized with "clang
1070-emit-llvm-bc | opt -std-compile-opts".
1071
1072//===---------------------------------------------------------------------===//
1073
1074unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
1075Should combine to "a * 0x88888888 >> 31". Currently not optimized
1076with "clang -emit-llvm-bc | opt -std-compile-opts".
1077
1078//===---------------------------------------------------------------------===//
1079
1080unsigned a(char* x) {if ((*x & 32) == 0) return b();}
1081There's an unnecessary zext in the generated code with "clang
1082-emit-llvm-bc | opt -std-compile-opts".
1083
1084//===---------------------------------------------------------------------===//
1085
1086unsigned a(unsigned long long x) {return 40 * (x >> 1);}
1087Should combine to "20 * (((unsigned)x) & -2)". Currently not
1088optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1089
1090//===---------------------------------------------------------------------===//
Bill Wendling3bdcda82008-12-02 05:12:47 +00001091
Chris Lattner88d84b22008-12-02 06:32:34 +00001092This was noticed in the entryblock for grokdeclarator in 403.gcc:
1093
1094 %tmp = icmp eq i32 %decl_context, 4
1095 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
1096 %tmp1 = icmp eq i32 %decl_context_addr.0, 1
1097 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
1098
1099tmp1 should be simplified to something like:
1100 (!tmp || decl_context == 1)
1101
1102This allows recursive simplifications, tmp1 is used all over the place in
1103the function, e.g. by:
1104
1105 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
1106 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
1107 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
1108
1109later.
1110
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001111//===---------------------------------------------------------------------===//
1112
1113Store sinking: This code:
1114
1115void f (int n, int *cond, int *res) {
1116 int i;
1117 *res = 0;
1118 for (i = 0; i < n; i++)
1119 if (*cond)
1120 *res ^= 234; /* (*) */
1121}
1122
1123On this function GVN hoists the fully redundant value of *res, but nothing
1124moves the store out. This gives us this code:
1125
1126bb: ; preds = %bb2, %entry
1127 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
1128 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
1129 %1 = load i32* %cond, align 4
1130 %2 = icmp eq i32 %1, 0
1131 br i1 %2, label %bb2, label %bb1
1132
1133bb1: ; preds = %bb
1134 %3 = xor i32 %.rle, 234
1135 store i32 %3, i32* %res, align 4
1136 br label %bb2
1137
1138bb2: ; preds = %bb, %bb1
1139 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
1140 %indvar.next = add i32 %i.05, 1
1141 %exitcond = icmp eq i32 %indvar.next, %n
1142 br i1 %exitcond, label %return, label %bb
1143
1144DSE should sink partially dead stores to get the store out of the loop.
1145
Chris Lattner6a09a742008-12-06 22:52:12 +00001146Here's another partial dead case:
1147http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
1148
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001149//===---------------------------------------------------------------------===//
1150
1151Scalar PRE hoists the mul in the common block up to the else:
1152
1153int test (int a, int b, int c, int g) {
1154 int d, e;
1155 if (a)
1156 d = b * c;
1157 else
1158 d = b - c;
1159 e = b * c + g;
1160 return d + e;
1161}
1162
1163It would be better to do the mul once to reduce codesize above the if.
1164This is GCC PR38204.
1165
1166//===---------------------------------------------------------------------===//
1167
1168GCC PR37810 is an interesting case where we should sink load/store reload
1169into the if block and outside the loop, so we don't reload/store it on the
1170non-call path.
1171
1172for () {
1173 *P += 1;
1174 if ()
1175 call();
1176 else
1177 ...
1178->
1179tmp = *P
1180for () {
1181 tmp += 1;
1182 if () {
1183 *P = tmp;
1184 call();
1185 tmp = *P;
1186 } else ...
1187}
1188*P = tmp;
1189
Chris Lattner8f416f32008-12-15 07:49:24 +00001190We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
1191we don't sink the store. We need partially dead store sinking.
1192
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001193//===---------------------------------------------------------------------===//
1194
Chris Lattner8f416f32008-12-15 07:49:24 +00001195[PHI TRANSLATE GEPs]
1196
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001197GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
1198leading to excess stack traffic. This could be handled by GVN with some crazy
1199symbolic phi translation. The code we get looks like (g is on the stack):
1200
1201bb2: ; preds = %bb1
1202..
1203 %9 = getelementptr %struct.f* %g, i32 0, i32 0
1204 store i32 %8, i32* %9, align bel %bb3
1205
1206bb3: ; preds = %bb1, %bb2, %bb
1207 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
1208 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
1209 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
1210 %11 = load i32* %10, align 4
1211
1212%11 is fully redundant, an in BB2 it should have the value %8.
1213
Chris Lattner6a09a742008-12-06 22:52:12 +00001214GCC PR33344 is a similar case.
1215
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001216//===---------------------------------------------------------------------===//
1217
Chris Lattner6c9fab72009-11-05 18:19:19 +00001218[PHI TRANSLATE INDEXED GEPs] PR5313
1219
1220Load redundancy elimination for simple loop. This loop:
1221
1222void append_text(const char* text,unsigned char * const io) {
1223 while(*text)
1224 *io=*text++;
1225}
1226
1227Compiles to have a fully redundant load in the loop (%2):
1228
1229define void @append_text(i8* nocapture %text, i8* nocapture %io) nounwind {
1230entry:
1231 %0 = load i8* %text, align 1 ; <i8> [#uses=1]
1232 %1 = icmp eq i8 %0, 0 ; <i1> [#uses=1]
1233 br i1 %1, label %return, label %bb
1234
1235bb: ; preds = %bb, %entry
1236 %indvar = phi i32 [ 0, %entry ], [ %tmp, %bb ] ; <i32> [#uses=2]
1237 %text_addr.04 = getelementptr i8* %text, i32 %indvar ; <i8*> [#uses=1]
1238 %2 = load i8* %text_addr.04, align 1 ; <i8> [#uses=1]
1239 store i8 %2, i8* %io, align 1
1240 %tmp = add i32 %indvar, 1 ; <i32> [#uses=2]
1241 %scevgep = getelementptr i8* %text, i32 %tmp ; <i8*> [#uses=1]
1242 %3 = load i8* %scevgep, align 1 ; <i8> [#uses=1]
1243 %4 = icmp eq i8 %3, 0 ; <i1> [#uses=1]
1244 br i1 %4, label %return, label %bb
1245
1246return: ; preds = %bb, %entry
1247 ret void
1248}
1249
1250//===---------------------------------------------------------------------===//
1251
Chris Lattner6a09a742008-12-06 22:52:12 +00001252There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
1253GCC testsuite. There are many pre testcases as ssa-pre-*.c
1254
Chris Lattner582048d2008-12-15 08:32:28 +00001255//===---------------------------------------------------------------------===//
1256
1257There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
1258GCC testsuite. For example, predcom-1.c is:
1259
1260 for (i = 2; i < 1000; i++)
1261 fib[i] = (fib[i-1] + fib[i - 2]) & 0xffff;
1262
1263which compiles into:
1264
1265bb1: ; preds = %bb1, %bb1.thread
1266 %indvar = phi i32 [ 0, %bb1.thread ], [ %0, %bb1 ]
1267 %i.0.reg2mem.0 = add i32 %indvar, 2
1268 %0 = add i32 %indvar, 1 ; <i32> [#uses=3]
1269 %1 = getelementptr [1000 x i32]* @fib, i32 0, i32 %0
1270 %2 = load i32* %1, align 4 ; <i32> [#uses=1]
1271 %3 = getelementptr [1000 x i32]* @fib, i32 0, i32 %indvar
1272 %4 = load i32* %3, align 4 ; <i32> [#uses=1]
1273 %5 = add i32 %4, %2 ; <i32> [#uses=1]
1274 %6 = and i32 %5, 65535 ; <i32> [#uses=1]
1275 %7 = getelementptr [1000 x i32]* @fib, i32 0, i32 %i.0.reg2mem.0
1276 store i32 %6, i32* %7, align 4
1277 %exitcond = icmp eq i32 %0, 998 ; <i1> [#uses=1]
1278 br i1 %exitcond, label %return, label %bb1
1279
1280This is basically:
1281 LOAD fib[i+1]
1282 LOAD fib[i]
1283 STORE fib[i+2]
1284
1285instead of handling this as a loop or other xform, all we'd need to do is teach
1286load PRE to phi translate the %0 add (i+1) into the predecessor as (i'+1+1) =
1287(i'+2) (where i' is the previous iteration of i). This would find the store
1288which feeds it.
1289
1290predcom-2.c is apparently the same as predcom-1.c
1291predcom-3.c is very similar but needs loads feeding each other instead of
1292store->load.
1293predcom-4.c seems the same as the rest.
1294
1295
1296//===---------------------------------------------------------------------===//
1297
Chris Lattner6a09a742008-12-06 22:52:12 +00001298Other simple load PRE cases:
Chris Lattner8f416f32008-12-15 07:49:24 +00001299http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35287 [LPRE crit edge splitting]
1300
1301http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34677 (licm does this, LPRE crit edge)
1302 llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as | opt -mem2reg -simplifycfg -gvn | llvm-dis
1303
Chris Lattner93c6c772009-09-21 02:53:57 +00001304http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16799 [BITCAST PHI TRANS]
1305
Chris Lattner582048d2008-12-15 08:32:28 +00001306//===---------------------------------------------------------------------===//
1307
1308Type based alias analysis:
Chris Lattner6a09a742008-12-06 22:52:12 +00001309http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
1310
1311//===---------------------------------------------------------------------===//
1312
Chris Lattner6a09a742008-12-06 22:52:12 +00001313A/B get pinned to the stack because we turn an if/then into a select instead
1314of PRE'ing the load/store. This may be fixable in instcombine:
1315http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
1316
Chris Lattner93c6c772009-09-21 02:53:57 +00001317struct X { int i; };
1318int foo (int x) {
1319 struct X a;
1320 struct X b;
1321 struct X *p;
1322 a.i = 1;
1323 b.i = 2;
1324 if (x)
1325 p = &a;
1326 else
1327 p = &b;
1328 return p->i;
1329}
Chris Lattner582048d2008-12-15 08:32:28 +00001330
Chris Lattner93c6c772009-09-21 02:53:57 +00001331//===---------------------------------------------------------------------===//
Chris Lattner582048d2008-12-15 08:32:28 +00001332
Chris Lattner6a09a742008-12-06 22:52:12 +00001333Interesting missed case because of control flow flattening (should be 2 loads):
1334http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner582048d2008-12-15 08:32:28 +00001335With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |
1336 opt -mem2reg -gvn -instcombine | llvm-dis
1337we miss it because we need 1) GEP PHI TRAN, 2) CRIT EDGE 3) MULTIPLE DIFFERENT
1338VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner6a09a742008-12-06 22:52:12 +00001339
1340//===---------------------------------------------------------------------===//
1341
1342http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
1343We could eliminate the branch condition here, loading from null is undefined:
1344
1345struct S { int w, x, y, z; };
1346struct T { int r; struct S s; };
1347void bar (struct S, int);
1348void foo (int a, struct T b)
1349{
1350 struct S *c = 0;
1351 if (a)
1352 c = &b.s;
1353 bar (*c, a);
1354}
1355
1356//===---------------------------------------------------------------------===//
Chris Lattner88d84b22008-12-02 06:32:34 +00001357
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001358simplifylibcalls should do several optimizations for strspn/strcspn:
1359
1360strcspn(x, "") -> strlen(x)
1361strcspn("", x) -> 0
1362strspn("", x) -> 0
1363strspn(x, "") -> strlen(x)
1364strspn(x, "a") -> strchr(x, 'a')-x
1365
1366strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
1367
1368size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
1369 int __reject3) {
1370 register size_t __result = 0;
1371 while (__s[__result] != '\0' && __s[__result] != __reject1 &&
1372 __s[__result] != __reject2 && __s[__result] != __reject3)
1373 ++__result;
1374 return __result;
1375}
1376
1377This should turn into a switch on the character. See PR3253 for some notes on
1378codegen.
1379
1380456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
1381
1382//===---------------------------------------------------------------------===//
Chris Lattnerd23b7992008-12-31 00:54:13 +00001383
1384"gas" uses this idiom:
1385 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string))
1386..
1387 else if (strchr ("<>", *intel_parser.op_string)
1388
1389Those should be turned into a switch.
1390
1391//===---------------------------------------------------------------------===//
Chris Lattnerffb08f52009-01-08 06:52:57 +00001392
1393252.eon contains this interesting code:
1394
1395 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
1396 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1397 %strlen = call i32 @strlen(i8* %3072) ; uses = 1
1398 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
1399 call void @llvm.memcpy.i32(i8* %endptr,
1400 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
1401 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly
1402
1403This is interesting for a couple reasons. First, in this:
1404
1405 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1406 %strlen = call i32 @strlen(i8* %3072)
1407
1408The strlen could be replaced with: %strlen = sub %3072, %3073, because the
1409strcpy call returns a pointer to the end of the string. Based on that, the
1410endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
1411
1412Second, the memcpy+strlen strlen can be replaced with:
1413
1414 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
1415
1416Because the destination was just copied into the specified memory buffer. This,
1417in turn, can be constant folded to "4".
1418
1419In other code, it contains:
1420
1421 %endptr6978 = bitcast i8* %endptr69 to i32*
1422 store i32 7107374, i32* %endptr6978, align 1
1423 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
1424
1425Which could also be constant folded. Whatever is producing this should probably
1426be fixed to leave this as a memcpy from a string.
1427
1428Further, eon also has an interesting partially redundant strlen call:
1429
1430bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
1431 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2]
1432 %683 = load i8** %682, align 4 ; <i8*> [#uses=4]
1433 %684 = load i8* %683, align 1 ; <i8> [#uses=1]
1434 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
1435 br i1 %685, label %bb10, label %bb9
1436
1437bb9: ; preds = %bb8
1438 %686 = call i32 @strlen(i8* %683) nounwind readonly
1439 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
1440 br i1 %687, label %bb10, label %bb11
1441
1442bb10: ; preds = %bb9, %bb8
1443 %688 = call i32 @strlen(i8* %683) nounwind readonly
1444
1445This could be eliminated by doing the strlen once in bb8, saving code size and
1446improving perf on the bb8->9->10 path.
1447
1448//===---------------------------------------------------------------------===//
Chris Lattner9fee08f2009-01-08 07:34:55 +00001449
1450I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
1451which looks like:
1452 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
1453
1454
1455bb62: ; preds = %bb55, %bb53
1456 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
1457 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1458 %172 = add i32 %171, -1 ; <i32> [#uses=1]
1459 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
1460
1461... no stores ...
1462 br i1 %or.cond, label %bb65, label %bb72
1463
1464bb65: ; preds = %bb62
1465 store i8 0, i8* %173, align 1
1466 br label %bb72
1467
1468bb72: ; preds = %bb65, %bb62
1469 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
1470 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1471
1472Note that on the bb62->bb72 path, that the %177 strlen call is partially
1473redundant with the %171 call. At worst, we could shove the %177 strlen call
1474up into the bb65 block moving it out of the bb62->bb72 path. However, note
1475that bb65 stores to the string, zeroing out the last byte. This means that on
1476that path the value of %177 is actually just %171-1. A sub is cheaper than a
1477strlen!
1478
1479This pattern repeats several times, basically doing:
1480
1481 A = strlen(P);
1482 P[A-1] = 0;
1483 B = strlen(P);
1484 where it is "obvious" that B = A-1.
1485
1486//===---------------------------------------------------------------------===//
1487
1488186.crafty contains this interesting pattern:
1489
1490%77 = call i8* @strstr(i8* getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0),
1491 i8* %30)
1492%phitmp648 = icmp eq i8* %77, getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0)
1493br i1 %phitmp648, label %bb70, label %bb76
1494
1495bb70: ; preds = %OptionMatch.exit91, %bb69
1496 %78 = call i32 @strlen(i8* %30) nounwind readonly align 1 ; <i32> [#uses=1]
1497
1498This is basically:
1499 cststr = "abcdef";
1500 if (strstr(cststr, P) == cststr) {
1501 x = strlen(P);
1502 ...
1503
1504The strstr call would be significantly cheaper written as:
1505
1506cststr = "abcdef";
1507if (memcmp(P, str, strlen(P)))
1508 x = strlen(P);
1509
1510This is memcmp+strlen instead of strstr. This also makes the strlen fully
1511redundant.
1512
1513//===---------------------------------------------------------------------===//
1514
1515186.crafty also contains this code:
1516
1517%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1518%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
1519%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
1520%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1521%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
1522
1523The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
1524
1525//===---------------------------------------------------------------------===//
1526
1527186.crafty has this interesting pattern with the "out.4543" variable:
1528
1529call void @llvm.memcpy.i32(
1530 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
1531 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
1532%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
1533
1534It is basically doing:
1535
1536 memcpy(globalarray, "string");
1537 printf(..., globalarray);
1538
1539Anyway, by knowing that printf just reads the memory and forward substituting
1540the string directly into the printf, this eliminates reads from globalarray.
1541Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
1542other similar functions) there are many stores to "out". Once all the printfs
1543stop using "out", all that is left is the memcpy's into it. This should allow
1544globalopt to remove the "stored only" global.
1545
1546//===---------------------------------------------------------------------===//
1547
Dan Gohman8289b052009-01-20 01:07:33 +00001548This code:
1549
1550define inreg i32 @foo(i8* inreg %p) nounwind {
1551 %tmp0 = load i8* %p
1552 %tmp1 = ashr i8 %tmp0, 5
1553 %tmp2 = sext i8 %tmp1 to i32
1554 ret i32 %tmp2
1555}
1556
1557could be dagcombine'd to a sign-extending load with a shift.
1558For example, on x86 this currently gets this:
1559
1560 movb (%eax), %al
1561 sarb $5, %al
1562 movsbl %al, %eax
1563
1564while it could get this:
1565
1566 movsbl (%eax), %eax
1567 sarl $5, %eax
1568
1569//===---------------------------------------------------------------------===//
Chris Lattner256baa42009-01-22 07:16:03 +00001570
1571GCC PR31029:
1572
1573int test(int x) { return 1-x == x; } // --> return false
1574int test2(int x) { return 2-x == x; } // --> return x == 1 ?
1575
1576Always foldable for odd constants, what is the rule for even?
1577
1578//===---------------------------------------------------------------------===//
1579
Torok Edwine46a6862009-01-24 19:30:25 +00001580PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
1581for next field in struct (which is at same address).
1582
1583For example: store of float into { {{}}, float } could be turned into a store to
1584the float directly.
1585
Torok Edwin474479f2009-02-20 18:42:06 +00001586//===---------------------------------------------------------------------===//
Nick Lewycky20babb12009-02-25 06:52:48 +00001587
Torok Edwin474479f2009-02-20 18:42:06 +00001588#include <math.h>
1589double foo(double a) { return sin(a); }
1590
1591This compiles into this on x86-64 Linux:
1592foo:
1593 subq $8, %rsp
1594 call sin
1595 addq $8, %rsp
1596 ret
1597vs:
1598
1599foo:
1600 jmp sin
1601
Nick Lewycky20babb12009-02-25 06:52:48 +00001602//===---------------------------------------------------------------------===//
1603
Chris Lattner32c5f172009-05-11 17:41:40 +00001604The arg promotion pass should make use of nocapture to make its alias analysis
1605stuff much more precise.
1606
1607//===---------------------------------------------------------------------===//
1608
1609The following functions should be optimized to use a select instead of a
1610branch (from gcc PR40072):
1611
1612char char_int(int m) {if(m>7) return 0; return m;}
1613int int_char(char m) {if(m>7) return 0; return m;}
1614
1615//===---------------------------------------------------------------------===//
1616
Bill Wendling5a569272009-10-27 22:48:31 +00001617int func(int a, int b) { if (a & 0x80) b |= 0x80; else b &= ~0x80; return b; }
1618
1619Generates this:
1620
1621define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1622entry:
1623 %0 = and i32 %a, 128 ; <i32> [#uses=1]
1624 %1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
1625 %2 = or i32 %b, 128 ; <i32> [#uses=1]
1626 %3 = and i32 %b, -129 ; <i32> [#uses=1]
1627 %b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
1628 ret i32 %b_addr.0
1629}
1630
1631However, it's functionally equivalent to:
1632
1633 b = (b & ~0x80) | (a & 0x80);
1634
1635Which generates this:
1636
1637define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1638entry:
1639 %0 = and i32 %b, -129 ; <i32> [#uses=1]
1640 %1 = and i32 %a, 128 ; <i32> [#uses=1]
1641 %2 = or i32 %0, %1 ; <i32> [#uses=1]
1642 ret i32 %2
1643}
1644
1645This can be generalized for other forms:
1646
1647 b = (b & ~0x80) | (a & 0x40) << 1;
1648
1649//===---------------------------------------------------------------------===//
Bill Wendlingc872e9c2009-10-27 23:30:07 +00001650
1651These two functions produce different code. They shouldn't:
1652
1653#include <stdint.h>
1654
1655uint8_t p1(uint8_t b, uint8_t a) {
1656 b = (b & ~0xc0) | (a & 0xc0);
1657 return (b);
1658}
1659
1660uint8_t p2(uint8_t b, uint8_t a) {
1661 b = (b & ~0x40) | (a & 0x40);
1662 b = (b & ~0x80) | (a & 0x80);
1663 return (b);
1664}
1665
1666define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1667entry:
1668 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1669 %1 = and i8 %a, -64 ; <i8> [#uses=1]
1670 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1671 ret i8 %2
1672}
1673
1674define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1675entry:
1676 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1677 %.masked = and i8 %a, 64 ; <i8> [#uses=1]
1678 %1 = and i8 %a, -128 ; <i8> [#uses=1]
1679 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1680 %3 = or i8 %2, %.masked ; <i8> [#uses=1]
1681 ret i8 %3
1682}
1683
1684//===---------------------------------------------------------------------===//
Chris Lattner6fdfc9c2009-11-11 17:51:27 +00001685
1686IPSCCP does not currently propagate argument dependent constants through
1687functions where it does not not all of the callers. This includes functions
1688with normal external linkage as well as templates, C99 inline functions etc.
1689Specifically, it does nothing to:
1690
1691define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
1692entry:
1693 %0 = add nsw i32 %y, %z
1694 %1 = mul i32 %0, %x
1695 %2 = mul i32 %y, %z
1696 %3 = add nsw i32 %1, %2
1697 ret i32 %3
1698}
1699
1700define i32 @test2() nounwind {
1701entry:
1702 %0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
1703 ret i32 %0
1704}
1705
1706It would be interesting extend IPSCCP to be able to handle simple cases like
1707this, where all of the arguments to a call are constant. Because IPSCCP runs
1708before inlining, trivial templates and inline functions are not yet inlined.
1709The results for a function + set of constant arguments should be memoized in a
1710map.
1711
1712//===---------------------------------------------------------------------===//
Chris Lattnerfc926c22009-11-11 17:54:02 +00001713
1714The libcall constant folding stuff should be moved out of SimplifyLibcalls into
1715libanalysis' constantfolding logic. This would allow IPSCCP to be able to
1716handle simple things like this:
1717
1718static int foo(const char *X) { return strlen(X); }
1719int bar() { return foo("abcd"); }
1720
1721//===---------------------------------------------------------------------===//