blob: 8399115e598aebe2011952f596bce12f830c9f1f [file] [log] [blame]
Chris Lattner086c0142006-02-03 06:21:43 +00001Target Independent Opportunities:
2
Chris Lattnerf308ea02006-09-28 06:01:17 +00003//===---------------------------------------------------------------------===//
4
Chris Lattner9b62b452006-11-14 01:57:53 +00005With the recent changes to make the implicit def/use set explicit in
6machineinstrs, we should change the target descriptions for 'call' instructions
7so that the .td files don't list all the call-clobbered registers as implicit
8defs. Instead, these should be added by the code generator (e.g. on the dag).
9
10This has a number of uses:
11
121. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
13 for their different impdef sets.
142. Targets with multiple calling convs (e.g. x86) which have different clobber
15 sets don't need copies of call instructions.
163. 'Interprocedural register allocation' can be done to reduce the clobber sets
17 of calls.
18
19//===---------------------------------------------------------------------===//
20
Nate Begeman81e80972006-03-17 01:40:33 +000021Make the PPC branch selector target independant
22
23//===---------------------------------------------------------------------===//
Chris Lattner086c0142006-02-03 06:21:43 +000024
25Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
Chris Lattner2dae65d2008-12-10 01:30:48 +000026precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
27safe in general, even on darwin. See the libm implementation of hypot for
28examples (which special case when x/y are exactly zero to get signed zeros etc
29right).
Chris Lattner086c0142006-02-03 06:21:43 +000030
Chris Lattner086c0142006-02-03 06:21:43 +000031//===---------------------------------------------------------------------===//
32
33Solve this DAG isel folding deficiency:
34
35int X, Y;
36
37void fn1(void)
38{
39 X = X | (Y << 3);
40}
41
42compiles to
43
44fn1:
45 movl Y, %eax
46 shll $3, %eax
47 orl X, %eax
48 movl %eax, X
49 ret
50
51The problem is the store's chain operand is not the load X but rather
52a TokenFactor of the load X and load Y, which prevents the folding.
53
54There are two ways to fix this:
55
561. The dag combiner can start using alias analysis to realize that y/x
57 don't alias, making the store to X not dependent on the load from Y.
582. The generated isel could be made smarter in the case it can't
59 disambiguate the pointers.
60
61Number 1 is the preferred solution.
62
Evan Chenge617b082006-03-13 23:19:10 +000063This has been "fixed" by a TableGen hack. But that is a short term workaround
64which will be removed once the proper fix is made.
65
Chris Lattner086c0142006-02-03 06:21:43 +000066//===---------------------------------------------------------------------===//
67
Chris Lattnerb27b69f2006-03-04 01:19:34 +000068On targets with expensive 64-bit multiply, we could LSR this:
69
70for (i = ...; ++i) {
71 x = 1ULL << i;
72
73into:
74 long long tmp = 1;
75 for (i = ...; ++i, tmp+=tmp)
76 x = tmp;
77
78This would be a win on ppc32, but not x86 or ppc64.
79
Chris Lattnerad019932006-03-04 08:44:51 +000080//===---------------------------------------------------------------------===//
Chris Lattner5b0fe7d2006-03-05 20:00:08 +000081
82Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
83
84//===---------------------------------------------------------------------===//
Chris Lattner549f27d22006-03-07 02:46:26 +000085
Chris Lattnerc20995e2006-03-11 20:17:08 +000086Reassociate should turn: X*X*X*X -> t=(X*X) (t*t) to eliminate a multiply.
87
88//===---------------------------------------------------------------------===//
89
Chris Lattner74cfb7d2006-03-11 20:20:40 +000090Interesting? testcase for add/shift/mul reassoc:
91
92int bar(int x, int y) {
93 return x*x*x+y+x*x*x*x*x*y*y*y*y;
94}
95int foo(int z, int n) {
96 return bar(z, n) + bar(2*z, 2*n);
97}
98
Chris Lattner5e14b0d2007-05-05 22:29:06 +000099Reassociate should handle the example in GCC PR16157.
100
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000101//===---------------------------------------------------------------------===//
102
Chris Lattner82c78b22006-03-09 20:13:21 +0000103These two functions should generate the same code on big-endian systems:
104
105int g(int *j,int *l) { return memcmp(j,l,4); }
106int h(int *j, int *l) { return *j - *l; }
107
108this could be done in SelectionDAGISel.cpp, along with other special cases,
109for 1,2,4,8 bytes.
110
111//===---------------------------------------------------------------------===//
112
Chris Lattnerc04b4232006-03-22 07:33:46 +0000113It would be nice to revert this patch:
114http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
115
116And teach the dag combiner enough to simplify the code expanded before
117legalize. It seems plausible that this knowledge would let it simplify other
118stuff too.
119
Chris Lattnere6cd96d2006-03-24 19:59:17 +0000120//===---------------------------------------------------------------------===//
121
Reid Spencerac9dcb92007-02-15 03:39:18 +0000122For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng67d3d4c2006-03-31 22:35:14 +0000123to the type size. It works but can be overly conservative as the alignment of
Reid Spencerac9dcb92007-02-15 03:39:18 +0000124specific vector types are target dependent.
Chris Lattnereaa7c062006-04-01 04:08:29 +0000125
126//===---------------------------------------------------------------------===//
127
128We should add 'unaligned load/store' nodes, and produce them from code like
129this:
130
131v4sf example(float *P) {
132 return (v4sf){P[0], P[1], P[2], P[3] };
133}
134
135//===---------------------------------------------------------------------===//
136
Chris Lattner16abfdf2006-05-18 18:26:13 +0000137Add support for conditional increments, and other related patterns. Instead
138of:
139
140 movl 136(%esp), %eax
141 cmpl $0, %eax
142 je LBB16_2 #cond_next
143LBB16_1: #cond_true
144 incl _foo
145LBB16_2: #cond_next
146
147emit:
148 movl _foo, %eax
149 cmpl $1, %edi
150 sbbl $-1, %eax
151 movl %eax, _foo
152
153//===---------------------------------------------------------------------===//
Chris Lattner870cf1b2006-05-19 20:45:08 +0000154
155Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
156
157Expand these to calls of sin/cos and stores:
158 double sincos(double x, double *sin, double *cos);
159 float sincosf(float x, float *sin, float *cos);
160 long double sincosl(long double x, long double *sin, long double *cos);
161
162Doing so could allow SROA of the destination pointers. See also:
163http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
164
Chris Lattner2dae65d2008-12-10 01:30:48 +0000165This is now easily doable with MRVs. We could even make an intrinsic for this
166if anyone cared enough about sincos.
167
Chris Lattner870cf1b2006-05-19 20:45:08 +0000168//===---------------------------------------------------------------------===//
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000169
Chris Lattnere8263e62006-05-21 03:57:07 +0000170Turn this into a single byte store with no load (the other 3 bytes are
171unmodified):
172
173void %test(uint* %P) {
174 %tmp = load uint* %P
175 %tmp14 = or uint %tmp, 3305111552
176 %tmp15 = and uint %tmp14, 3321888767
177 store uint %tmp15, uint* %P
178 ret void
179}
180
Chris Lattner9e18ef52006-05-30 21:29:15 +0000181//===---------------------------------------------------------------------===//
182
183dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
184
185Compile:
186
187int bar(int x)
188{
189 int t = __builtin_clz(x);
190 return -(t>>5);
191}
192
193to:
194
195_bar: addic r3,r3,-1
196 subfe r3,r3,r3
197 blr
198
Chris Lattnercbce2f62006-09-15 20:31:36 +0000199//===---------------------------------------------------------------------===//
200
201Legalize should lower ctlz like this:
202 ctlz(x) = popcnt((x-1) & ~x)
203
204on targets that have popcnt but not ctlz. itanium, what else?
Chris Lattner9e18ef52006-05-30 21:29:15 +0000205
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000206//===---------------------------------------------------------------------===//
207
208quantum_sigma_x in 462.libquantum contains the following loop:
209
210 for(i=0; i<reg->size; i++)
211 {
212 /* Flip the target bit of each basis state */
213 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
214 }
215
216Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
217so cool to turn it into something like:
218
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000219 long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000220 if (target < 32) {
221 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000222 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000223 } else {
224 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000225 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000226 }
227
228... which would only do one 32-bit XOR per loop iteration instead of two.
229
230It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
231alas...
232
233//===---------------------------------------------------------------------===//
Chris Lattnerfb981f32006-09-25 17:12:14 +0000234
Chris Lattnerb1ac7692008-10-05 02:16:12 +0000235This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattnerf9bae432006-12-08 02:01:32 +0000236
237unsigned long reverse(unsigned v) {
238 unsigned t;
239 t = v ^ ((v << 16) | (v >> 16));
240 t &= ~0xff0000;
241 v = (v << 24) | (v >> 8);
242 return v ^ (t >> 8);
243}
244
Chris Lattnerfb981f32006-09-25 17:12:14 +0000245//===---------------------------------------------------------------------===//
246
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000247These idioms should be recognized as popcount (see PR1488):
248
249unsigned countbits_slow(unsigned v) {
250 unsigned c;
251 for (c = 0; v; v >>= 1)
252 c += v & 1;
253 return c;
254}
255unsigned countbits_fast(unsigned v){
256 unsigned c;
257 for (c = 0; v; c++)
258 v &= v - 1; // clear the least significant bit set
259 return c;
260}
261
262BITBOARD = unsigned long long
263int PopCnt(register BITBOARD a) {
264 register int c=0;
265 while(a) {
266 c++;
267 a &= a - 1;
268 }
269 return c;
270}
271unsigned int popcount(unsigned int input) {
272 unsigned int count = 0;
273 for (unsigned int i = 0; i < 4 * 8; i++)
274 count += (input >> i) & i;
275 return count;
276}
277
278//===---------------------------------------------------------------------===//
279
Chris Lattnerfb981f32006-09-25 17:12:14 +0000280These should turn into single 16-bit (unaligned?) loads on little/big endian
281processors.
282
283unsigned short read_16_le(const unsigned char *adr) {
284 return adr[0] | (adr[1] << 8);
285}
286unsigned short read_16_be(const unsigned char *adr) {
287 return (adr[0] << 8) | adr[1];
288}
289
290//===---------------------------------------------------------------------===//
Chris Lattnercf103912006-10-24 16:12:47 +0000291
Reid Spencer1628cec2006-10-26 06:15:43 +0000292-instcombine should handle this transform:
Reid Spencere4d87aa2006-12-23 06:05:41 +0000293 icmp pred (sdiv X / C1 ), C2
Reid Spencer1628cec2006-10-26 06:15:43 +0000294when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
295
296Currently InstCombine avoids this transform but will do it when the signs of
297the operands and the sign of the divide match. See the FIXME in
298InstructionCombining.cpp in the visitSetCondInst method after the switch case
299for Instruction::UDiv (around line 4447) for more details.
300
301The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
302this construct.
Chris Lattnerd7c628d2006-11-03 22:27:39 +0000303
304//===---------------------------------------------------------------------===//
305
Chris Lattner578d2df2006-11-10 00:23:26 +0000306viterbi speeds up *significantly* if the various "history" related copy loops
307are turned into memcpy calls at the source level. We need a "loops to memcpy"
308pass.
309
310//===---------------------------------------------------------------------===//
Nick Lewyckybf637342006-11-13 00:23:28 +0000311
Chris Lattner03a6d962007-01-16 06:39:48 +0000312Consider:
313
314typedef unsigned U32;
315typedef unsigned long long U64;
316int test (U32 *inst, U64 *regs) {
317 U64 effective_addr2;
318 U32 temp = *inst;
319 int r1 = (temp >> 20) & 0xf;
320 int b2 = (temp >> 16) & 0xf;
321 effective_addr2 = temp & 0xfff;
322 if (b2) effective_addr2 += regs[b2];
323 b2 = (temp >> 12) & 0xf;
324 if (b2) effective_addr2 += regs[b2];
325 effective_addr2 &= regs[4];
326 if ((effective_addr2 & 3) == 0)
327 return 1;
328 return 0;
329}
330
331Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
332we don't eliminate the computation of the top half of effective_addr2 because
333we don't have whole-function selection dags. On x86, this means we use one
334extra register for the function when effective_addr2 is declared as U64 than
335when it is declared U32.
336
337//===---------------------------------------------------------------------===//
338
Chris Lattner36e37d22007-02-13 21:44:43 +0000339Promote for i32 bswap can use i64 bswap + shr. Useful on targets with 64-bit
340regs and bswap, like itanium.
341
342//===---------------------------------------------------------------------===//
Chris Lattner1a77a552007-03-24 06:01:32 +0000343
344LSR should know what GPR types a target has. This code:
345
346volatile short X, Y; // globals
347
348void foo(int N) {
349 int i;
350 for (i = 0; i < N; i++) { X = i; Y = i*4; }
351}
352
353produces two identical IV's (after promotion) on PPC/ARM:
354
355LBB1_1: @bb.preheader
356 mov r3, #0
357 mov r2, r3
358 mov r1, r3
359LBB1_2: @bb
360 ldr r12, LCPI1_0
361 ldr r12, [r12]
362 strh r2, [r12]
363 ldr r12, LCPI1_1
364 ldr r12, [r12]
365 strh r3, [r12]
366 add r1, r1, #1 <- [0,+,1]
367 add r3, r3, #4
368 add r2, r2, #1 <- [0,+,1]
369 cmp r1, r0
370 bne LBB1_2 @bb
371
372
373//===---------------------------------------------------------------------===//
374
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000375Tail call elim should be more aggressive, checking to see if the call is
376followed by an uncond branch to an exit block.
377
378; This testcase is due to tail-duplication not wanting to copy the return
379; instruction into the terminating blocks because there was other code
380; optimized out of the function after the taildup happened.
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000381; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000382
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000383define i32 @t4(i32 %a) {
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000384entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000385 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
386 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
387 br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000388
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000389then.0: ; preds = %entry
390 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
391 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
392 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000393
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000394else.0: ; preds = %entry
395 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
396 br i1 %tmp.7, label %then.1, label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000397
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000398then.1: ; preds = %else.0
399 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
400 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
401 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000402
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000403return: ; preds = %then.1, %else.0, %then.0
404 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000405 [ %tmp.9, %then.1 ]
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000406 ret i32 %result.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000407}
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000408
409//===---------------------------------------------------------------------===//
410
Chris Lattnere1bb6ab2007-10-03 06:10:59 +0000411Tail recursion elimination is not transforming this function, because it is
412returning n, which fails the isDynamicConstant check in the accumulator
413recursion checks.
414
415long long fib(const long long n) {
416 switch(n) {
417 case 0:
418 case 1:
419 return n;
420 default:
421 return fib(n-1) + fib(n-2);
422 }
423}
424
425//===---------------------------------------------------------------------===//
426
Chris Lattnerc90b8662008-08-10 00:47:21 +0000427Tail recursion elimination should handle:
428
429int pow2m1(int n) {
430 if (n == 0)
431 return 0;
432 return 2 * pow2m1 (n - 1) + 1;
433}
434
435Also, multiplies can be turned into SHL's, so they should be handled as if
436they were associative. "return foo() << 1" can be tail recursion eliminated.
437
438//===---------------------------------------------------------------------===//
439
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000440Argument promotion should promote arguments for recursive functions, like
441this:
442
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000443; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000444
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000445define internal i32 @foo(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000446entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000447 %tmp = load i32* %x ; <i32> [#uses=0]
448 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
449 ret i32 %tmp.foo
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000450}
451
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000452define i32 @bar(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000453entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000454 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
455 ret i32 %tmp3
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000456}
457
Chris Lattner81f2d712007-12-05 23:05:06 +0000458//===---------------------------------------------------------------------===//
Chris Lattner166a2682007-12-28 04:42:05 +0000459
460"basicaa" should know how to look through "or" instructions that act like add
461instructions. For example in this code, the x*4+1 is turned into x*4 | 1, and
462basicaa can't analyze the array subscript, leading to duplicated loads in the
463generated code:
464
465void test(int X, int Y, int a[]) {
466int i;
467 for (i=2; i<1000; i+=4) {
468 a[i+0] = a[i-1+0]*a[i-2+0];
469 a[i+1] = a[i-1+1]*a[i-2+1];
470 a[i+2] = a[i-1+2]*a[i-2+2];
471 a[i+3] = a[i-1+3]*a[i-2+3];
472 }
473}
474
Chris Lattner2dae65d2008-12-10 01:30:48 +0000475BasicAA also doesn't do this for add. It needs to know that &A[i+1] != &A[i].
476
Chris Lattnera1643ba2007-12-28 22:30:05 +0000477//===---------------------------------------------------------------------===//
Chris Lattner166a2682007-12-28 04:42:05 +0000478
Chris Lattnera1643ba2007-12-28 22:30:05 +0000479We should investigate an instruction sinking pass. Consider this silly
480example in pic mode:
481
482#include <assert.h>
483void foo(int x) {
484 assert(x);
485 //...
486}
487
488we compile this to:
489_foo:
490 subl $28, %esp
491 call "L1$pb"
492"L1$pb":
493 popl %eax
494 cmpl $0, 32(%esp)
495 je LBB1_2 # cond_true
496LBB1_1: # return
497 # ...
498 addl $28, %esp
499 ret
500LBB1_2: # cond_true
501...
502
503The PIC base computation (call+popl) is only used on one path through the
504code, but is currently always computed in the entry block. It would be
505better to sink the picbase computation down into the block for the
506assertion, as it is the only one that uses it. This happens for a lot of
507code with early outs.
508
Chris Lattner92c06a02007-12-29 01:05:01 +0000509Another example is loads of arguments, which are usually emitted into the
510entry block on targets like x86. If not used in all paths through a
511function, they should be sunk into the ones that do.
512
Chris Lattnera1643ba2007-12-28 22:30:05 +0000513In this case, whole-function-isel would also handle this.
Chris Lattner166a2682007-12-28 04:42:05 +0000514
515//===---------------------------------------------------------------------===//
Chris Lattnerb3041942008-01-07 21:38:14 +0000516
517Investigate lowering of sparse switch statements into perfect hash tables:
518http://burtleburtle.net/bob/hash/perfect.html
519
520//===---------------------------------------------------------------------===//
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000521
522We should turn things like "load+fabs+store" and "load+fneg+store" into the
523corresponding integer operations. On a yonah, this loop:
524
525double a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000526void foo() {
527 int i, b;
528 for (b = 0; b < 10000000; b++)
529 for (i = 0; i < 256; i++)
530 a[i] = -a[i];
531}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000532
533is twice as slow as this loop:
534
535long long a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000536void foo() {
537 int i, b;
538 for (b = 0; b < 10000000; b++)
539 for (i = 0; i < 256; i++)
540 a[i] ^= (1ULL << 63);
541}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000542
543and I suspect other processors are similar. On X86 in particular this is a
544big win because doing this with integers allows the use of read/modify/write
545instructions.
546
547//===---------------------------------------------------------------------===//
Chris Lattner83726012008-01-10 18:25:41 +0000548
549DAG Combiner should try to combine small loads into larger loads when
550profitable. For example, we compile this C++ example:
551
552struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
553extern THotKey m_HotKey;
554THotKey GetHotKey () { return m_HotKey; }
555
556into (-O3 -fno-exceptions -static -fomit-frame-pointer):
557
558__Z9GetHotKeyv:
559 pushl %esi
560 movl 8(%esp), %eax
561 movb _m_HotKey+3, %cl
562 movb _m_HotKey+4, %dl
563 movb _m_HotKey+2, %ch
564 movw _m_HotKey, %si
565 movw %si, (%eax)
566 movb %ch, 2(%eax)
567 movb %cl, 3(%eax)
568 movb %dl, 4(%eax)
569 popl %esi
570 ret $4
571
572GCC produces:
573
574__Z9GetHotKeyv:
575 movl _m_HotKey, %edx
576 movl 4(%esp), %eax
577 movl %edx, (%eax)
578 movzwl _m_HotKey+4, %edx
579 movw %dx, 4(%eax)
580 ret $4
581
582The LLVM IR contains the needed alignment info, so we should be able to
583merge the loads and stores into 4-byte loads:
584
585 %struct.THotKey = type { i16, i8, i8, i8 }
586define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
587...
588 %tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
589 %tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
590 %tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
591 %tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
592
593Alternatively, we should use a small amount of base-offset alias analysis
594to make it so the scheduler doesn't need to hold all the loads in regs at
595once.
596
597//===---------------------------------------------------------------------===//
Chris Lattner497b7e92008-01-11 06:17:47 +0000598
Nate Begemane9fe65c2008-02-18 18:39:23 +0000599We should add an FRINT node to the DAG to model targets that have legal
600implementations of ceil/floor/rint.
Chris Lattner48840f82008-02-28 05:34:27 +0000601
602//===---------------------------------------------------------------------===//
603
Chris Lattnere29536c2008-02-28 17:21:27 +0000604This GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34043
605contains a testcase that compiles down to:
606
607 %struct.XMM128 = type { <4 x float> }
608..
609 %src = alloca %struct.XMM128
610..
611 %tmp6263 = bitcast %struct.XMM128* %src to <2 x i64>*
612 %tmp65 = getelementptr %struct.XMM128* %src, i32 0, i32 0
613 store <2 x i64> %tmp5899, <2 x i64>* %tmp6263, align 16
614 %tmp66 = load <4 x float>* %tmp65, align 16
615 %tmp71 = add <4 x float> %tmp66, %tmp66
616
617If the mid-level optimizer turned the bitcast of pointer + store of tmp5899
618into a bitcast of the vector value and a store to the pointer, then the
619store->load could be easily removed.
620
621//===---------------------------------------------------------------------===//
622
Chris Lattner48840f82008-02-28 05:34:27 +0000623Consider:
624
625int test() {
626 long long input[8] = {1,1,1,1,1,1,1,1};
627 foo(input);
628}
629
630We currently compile this into a memcpy from a global array since the
631initializer is fairly large and not memset'able. This is good, but the memcpy
632gets lowered to load/stores in the code generator. This is also ok, except
633that the codegen lowering for memcpy doesn't handle the case when the source
634is a constant global. This gives us atrocious code like this:
635
636 call "L1$pb"
637"L1$pb":
638 popl %eax
639 movl _C.0.1444-"L1$pb"+32(%eax), %ecx
640 movl %ecx, 40(%esp)
641 movl _C.0.1444-"L1$pb"+20(%eax), %ecx
642 movl %ecx, 28(%esp)
643 movl _C.0.1444-"L1$pb"+36(%eax), %ecx
644 movl %ecx, 44(%esp)
645 movl _C.0.1444-"L1$pb"+44(%eax), %ecx
646 movl %ecx, 52(%esp)
647 movl _C.0.1444-"L1$pb"+40(%eax), %ecx
648 movl %ecx, 48(%esp)
649 movl _C.0.1444-"L1$pb"+12(%eax), %ecx
650 movl %ecx, 20(%esp)
651 movl _C.0.1444-"L1$pb"+4(%eax), %ecx
652...
653
654instead of:
655 movl $1, 16(%esp)
656 movl $0, 20(%esp)
657 movl $1, 24(%esp)
658 movl $0, 28(%esp)
659 movl $1, 32(%esp)
660 movl $0, 36(%esp)
661 ...
662
663//===---------------------------------------------------------------------===//
Chris Lattnera11deb02008-03-02 02:51:40 +0000664
665http://llvm.org/PR717:
666
667The following code should compile into "ret int undef". Instead, LLVM
668produces "ret int 0":
669
670int f() {
671 int x = 4;
672 int y;
673 if (x == 3) y = 0;
674 return y;
675}
676
677//===---------------------------------------------------------------------===//
Chris Lattner53b72772008-03-02 19:29:42 +0000678
679The loop unroller should partially unroll loops (instead of peeling them)
680when code growth isn't too bad and when an unroll count allows simplification
681of some code within the loop. One trivial example is:
682
683#include <stdio.h>
684int main() {
685 int nRet = 17;
686 int nLoop;
687 for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
688 if ( nLoop & 1 )
689 nRet += 2;
690 else
691 nRet -= 1;
692 }
693 return nRet;
694}
695
696Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
697reduction in code size. The resultant code would then also be suitable for
698exit value computation.
699
700//===---------------------------------------------------------------------===//
Chris Lattner349155b2008-03-17 01:47:51 +0000701
702We miss a bunch of rotate opportunities on various targets, including ppc, x86,
703etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
704matching code in dag combine doesn't look through truncates aggressively
705enough. Here are some testcases reduces from GCC PR17886:
706
707unsigned long long f(unsigned long long x, int y) {
708 return (x << y) | (x >> 64-y);
709}
710unsigned f2(unsigned x, int y){
711 return (x << y) | (x >> 32-y);
712}
713unsigned long long f3(unsigned long long x){
714 int y = 9;
715 return (x << y) | (x >> 64-y);
716}
717unsigned f4(unsigned x){
718 int y = 10;
719 return (x << y) | (x >> 32-y);
720}
721unsigned long long f5(unsigned long long x, unsigned long long y) {
722 return (x << 8) | ((y >> 48) & 0xffull);
723}
724unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
725 switch(z) {
726 case 1:
727 return (x << 8) | ((y >> 48) & 0xffull);
728 case 2:
729 return (x << 16) | ((y >> 40) & 0xffffull);
730 case 3:
731 return (x << 24) | ((y >> 32) & 0xffffffull);
732 case 4:
733 return (x << 32) | ((y >> 24) & 0xffffffffull);
734 default:
735 return (x << 40) | ((y >> 16) & 0xffffffffffull);
736 }
737}
738
Dan Gohmancb747c52008-10-17 21:39:27 +0000739On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner349155b2008-03-17 01:47:51 +0000740generate truly horrible code, instead of using shld and friends. On
741ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
742badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
743
744//===---------------------------------------------------------------------===//
Chris Lattnerf70107f2008-03-20 04:46:13 +0000745
746We do a number of simplifications in simplify libcalls to strength reduce
747standard library functions, but we don't currently merge them together. For
748example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
749be done safely if "b" isn't modified between the strlen and memcpy of course.
750
751//===---------------------------------------------------------------------===//
752
Chris Lattnerb5783102008-05-17 15:37:38 +0000753We should be able to evaluate this loop:
754
755int test(int x_offs) {
756 while (x_offs > 4)
757 x_offs -= 4;
758 return x_offs;
759}
760
761//===---------------------------------------------------------------------===//
Chris Lattner10c5d362008-07-14 00:19:59 +0000762
763Reassociate should turn things like:
764
765int factorial(int X) {
766 return X*X*X*X*X*X*X*X;
767}
768
769into llvm.powi calls, allowing the code generator to produce balanced
770multiplication trees.
771
772//===---------------------------------------------------------------------===//
773
Chris Lattner26e150f2008-08-10 01:14:08 +0000774We generate a horrible libcall for llvm.powi. For example, we compile:
775
776#include <cmath>
777double f(double a) { return std::pow(a, 4); }
778
779into:
780
781__Z1fd:
782 subl $12, %esp
783 movsd 16(%esp), %xmm0
784 movsd %xmm0, (%esp)
785 movl $4, 8(%esp)
786 call L___powidf2$stub
787 addl $12, %esp
788 ret
789
790GCC produces:
791
792__Z1fd:
793 subl $12, %esp
794 movsd 16(%esp), %xmm0
795 mulsd %xmm0, %xmm0
796 mulsd %xmm0, %xmm0
797 movsd %xmm0, (%esp)
798 fldl (%esp)
799 addl $12, %esp
800 ret
801
802//===---------------------------------------------------------------------===//
803
804We compile this program: (from GCC PR11680)
805http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
806
807Into code that runs the same speed in fast/slow modes, but both modes run 2x
808slower than when compile with GCC (either 4.0 or 4.2):
809
810$ llvm-g++ perf.cpp -O3 -fno-exceptions
811$ time ./a.out fast
8121.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
813
814$ g++ perf.cpp -O3 -fno-exceptions
815$ time ./a.out fast
8160.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
817
818It looks like we are making the same inlining decisions, so this may be raw
819codegen badness or something else (haven't investigated).
820
821//===---------------------------------------------------------------------===//
822
823We miss some instcombines for stuff like this:
824void bar (void);
825void foo (unsigned int a) {
826 /* This one is equivalent to a >= (3 << 2). */
827 if ((a >> 2) >= 3)
828 bar ();
829}
830
831A few other related ones are in GCC PR14753.
832
833//===---------------------------------------------------------------------===//
834
835Divisibility by constant can be simplified (according to GCC PR12849) from
836being a mulhi to being a mul lo (cheaper). Testcase:
837
838void bar(unsigned n) {
839 if (n % 3 == 0)
840 true();
841}
842
843I think this basically amounts to a dag combine to simplify comparisons against
844multiply hi's into a comparison against the mullo.
845
846//===---------------------------------------------------------------------===//
Chris Lattner23f35bc2008-08-19 06:22:16 +0000847
Chris Lattnerdb039832008-10-15 16:06:03 +0000848Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
849bunch of other stuff from this example (see PR1604):
850
851#include <cstdio>
852struct test {
853 int val;
854 virtual ~test() {}
855};
856
857int main() {
858 test t;
859 std::scanf("%d", &t.val);
860 std::printf("%d\n", t.val);
861}
862
863//===---------------------------------------------------------------------===//
864
Chris Lattner3b364cb2008-10-15 16:33:52 +0000865Instcombine will merge comparisons like (x >= 10) && (x < 20) by producing (x -
86610) u< 10, but only when the comparisons have matching sign.
867
868This could be converted with a similiar technique. (PR1941)
869
870define i1 @test(i8 %x) {
871 %A = icmp uge i8 %x, 5
872 %B = icmp slt i8 %x, 20
873 %C = and i1 %A, %B
874 ret i1 %C
875}
876
877//===---------------------------------------------------------------------===//
Nick Lewyckydf563ca2008-11-27 22:12:22 +0000878
Nick Lewyckyd2f0db12008-11-27 22:41:45 +0000879These functions perform the same computation, but produce different assembly.
Nick Lewyckydf563ca2008-11-27 22:12:22 +0000880
881define i8 @select(i8 %x) readnone nounwind {
882 %A = icmp ult i8 %x, 250
883 %B = select i1 %A, i8 0, i8 1
884 ret i8 %B
885}
886
887define i8 @addshr(i8 %x) readnone nounwind {
888 %A = zext i8 %x to i9
889 %B = add i9 %A, 6 ;; 256 - 250 == 6
890 %C = lshr i9 %B, 8
891 %D = trunc i9 %C to i8
892 ret i8 %D
893}
894
895//===---------------------------------------------------------------------===//
Eli Friedman4e16b292008-11-30 07:36:04 +0000896
897From gcc bug 24696:
898int
899f (unsigned long a, unsigned long b, unsigned long c)
900{
901 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0);
902}
903int
904f (unsigned long a, unsigned long b, unsigned long c)
905{
906 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0);
907}
908Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with
909"clang -emit-llvm-bc | opt -std-compile-opts".
910
911//===---------------------------------------------------------------------===//
912
913From GCC Bug 20192:
914#define PMD_MASK (~((1UL << 23) - 1))
915void clear_pmd_range(unsigned long start, unsigned long end)
916{
917 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
918 f();
919}
920The expression should optimize to something like
921"!((start|end)&~PMD_MASK). Currently not optimized with "clang
922-emit-llvm-bc | opt -std-compile-opts".
923
924//===---------------------------------------------------------------------===//
925
926From GCC Bug 15241:
927unsigned int
928foo (unsigned int a, unsigned int b)
929{
930 if (a <= 7 && b <= 7)
931 baz ();
932}
933Should combine to "(a|b) <= 7". Currently not optimized with "clang
934-emit-llvm-bc | opt -std-compile-opts".
935
936//===---------------------------------------------------------------------===//
937
938From GCC Bug 3756:
939int
940pn (int n)
941{
942 return (n >= 0 ? 1 : -1);
943}
944Should combine to (n >> 31) | 1. Currently not optimized with "clang
945-emit-llvm-bc | opt -std-compile-opts | llc".
946
947//===---------------------------------------------------------------------===//
948
949From GCC Bug 28685:
950int test(int a, int b)
951{
952 int lt = a < b;
953 int eq = a == b;
954
955 return (lt || eq);
956}
957Should combine to "a <= b". Currently not optimized with "clang
958-emit-llvm-bc | opt -std-compile-opts | llc".
959
960//===---------------------------------------------------------------------===//
961
962void a(int variable)
963{
964 if (variable == 4 || variable == 6)
965 bar();
966}
967This should optimize to "if ((variable | 2) == 6)". Currently not
968optimized with "clang -emit-llvm-bc | opt -std-compile-opts | llc".
969
970//===---------------------------------------------------------------------===//
971
972unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
973i;}
974unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
975These should combine to the same thing. Currently, the first function
976produces better code on X86.
977
978//===---------------------------------------------------------------------===//
979
Eli Friedman4e16b292008-11-30 07:36:04 +0000980From GCC Bug 15784:
981#define abs(x) x>0?x:-x
982int f(int x, int y)
983{
984 return (abs(x)) >= 0;
985}
986This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
987optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
988
989//===---------------------------------------------------------------------===//
990
991From GCC Bug 14753:
992void
993rotate_cst (unsigned int a)
994{
995 a = (a << 10) | (a >> 22);
996 if (a == 123)
997 bar ();
998}
999void
1000minus_cst (unsigned int a)
1001{
1002 unsigned int tem;
1003
1004 tem = 20 - a;
1005 if (tem == 5)
1006 bar ();
1007}
1008void
1009mask_gt (unsigned int a)
1010{
1011 /* This is equivalent to a > 15. */
1012 if ((a & ~7) > 8)
1013 bar ();
1014}
1015void
1016rshift_gt (unsigned int a)
1017{
1018 /* This is equivalent to a > 23. */
1019 if ((a >> 2) > 5)
1020 bar ();
1021}
1022All should simplify to a single comparison. All of these are
1023currently not optimized with "clang -emit-llvm-bc | opt
1024-std-compile-opts".
1025
1026//===---------------------------------------------------------------------===//
1027
1028From GCC Bug 32605:
1029int c(int* x) {return (char*)x+2 == (char*)x;}
1030Should combine to 0. Currently not optimized with "clang
1031-emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it).
1032
1033//===---------------------------------------------------------------------===//
1034
1035int a(unsigned char* b) {return *b > 99;}
1036There's an unnecessary zext in the generated code with "clang
1037-emit-llvm-bc | opt -std-compile-opts".
1038
1039//===---------------------------------------------------------------------===//
1040
Eli Friedman4e16b292008-11-30 07:36:04 +00001041int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;}
1042Should be combined to "((b >> 1) | b) & 1". Currently not optimized
1043with "clang -emit-llvm-bc | opt -std-compile-opts".
1044
1045//===---------------------------------------------------------------------===//
1046
1047unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);}
1048Should combine to "x | (y & 3)". Currently not optimized with "clang
1049-emit-llvm-bc | opt -std-compile-opts".
1050
1051//===---------------------------------------------------------------------===//
1052
1053unsigned a(unsigned a) {return ((a | 1) & 3) | (a & -4);}
1054Should combine to "a | 1". Currently not optimized with "clang
1055-emit-llvm-bc | opt -std-compile-opts".
1056
1057//===---------------------------------------------------------------------===//
1058
Eli Friedman4e16b292008-11-30 07:36:04 +00001059int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);}
1060Should fold to "(~a & c) | (a & b)". Currently not optimized with
1061"clang -emit-llvm-bc | opt -std-compile-opts".
1062
1063//===---------------------------------------------------------------------===//
1064
1065int a(int a,int b) {return (~(a|b))|a;}
1066Should fold to "a|~b". Currently not optimized with "clang
1067-emit-llvm-bc | opt -std-compile-opts".
1068
1069//===---------------------------------------------------------------------===//
1070
1071int a(int a, int b) {return (a&&b) || (a&&!b);}
1072Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
1073| opt -std-compile-opts".
1074
1075//===---------------------------------------------------------------------===//
1076
1077int a(int a, int b, int c) {return (a&&b) || (!a&&c);}
1078Should fold to "a ? b : c", or at least something sane. Currently not
1079optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1080
1081//===---------------------------------------------------------------------===//
1082
1083int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);}
1084Should fold to a && (b || c). Currently not optimized with "clang
1085-emit-llvm-bc | opt -std-compile-opts".
1086
1087//===---------------------------------------------------------------------===//
1088
1089int a(int x) {return x | ((x & 8) ^ 8);}
1090Should combine to x | 8. Currently not optimized with "clang
1091-emit-llvm-bc | opt -std-compile-opts".
1092
1093//===---------------------------------------------------------------------===//
1094
1095int a(int x) {return x ^ ((x & 8) ^ 8);}
1096Should also combine to x | 8. Currently not optimized with "clang
1097-emit-llvm-bc | opt -std-compile-opts".
1098
1099//===---------------------------------------------------------------------===//
1100
1101int a(int x) {return (x & 8) == 0 ? -1 : -9;}
1102Should combine to (x | -9) ^ 8. Currently not optimized with "clang
1103-emit-llvm-bc | opt -std-compile-opts".
1104
1105//===---------------------------------------------------------------------===//
1106
1107int a(int x) {return (x & 8) == 0 ? -9 : -1;}
1108Should combine to x | -9. Currently not optimized with "clang
1109-emit-llvm-bc | opt -std-compile-opts".
1110
1111//===---------------------------------------------------------------------===//
1112
1113int a(int x) {return ((x | -9) ^ 8) & x;}
1114Should combine to x & -9. Currently not optimized with "clang
1115-emit-llvm-bc | opt -std-compile-opts".
1116
1117//===---------------------------------------------------------------------===//
1118
1119unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
1120Should combine to "a * 0x88888888 >> 31". Currently not optimized
1121with "clang -emit-llvm-bc | opt -std-compile-opts".
1122
1123//===---------------------------------------------------------------------===//
1124
1125unsigned a(char* x) {if ((*x & 32) == 0) return b();}
1126There's an unnecessary zext in the generated code with "clang
1127-emit-llvm-bc | opt -std-compile-opts".
1128
1129//===---------------------------------------------------------------------===//
1130
1131unsigned a(unsigned long long x) {return 40 * (x >> 1);}
1132Should combine to "20 * (((unsigned)x) & -2)". Currently not
1133optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1134
1135//===---------------------------------------------------------------------===//
Bill Wendling3bdcda82008-12-02 05:12:47 +00001136
1137We would like to do the following transform in the instcombiner:
1138
1139 -X/C -> X/-C
1140
1141However, this isn't valid if (-X) overflows. We can implement this when we
1142have the concept of a "C signed subtraction" operator that which is undefined
1143on overflow.
1144
1145//===---------------------------------------------------------------------===//
Chris Lattner88d84b22008-12-02 06:32:34 +00001146
1147This was noticed in the entryblock for grokdeclarator in 403.gcc:
1148
1149 %tmp = icmp eq i32 %decl_context, 4
1150 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
1151 %tmp1 = icmp eq i32 %decl_context_addr.0, 1
1152 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
1153
1154tmp1 should be simplified to something like:
1155 (!tmp || decl_context == 1)
1156
1157This allows recursive simplifications, tmp1 is used all over the place in
1158the function, e.g. by:
1159
1160 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
1161 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
1162 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
1163
1164later.
1165
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001166//===---------------------------------------------------------------------===//
1167
1168Store sinking: This code:
1169
1170void f (int n, int *cond, int *res) {
1171 int i;
1172 *res = 0;
1173 for (i = 0; i < n; i++)
1174 if (*cond)
1175 *res ^= 234; /* (*) */
1176}
1177
1178On this function GVN hoists the fully redundant value of *res, but nothing
1179moves the store out. This gives us this code:
1180
1181bb: ; preds = %bb2, %entry
1182 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
1183 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
1184 %1 = load i32* %cond, align 4
1185 %2 = icmp eq i32 %1, 0
1186 br i1 %2, label %bb2, label %bb1
1187
1188bb1: ; preds = %bb
1189 %3 = xor i32 %.rle, 234
1190 store i32 %3, i32* %res, align 4
1191 br label %bb2
1192
1193bb2: ; preds = %bb, %bb1
1194 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
1195 %indvar.next = add i32 %i.05, 1
1196 %exitcond = icmp eq i32 %indvar.next, %n
1197 br i1 %exitcond, label %return, label %bb
1198
1199DSE should sink partially dead stores to get the store out of the loop.
1200
Chris Lattner6a09a742008-12-06 22:52:12 +00001201Here's another partial dead case:
1202http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
1203
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001204//===---------------------------------------------------------------------===//
1205
1206Scalar PRE hoists the mul in the common block up to the else:
1207
1208int test (int a, int b, int c, int g) {
1209 int d, e;
1210 if (a)
1211 d = b * c;
1212 else
1213 d = b - c;
1214 e = b * c + g;
1215 return d + e;
1216}
1217
1218It would be better to do the mul once to reduce codesize above the if.
1219This is GCC PR38204.
1220
1221//===---------------------------------------------------------------------===//
1222
1223GCC PR37810 is an interesting case where we should sink load/store reload
1224into the if block and outside the loop, so we don't reload/store it on the
1225non-call path.
1226
1227for () {
1228 *P += 1;
1229 if ()
1230 call();
1231 else
1232 ...
1233->
1234tmp = *P
1235for () {
1236 tmp += 1;
1237 if () {
1238 *P = tmp;
1239 call();
1240 tmp = *P;
1241 } else ...
1242}
1243*P = tmp;
1244
Chris Lattner8f416f32008-12-15 07:49:24 +00001245We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
1246we don't sink the store. We need partially dead store sinking.
1247
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001248//===---------------------------------------------------------------------===//
1249
Chris Lattner8f416f32008-12-15 07:49:24 +00001250[PHI TRANSLATE GEPs]
1251
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001252GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
1253leading to excess stack traffic. This could be handled by GVN with some crazy
1254symbolic phi translation. The code we get looks like (g is on the stack):
1255
1256bb2: ; preds = %bb1
1257..
1258 %9 = getelementptr %struct.f* %g, i32 0, i32 0
1259 store i32 %8, i32* %9, align bel %bb3
1260
1261bb3: ; preds = %bb1, %bb2, %bb
1262 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
1263 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
1264 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
1265 %11 = load i32* %10, align 4
1266
1267%11 is fully redundant, an in BB2 it should have the value %8.
1268
Chris Lattner6a09a742008-12-06 22:52:12 +00001269GCC PR33344 is a similar case.
1270
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001271//===---------------------------------------------------------------------===//
1272
Chris Lattner6a09a742008-12-06 22:52:12 +00001273There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
1274GCC testsuite. There are many pre testcases as ssa-pre-*.c
1275
Chris Lattner582048d2008-12-15 08:32:28 +00001276//===---------------------------------------------------------------------===//
1277
1278There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
1279GCC testsuite. For example, predcom-1.c is:
1280
1281 for (i = 2; i < 1000; i++)
1282 fib[i] = (fib[i-1] + fib[i - 2]) & 0xffff;
1283
1284which compiles into:
1285
1286bb1: ; preds = %bb1, %bb1.thread
1287 %indvar = phi i32 [ 0, %bb1.thread ], [ %0, %bb1 ]
1288 %i.0.reg2mem.0 = add i32 %indvar, 2
1289 %0 = add i32 %indvar, 1 ; <i32> [#uses=3]
1290 %1 = getelementptr [1000 x i32]* @fib, i32 0, i32 %0
1291 %2 = load i32* %1, align 4 ; <i32> [#uses=1]
1292 %3 = getelementptr [1000 x i32]* @fib, i32 0, i32 %indvar
1293 %4 = load i32* %3, align 4 ; <i32> [#uses=1]
1294 %5 = add i32 %4, %2 ; <i32> [#uses=1]
1295 %6 = and i32 %5, 65535 ; <i32> [#uses=1]
1296 %7 = getelementptr [1000 x i32]* @fib, i32 0, i32 %i.0.reg2mem.0
1297 store i32 %6, i32* %7, align 4
1298 %exitcond = icmp eq i32 %0, 998 ; <i1> [#uses=1]
1299 br i1 %exitcond, label %return, label %bb1
1300
1301This is basically:
1302 LOAD fib[i+1]
1303 LOAD fib[i]
1304 STORE fib[i+2]
1305
1306instead of handling this as a loop or other xform, all we'd need to do is teach
1307load PRE to phi translate the %0 add (i+1) into the predecessor as (i'+1+1) =
1308(i'+2) (where i' is the previous iteration of i). This would find the store
1309which feeds it.
1310
1311predcom-2.c is apparently the same as predcom-1.c
1312predcom-3.c is very similar but needs loads feeding each other instead of
1313store->load.
1314predcom-4.c seems the same as the rest.
1315
1316
1317//===---------------------------------------------------------------------===//
1318
Chris Lattner6a09a742008-12-06 22:52:12 +00001319Other simple load PRE cases:
Chris Lattner8f416f32008-12-15 07:49:24 +00001320http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35287 [LPRE crit edge splitting]
1321
1322http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34677 (licm does this, LPRE crit edge)
1323 llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as | opt -mem2reg -simplifycfg -gvn | llvm-dis
1324
Chris Lattner582048d2008-12-15 08:32:28 +00001325//===---------------------------------------------------------------------===//
1326
1327Type based alias analysis:
Chris Lattner6a09a742008-12-06 22:52:12 +00001328http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
1329
1330//===---------------------------------------------------------------------===//
1331
1332When GVN/PRE finds a store of float* to a must aliases pointer when expecting
1333an int*, it should turn it into a bitcast. This is a nice generalization of
Chris Lattner630c99f2008-12-07 00:15:10 +00001334the SROA hack that would apply to other cases, e.g.:
1335
1336int foo(int C, int *P, float X) {
1337 if (C) {
1338 bar();
1339 *P = 42;
1340 } else
1341 *(float*)P = X;
1342
1343 return *P;
1344}
1345
Chris Lattner6a09a742008-12-06 22:52:12 +00001346
1347One example (that requires crazy phi translation) is:
Chris Lattner582048d2008-12-15 08:32:28 +00001348http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16799 [BITCAST PHI TRANS]
Chris Lattner6a09a742008-12-06 22:52:12 +00001349
1350//===---------------------------------------------------------------------===//
1351
1352A/B get pinned to the stack because we turn an if/then into a select instead
1353of PRE'ing the load/store. This may be fixable in instcombine:
1354http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
1355
Chris Lattner582048d2008-12-15 08:32:28 +00001356
1357
Chris Lattner6a09a742008-12-06 22:52:12 +00001358Interesting missed case because of control flow flattening (should be 2 loads):
1359http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner582048d2008-12-15 08:32:28 +00001360With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |
1361 opt -mem2reg -gvn -instcombine | llvm-dis
1362we miss it because we need 1) GEP PHI TRAN, 2) CRIT EDGE 3) MULTIPLE DIFFERENT
1363VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner6a09a742008-12-06 22:52:12 +00001364
1365//===---------------------------------------------------------------------===//
1366
1367http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
1368We could eliminate the branch condition here, loading from null is undefined:
1369
1370struct S { int w, x, y, z; };
1371struct T { int r; struct S s; };
1372void bar (struct S, int);
1373void foo (int a, struct T b)
1374{
1375 struct S *c = 0;
1376 if (a)
1377 c = &b.s;
1378 bar (*c, a);
1379}
1380
1381//===---------------------------------------------------------------------===//
Chris Lattner88d84b22008-12-02 06:32:34 +00001382
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001383simplifylibcalls should do several optimizations for strspn/strcspn:
1384
1385strcspn(x, "") -> strlen(x)
1386strcspn("", x) -> 0
1387strspn("", x) -> 0
1388strspn(x, "") -> strlen(x)
1389strspn(x, "a") -> strchr(x, 'a')-x
1390
1391strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
1392
1393size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
1394 int __reject3) {
1395 register size_t __result = 0;
1396 while (__s[__result] != '\0' && __s[__result] != __reject1 &&
1397 __s[__result] != __reject2 && __s[__result] != __reject3)
1398 ++__result;
1399 return __result;
1400}
1401
1402This should turn into a switch on the character. See PR3253 for some notes on
1403codegen.
1404
1405456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
1406
1407//===---------------------------------------------------------------------===//
Chris Lattnerd23b7992008-12-31 00:54:13 +00001408
1409"gas" uses this idiom:
1410 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string))
1411..
1412 else if (strchr ("<>", *intel_parser.op_string)
1413
1414Those should be turned into a switch.
1415
1416//===---------------------------------------------------------------------===//
Chris Lattnerffb08f52009-01-08 06:52:57 +00001417
1418252.eon contains this interesting code:
1419
1420 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
1421 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1422 %strlen = call i32 @strlen(i8* %3072) ; uses = 1
1423 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
1424 call void @llvm.memcpy.i32(i8* %endptr,
1425 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
1426 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly
1427
1428This is interesting for a couple reasons. First, in this:
1429
1430 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1431 %strlen = call i32 @strlen(i8* %3072)
1432
1433The strlen could be replaced with: %strlen = sub %3072, %3073, because the
1434strcpy call returns a pointer to the end of the string. Based on that, the
1435endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
1436
1437Second, the memcpy+strlen strlen can be replaced with:
1438
1439 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
1440
1441Because the destination was just copied into the specified memory buffer. This,
1442in turn, can be constant folded to "4".
1443
1444In other code, it contains:
1445
1446 %endptr6978 = bitcast i8* %endptr69 to i32*
1447 store i32 7107374, i32* %endptr6978, align 1
1448 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
1449
1450Which could also be constant folded. Whatever is producing this should probably
1451be fixed to leave this as a memcpy from a string.
1452
1453Further, eon also has an interesting partially redundant strlen call:
1454
1455bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
1456 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2]
1457 %683 = load i8** %682, align 4 ; <i8*> [#uses=4]
1458 %684 = load i8* %683, align 1 ; <i8> [#uses=1]
1459 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
1460 br i1 %685, label %bb10, label %bb9
1461
1462bb9: ; preds = %bb8
1463 %686 = call i32 @strlen(i8* %683) nounwind readonly
1464 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
1465 br i1 %687, label %bb10, label %bb11
1466
1467bb10: ; preds = %bb9, %bb8
1468 %688 = call i32 @strlen(i8* %683) nounwind readonly
1469
1470This could be eliminated by doing the strlen once in bb8, saving code size and
1471improving perf on the bb8->9->10 path.
1472
1473//===---------------------------------------------------------------------===//
Chris Lattner9fee08f2009-01-08 07:34:55 +00001474
1475I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
1476which looks like:
1477 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
1478
1479
1480bb62: ; preds = %bb55, %bb53
1481 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
1482 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1483 %172 = add i32 %171, -1 ; <i32> [#uses=1]
1484 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
1485
1486... no stores ...
1487 br i1 %or.cond, label %bb65, label %bb72
1488
1489bb65: ; preds = %bb62
1490 store i8 0, i8* %173, align 1
1491 br label %bb72
1492
1493bb72: ; preds = %bb65, %bb62
1494 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
1495 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1496
1497Note that on the bb62->bb72 path, that the %177 strlen call is partially
1498redundant with the %171 call. At worst, we could shove the %177 strlen call
1499up into the bb65 block moving it out of the bb62->bb72 path. However, note
1500that bb65 stores to the string, zeroing out the last byte. This means that on
1501that path the value of %177 is actually just %171-1. A sub is cheaper than a
1502strlen!
1503
1504This pattern repeats several times, basically doing:
1505
1506 A = strlen(P);
1507 P[A-1] = 0;
1508 B = strlen(P);
1509 where it is "obvious" that B = A-1.
1510
1511//===---------------------------------------------------------------------===//
1512
1513186.crafty contains this interesting pattern:
1514
1515%77 = call i8* @strstr(i8* getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0),
1516 i8* %30)
1517%phitmp648 = icmp eq i8* %77, getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0)
1518br i1 %phitmp648, label %bb70, label %bb76
1519
1520bb70: ; preds = %OptionMatch.exit91, %bb69
1521 %78 = call i32 @strlen(i8* %30) nounwind readonly align 1 ; <i32> [#uses=1]
1522
1523This is basically:
1524 cststr = "abcdef";
1525 if (strstr(cststr, P) == cststr) {
1526 x = strlen(P);
1527 ...
1528
1529The strstr call would be significantly cheaper written as:
1530
1531cststr = "abcdef";
1532if (memcmp(P, str, strlen(P)))
1533 x = strlen(P);
1534
1535This is memcmp+strlen instead of strstr. This also makes the strlen fully
1536redundant.
1537
1538//===---------------------------------------------------------------------===//
1539
1540186.crafty also contains this code:
1541
1542%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1543%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
1544%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
1545%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1546%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
1547
1548The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
1549
1550//===---------------------------------------------------------------------===//
1551
1552186.crafty has this interesting pattern with the "out.4543" variable:
1553
1554call void @llvm.memcpy.i32(
1555 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
1556 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
1557%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
1558
1559It is basically doing:
1560
1561 memcpy(globalarray, "string");
1562 printf(..., globalarray);
1563
1564Anyway, by knowing that printf just reads the memory and forward substituting
1565the string directly into the printf, this eliminates reads from globalarray.
1566Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
1567other similar functions) there are many stores to "out". Once all the printfs
1568stop using "out", all that is left is the memcpy's into it. This should allow
1569globalopt to remove the "stored only" global.
1570
1571//===---------------------------------------------------------------------===//
1572
Dan Gohman8289b052009-01-20 01:07:33 +00001573This code:
1574
1575define inreg i32 @foo(i8* inreg %p) nounwind {
1576 %tmp0 = load i8* %p
1577 %tmp1 = ashr i8 %tmp0, 5
1578 %tmp2 = sext i8 %tmp1 to i32
1579 ret i32 %tmp2
1580}
1581
1582could be dagcombine'd to a sign-extending load with a shift.
1583For example, on x86 this currently gets this:
1584
1585 movb (%eax), %al
1586 sarb $5, %al
1587 movsbl %al, %eax
1588
1589while it could get this:
1590
1591 movsbl (%eax), %eax
1592 sarl $5, %eax
1593
1594//===---------------------------------------------------------------------===//
Chris Lattner256baa42009-01-22 07:16:03 +00001595
1596GCC PR31029:
1597
1598int test(int x) { return 1-x == x; } // --> return false
1599int test2(int x) { return 2-x == x; } // --> return x == 1 ?
1600
1601Always foldable for odd constants, what is the rule for even?
1602
1603//===---------------------------------------------------------------------===//
1604
Torok Edwine46a6862009-01-24 19:30:25 +00001605PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
1606for next field in struct (which is at same address).
1607
1608For example: store of float into { {{}}, float } could be turned into a store to
1609the float directly.
1610
Torok Edwin474479f2009-02-20 18:42:06 +00001611//===---------------------------------------------------------------------===//
Nick Lewycky20babb12009-02-25 06:52:48 +00001612
Torok Edwin474479f2009-02-20 18:42:06 +00001613#include <math.h>
1614double foo(double a) { return sin(a); }
1615
1616This compiles into this on x86-64 Linux:
1617foo:
1618 subq $8, %rsp
1619 call sin
1620 addq $8, %rsp
1621 ret
1622vs:
1623
1624foo:
1625 jmp sin
1626
Nick Lewycky20babb12009-02-25 06:52:48 +00001627//===---------------------------------------------------------------------===//
1628
Chris Lattner32c5f172009-05-11 17:41:40 +00001629The arg promotion pass should make use of nocapture to make its alias analysis
1630stuff much more precise.
1631
1632//===---------------------------------------------------------------------===//
1633
1634The following functions should be optimized to use a select instead of a
1635branch (from gcc PR40072):
1636
1637char char_int(int m) {if(m>7) return 0; return m;}
1638int int_char(char m) {if(m>7) return 0; return m;}
1639
1640//===---------------------------------------------------------------------===//
1641
Nick Lewycky20babb12009-02-25 06:52:48 +00001642Instcombine should replace the load with a constant in:
1643
1644 static const char x[4] = {'a', 'b', 'c', 'd'};
1645
1646 unsigned int y(void) {
1647 return *(unsigned int *)x;
1648 }
1649
1650It currently only does this transformation when the size of the constant
1651is the same as the size of the integer (so, try x[5]) and the last byte
1652is a null (making it a C string). There's no need for these restrictions.
1653
1654//===---------------------------------------------------------------------===//
1655
Chris Lattnerd919a8b2009-05-11 17:36:33 +00001656InstCombine's "turn load from constant into constant" optimization should be
1657more aggressive in the presence of bitcasts. For example, because of unions,
1658this code:
1659
1660union vec2d {
1661 double e[2];
1662 double v __attribute__((vector_size(16)));
1663};
1664typedef union vec2d vec2d;
1665
1666static vec2d a={{1,2}}, b={{3,4}};
1667
1668vec2d foo () {
1669 return (vec2d){ .v = a.v + b.v * (vec2d){{5,5}}.v };
1670}
1671
1672Compiles into:
1673
1674@a = internal constant %0 { [2 x double]
1675 [double 1.000000e+00, double 2.000000e+00] }, align 16
1676@b = internal constant %0 { [2 x double]
1677 [double 3.000000e+00, double 4.000000e+00] }, align 16
1678...
1679define void @foo(%struct.vec2d* noalias nocapture sret %agg.result) nounwind {
1680entry:
1681 %0 = load <2 x double>* getelementptr (%struct.vec2d*
1682 bitcast (%0* @a to %struct.vec2d*), i32 0, i32 0), align 16
1683 %1 = load <2 x double>* getelementptr (%struct.vec2d*
1684 bitcast (%0* @b to %struct.vec2d*), i32 0, i32 0), align 16
1685
1686
1687Instcombine should be able to optimize away the loads (and thus the globals).
1688
1689
1690//===---------------------------------------------------------------------===//