blob: bcc55b41e1a55fd10f25c4724c1e418b0a514164 [file] [log] [blame]
Dan Gohmanf17a25c2007-07-18 16:29:46 +00001Target Independent Opportunities:
2
3//===---------------------------------------------------------------------===//
4
5With the recent changes to make the implicit def/use set explicit in
6machineinstrs, we should change the target descriptions for 'call' instructions
7so that the .td files don't list all the call-clobbered registers as implicit
8defs. Instead, these should be added by the code generator (e.g. on the dag).
9
10This has a number of uses:
11
121. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
13 for their different impdef sets.
142. Targets with multiple calling convs (e.g. x86) which have different clobber
15 sets don't need copies of call instructions.
163. 'Interprocedural register allocation' can be done to reduce the clobber sets
17 of calls.
18
19//===---------------------------------------------------------------------===//
20
21Make the PPC branch selector target independant
22
23//===---------------------------------------------------------------------===//
24
25Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
Chris Lattner5ae3f3d2008-12-10 01:30:48 +000026precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
27safe in general, even on darwin. See the libm implementation of hypot for
28examples (which special case when x/y are exactly zero to get signed zeros etc
29right).
Dan Gohmanf17a25c2007-07-18 16:29:46 +000030
31//===---------------------------------------------------------------------===//
32
33Solve this DAG isel folding deficiency:
34
35int X, Y;
36
37void fn1(void)
38{
39 X = X | (Y << 3);
40}
41
42compiles to
43
44fn1:
45 movl Y, %eax
46 shll $3, %eax
47 orl X, %eax
48 movl %eax, X
49 ret
50
51The problem is the store's chain operand is not the load X but rather
52a TokenFactor of the load X and load Y, which prevents the folding.
53
54There are two ways to fix this:
55
561. The dag combiner can start using alias analysis to realize that y/x
57 don't alias, making the store to X not dependent on the load from Y.
582. The generated isel could be made smarter in the case it can't
59 disambiguate the pointers.
60
61Number 1 is the preferred solution.
62
63This has been "fixed" by a TableGen hack. But that is a short term workaround
64which will be removed once the proper fix is made.
65
66//===---------------------------------------------------------------------===//
67
68On targets with expensive 64-bit multiply, we could LSR this:
69
70for (i = ...; ++i) {
71 x = 1ULL << i;
72
73into:
74 long long tmp = 1;
75 for (i = ...; ++i, tmp+=tmp)
76 x = tmp;
77
78This would be a win on ppc32, but not x86 or ppc64.
79
80//===---------------------------------------------------------------------===//
81
82Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
83
84//===---------------------------------------------------------------------===//
85
86Reassociate should turn: X*X*X*X -> t=(X*X) (t*t) to eliminate a multiply.
87
88//===---------------------------------------------------------------------===//
89
90Interesting? testcase for add/shift/mul reassoc:
91
92int bar(int x, int y) {
93 return x*x*x+y+x*x*x*x*x*y*y*y*y;
94}
95int foo(int z, int n) {
96 return bar(z, n) + bar(2*z, 2*n);
97}
98
99Reassociate should handle the example in GCC PR16157.
100
101//===---------------------------------------------------------------------===//
102
103These two functions should generate the same code on big-endian systems:
104
105int g(int *j,int *l) { return memcmp(j,l,4); }
106int h(int *j, int *l) { return *j - *l; }
107
108this could be done in SelectionDAGISel.cpp, along with other special cases,
109for 1,2,4,8 bytes.
110
111//===---------------------------------------------------------------------===//
112
113It would be nice to revert this patch:
114http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
115
116And teach the dag combiner enough to simplify the code expanded before
117legalize. It seems plausible that this knowledge would let it simplify other
118stuff too.
119
120//===---------------------------------------------------------------------===//
121
122For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
123to the type size. It works but can be overly conservative as the alignment of
124specific vector types are target dependent.
125
126//===---------------------------------------------------------------------===//
127
Dan Gohman71e4f772009-05-11 18:51:16 +0000128We should produce an unaligned load from code like this:
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000129
130v4sf example(float *P) {
131 return (v4sf){P[0], P[1], P[2], P[3] };
132}
133
134//===---------------------------------------------------------------------===//
135
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000136Add support for conditional increments, and other related patterns. Instead
137of:
138
139 movl 136(%esp), %eax
140 cmpl $0, %eax
141 je LBB16_2 #cond_next
142LBB16_1: #cond_true
143 incl _foo
144LBB16_2: #cond_next
145
146emit:
147 movl _foo, %eax
148 cmpl $1, %edi
149 sbbl $-1, %eax
150 movl %eax, _foo
151
152//===---------------------------------------------------------------------===//
153
154Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
155
156Expand these to calls of sin/cos and stores:
157 double sincos(double x, double *sin, double *cos);
158 float sincosf(float x, float *sin, float *cos);
159 long double sincosl(long double x, long double *sin, long double *cos);
160
161Doing so could allow SROA of the destination pointers. See also:
162http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
163
Chris Lattner5ae3f3d2008-12-10 01:30:48 +0000164This is now easily doable with MRVs. We could even make an intrinsic for this
165if anyone cared enough about sincos.
166
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000167//===---------------------------------------------------------------------===//
168
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000169Turn this into a single byte store with no load (the other 3 bytes are
170unmodified):
171
Dan Gohman95df9622009-05-11 18:04:52 +0000172define void @test(i32* %P) {
173 %tmp = load i32* %P
174 %tmp14 = or i32 %tmp, 3305111552
175 %tmp15 = and i32 %tmp14, 3321888767
176 store i32 %tmp15, i32* %P
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000177 ret void
178}
179
180//===---------------------------------------------------------------------===//
181
182dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
183
184Compile:
185
186int bar(int x)
187{
188 int t = __builtin_clz(x);
189 return -(t>>5);
190}
191
192to:
193
194_bar: addic r3,r3,-1
195 subfe r3,r3,r3
196 blr
197
198//===---------------------------------------------------------------------===//
199
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000200quantum_sigma_x in 462.libquantum contains the following loop:
201
202 for(i=0; i<reg->size; i++)
203 {
204 /* Flip the target bit of each basis state */
205 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
206 }
207
208Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
209so cool to turn it into something like:
210
211 long long Res = ((MAX_UNSIGNED) 1 << target);
212 if (target < 32) {
213 for(i=0; i<reg->size; i++)
214 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
215 } else {
216 for(i=0; i<reg->size; i++)
217 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
218 }
219
220... which would only do one 32-bit XOR per loop iteration instead of two.
221
222It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattnerc62f2aa2009-09-21 06:04:07 +0000223alas.
224
225//===---------------------------------------------------------------------===//
226
227This should be optimized to one 'and' and one 'or', from PR4216:
228
229define i32 @test_bitfield(i32 %bf.prev.low) nounwind ssp {
230entry:
231 %bf.prev.lo.cleared10 = or i32 %bf.prev.low, 32962 ; <i32> [#uses=1]
232 %0 = and i32 %bf.prev.low, -65536 ; <i32> [#uses=1]
233 %1 = and i32 %bf.prev.lo.cleared10, 40186 ; <i32> [#uses=1]
234 %2 = or i32 %1, %0 ; <i32> [#uses=1]
235 ret i32 %2
236}
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000237
238//===---------------------------------------------------------------------===//
239
Chris Lattner909c72c2008-10-05 02:16:12 +0000240This isn't recognized as bswap by instcombine (yes, it really is bswap):
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000241
242unsigned long reverse(unsigned v) {
243 unsigned t;
244 t = v ^ ((v << 16) | (v >> 16));
245 t &= ~0xff0000;
246 v = (v << 24) | (v >> 8);
247 return v ^ (t >> 8);
248}
249
250//===---------------------------------------------------------------------===//
251
Chris Lattner200ca612008-10-15 16:02:15 +0000252These idioms should be recognized as popcount (see PR1488):
253
254unsigned countbits_slow(unsigned v) {
255 unsigned c;
256 for (c = 0; v; v >>= 1)
257 c += v & 1;
258 return c;
259}
260unsigned countbits_fast(unsigned v){
261 unsigned c;
262 for (c = 0; v; c++)
263 v &= v - 1; // clear the least significant bit set
264 return c;
265}
266
267BITBOARD = unsigned long long
268int PopCnt(register BITBOARD a) {
269 register int c=0;
270 while(a) {
271 c++;
272 a &= a - 1;
273 }
274 return c;
275}
276unsigned int popcount(unsigned int input) {
277 unsigned int count = 0;
278 for (unsigned int i = 0; i < 4 * 8; i++)
279 count += (input >> i) & i;
280 return count;
281}
282
283//===---------------------------------------------------------------------===//
284
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000285These should turn into single 16-bit (unaligned?) loads on little/big endian
286processors.
287
288unsigned short read_16_le(const unsigned char *adr) {
289 return adr[0] | (adr[1] << 8);
290}
291unsigned short read_16_be(const unsigned char *adr) {
292 return (adr[0] << 8) | adr[1];
293}
294
295//===---------------------------------------------------------------------===//
296
297-instcombine should handle this transform:
298 icmp pred (sdiv X / C1 ), C2
299when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
300
301Currently InstCombine avoids this transform but will do it when the signs of
302the operands and the sign of the divide match. See the FIXME in
303InstructionCombining.cpp in the visitSetCondInst method after the switch case
304for Instruction::UDiv (around line 4447) for more details.
305
306The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
307this construct.
308
309//===---------------------------------------------------------------------===//
310
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000311viterbi speeds up *significantly* if the various "history" related copy loops
312are turned into memcpy calls at the source level. We need a "loops to memcpy"
313pass.
314
315//===---------------------------------------------------------------------===//
316
317Consider:
318
319typedef unsigned U32;
320typedef unsigned long long U64;
321int test (U32 *inst, U64 *regs) {
322 U64 effective_addr2;
323 U32 temp = *inst;
324 int r1 = (temp >> 20) & 0xf;
325 int b2 = (temp >> 16) & 0xf;
326 effective_addr2 = temp & 0xfff;
327 if (b2) effective_addr2 += regs[b2];
328 b2 = (temp >> 12) & 0xf;
329 if (b2) effective_addr2 += regs[b2];
330 effective_addr2 &= regs[4];
331 if ((effective_addr2 & 3) == 0)
332 return 1;
333 return 0;
334}
335
336Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
337we don't eliminate the computation of the top half of effective_addr2 because
338we don't have whole-function selection dags. On x86, this means we use one
339extra register for the function when effective_addr2 is declared as U64 than
340when it is declared U32.
341
Chris Lattner81060012009-11-10 23:47:45 +0000342PHI Slicing could be extended to do this.
343
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000344//===---------------------------------------------------------------------===//
345
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000346LSR should know what GPR types a target has. This code:
347
348volatile short X, Y; // globals
349
350void foo(int N) {
351 int i;
352 for (i = 0; i < N; i++) { X = i; Y = i*4; }
353}
354
Chris Lattner7acce942009-09-20 17:37:38 +0000355produces two near identical IV's (after promotion) on PPC/ARM:
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000356
Chris Lattner7acce942009-09-20 17:37:38 +0000357LBB1_2:
358 ldr r3, LCPI1_0
359 ldr r3, [r3]
360 strh r2, [r3]
361 ldr r3, LCPI1_1
362 ldr r3, [r3]
363 strh r1, [r3]
364 add r1, r1, #4
365 add r2, r2, #1 <- [0,+,1]
366 sub r0, r0, #1 <- [0,-,1]
367 cmp r0, #0
368 bne LBB1_2
369
370LSR should reuse the "+" IV for the exit test.
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000371
372
373//===---------------------------------------------------------------------===//
374
375Tail call elim should be more aggressive, checking to see if the call is
376followed by an uncond branch to an exit block.
377
378; This testcase is due to tail-duplication not wanting to copy the return
379; instruction into the terminating blocks because there was other code
380; optimized out of the function after the taildup happened.
Chris Lattnerd78823e2008-02-18 18:46:39 +0000381; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000382
Chris Lattnerd78823e2008-02-18 18:46:39 +0000383define i32 @t4(i32 %a) {
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000384entry:
Chris Lattnerd78823e2008-02-18 18:46:39 +0000385 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
386 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
387 br i1 %tmp.2, label %then.0, label %else.0
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000388
Chris Lattnerd78823e2008-02-18 18:46:39 +0000389then.0: ; preds = %entry
390 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
391 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
392 br label %return
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000393
Chris Lattnerd78823e2008-02-18 18:46:39 +0000394else.0: ; preds = %entry
395 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
396 br i1 %tmp.7, label %then.1, label %return
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000397
Chris Lattnerd78823e2008-02-18 18:46:39 +0000398then.1: ; preds = %else.0
399 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
400 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
401 br label %return
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000402
Chris Lattnerd78823e2008-02-18 18:46:39 +0000403return: ; preds = %then.1, %else.0, %then.0
404 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000405 [ %tmp.9, %then.1 ]
Chris Lattnerd78823e2008-02-18 18:46:39 +0000406 ret i32 %result.0
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000407}
408
409//===---------------------------------------------------------------------===//
410
Chris Lattner7be8ac22008-08-10 00:47:21 +0000411Tail recursion elimination should handle:
412
413int pow2m1(int n) {
414 if (n == 0)
415 return 0;
416 return 2 * pow2m1 (n - 1) + 1;
417}
418
419Also, multiplies can be turned into SHL's, so they should be handled as if
420they were associative. "return foo() << 1" can be tail recursion eliminated.
421
422//===---------------------------------------------------------------------===//
423
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000424Argument promotion should promote arguments for recursive functions, like
425this:
426
Chris Lattnerd78823e2008-02-18 18:46:39 +0000427; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000428
Chris Lattnerd78823e2008-02-18 18:46:39 +0000429define internal i32 @foo(i32* %x) {
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000430entry:
Chris Lattnerd78823e2008-02-18 18:46:39 +0000431 %tmp = load i32* %x ; <i32> [#uses=0]
432 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
433 ret i32 %tmp.foo
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000434}
435
Chris Lattnerd78823e2008-02-18 18:46:39 +0000436define i32 @bar(i32* %x) {
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000437entry:
Chris Lattnerd78823e2008-02-18 18:46:39 +0000438 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
439 ret i32 %tmp3
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000440}
441
Chris Lattner421a7332007-12-05 23:05:06 +0000442//===---------------------------------------------------------------------===//
Chris Lattner072ab752007-12-28 04:42:05 +0000443
444"basicaa" should know how to look through "or" instructions that act like add
445instructions. For example in this code, the x*4+1 is turned into x*4 | 1, and
446basicaa can't analyze the array subscript, leading to duplicated loads in the
447generated code:
448
449void test(int X, int Y, int a[]) {
450int i;
451 for (i=2; i<1000; i+=4) {
452 a[i+0] = a[i-1+0]*a[i-2+0];
453 a[i+1] = a[i-1+1]*a[i-2+1];
454 a[i+2] = a[i-1+2]*a[i-2+2];
455 a[i+3] = a[i-1+3]*a[i-2+3];
456 }
457}
458
Chris Lattner5ae3f3d2008-12-10 01:30:48 +0000459BasicAA also doesn't do this for add. It needs to know that &A[i+1] != &A[i].
460
Chris Lattnerfe7fe912007-12-28 22:30:05 +0000461//===---------------------------------------------------------------------===//
Chris Lattner072ab752007-12-28 04:42:05 +0000462
Chris Lattnerfe7fe912007-12-28 22:30:05 +0000463We should investigate an instruction sinking pass. Consider this silly
464example in pic mode:
465
466#include <assert.h>
467void foo(int x) {
468 assert(x);
469 //...
470}
471
472we compile this to:
473_foo:
474 subl $28, %esp
475 call "L1$pb"
476"L1$pb":
477 popl %eax
478 cmpl $0, 32(%esp)
479 je LBB1_2 # cond_true
480LBB1_1: # return
481 # ...
482 addl $28, %esp
483 ret
484LBB1_2: # cond_true
485...
486
487The PIC base computation (call+popl) is only used on one path through the
488code, but is currently always computed in the entry block. It would be
489better to sink the picbase computation down into the block for the
490assertion, as it is the only one that uses it. This happens for a lot of
491code with early outs.
492
Chris Lattnerbe9fe9d2007-12-29 01:05:01 +0000493Another example is loads of arguments, which are usually emitted into the
494entry block on targets like x86. If not used in all paths through a
495function, they should be sunk into the ones that do.
496
Chris Lattnerfe7fe912007-12-28 22:30:05 +0000497In this case, whole-function-isel would also handle this.
Chris Lattner072ab752007-12-28 04:42:05 +0000498
499//===---------------------------------------------------------------------===//
Chris Lattner8551f192008-01-07 21:38:14 +0000500
501Investigate lowering of sparse switch statements into perfect hash tables:
502http://burtleburtle.net/bob/hash/perfect.html
503
504//===---------------------------------------------------------------------===//
Chris Lattnerdc089f02008-01-09 00:17:57 +0000505
506We should turn things like "load+fabs+store" and "load+fneg+store" into the
507corresponding integer operations. On a yonah, this loop:
508
509double a[256];
Chris Lattnerd78823e2008-02-18 18:46:39 +0000510void foo() {
511 int i, b;
512 for (b = 0; b < 10000000; b++)
513 for (i = 0; i < 256; i++)
514 a[i] = -a[i];
515}
Chris Lattnerdc089f02008-01-09 00:17:57 +0000516
517is twice as slow as this loop:
518
519long long a[256];
Chris Lattnerd78823e2008-02-18 18:46:39 +0000520void foo() {
521 int i, b;
522 for (b = 0; b < 10000000; b++)
523 for (i = 0; i < 256; i++)
524 a[i] ^= (1ULL << 63);
525}
Chris Lattnerdc089f02008-01-09 00:17:57 +0000526
527and I suspect other processors are similar. On X86 in particular this is a
528big win because doing this with integers allows the use of read/modify/write
529instructions.
530
531//===---------------------------------------------------------------------===//
Chris Lattnera909d312008-01-10 18:25:41 +0000532
533DAG Combiner should try to combine small loads into larger loads when
534profitable. For example, we compile this C++ example:
535
536struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
537extern THotKey m_HotKey;
538THotKey GetHotKey () { return m_HotKey; }
539
540into (-O3 -fno-exceptions -static -fomit-frame-pointer):
541
542__Z9GetHotKeyv:
543 pushl %esi
544 movl 8(%esp), %eax
545 movb _m_HotKey+3, %cl
546 movb _m_HotKey+4, %dl
547 movb _m_HotKey+2, %ch
548 movw _m_HotKey, %si
549 movw %si, (%eax)
550 movb %ch, 2(%eax)
551 movb %cl, 3(%eax)
552 movb %dl, 4(%eax)
553 popl %esi
554 ret $4
555
556GCC produces:
557
558__Z9GetHotKeyv:
559 movl _m_HotKey, %edx
560 movl 4(%esp), %eax
561 movl %edx, (%eax)
562 movzwl _m_HotKey+4, %edx
563 movw %dx, 4(%eax)
564 ret $4
565
566The LLVM IR contains the needed alignment info, so we should be able to
567merge the loads and stores into 4-byte loads:
568
569 %struct.THotKey = type { i16, i8, i8, i8 }
570define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
571...
572 %tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
573 %tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
574 %tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
575 %tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
576
577Alternatively, we should use a small amount of base-offset alias analysis
578to make it so the scheduler doesn't need to hold all the loads in regs at
579once.
580
581//===---------------------------------------------------------------------===//
Chris Lattner71538b52008-01-11 06:17:47 +0000582
Nate Begemana3c7ba32008-02-18 18:39:23 +0000583We should add an FRINT node to the DAG to model targets that have legal
584implementations of ceil/floor/rint.
Chris Lattnera672d3d2008-02-28 05:34:27 +0000585
586//===---------------------------------------------------------------------===//
587
588Consider:
589
590int test() {
591 long long input[8] = {1,1,1,1,1,1,1,1};
592 foo(input);
593}
594
595We currently compile this into a memcpy from a global array since the
596initializer is fairly large and not memset'able. This is good, but the memcpy
597gets lowered to load/stores in the code generator. This is also ok, except
598that the codegen lowering for memcpy doesn't handle the case when the source
599is a constant global. This gives us atrocious code like this:
600
601 call "L1$pb"
602"L1$pb":
603 popl %eax
604 movl _C.0.1444-"L1$pb"+32(%eax), %ecx
605 movl %ecx, 40(%esp)
606 movl _C.0.1444-"L1$pb"+20(%eax), %ecx
607 movl %ecx, 28(%esp)
608 movl _C.0.1444-"L1$pb"+36(%eax), %ecx
609 movl %ecx, 44(%esp)
610 movl _C.0.1444-"L1$pb"+44(%eax), %ecx
611 movl %ecx, 52(%esp)
612 movl _C.0.1444-"L1$pb"+40(%eax), %ecx
613 movl %ecx, 48(%esp)
614 movl _C.0.1444-"L1$pb"+12(%eax), %ecx
615 movl %ecx, 20(%esp)
616 movl _C.0.1444-"L1$pb"+4(%eax), %ecx
617...
618
619instead of:
620 movl $1, 16(%esp)
621 movl $0, 20(%esp)
622 movl $1, 24(%esp)
623 movl $0, 28(%esp)
624 movl $1, 32(%esp)
625 movl $0, 36(%esp)
626 ...
627
628//===---------------------------------------------------------------------===//
Chris Lattnera709d332008-03-02 02:51:40 +0000629
630http://llvm.org/PR717:
631
632The following code should compile into "ret int undef". Instead, LLVM
633produces "ret int 0":
634
635int f() {
636 int x = 4;
637 int y;
638 if (x == 3) y = 0;
639 return y;
640}
641
642//===---------------------------------------------------------------------===//
Chris Lattnere9c68442008-03-02 19:29:42 +0000643
644The loop unroller should partially unroll loops (instead of peeling them)
645when code growth isn't too bad and when an unroll count allows simplification
646of some code within the loop. One trivial example is:
647
648#include <stdio.h>
649int main() {
650 int nRet = 17;
651 int nLoop;
652 for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
653 if ( nLoop & 1 )
654 nRet += 2;
655 else
656 nRet -= 1;
657 }
658 return nRet;
659}
660
661Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
662reduction in code size. The resultant code would then also be suitable for
663exit value computation.
664
665//===---------------------------------------------------------------------===//
Chris Lattner962a7d52008-03-17 01:47:51 +0000666
667We miss a bunch of rotate opportunities on various targets, including ppc, x86,
668etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
669matching code in dag combine doesn't look through truncates aggressively
670enough. Here are some testcases reduces from GCC PR17886:
671
672unsigned long long f(unsigned long long x, int y) {
673 return (x << y) | (x >> 64-y);
674}
675unsigned f2(unsigned x, int y){
676 return (x << y) | (x >> 32-y);
677}
678unsigned long long f3(unsigned long long x){
679 int y = 9;
680 return (x << y) | (x >> 64-y);
681}
682unsigned f4(unsigned x){
683 int y = 10;
684 return (x << y) | (x >> 32-y);
685}
686unsigned long long f5(unsigned long long x, unsigned long long y) {
687 return (x << 8) | ((y >> 48) & 0xffull);
688}
689unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
690 switch(z) {
691 case 1:
692 return (x << 8) | ((y >> 48) & 0xffull);
693 case 2:
694 return (x << 16) | ((y >> 40) & 0xffffull);
695 case 3:
696 return (x << 24) | ((y >> 32) & 0xffffffull);
697 case 4:
698 return (x << 32) | ((y >> 24) & 0xffffffffull);
699 default:
700 return (x << 40) | ((y >> 16) & 0xffffffffffull);
701 }
702}
703
Dan Gohman2896a4e2008-10-17 21:39:27 +0000704On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner962a7d52008-03-17 01:47:51 +0000705generate truly horrible code, instead of using shld and friends. On
706ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
707badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
708
709//===---------------------------------------------------------------------===//
Chris Lattner38c4a152008-03-20 04:46:13 +0000710
711We do a number of simplifications in simplify libcalls to strength reduce
712standard library functions, but we don't currently merge them together. For
713example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
714be done safely if "b" isn't modified between the strlen and memcpy of course.
715
716//===---------------------------------------------------------------------===//
717
Chris Lattner40a68642008-07-14 00:19:59 +0000718Reassociate should turn things like:
719
720int factorial(int X) {
721 return X*X*X*X*X*X*X*X;
722}
723
724into llvm.powi calls, allowing the code generator to produce balanced
725multiplication trees.
726
727//===---------------------------------------------------------------------===//
728
Chris Lattner16e192f2008-08-10 01:14:08 +0000729We generate a horrible libcall for llvm.powi. For example, we compile:
730
731#include <cmath>
732double f(double a) { return std::pow(a, 4); }
733
734into:
735
736__Z1fd:
737 subl $12, %esp
738 movsd 16(%esp), %xmm0
739 movsd %xmm0, (%esp)
740 movl $4, 8(%esp)
741 call L___powidf2$stub
742 addl $12, %esp
743 ret
744
745GCC produces:
746
747__Z1fd:
748 subl $12, %esp
749 movsd 16(%esp), %xmm0
750 mulsd %xmm0, %xmm0
751 mulsd %xmm0, %xmm0
752 movsd %xmm0, (%esp)
753 fldl (%esp)
754 addl $12, %esp
755 ret
756
757//===---------------------------------------------------------------------===//
758
759We compile this program: (from GCC PR11680)
760http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
761
762Into code that runs the same speed in fast/slow modes, but both modes run 2x
763slower than when compile with GCC (either 4.0 or 4.2):
764
765$ llvm-g++ perf.cpp -O3 -fno-exceptions
766$ time ./a.out fast
7671.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
768
769$ g++ perf.cpp -O3 -fno-exceptions
770$ time ./a.out fast
7710.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
772
773It looks like we are making the same inlining decisions, so this may be raw
774codegen badness or something else (haven't investigated).
775
776//===---------------------------------------------------------------------===//
777
778We miss some instcombines for stuff like this:
779void bar (void);
780void foo (unsigned int a) {
781 /* This one is equivalent to a >= (3 << 2). */
782 if ((a >> 2) >= 3)
783 bar ();
784}
785
786A few other related ones are in GCC PR14753.
787
788//===---------------------------------------------------------------------===//
789
790Divisibility by constant can be simplified (according to GCC PR12849) from
791being a mulhi to being a mul lo (cheaper). Testcase:
792
793void bar(unsigned n) {
794 if (n % 3 == 0)
795 true();
796}
797
798I think this basically amounts to a dag combine to simplify comparisons against
799multiply hi's into a comparison against the mullo.
800
801//===---------------------------------------------------------------------===//
Chris Lattnerc10e07a2008-08-19 06:22:16 +0000802
Chris Lattner92f97832008-10-15 16:06:03 +0000803Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
804bunch of other stuff from this example (see PR1604):
805
806#include <cstdio>
807struct test {
808 int val;
809 virtual ~test() {}
810};
811
812int main() {
813 test t;
814 std::scanf("%d", &t.val);
815 std::printf("%d\n", t.val);
816}
817
818//===---------------------------------------------------------------------===//
819
Chris Lattner74da9112008-10-15 16:33:52 +0000820Instcombine will merge comparisons like (x >= 10) && (x < 20) by producing (x -
82110) u< 10, but only when the comparisons have matching sign.
822
823This could be converted with a similiar technique. (PR1941)
824
825define i1 @test(i8 %x) {
826 %A = icmp uge i8 %x, 5
827 %B = icmp slt i8 %x, 20
828 %C = and i1 %A, %B
829 ret i1 %C
830}
831
832//===---------------------------------------------------------------------===//
Nick Lewycky8ef52e22008-11-27 22:12:22 +0000833
Nick Lewycky728f8742008-11-27 22:41:45 +0000834These functions perform the same computation, but produce different assembly.
Nick Lewycky8ef52e22008-11-27 22:12:22 +0000835
836define i8 @select(i8 %x) readnone nounwind {
837 %A = icmp ult i8 %x, 250
838 %B = select i1 %A, i8 0, i8 1
839 ret i8 %B
840}
841
842define i8 @addshr(i8 %x) readnone nounwind {
843 %A = zext i8 %x to i9
844 %B = add i9 %A, 6 ;; 256 - 250 == 6
845 %C = lshr i9 %B, 8
846 %D = trunc i9 %C to i8
847 ret i8 %D
848}
849
850//===---------------------------------------------------------------------===//
Eli Friedman8e59f062008-11-30 07:36:04 +0000851
852From gcc bug 24696:
853int
854f (unsigned long a, unsigned long b, unsigned long c)
855{
856 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0);
857}
858int
859f (unsigned long a, unsigned long b, unsigned long c)
860{
861 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0);
862}
863Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with
864"clang -emit-llvm-bc | opt -std-compile-opts".
865
866//===---------------------------------------------------------------------===//
867
868From GCC Bug 20192:
869#define PMD_MASK (~((1UL << 23) - 1))
870void clear_pmd_range(unsigned long start, unsigned long end)
871{
872 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
873 f();
874}
875The expression should optimize to something like
876"!((start|end)&~PMD_MASK). Currently not optimized with "clang
877-emit-llvm-bc | opt -std-compile-opts".
878
879//===---------------------------------------------------------------------===//
880
881From GCC Bug 15241:
882unsigned int
883foo (unsigned int a, unsigned int b)
884{
885 if (a <= 7 && b <= 7)
886 baz ();
887}
888Should combine to "(a|b) <= 7". Currently not optimized with "clang
889-emit-llvm-bc | opt -std-compile-opts".
890
891//===---------------------------------------------------------------------===//
892
893From GCC Bug 3756:
894int
895pn (int n)
896{
897 return (n >= 0 ? 1 : -1);
898}
899Should combine to (n >> 31) | 1. Currently not optimized with "clang
900-emit-llvm-bc | opt -std-compile-opts | llc".
901
902//===---------------------------------------------------------------------===//
903
904From GCC Bug 28685:
905int test(int a, int b)
906{
907 int lt = a < b;
908 int eq = a == b;
909
910 return (lt || eq);
911}
912Should combine to "a <= b". Currently not optimized with "clang
913-emit-llvm-bc | opt -std-compile-opts | llc".
914
915//===---------------------------------------------------------------------===//
916
917void a(int variable)
918{
919 if (variable == 4 || variable == 6)
920 bar();
921}
922This should optimize to "if ((variable | 2) == 6)". Currently not
923optimized with "clang -emit-llvm-bc | opt -std-compile-opts | llc".
924
925//===---------------------------------------------------------------------===//
926
927unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
928i;}
929unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
930These should combine to the same thing. Currently, the first function
931produces better code on X86.
932
933//===---------------------------------------------------------------------===//
934
Eli Friedman8e59f062008-11-30 07:36:04 +0000935From GCC Bug 15784:
936#define abs(x) x>0?x:-x
937int f(int x, int y)
938{
939 return (abs(x)) >= 0;
940}
941This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
942optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
943
944//===---------------------------------------------------------------------===//
945
946From GCC Bug 14753:
947void
948rotate_cst (unsigned int a)
949{
950 a = (a << 10) | (a >> 22);
951 if (a == 123)
952 bar ();
953}
954void
955minus_cst (unsigned int a)
956{
957 unsigned int tem;
958
959 tem = 20 - a;
960 if (tem == 5)
961 bar ();
962}
963void
964mask_gt (unsigned int a)
965{
966 /* This is equivalent to a > 15. */
967 if ((a & ~7) > 8)
968 bar ();
969}
970void
971rshift_gt (unsigned int a)
972{
973 /* This is equivalent to a > 23. */
974 if ((a >> 2) > 5)
975 bar ();
976}
977All should simplify to a single comparison. All of these are
978currently not optimized with "clang -emit-llvm-bc | opt
979-std-compile-opts".
980
981//===---------------------------------------------------------------------===//
982
983From GCC Bug 32605:
984int c(int* x) {return (char*)x+2 == (char*)x;}
985Should combine to 0. Currently not optimized with "clang
986-emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it).
987
988//===---------------------------------------------------------------------===//
989
990int a(unsigned char* b) {return *b > 99;}
991There's an unnecessary zext in the generated code with "clang
992-emit-llvm-bc | opt -std-compile-opts".
993
994//===---------------------------------------------------------------------===//
995
Eli Friedman8e59f062008-11-30 07:36:04 +0000996int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;}
997Should be combined to "((b >> 1) | b) & 1". Currently not optimized
998with "clang -emit-llvm-bc | opt -std-compile-opts".
999
1000//===---------------------------------------------------------------------===//
1001
1002unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);}
1003Should combine to "x | (y & 3)". Currently not optimized with "clang
1004-emit-llvm-bc | opt -std-compile-opts".
1005
1006//===---------------------------------------------------------------------===//
1007
1008unsigned a(unsigned a) {return ((a | 1) & 3) | (a & -4);}
1009Should combine to "a | 1". Currently not optimized with "clang
1010-emit-llvm-bc | opt -std-compile-opts".
1011
1012//===---------------------------------------------------------------------===//
1013
Eli Friedman8e59f062008-11-30 07:36:04 +00001014int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);}
1015Should fold to "(~a & c) | (a & b)". Currently not optimized with
1016"clang -emit-llvm-bc | opt -std-compile-opts".
1017
1018//===---------------------------------------------------------------------===//
1019
1020int a(int a,int b) {return (~(a|b))|a;}
1021Should fold to "a|~b". Currently not optimized with "clang
1022-emit-llvm-bc | opt -std-compile-opts".
1023
1024//===---------------------------------------------------------------------===//
1025
1026int a(int a, int b) {return (a&&b) || (a&&!b);}
1027Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
1028| opt -std-compile-opts".
1029
1030//===---------------------------------------------------------------------===//
1031
1032int a(int a, int b, int c) {return (a&&b) || (!a&&c);}
1033Should fold to "a ? b : c", or at least something sane. Currently not
1034optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1035
1036//===---------------------------------------------------------------------===//
1037
1038int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);}
1039Should fold to a && (b || c). Currently not optimized with "clang
1040-emit-llvm-bc | opt -std-compile-opts".
1041
1042//===---------------------------------------------------------------------===//
1043
1044int a(int x) {return x | ((x & 8) ^ 8);}
1045Should combine to x | 8. Currently not optimized with "clang
1046-emit-llvm-bc | opt -std-compile-opts".
1047
1048//===---------------------------------------------------------------------===//
1049
1050int a(int x) {return x ^ ((x & 8) ^ 8);}
1051Should also combine to x | 8. Currently not optimized with "clang
1052-emit-llvm-bc | opt -std-compile-opts".
1053
1054//===---------------------------------------------------------------------===//
1055
1056int a(int x) {return (x & 8) == 0 ? -1 : -9;}
1057Should combine to (x | -9) ^ 8. Currently not optimized with "clang
1058-emit-llvm-bc | opt -std-compile-opts".
1059
1060//===---------------------------------------------------------------------===//
1061
1062int a(int x) {return (x & 8) == 0 ? -9 : -1;}
1063Should combine to x | -9. Currently not optimized with "clang
1064-emit-llvm-bc | opt -std-compile-opts".
1065
1066//===---------------------------------------------------------------------===//
1067
1068int a(int x) {return ((x | -9) ^ 8) & x;}
1069Should combine to x & -9. Currently not optimized with "clang
1070-emit-llvm-bc | opt -std-compile-opts".
1071
1072//===---------------------------------------------------------------------===//
1073
1074unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
1075Should combine to "a * 0x88888888 >> 31". Currently not optimized
1076with "clang -emit-llvm-bc | opt -std-compile-opts".
1077
1078//===---------------------------------------------------------------------===//
1079
1080unsigned a(char* x) {if ((*x & 32) == 0) return b();}
1081There's an unnecessary zext in the generated code with "clang
1082-emit-llvm-bc | opt -std-compile-opts".
1083
1084//===---------------------------------------------------------------------===//
1085
1086unsigned a(unsigned long long x) {return 40 * (x >> 1);}
1087Should combine to "20 * (((unsigned)x) & -2)". Currently not
1088optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1089
1090//===---------------------------------------------------------------------===//
Bill Wendlingd0a76ee2008-12-02 05:12:47 +00001091
Chris Lattner8cbcc3a2008-12-02 06:32:34 +00001092This was noticed in the entryblock for grokdeclarator in 403.gcc:
1093
1094 %tmp = icmp eq i32 %decl_context, 4
1095 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
1096 %tmp1 = icmp eq i32 %decl_context_addr.0, 1
1097 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
1098
1099tmp1 should be simplified to something like:
1100 (!tmp || decl_context == 1)
1101
1102This allows recursive simplifications, tmp1 is used all over the place in
1103the function, e.g. by:
1104
1105 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
1106 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
1107 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
1108
1109later.
1110
Chris Lattnerb37c3a42008-12-06 19:28:22 +00001111//===---------------------------------------------------------------------===//
1112
1113Store sinking: This code:
1114
1115void f (int n, int *cond, int *res) {
1116 int i;
1117 *res = 0;
1118 for (i = 0; i < n; i++)
1119 if (*cond)
1120 *res ^= 234; /* (*) */
1121}
1122
1123On this function GVN hoists the fully redundant value of *res, but nothing
1124moves the store out. This gives us this code:
1125
1126bb: ; preds = %bb2, %entry
1127 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
1128 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
1129 %1 = load i32* %cond, align 4
1130 %2 = icmp eq i32 %1, 0
1131 br i1 %2, label %bb2, label %bb1
1132
1133bb1: ; preds = %bb
1134 %3 = xor i32 %.rle, 234
1135 store i32 %3, i32* %res, align 4
1136 br label %bb2
1137
1138bb2: ; preds = %bb, %bb1
1139 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
1140 %indvar.next = add i32 %i.05, 1
1141 %exitcond = icmp eq i32 %indvar.next, %n
1142 br i1 %exitcond, label %return, label %bb
1143
1144DSE should sink partially dead stores to get the store out of the loop.
1145
Chris Lattner7d675ed2008-12-06 22:52:12 +00001146Here's another partial dead case:
1147http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
1148
Chris Lattnerb37c3a42008-12-06 19:28:22 +00001149//===---------------------------------------------------------------------===//
1150
1151Scalar PRE hoists the mul in the common block up to the else:
1152
1153int test (int a, int b, int c, int g) {
1154 int d, e;
1155 if (a)
1156 d = b * c;
1157 else
1158 d = b - c;
1159 e = b * c + g;
1160 return d + e;
1161}
1162
1163It would be better to do the mul once to reduce codesize above the if.
1164This is GCC PR38204.
1165
1166//===---------------------------------------------------------------------===//
1167
1168GCC PR37810 is an interesting case where we should sink load/store reload
1169into the if block and outside the loop, so we don't reload/store it on the
1170non-call path.
1171
1172for () {
1173 *P += 1;
1174 if ()
1175 call();
1176 else
1177 ...
1178->
1179tmp = *P
1180for () {
1181 tmp += 1;
1182 if () {
1183 *P = tmp;
1184 call();
1185 tmp = *P;
1186 } else ...
1187}
1188*P = tmp;
1189
Chris Lattnera8ff4242008-12-15 07:49:24 +00001190We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
1191we don't sink the store. We need partially dead store sinking.
1192
Chris Lattnerb37c3a42008-12-06 19:28:22 +00001193//===---------------------------------------------------------------------===//
1194
Chris Lattnera8ff4242008-12-15 07:49:24 +00001195[PHI TRANSLATE GEPs]
1196
Chris Lattnerb37c3a42008-12-06 19:28:22 +00001197GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
1198leading to excess stack traffic. This could be handled by GVN with some crazy
1199symbolic phi translation. The code we get looks like (g is on the stack):
1200
1201bb2: ; preds = %bb1
1202..
1203 %9 = getelementptr %struct.f* %g, i32 0, i32 0
1204 store i32 %8, i32* %9, align bel %bb3
1205
1206bb3: ; preds = %bb1, %bb2, %bb
1207 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
1208 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
1209 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
1210 %11 = load i32* %10, align 4
1211
1212%11 is fully redundant, an in BB2 it should have the value %8.
1213
Chris Lattner7d675ed2008-12-06 22:52:12 +00001214GCC PR33344 is a similar case.
1215
Chris Lattnerb37c3a42008-12-06 19:28:22 +00001216//===---------------------------------------------------------------------===//
1217
Chris Lattner586101f2009-11-05 18:19:19 +00001218[PHI TRANSLATE INDEXED GEPs] PR5313
1219
1220Load redundancy elimination for simple loop. This loop:
1221
1222void append_text(const char* text,unsigned char * const io) {
1223 while(*text)
1224 *io=*text++;
1225}
1226
1227Compiles to have a fully redundant load in the loop (%2):
1228
1229define void @append_text(i8* nocapture %text, i8* nocapture %io) nounwind {
1230entry:
1231 %0 = load i8* %text, align 1 ; <i8> [#uses=1]
1232 %1 = icmp eq i8 %0, 0 ; <i1> [#uses=1]
1233 br i1 %1, label %return, label %bb
1234
1235bb: ; preds = %bb, %entry
1236 %indvar = phi i32 [ 0, %entry ], [ %tmp, %bb ] ; <i32> [#uses=2]
1237 %text_addr.04 = getelementptr i8* %text, i32 %indvar ; <i8*> [#uses=1]
1238 %2 = load i8* %text_addr.04, align 1 ; <i8> [#uses=1]
1239 store i8 %2, i8* %io, align 1
1240 %tmp = add i32 %indvar, 1 ; <i32> [#uses=2]
1241 %scevgep = getelementptr i8* %text, i32 %tmp ; <i8*> [#uses=1]
1242 %3 = load i8* %scevgep, align 1 ; <i8> [#uses=1]
1243 %4 = icmp eq i8 %3, 0 ; <i1> [#uses=1]
1244 br i1 %4, label %return, label %bb
1245
1246return: ; preds = %bb, %entry
1247 ret void
1248}
1249
1250//===---------------------------------------------------------------------===//
1251
Chris Lattner7d675ed2008-12-06 22:52:12 +00001252There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
1253GCC testsuite. There are many pre testcases as ssa-pre-*.c
1254
Chris Lattnerfaf145c2008-12-15 08:32:28 +00001255//===---------------------------------------------------------------------===//
1256
1257There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
1258GCC testsuite. For example, predcom-1.c is:
1259
1260 for (i = 2; i < 1000; i++)
1261 fib[i] = (fib[i-1] + fib[i - 2]) & 0xffff;
1262
1263which compiles into:
1264
1265bb1: ; preds = %bb1, %bb1.thread
1266 %indvar = phi i32 [ 0, %bb1.thread ], [ %0, %bb1 ]
1267 %i.0.reg2mem.0 = add i32 %indvar, 2
1268 %0 = add i32 %indvar, 1 ; <i32> [#uses=3]
1269 %1 = getelementptr [1000 x i32]* @fib, i32 0, i32 %0
1270 %2 = load i32* %1, align 4 ; <i32> [#uses=1]
1271 %3 = getelementptr [1000 x i32]* @fib, i32 0, i32 %indvar
1272 %4 = load i32* %3, align 4 ; <i32> [#uses=1]
1273 %5 = add i32 %4, %2 ; <i32> [#uses=1]
1274 %6 = and i32 %5, 65535 ; <i32> [#uses=1]
1275 %7 = getelementptr [1000 x i32]* @fib, i32 0, i32 %i.0.reg2mem.0
1276 store i32 %6, i32* %7, align 4
1277 %exitcond = icmp eq i32 %0, 998 ; <i1> [#uses=1]
1278 br i1 %exitcond, label %return, label %bb1
1279
1280This is basically:
1281 LOAD fib[i+1]
1282 LOAD fib[i]
1283 STORE fib[i+2]
1284
1285instead of handling this as a loop or other xform, all we'd need to do is teach
1286load PRE to phi translate the %0 add (i+1) into the predecessor as (i'+1+1) =
1287(i'+2) (where i' is the previous iteration of i). This would find the store
1288which feeds it.
1289
1290predcom-2.c is apparently the same as predcom-1.c
1291predcom-3.c is very similar but needs loads feeding each other instead of
1292store->load.
1293predcom-4.c seems the same as the rest.
1294
1295
1296//===---------------------------------------------------------------------===//
1297
Chris Lattner7d675ed2008-12-06 22:52:12 +00001298Other simple load PRE cases:
Chris Lattnera8ff4242008-12-15 07:49:24 +00001299http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35287 [LPRE crit edge splitting]
1300
1301http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34677 (licm does this, LPRE crit edge)
1302 llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as | opt -mem2reg -simplifycfg -gvn | llvm-dis
1303
Chris Lattner51481992009-09-21 02:53:57 +00001304http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16799 [BITCAST PHI TRANS]
1305
Chris Lattnerfaf145c2008-12-15 08:32:28 +00001306//===---------------------------------------------------------------------===//
1307
1308Type based alias analysis:
Chris Lattner7d675ed2008-12-06 22:52:12 +00001309http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
1310
1311//===---------------------------------------------------------------------===//
1312
Chris Lattner7d675ed2008-12-06 22:52:12 +00001313A/B get pinned to the stack because we turn an if/then into a select instead
1314of PRE'ing the load/store. This may be fixable in instcombine:
1315http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
1316
Chris Lattner51481992009-09-21 02:53:57 +00001317struct X { int i; };
1318int foo (int x) {
1319 struct X a;
1320 struct X b;
1321 struct X *p;
1322 a.i = 1;
1323 b.i = 2;
1324 if (x)
1325 p = &a;
1326 else
1327 p = &b;
1328 return p->i;
1329}
Chris Lattnerfaf145c2008-12-15 08:32:28 +00001330
Chris Lattner51481992009-09-21 02:53:57 +00001331//===---------------------------------------------------------------------===//
Chris Lattnerfaf145c2008-12-15 08:32:28 +00001332
Chris Lattner7d675ed2008-12-06 22:52:12 +00001333Interesting missed case because of control flow flattening (should be 2 loads):
1334http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattnerfaf145c2008-12-15 08:32:28 +00001335With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |
1336 opt -mem2reg -gvn -instcombine | llvm-dis
1337we miss it because we need 1) GEP PHI TRAN, 2) CRIT EDGE 3) MULTIPLE DIFFERENT
1338VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner7d675ed2008-12-06 22:52:12 +00001339
1340//===---------------------------------------------------------------------===//
1341
1342http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
1343We could eliminate the branch condition here, loading from null is undefined:
1344
1345struct S { int w, x, y, z; };
1346struct T { int r; struct S s; };
1347void bar (struct S, int);
1348void foo (int a, struct T b)
1349{
1350 struct S *c = 0;
1351 if (a)
1352 c = &b.s;
1353 bar (*c, a);
1354}
1355
1356//===---------------------------------------------------------------------===//
Chris Lattner8cbcc3a2008-12-02 06:32:34 +00001357
Chris Lattnerb36ace92008-12-23 20:52:52 +00001358simplifylibcalls should do several optimizations for strspn/strcspn:
1359
1360strcspn(x, "") -> strlen(x)
1361strcspn("", x) -> 0
1362strspn("", x) -> 0
1363strspn(x, "") -> strlen(x)
1364strspn(x, "a") -> strchr(x, 'a')-x
1365
1366strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
1367
1368size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
1369 int __reject3) {
1370 register size_t __result = 0;
1371 while (__s[__result] != '\0' && __s[__result] != __reject1 &&
1372 __s[__result] != __reject2 && __s[__result] != __reject3)
1373 ++__result;
1374 return __result;
1375}
1376
1377This should turn into a switch on the character. See PR3253 for some notes on
1378codegen.
1379
1380456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
1381
1382//===---------------------------------------------------------------------===//
Chris Lattner107d30a2008-12-31 00:54:13 +00001383
1384"gas" uses this idiom:
1385 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string))
1386..
1387 else if (strchr ("<>", *intel_parser.op_string)
1388
1389Those should be turned into a switch.
1390
1391//===---------------------------------------------------------------------===//
Chris Lattnera146bc52009-01-08 06:52:57 +00001392
1393252.eon contains this interesting code:
1394
1395 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
1396 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1397 %strlen = call i32 @strlen(i8* %3072) ; uses = 1
1398 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
1399 call void @llvm.memcpy.i32(i8* %endptr,
1400 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
1401 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly
1402
1403This is interesting for a couple reasons. First, in this:
1404
1405 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1406 %strlen = call i32 @strlen(i8* %3072)
1407
1408The strlen could be replaced with: %strlen = sub %3072, %3073, because the
1409strcpy call returns a pointer to the end of the string. Based on that, the
1410endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
1411
1412Second, the memcpy+strlen strlen can be replaced with:
1413
1414 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
1415
1416Because the destination was just copied into the specified memory buffer. This,
1417in turn, can be constant folded to "4".
1418
1419In other code, it contains:
1420
1421 %endptr6978 = bitcast i8* %endptr69 to i32*
1422 store i32 7107374, i32* %endptr6978, align 1
1423 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
1424
1425Which could also be constant folded. Whatever is producing this should probably
1426be fixed to leave this as a memcpy from a string.
1427
1428Further, eon also has an interesting partially redundant strlen call:
1429
1430bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
1431 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2]
1432 %683 = load i8** %682, align 4 ; <i8*> [#uses=4]
1433 %684 = load i8* %683, align 1 ; <i8> [#uses=1]
1434 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
1435 br i1 %685, label %bb10, label %bb9
1436
1437bb9: ; preds = %bb8
1438 %686 = call i32 @strlen(i8* %683) nounwind readonly
1439 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
1440 br i1 %687, label %bb10, label %bb11
1441
1442bb10: ; preds = %bb9, %bb8
1443 %688 = call i32 @strlen(i8* %683) nounwind readonly
1444
1445This could be eliminated by doing the strlen once in bb8, saving code size and
1446improving perf on the bb8->9->10 path.
1447
1448//===---------------------------------------------------------------------===//
Chris Lattner9c044632009-01-08 07:34:55 +00001449
1450I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
1451which looks like:
1452 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
1453
1454
1455bb62: ; preds = %bb55, %bb53
1456 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
1457 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1458 %172 = add i32 %171, -1 ; <i32> [#uses=1]
1459 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
1460
1461... no stores ...
1462 br i1 %or.cond, label %bb65, label %bb72
1463
1464bb65: ; preds = %bb62
1465 store i8 0, i8* %173, align 1
1466 br label %bb72
1467
1468bb72: ; preds = %bb65, %bb62
1469 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
1470 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1471
1472Note that on the bb62->bb72 path, that the %177 strlen call is partially
1473redundant with the %171 call. At worst, we could shove the %177 strlen call
1474up into the bb65 block moving it out of the bb62->bb72 path. However, note
1475that bb65 stores to the string, zeroing out the last byte. This means that on
1476that path the value of %177 is actually just %171-1. A sub is cheaper than a
1477strlen!
1478
1479This pattern repeats several times, basically doing:
1480
1481 A = strlen(P);
1482 P[A-1] = 0;
1483 B = strlen(P);
1484 where it is "obvious" that B = A-1.
1485
1486//===---------------------------------------------------------------------===//
1487
1488186.crafty contains this interesting pattern:
1489
1490%77 = call i8* @strstr(i8* getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0),
1491 i8* %30)
1492%phitmp648 = icmp eq i8* %77, getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0)
1493br i1 %phitmp648, label %bb70, label %bb76
1494
1495bb70: ; preds = %OptionMatch.exit91, %bb69
1496 %78 = call i32 @strlen(i8* %30) nounwind readonly align 1 ; <i32> [#uses=1]
1497
1498This is basically:
1499 cststr = "abcdef";
1500 if (strstr(cststr, P) == cststr) {
1501 x = strlen(P);
1502 ...
1503
1504The strstr call would be significantly cheaper written as:
1505
1506cststr = "abcdef";
1507if (memcmp(P, str, strlen(P)))
1508 x = strlen(P);
1509
1510This is memcmp+strlen instead of strstr. This also makes the strlen fully
1511redundant.
1512
1513//===---------------------------------------------------------------------===//
1514
1515186.crafty also contains this code:
1516
1517%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1518%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
1519%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
1520%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1521%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
1522
1523The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
1524
1525//===---------------------------------------------------------------------===//
1526
1527186.crafty has this interesting pattern with the "out.4543" variable:
1528
1529call void @llvm.memcpy.i32(
1530 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
1531 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
1532%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
1533
1534It is basically doing:
1535
1536 memcpy(globalarray, "string");
1537 printf(..., globalarray);
1538
1539Anyway, by knowing that printf just reads the memory and forward substituting
1540the string directly into the printf, this eliminates reads from globalarray.
1541Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
1542other similar functions) there are many stores to "out". Once all the printfs
1543stop using "out", all that is left is the memcpy's into it. This should allow
1544globalopt to remove the "stored only" global.
1545
1546//===---------------------------------------------------------------------===//
1547
Dan Gohmana3401022009-01-20 01:07:33 +00001548This code:
1549
1550define inreg i32 @foo(i8* inreg %p) nounwind {
1551 %tmp0 = load i8* %p
1552 %tmp1 = ashr i8 %tmp0, 5
1553 %tmp2 = sext i8 %tmp1 to i32
1554 ret i32 %tmp2
1555}
1556
1557could be dagcombine'd to a sign-extending load with a shift.
1558For example, on x86 this currently gets this:
1559
1560 movb (%eax), %al
1561 sarb $5, %al
1562 movsbl %al, %eax
1563
1564while it could get this:
1565
1566 movsbl (%eax), %eax
1567 sarl $5, %eax
1568
1569//===---------------------------------------------------------------------===//
Chris Lattnered047e92009-01-22 07:16:03 +00001570
1571GCC PR31029:
1572
1573int test(int x) { return 1-x == x; } // --> return false
1574int test2(int x) { return 2-x == x; } // --> return x == 1 ?
1575
1576Always foldable for odd constants, what is the rule for even?
1577
1578//===---------------------------------------------------------------------===//
1579
edwin331eef22009-01-24 19:30:25 +00001580PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
1581for next field in struct (which is at same address).
1582
1583For example: store of float into { {{}}, float } could be turned into a store to
1584the float directly.
1585
Edwin Török13ede9c2009-02-20 18:42:06 +00001586//===---------------------------------------------------------------------===//
Nick Lewycky7ff62412009-02-25 06:52:48 +00001587
Edwin Török13ede9c2009-02-20 18:42:06 +00001588#include <math.h>
1589double foo(double a) { return sin(a); }
1590
1591This compiles into this on x86-64 Linux:
1592foo:
1593 subq $8, %rsp
1594 call sin
1595 addq $8, %rsp
1596 ret
1597vs:
1598
1599foo:
1600 jmp sin
1601
Nick Lewycky7ff62412009-02-25 06:52:48 +00001602//===---------------------------------------------------------------------===//
1603
Chris Lattnerb1f817f2009-05-11 17:41:40 +00001604The arg promotion pass should make use of nocapture to make its alias analysis
1605stuff much more precise.
1606
1607//===---------------------------------------------------------------------===//
1608
1609The following functions should be optimized to use a select instead of a
1610branch (from gcc PR40072):
1611
1612char char_int(int m) {if(m>7) return 0; return m;}
1613int int_char(char m) {if(m>7) return 0; return m;}
1614
1615//===---------------------------------------------------------------------===//
1616
Bill Wendling53353c42009-10-27 22:48:31 +00001617int func(int a, int b) { if (a & 0x80) b |= 0x80; else b &= ~0x80; return b; }
1618
1619Generates this:
1620
1621define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1622entry:
1623 %0 = and i32 %a, 128 ; <i32> [#uses=1]
1624 %1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
1625 %2 = or i32 %b, 128 ; <i32> [#uses=1]
1626 %3 = and i32 %b, -129 ; <i32> [#uses=1]
1627 %b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
1628 ret i32 %b_addr.0
1629}
1630
1631However, it's functionally equivalent to:
1632
1633 b = (b & ~0x80) | (a & 0x80);
1634
1635Which generates this:
1636
1637define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1638entry:
1639 %0 = and i32 %b, -129 ; <i32> [#uses=1]
1640 %1 = and i32 %a, 128 ; <i32> [#uses=1]
1641 %2 = or i32 %0, %1 ; <i32> [#uses=1]
1642 ret i32 %2
1643}
1644
1645This can be generalized for other forms:
1646
1647 b = (b & ~0x80) | (a & 0x40) << 1;
1648
1649//===---------------------------------------------------------------------===//
Bill Wendlingf9e22262009-10-27 23:30:07 +00001650
1651These two functions produce different code. They shouldn't:
1652
1653#include <stdint.h>
1654
1655uint8_t p1(uint8_t b, uint8_t a) {
1656 b = (b & ~0xc0) | (a & 0xc0);
1657 return (b);
1658}
1659
1660uint8_t p2(uint8_t b, uint8_t a) {
1661 b = (b & ~0x40) | (a & 0x40);
1662 b = (b & ~0x80) | (a & 0x80);
1663 return (b);
1664}
1665
1666define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1667entry:
1668 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1669 %1 = and i8 %a, -64 ; <i8> [#uses=1]
1670 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1671 ret i8 %2
1672}
1673
1674define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1675entry:
1676 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1677 %.masked = and i8 %a, 64 ; <i8> [#uses=1]
1678 %1 = and i8 %a, -128 ; <i8> [#uses=1]
1679 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1680 %3 = or i8 %2, %.masked ; <i8> [#uses=1]
1681 ret i8 %3
1682}
1683
1684//===---------------------------------------------------------------------===//
Chris Lattnerd6b0ee12009-11-11 17:51:27 +00001685
1686IPSCCP does not currently propagate argument dependent constants through
1687functions where it does not not all of the callers. This includes functions
1688with normal external linkage as well as templates, C99 inline functions etc.
1689Specifically, it does nothing to:
1690
1691define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
1692entry:
1693 %0 = add nsw i32 %y, %z
1694 %1 = mul i32 %0, %x
1695 %2 = mul i32 %y, %z
1696 %3 = add nsw i32 %1, %2
1697 ret i32 %3
1698}
1699
1700define i32 @test2() nounwind {
1701entry:
1702 %0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
1703 ret i32 %0
1704}
1705
1706It would be interesting extend IPSCCP to be able to handle simple cases like
1707this, where all of the arguments to a call are constant. Because IPSCCP runs
1708before inlining, trivial templates and inline functions are not yet inlined.
1709The results for a function + set of constant arguments should be memoized in a
1710map.
1711
1712//===---------------------------------------------------------------------===//
Chris Lattner4adaab12009-11-11 17:54:02 +00001713
1714The libcall constant folding stuff should be moved out of SimplifyLibcalls into
1715libanalysis' constantfolding logic. This would allow IPSCCP to be able to
1716handle simple things like this:
1717
1718static int foo(const char *X) { return strlen(X); }
1719int bar() { return foo("abcd"); }
1720
1721//===---------------------------------------------------------------------===//