blob: 538d1371a16f82996731c58f65ad0ab0ec36a735 [file] [log] [blame]
Dan Gohmanf17a25c2007-07-18 16:29:46 +00001Target Independent Opportunities:
2
3//===---------------------------------------------------------------------===//
4
5With the recent changes to make the implicit def/use set explicit in
6machineinstrs, we should change the target descriptions for 'call' instructions
7so that the .td files don't list all the call-clobbered registers as implicit
8defs. Instead, these should be added by the code generator (e.g. on the dag).
9
10This has a number of uses:
11
121. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
13 for their different impdef sets.
142. Targets with multiple calling convs (e.g. x86) which have different clobber
15 sets don't need copies of call instructions.
163. 'Interprocedural register allocation' can be done to reduce the clobber sets
17 of calls.
18
19//===---------------------------------------------------------------------===//
20
21Make the PPC branch selector target independant
22
23//===---------------------------------------------------------------------===//
24
25Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
Chris Lattner5ae3f3d2008-12-10 01:30:48 +000026precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
27safe in general, even on darwin. See the libm implementation of hypot for
28examples (which special case when x/y are exactly zero to get signed zeros etc
29right).
Dan Gohmanf17a25c2007-07-18 16:29:46 +000030
31//===---------------------------------------------------------------------===//
32
33Solve this DAG isel folding deficiency:
34
35int X, Y;
36
37void fn1(void)
38{
39 X = X | (Y << 3);
40}
41
42compiles to
43
44fn1:
45 movl Y, %eax
46 shll $3, %eax
47 orl X, %eax
48 movl %eax, X
49 ret
50
51The problem is the store's chain operand is not the load X but rather
52a TokenFactor of the load X and load Y, which prevents the folding.
53
54There are two ways to fix this:
55
561. The dag combiner can start using alias analysis to realize that y/x
57 don't alias, making the store to X not dependent on the load from Y.
582. The generated isel could be made smarter in the case it can't
59 disambiguate the pointers.
60
61Number 1 is the preferred solution.
62
63This has been "fixed" by a TableGen hack. But that is a short term workaround
64which will be removed once the proper fix is made.
65
66//===---------------------------------------------------------------------===//
67
68On targets with expensive 64-bit multiply, we could LSR this:
69
70for (i = ...; ++i) {
71 x = 1ULL << i;
72
73into:
74 long long tmp = 1;
75 for (i = ...; ++i, tmp+=tmp)
76 x = tmp;
77
78This would be a win on ppc32, but not x86 or ppc64.
79
80//===---------------------------------------------------------------------===//
81
82Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
83
84//===---------------------------------------------------------------------===//
85
86Reassociate should turn: X*X*X*X -> t=(X*X) (t*t) to eliminate a multiply.
87
88//===---------------------------------------------------------------------===//
89
90Interesting? testcase for add/shift/mul reassoc:
91
92int bar(int x, int y) {
93 return x*x*x+y+x*x*x*x*x*y*y*y*y;
94}
95int foo(int z, int n) {
96 return bar(z, n) + bar(2*z, 2*n);
97}
98
99Reassociate should handle the example in GCC PR16157.
100
101//===---------------------------------------------------------------------===//
102
103These two functions should generate the same code on big-endian systems:
104
105int g(int *j,int *l) { return memcmp(j,l,4); }
106int h(int *j, int *l) { return *j - *l; }
107
108this could be done in SelectionDAGISel.cpp, along with other special cases,
109for 1,2,4,8 bytes.
110
111//===---------------------------------------------------------------------===//
112
113It would be nice to revert this patch:
114http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
115
116And teach the dag combiner enough to simplify the code expanded before
117legalize. It seems plausible that this knowledge would let it simplify other
118stuff too.
119
120//===---------------------------------------------------------------------===//
121
122For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
123to the type size. It works but can be overly conservative as the alignment of
124specific vector types are target dependent.
125
126//===---------------------------------------------------------------------===//
127
Dan Gohman71e4f772009-05-11 18:51:16 +0000128We should produce an unaligned load from code like this:
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000129
130v4sf example(float *P) {
131 return (v4sf){P[0], P[1], P[2], P[3] };
132}
133
134//===---------------------------------------------------------------------===//
135
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000136Add support for conditional increments, and other related patterns. Instead
137of:
138
139 movl 136(%esp), %eax
140 cmpl $0, %eax
141 je LBB16_2 #cond_next
142LBB16_1: #cond_true
143 incl _foo
144LBB16_2: #cond_next
145
146emit:
147 movl _foo, %eax
148 cmpl $1, %edi
149 sbbl $-1, %eax
150 movl %eax, _foo
151
152//===---------------------------------------------------------------------===//
153
154Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
155
156Expand these to calls of sin/cos and stores:
157 double sincos(double x, double *sin, double *cos);
158 float sincosf(float x, float *sin, float *cos);
159 long double sincosl(long double x, long double *sin, long double *cos);
160
161Doing so could allow SROA of the destination pointers. See also:
162http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
163
Chris Lattner5ae3f3d2008-12-10 01:30:48 +0000164This is now easily doable with MRVs. We could even make an intrinsic for this
165if anyone cared enough about sincos.
166
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000167//===---------------------------------------------------------------------===//
168
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000169Turn this into a single byte store with no load (the other 3 bytes are
170unmodified):
171
Dan Gohman95df9622009-05-11 18:04:52 +0000172define void @test(i32* %P) {
173 %tmp = load i32* %P
174 %tmp14 = or i32 %tmp, 3305111552
175 %tmp15 = and i32 %tmp14, 3321888767
176 store i32 %tmp15, i32* %P
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000177 ret void
178}
179
180//===---------------------------------------------------------------------===//
181
182dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
183
184Compile:
185
186int bar(int x)
187{
188 int t = __builtin_clz(x);
189 return -(t>>5);
190}
191
192to:
193
194_bar: addic r3,r3,-1
195 subfe r3,r3,r3
196 blr
197
198//===---------------------------------------------------------------------===//
199
200Legalize should lower ctlz like this:
201 ctlz(x) = popcnt((x-1) & ~x)
202
203on targets that have popcnt but not ctlz. itanium, what else?
204
205//===---------------------------------------------------------------------===//
206
207quantum_sigma_x in 462.libquantum contains the following loop:
208
209 for(i=0; i<reg->size; i++)
210 {
211 /* Flip the target bit of each basis state */
212 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
213 }
214
215Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
216so cool to turn it into something like:
217
218 long long Res = ((MAX_UNSIGNED) 1 << target);
219 if (target < 32) {
220 for(i=0; i<reg->size; i++)
221 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
222 } else {
223 for(i=0; i<reg->size; i++)
224 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
225 }
226
227... which would only do one 32-bit XOR per loop iteration instead of two.
228
229It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
230alas...
231
232//===---------------------------------------------------------------------===//
233
Chris Lattner909c72c2008-10-05 02:16:12 +0000234This isn't recognized as bswap by instcombine (yes, it really is bswap):
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000235
236unsigned long reverse(unsigned v) {
237 unsigned t;
238 t = v ^ ((v << 16) | (v >> 16));
239 t &= ~0xff0000;
240 v = (v << 24) | (v >> 8);
241 return v ^ (t >> 8);
242}
243
244//===---------------------------------------------------------------------===//
245
Chris Lattner200ca612008-10-15 16:02:15 +0000246These idioms should be recognized as popcount (see PR1488):
247
248unsigned countbits_slow(unsigned v) {
249 unsigned c;
250 for (c = 0; v; v >>= 1)
251 c += v & 1;
252 return c;
253}
254unsigned countbits_fast(unsigned v){
255 unsigned c;
256 for (c = 0; v; c++)
257 v &= v - 1; // clear the least significant bit set
258 return c;
259}
260
261BITBOARD = unsigned long long
262int PopCnt(register BITBOARD a) {
263 register int c=0;
264 while(a) {
265 c++;
266 a &= a - 1;
267 }
268 return c;
269}
270unsigned int popcount(unsigned int input) {
271 unsigned int count = 0;
272 for (unsigned int i = 0; i < 4 * 8; i++)
273 count += (input >> i) & i;
274 return count;
275}
276
277//===---------------------------------------------------------------------===//
278
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000279These should turn into single 16-bit (unaligned?) loads on little/big endian
280processors.
281
282unsigned short read_16_le(const unsigned char *adr) {
283 return adr[0] | (adr[1] << 8);
284}
285unsigned short read_16_be(const unsigned char *adr) {
286 return (adr[0] << 8) | adr[1];
287}
288
289//===---------------------------------------------------------------------===//
290
291-instcombine should handle this transform:
292 icmp pred (sdiv X / C1 ), C2
293when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
294
295Currently InstCombine avoids this transform but will do it when the signs of
296the operands and the sign of the divide match. See the FIXME in
297InstructionCombining.cpp in the visitSetCondInst method after the switch case
298for Instruction::UDiv (around line 4447) for more details.
299
300The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
301this construct.
302
303//===---------------------------------------------------------------------===//
304
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000305viterbi speeds up *significantly* if the various "history" related copy loops
306are turned into memcpy calls at the source level. We need a "loops to memcpy"
307pass.
308
309//===---------------------------------------------------------------------===//
310
311Consider:
312
313typedef unsigned U32;
314typedef unsigned long long U64;
315int test (U32 *inst, U64 *regs) {
316 U64 effective_addr2;
317 U32 temp = *inst;
318 int r1 = (temp >> 20) & 0xf;
319 int b2 = (temp >> 16) & 0xf;
320 effective_addr2 = temp & 0xfff;
321 if (b2) effective_addr2 += regs[b2];
322 b2 = (temp >> 12) & 0xf;
323 if (b2) effective_addr2 += regs[b2];
324 effective_addr2 &= regs[4];
325 if ((effective_addr2 & 3) == 0)
326 return 1;
327 return 0;
328}
329
330Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
331we don't eliminate the computation of the top half of effective_addr2 because
332we don't have whole-function selection dags. On x86, this means we use one
333extra register for the function when effective_addr2 is declared as U64 than
334when it is declared U32.
335
336//===---------------------------------------------------------------------===//
337
338Promote for i32 bswap can use i64 bswap + shr. Useful on targets with 64-bit
339regs and bswap, like itanium.
340
341//===---------------------------------------------------------------------===//
342
343LSR should know what GPR types a target has. This code:
344
345volatile short X, Y; // globals
346
347void foo(int N) {
348 int i;
349 for (i = 0; i < N; i++) { X = i; Y = i*4; }
350}
351
352produces two identical IV's (after promotion) on PPC/ARM:
353
354LBB1_1: @bb.preheader
355 mov r3, #0
356 mov r2, r3
357 mov r1, r3
358LBB1_2: @bb
359 ldr r12, LCPI1_0
360 ldr r12, [r12]
361 strh r2, [r12]
362 ldr r12, LCPI1_1
363 ldr r12, [r12]
364 strh r3, [r12]
365 add r1, r1, #1 <- [0,+,1]
366 add r3, r3, #4
367 add r2, r2, #1 <- [0,+,1]
368 cmp r1, r0
369 bne LBB1_2 @bb
370
371
372//===---------------------------------------------------------------------===//
373
374Tail call elim should be more aggressive, checking to see if the call is
375followed by an uncond branch to an exit block.
376
377; This testcase is due to tail-duplication not wanting to copy the return
378; instruction into the terminating blocks because there was other code
379; optimized out of the function after the taildup happened.
Chris Lattnerd78823e2008-02-18 18:46:39 +0000380; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000381
Chris Lattnerd78823e2008-02-18 18:46:39 +0000382define i32 @t4(i32 %a) {
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000383entry:
Chris Lattnerd78823e2008-02-18 18:46:39 +0000384 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
385 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
386 br i1 %tmp.2, label %then.0, label %else.0
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000387
Chris Lattnerd78823e2008-02-18 18:46:39 +0000388then.0: ; preds = %entry
389 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
390 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
391 br label %return
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000392
Chris Lattnerd78823e2008-02-18 18:46:39 +0000393else.0: ; preds = %entry
394 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
395 br i1 %tmp.7, label %then.1, label %return
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000396
Chris Lattnerd78823e2008-02-18 18:46:39 +0000397then.1: ; preds = %else.0
398 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
399 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
400 br label %return
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000401
Chris Lattnerd78823e2008-02-18 18:46:39 +0000402return: ; preds = %then.1, %else.0, %then.0
403 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000404 [ %tmp.9, %then.1 ]
Chris Lattnerd78823e2008-02-18 18:46:39 +0000405 ret i32 %result.0
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000406}
407
408//===---------------------------------------------------------------------===//
409
Chris Lattner00159fc2007-10-03 06:10:59 +0000410Tail recursion elimination is not transforming this function, because it is
411returning n, which fails the isDynamicConstant check in the accumulator
412recursion checks.
413
414long long fib(const long long n) {
415 switch(n) {
416 case 0:
417 case 1:
418 return n;
419 default:
420 return fib(n-1) + fib(n-2);
421 }
422}
423
424//===---------------------------------------------------------------------===//
425
Chris Lattner7be8ac22008-08-10 00:47:21 +0000426Tail recursion elimination should handle:
427
428int pow2m1(int n) {
429 if (n == 0)
430 return 0;
431 return 2 * pow2m1 (n - 1) + 1;
432}
433
434Also, multiplies can be turned into SHL's, so they should be handled as if
435they were associative. "return foo() << 1" can be tail recursion eliminated.
436
437//===---------------------------------------------------------------------===//
438
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000439Argument promotion should promote arguments for recursive functions, like
440this:
441
Chris Lattnerd78823e2008-02-18 18:46:39 +0000442; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000443
Chris Lattnerd78823e2008-02-18 18:46:39 +0000444define internal i32 @foo(i32* %x) {
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000445entry:
Chris Lattnerd78823e2008-02-18 18:46:39 +0000446 %tmp = load i32* %x ; <i32> [#uses=0]
447 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
448 ret i32 %tmp.foo
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000449}
450
Chris Lattnerd78823e2008-02-18 18:46:39 +0000451define i32 @bar(i32* %x) {
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000452entry:
Chris Lattnerd78823e2008-02-18 18:46:39 +0000453 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
454 ret i32 %tmp3
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000455}
456
Chris Lattner421a7332007-12-05 23:05:06 +0000457//===---------------------------------------------------------------------===//
Chris Lattner072ab752007-12-28 04:42:05 +0000458
459"basicaa" should know how to look through "or" instructions that act like add
460instructions. For example in this code, the x*4+1 is turned into x*4 | 1, and
461basicaa can't analyze the array subscript, leading to duplicated loads in the
462generated code:
463
464void test(int X, int Y, int a[]) {
465int i;
466 for (i=2; i<1000; i+=4) {
467 a[i+0] = a[i-1+0]*a[i-2+0];
468 a[i+1] = a[i-1+1]*a[i-2+1];
469 a[i+2] = a[i-1+2]*a[i-2+2];
470 a[i+3] = a[i-1+3]*a[i-2+3];
471 }
472}
473
Chris Lattner5ae3f3d2008-12-10 01:30:48 +0000474BasicAA also doesn't do this for add. It needs to know that &A[i+1] != &A[i].
475
Chris Lattnerfe7fe912007-12-28 22:30:05 +0000476//===---------------------------------------------------------------------===//
Chris Lattner072ab752007-12-28 04:42:05 +0000477
Chris Lattnerfe7fe912007-12-28 22:30:05 +0000478We should investigate an instruction sinking pass. Consider this silly
479example in pic mode:
480
481#include <assert.h>
482void foo(int x) {
483 assert(x);
484 //...
485}
486
487we compile this to:
488_foo:
489 subl $28, %esp
490 call "L1$pb"
491"L1$pb":
492 popl %eax
493 cmpl $0, 32(%esp)
494 je LBB1_2 # cond_true
495LBB1_1: # return
496 # ...
497 addl $28, %esp
498 ret
499LBB1_2: # cond_true
500...
501
502The PIC base computation (call+popl) is only used on one path through the
503code, but is currently always computed in the entry block. It would be
504better to sink the picbase computation down into the block for the
505assertion, as it is the only one that uses it. This happens for a lot of
506code with early outs.
507
Chris Lattnerbe9fe9d2007-12-29 01:05:01 +0000508Another example is loads of arguments, which are usually emitted into the
509entry block on targets like x86. If not used in all paths through a
510function, they should be sunk into the ones that do.
511
Chris Lattnerfe7fe912007-12-28 22:30:05 +0000512In this case, whole-function-isel would also handle this.
Chris Lattner072ab752007-12-28 04:42:05 +0000513
514//===---------------------------------------------------------------------===//
Chris Lattner8551f192008-01-07 21:38:14 +0000515
516Investigate lowering of sparse switch statements into perfect hash tables:
517http://burtleburtle.net/bob/hash/perfect.html
518
519//===---------------------------------------------------------------------===//
Chris Lattnerdc089f02008-01-09 00:17:57 +0000520
521We should turn things like "load+fabs+store" and "load+fneg+store" into the
522corresponding integer operations. On a yonah, this loop:
523
524double a[256];
Chris Lattnerd78823e2008-02-18 18:46:39 +0000525void foo() {
526 int i, b;
527 for (b = 0; b < 10000000; b++)
528 for (i = 0; i < 256; i++)
529 a[i] = -a[i];
530}
Chris Lattnerdc089f02008-01-09 00:17:57 +0000531
532is twice as slow as this loop:
533
534long long a[256];
Chris Lattnerd78823e2008-02-18 18:46:39 +0000535void foo() {
536 int i, b;
537 for (b = 0; b < 10000000; b++)
538 for (i = 0; i < 256; i++)
539 a[i] ^= (1ULL << 63);
540}
Chris Lattnerdc089f02008-01-09 00:17:57 +0000541
542and I suspect other processors are similar. On X86 in particular this is a
543big win because doing this with integers allows the use of read/modify/write
544instructions.
545
546//===---------------------------------------------------------------------===//
Chris Lattnera909d312008-01-10 18:25:41 +0000547
548DAG Combiner should try to combine small loads into larger loads when
549profitable. For example, we compile this C++ example:
550
551struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
552extern THotKey m_HotKey;
553THotKey GetHotKey () { return m_HotKey; }
554
555into (-O3 -fno-exceptions -static -fomit-frame-pointer):
556
557__Z9GetHotKeyv:
558 pushl %esi
559 movl 8(%esp), %eax
560 movb _m_HotKey+3, %cl
561 movb _m_HotKey+4, %dl
562 movb _m_HotKey+2, %ch
563 movw _m_HotKey, %si
564 movw %si, (%eax)
565 movb %ch, 2(%eax)
566 movb %cl, 3(%eax)
567 movb %dl, 4(%eax)
568 popl %esi
569 ret $4
570
571GCC produces:
572
573__Z9GetHotKeyv:
574 movl _m_HotKey, %edx
575 movl 4(%esp), %eax
576 movl %edx, (%eax)
577 movzwl _m_HotKey+4, %edx
578 movw %dx, 4(%eax)
579 ret $4
580
581The LLVM IR contains the needed alignment info, so we should be able to
582merge the loads and stores into 4-byte loads:
583
584 %struct.THotKey = type { i16, i8, i8, i8 }
585define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
586...
587 %tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
588 %tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
589 %tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
590 %tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
591
592Alternatively, we should use a small amount of base-offset alias analysis
593to make it so the scheduler doesn't need to hold all the loads in regs at
594once.
595
596//===---------------------------------------------------------------------===//
Chris Lattner71538b52008-01-11 06:17:47 +0000597
Nate Begemana3c7ba32008-02-18 18:39:23 +0000598We should add an FRINT node to the DAG to model targets that have legal
599implementations of ceil/floor/rint.
Chris Lattnera672d3d2008-02-28 05:34:27 +0000600
601//===---------------------------------------------------------------------===//
602
Chris Lattner61075f32008-02-28 17:21:27 +0000603This GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34043
604contains a testcase that compiles down to:
605
606 %struct.XMM128 = type { <4 x float> }
607..
608 %src = alloca %struct.XMM128
609..
610 %tmp6263 = bitcast %struct.XMM128* %src to <2 x i64>*
611 %tmp65 = getelementptr %struct.XMM128* %src, i32 0, i32 0
612 store <2 x i64> %tmp5899, <2 x i64>* %tmp6263, align 16
613 %tmp66 = load <4 x float>* %tmp65, align 16
614 %tmp71 = add <4 x float> %tmp66, %tmp66
615
616If the mid-level optimizer turned the bitcast of pointer + store of tmp5899
617into a bitcast of the vector value and a store to the pointer, then the
618store->load could be easily removed.
619
620//===---------------------------------------------------------------------===//
621
Chris Lattnera672d3d2008-02-28 05:34:27 +0000622Consider:
623
624int test() {
625 long long input[8] = {1,1,1,1,1,1,1,1};
626 foo(input);
627}
628
629We currently compile this into a memcpy from a global array since the
630initializer is fairly large and not memset'able. This is good, but the memcpy
631gets lowered to load/stores in the code generator. This is also ok, except
632that the codegen lowering for memcpy doesn't handle the case when the source
633is a constant global. This gives us atrocious code like this:
634
635 call "L1$pb"
636"L1$pb":
637 popl %eax
638 movl _C.0.1444-"L1$pb"+32(%eax), %ecx
639 movl %ecx, 40(%esp)
640 movl _C.0.1444-"L1$pb"+20(%eax), %ecx
641 movl %ecx, 28(%esp)
642 movl _C.0.1444-"L1$pb"+36(%eax), %ecx
643 movl %ecx, 44(%esp)
644 movl _C.0.1444-"L1$pb"+44(%eax), %ecx
645 movl %ecx, 52(%esp)
646 movl _C.0.1444-"L1$pb"+40(%eax), %ecx
647 movl %ecx, 48(%esp)
648 movl _C.0.1444-"L1$pb"+12(%eax), %ecx
649 movl %ecx, 20(%esp)
650 movl _C.0.1444-"L1$pb"+4(%eax), %ecx
651...
652
653instead of:
654 movl $1, 16(%esp)
655 movl $0, 20(%esp)
656 movl $1, 24(%esp)
657 movl $0, 28(%esp)
658 movl $1, 32(%esp)
659 movl $0, 36(%esp)
660 ...
661
662//===---------------------------------------------------------------------===//
Chris Lattnera709d332008-03-02 02:51:40 +0000663
664http://llvm.org/PR717:
665
666The following code should compile into "ret int undef". Instead, LLVM
667produces "ret int 0":
668
669int f() {
670 int x = 4;
671 int y;
672 if (x == 3) y = 0;
673 return y;
674}
675
676//===---------------------------------------------------------------------===//
Chris Lattnere9c68442008-03-02 19:29:42 +0000677
678The loop unroller should partially unroll loops (instead of peeling them)
679when code growth isn't too bad and when an unroll count allows simplification
680of some code within the loop. One trivial example is:
681
682#include <stdio.h>
683int main() {
684 int nRet = 17;
685 int nLoop;
686 for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
687 if ( nLoop & 1 )
688 nRet += 2;
689 else
690 nRet -= 1;
691 }
692 return nRet;
693}
694
695Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
696reduction in code size. The resultant code would then also be suitable for
697exit value computation.
698
699//===---------------------------------------------------------------------===//
Chris Lattner962a7d52008-03-17 01:47:51 +0000700
701We miss a bunch of rotate opportunities on various targets, including ppc, x86,
702etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
703matching code in dag combine doesn't look through truncates aggressively
704enough. Here are some testcases reduces from GCC PR17886:
705
706unsigned long long f(unsigned long long x, int y) {
707 return (x << y) | (x >> 64-y);
708}
709unsigned f2(unsigned x, int y){
710 return (x << y) | (x >> 32-y);
711}
712unsigned long long f3(unsigned long long x){
713 int y = 9;
714 return (x << y) | (x >> 64-y);
715}
716unsigned f4(unsigned x){
717 int y = 10;
718 return (x << y) | (x >> 32-y);
719}
720unsigned long long f5(unsigned long long x, unsigned long long y) {
721 return (x << 8) | ((y >> 48) & 0xffull);
722}
723unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
724 switch(z) {
725 case 1:
726 return (x << 8) | ((y >> 48) & 0xffull);
727 case 2:
728 return (x << 16) | ((y >> 40) & 0xffffull);
729 case 3:
730 return (x << 24) | ((y >> 32) & 0xffffffull);
731 case 4:
732 return (x << 32) | ((y >> 24) & 0xffffffffull);
733 default:
734 return (x << 40) | ((y >> 16) & 0xffffffffffull);
735 }
736}
737
Dan Gohman2896a4e2008-10-17 21:39:27 +0000738On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner962a7d52008-03-17 01:47:51 +0000739generate truly horrible code, instead of using shld and friends. On
740ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
741badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
742
743//===---------------------------------------------------------------------===//
Chris Lattner38c4a152008-03-20 04:46:13 +0000744
745We do a number of simplifications in simplify libcalls to strength reduce
746standard library functions, but we don't currently merge them together. For
747example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
748be done safely if "b" isn't modified between the strlen and memcpy of course.
749
750//===---------------------------------------------------------------------===//
751
Chris Lattner045fd222008-05-17 15:37:38 +0000752We should be able to evaluate this loop:
753
754int test(int x_offs) {
755 while (x_offs > 4)
756 x_offs -= 4;
757 return x_offs;
758}
759
760//===---------------------------------------------------------------------===//
Chris Lattner40a68642008-07-14 00:19:59 +0000761
762Reassociate should turn things like:
763
764int factorial(int X) {
765 return X*X*X*X*X*X*X*X;
766}
767
768into llvm.powi calls, allowing the code generator to produce balanced
769multiplication trees.
770
771//===---------------------------------------------------------------------===//
772
Chris Lattner16e192f2008-08-10 01:14:08 +0000773We generate a horrible libcall for llvm.powi. For example, we compile:
774
775#include <cmath>
776double f(double a) { return std::pow(a, 4); }
777
778into:
779
780__Z1fd:
781 subl $12, %esp
782 movsd 16(%esp), %xmm0
783 movsd %xmm0, (%esp)
784 movl $4, 8(%esp)
785 call L___powidf2$stub
786 addl $12, %esp
787 ret
788
789GCC produces:
790
791__Z1fd:
792 subl $12, %esp
793 movsd 16(%esp), %xmm0
794 mulsd %xmm0, %xmm0
795 mulsd %xmm0, %xmm0
796 movsd %xmm0, (%esp)
797 fldl (%esp)
798 addl $12, %esp
799 ret
800
801//===---------------------------------------------------------------------===//
802
803We compile this program: (from GCC PR11680)
804http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
805
806Into code that runs the same speed in fast/slow modes, but both modes run 2x
807slower than when compile with GCC (either 4.0 or 4.2):
808
809$ llvm-g++ perf.cpp -O3 -fno-exceptions
810$ time ./a.out fast
8111.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
812
813$ g++ perf.cpp -O3 -fno-exceptions
814$ time ./a.out fast
8150.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
816
817It looks like we are making the same inlining decisions, so this may be raw
818codegen badness or something else (haven't investigated).
819
820//===---------------------------------------------------------------------===//
821
822We miss some instcombines for stuff like this:
823void bar (void);
824void foo (unsigned int a) {
825 /* This one is equivalent to a >= (3 << 2). */
826 if ((a >> 2) >= 3)
827 bar ();
828}
829
830A few other related ones are in GCC PR14753.
831
832//===---------------------------------------------------------------------===//
833
834Divisibility by constant can be simplified (according to GCC PR12849) from
835being a mulhi to being a mul lo (cheaper). Testcase:
836
837void bar(unsigned n) {
838 if (n % 3 == 0)
839 true();
840}
841
842I think this basically amounts to a dag combine to simplify comparisons against
843multiply hi's into a comparison against the mullo.
844
845//===---------------------------------------------------------------------===//
Chris Lattnerc10e07a2008-08-19 06:22:16 +0000846
Chris Lattner92f97832008-10-15 16:06:03 +0000847Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
848bunch of other stuff from this example (see PR1604):
849
850#include <cstdio>
851struct test {
852 int val;
853 virtual ~test() {}
854};
855
856int main() {
857 test t;
858 std::scanf("%d", &t.val);
859 std::printf("%d\n", t.val);
860}
861
862//===---------------------------------------------------------------------===//
863
Chris Lattner74da9112008-10-15 16:33:52 +0000864Instcombine will merge comparisons like (x >= 10) && (x < 20) by producing (x -
86510) u< 10, but only when the comparisons have matching sign.
866
867This could be converted with a similiar technique. (PR1941)
868
869define i1 @test(i8 %x) {
870 %A = icmp uge i8 %x, 5
871 %B = icmp slt i8 %x, 20
872 %C = and i1 %A, %B
873 ret i1 %C
874}
875
876//===---------------------------------------------------------------------===//
Nick Lewycky8ef52e22008-11-27 22:12:22 +0000877
Nick Lewycky728f8742008-11-27 22:41:45 +0000878These functions perform the same computation, but produce different assembly.
Nick Lewycky8ef52e22008-11-27 22:12:22 +0000879
880define i8 @select(i8 %x) readnone nounwind {
881 %A = icmp ult i8 %x, 250
882 %B = select i1 %A, i8 0, i8 1
883 ret i8 %B
884}
885
886define i8 @addshr(i8 %x) readnone nounwind {
887 %A = zext i8 %x to i9
888 %B = add i9 %A, 6 ;; 256 - 250 == 6
889 %C = lshr i9 %B, 8
890 %D = trunc i9 %C to i8
891 ret i8 %D
892}
893
894//===---------------------------------------------------------------------===//
Eli Friedman8e59f062008-11-30 07:36:04 +0000895
896From gcc bug 24696:
897int
898f (unsigned long a, unsigned long b, unsigned long c)
899{
900 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0);
901}
902int
903f (unsigned long a, unsigned long b, unsigned long c)
904{
905 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0);
906}
907Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with
908"clang -emit-llvm-bc | opt -std-compile-opts".
909
910//===---------------------------------------------------------------------===//
911
912From GCC Bug 20192:
913#define PMD_MASK (~((1UL << 23) - 1))
914void clear_pmd_range(unsigned long start, unsigned long end)
915{
916 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
917 f();
918}
919The expression should optimize to something like
920"!((start|end)&~PMD_MASK). Currently not optimized with "clang
921-emit-llvm-bc | opt -std-compile-opts".
922
923//===---------------------------------------------------------------------===//
924
925From GCC Bug 15241:
926unsigned int
927foo (unsigned int a, unsigned int b)
928{
929 if (a <= 7 && b <= 7)
930 baz ();
931}
932Should combine to "(a|b) <= 7". Currently not optimized with "clang
933-emit-llvm-bc | opt -std-compile-opts".
934
935//===---------------------------------------------------------------------===//
936
937From GCC Bug 3756:
938int
939pn (int n)
940{
941 return (n >= 0 ? 1 : -1);
942}
943Should combine to (n >> 31) | 1. Currently not optimized with "clang
944-emit-llvm-bc | opt -std-compile-opts | llc".
945
946//===---------------------------------------------------------------------===//
947
948From GCC Bug 28685:
949int test(int a, int b)
950{
951 int lt = a < b;
952 int eq = a == b;
953
954 return (lt || eq);
955}
956Should combine to "a <= b". Currently not optimized with "clang
957-emit-llvm-bc | opt -std-compile-opts | llc".
958
959//===---------------------------------------------------------------------===//
960
961void a(int variable)
962{
963 if (variable == 4 || variable == 6)
964 bar();
965}
966This should optimize to "if ((variable | 2) == 6)". Currently not
967optimized with "clang -emit-llvm-bc | opt -std-compile-opts | llc".
968
969//===---------------------------------------------------------------------===//
970
971unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
972i;}
973unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
974These should combine to the same thing. Currently, the first function
975produces better code on X86.
976
977//===---------------------------------------------------------------------===//
978
Eli Friedman8e59f062008-11-30 07:36:04 +0000979From GCC Bug 15784:
980#define abs(x) x>0?x:-x
981int f(int x, int y)
982{
983 return (abs(x)) >= 0;
984}
985This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
986optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
987
988//===---------------------------------------------------------------------===//
989
990From GCC Bug 14753:
991void
992rotate_cst (unsigned int a)
993{
994 a = (a << 10) | (a >> 22);
995 if (a == 123)
996 bar ();
997}
998void
999minus_cst (unsigned int a)
1000{
1001 unsigned int tem;
1002
1003 tem = 20 - a;
1004 if (tem == 5)
1005 bar ();
1006}
1007void
1008mask_gt (unsigned int a)
1009{
1010 /* This is equivalent to a > 15. */
1011 if ((a & ~7) > 8)
1012 bar ();
1013}
1014void
1015rshift_gt (unsigned int a)
1016{
1017 /* This is equivalent to a > 23. */
1018 if ((a >> 2) > 5)
1019 bar ();
1020}
1021All should simplify to a single comparison. All of these are
1022currently not optimized with "clang -emit-llvm-bc | opt
1023-std-compile-opts".
1024
1025//===---------------------------------------------------------------------===//
1026
1027From GCC Bug 32605:
1028int c(int* x) {return (char*)x+2 == (char*)x;}
1029Should combine to 0. Currently not optimized with "clang
1030-emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it).
1031
1032//===---------------------------------------------------------------------===//
1033
1034int a(unsigned char* b) {return *b > 99;}
1035There's an unnecessary zext in the generated code with "clang
1036-emit-llvm-bc | opt -std-compile-opts".
1037
1038//===---------------------------------------------------------------------===//
1039
Eli Friedman8e59f062008-11-30 07:36:04 +00001040int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;}
1041Should be combined to "((b >> 1) | b) & 1". Currently not optimized
1042with "clang -emit-llvm-bc | opt -std-compile-opts".
1043
1044//===---------------------------------------------------------------------===//
1045
1046unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);}
1047Should combine to "x | (y & 3)". Currently not optimized with "clang
1048-emit-llvm-bc | opt -std-compile-opts".
1049
1050//===---------------------------------------------------------------------===//
1051
1052unsigned a(unsigned a) {return ((a | 1) & 3) | (a & -4);}
1053Should combine to "a | 1". Currently not optimized with "clang
1054-emit-llvm-bc | opt -std-compile-opts".
1055
1056//===---------------------------------------------------------------------===//
1057
Eli Friedman8e59f062008-11-30 07:36:04 +00001058int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);}
1059Should fold to "(~a & c) | (a & b)". Currently not optimized with
1060"clang -emit-llvm-bc | opt -std-compile-opts".
1061
1062//===---------------------------------------------------------------------===//
1063
1064int a(int a,int b) {return (~(a|b))|a;}
1065Should fold to "a|~b". Currently not optimized with "clang
1066-emit-llvm-bc | opt -std-compile-opts".
1067
1068//===---------------------------------------------------------------------===//
1069
1070int a(int a, int b) {return (a&&b) || (a&&!b);}
1071Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
1072| opt -std-compile-opts".
1073
1074//===---------------------------------------------------------------------===//
1075
1076int a(int a, int b, int c) {return (a&&b) || (!a&&c);}
1077Should fold to "a ? b : c", or at least something sane. Currently not
1078optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1079
1080//===---------------------------------------------------------------------===//
1081
1082int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);}
1083Should fold to a && (b || c). Currently not optimized with "clang
1084-emit-llvm-bc | opt -std-compile-opts".
1085
1086//===---------------------------------------------------------------------===//
1087
1088int a(int x) {return x | ((x & 8) ^ 8);}
1089Should combine to x | 8. Currently not optimized with "clang
1090-emit-llvm-bc | opt -std-compile-opts".
1091
1092//===---------------------------------------------------------------------===//
1093
1094int a(int x) {return x ^ ((x & 8) ^ 8);}
1095Should also combine to x | 8. Currently not optimized with "clang
1096-emit-llvm-bc | opt -std-compile-opts".
1097
1098//===---------------------------------------------------------------------===//
1099
1100int a(int x) {return (x & 8) == 0 ? -1 : -9;}
1101Should combine to (x | -9) ^ 8. Currently not optimized with "clang
1102-emit-llvm-bc | opt -std-compile-opts".
1103
1104//===---------------------------------------------------------------------===//
1105
1106int a(int x) {return (x & 8) == 0 ? -9 : -1;}
1107Should combine to x | -9. Currently not optimized with "clang
1108-emit-llvm-bc | opt -std-compile-opts".
1109
1110//===---------------------------------------------------------------------===//
1111
1112int a(int x) {return ((x | -9) ^ 8) & x;}
1113Should combine to x & -9. Currently not optimized with "clang
1114-emit-llvm-bc | opt -std-compile-opts".
1115
1116//===---------------------------------------------------------------------===//
1117
1118unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
1119Should combine to "a * 0x88888888 >> 31". Currently not optimized
1120with "clang -emit-llvm-bc | opt -std-compile-opts".
1121
1122//===---------------------------------------------------------------------===//
1123
1124unsigned a(char* x) {if ((*x & 32) == 0) return b();}
1125There's an unnecessary zext in the generated code with "clang
1126-emit-llvm-bc | opt -std-compile-opts".
1127
1128//===---------------------------------------------------------------------===//
1129
1130unsigned a(unsigned long long x) {return 40 * (x >> 1);}
1131Should combine to "20 * (((unsigned)x) & -2)". Currently not
1132optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1133
1134//===---------------------------------------------------------------------===//
Bill Wendlingd0a76ee2008-12-02 05:12:47 +00001135
1136We would like to do the following transform in the instcombiner:
1137
1138 -X/C -> X/-C
1139
1140However, this isn't valid if (-X) overflows. We can implement this when we
1141have the concept of a "C signed subtraction" operator that which is undefined
1142on overflow.
1143
1144//===---------------------------------------------------------------------===//
Chris Lattner8cbcc3a2008-12-02 06:32:34 +00001145
1146This was noticed in the entryblock for grokdeclarator in 403.gcc:
1147
1148 %tmp = icmp eq i32 %decl_context, 4
1149 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
1150 %tmp1 = icmp eq i32 %decl_context_addr.0, 1
1151 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
1152
1153tmp1 should be simplified to something like:
1154 (!tmp || decl_context == 1)
1155
1156This allows recursive simplifications, tmp1 is used all over the place in
1157the function, e.g. by:
1158
1159 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
1160 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
1161 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
1162
1163later.
1164
Chris Lattnerb37c3a42008-12-06 19:28:22 +00001165//===---------------------------------------------------------------------===//
1166
1167Store sinking: This code:
1168
1169void f (int n, int *cond, int *res) {
1170 int i;
1171 *res = 0;
1172 for (i = 0; i < n; i++)
1173 if (*cond)
1174 *res ^= 234; /* (*) */
1175}
1176
1177On this function GVN hoists the fully redundant value of *res, but nothing
1178moves the store out. This gives us this code:
1179
1180bb: ; preds = %bb2, %entry
1181 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
1182 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
1183 %1 = load i32* %cond, align 4
1184 %2 = icmp eq i32 %1, 0
1185 br i1 %2, label %bb2, label %bb1
1186
1187bb1: ; preds = %bb
1188 %3 = xor i32 %.rle, 234
1189 store i32 %3, i32* %res, align 4
1190 br label %bb2
1191
1192bb2: ; preds = %bb, %bb1
1193 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
1194 %indvar.next = add i32 %i.05, 1
1195 %exitcond = icmp eq i32 %indvar.next, %n
1196 br i1 %exitcond, label %return, label %bb
1197
1198DSE should sink partially dead stores to get the store out of the loop.
1199
Chris Lattner7d675ed2008-12-06 22:52:12 +00001200Here's another partial dead case:
1201http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
1202
Chris Lattnerb37c3a42008-12-06 19:28:22 +00001203//===---------------------------------------------------------------------===//
1204
1205Scalar PRE hoists the mul in the common block up to the else:
1206
1207int test (int a, int b, int c, int g) {
1208 int d, e;
1209 if (a)
1210 d = b * c;
1211 else
1212 d = b - c;
1213 e = b * c + g;
1214 return d + e;
1215}
1216
1217It would be better to do the mul once to reduce codesize above the if.
1218This is GCC PR38204.
1219
1220//===---------------------------------------------------------------------===//
1221
1222GCC PR37810 is an interesting case where we should sink load/store reload
1223into the if block and outside the loop, so we don't reload/store it on the
1224non-call path.
1225
1226for () {
1227 *P += 1;
1228 if ()
1229 call();
1230 else
1231 ...
1232->
1233tmp = *P
1234for () {
1235 tmp += 1;
1236 if () {
1237 *P = tmp;
1238 call();
1239 tmp = *P;
1240 } else ...
1241}
1242*P = tmp;
1243
Chris Lattnera8ff4242008-12-15 07:49:24 +00001244We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
1245we don't sink the store. We need partially dead store sinking.
1246
Chris Lattnerb37c3a42008-12-06 19:28:22 +00001247//===---------------------------------------------------------------------===//
1248
Chris Lattnera8ff4242008-12-15 07:49:24 +00001249[PHI TRANSLATE GEPs]
1250
Chris Lattnerb37c3a42008-12-06 19:28:22 +00001251GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
1252leading to excess stack traffic. This could be handled by GVN with some crazy
1253symbolic phi translation. The code we get looks like (g is on the stack):
1254
1255bb2: ; preds = %bb1
1256..
1257 %9 = getelementptr %struct.f* %g, i32 0, i32 0
1258 store i32 %8, i32* %9, align bel %bb3
1259
1260bb3: ; preds = %bb1, %bb2, %bb
1261 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
1262 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
1263 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
1264 %11 = load i32* %10, align 4
1265
1266%11 is fully redundant, an in BB2 it should have the value %8.
1267
Chris Lattner7d675ed2008-12-06 22:52:12 +00001268GCC PR33344 is a similar case.
1269
Chris Lattnerb37c3a42008-12-06 19:28:22 +00001270//===---------------------------------------------------------------------===//
1271
Chris Lattner7d675ed2008-12-06 22:52:12 +00001272There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
1273GCC testsuite. There are many pre testcases as ssa-pre-*.c
1274
Chris Lattnerfaf145c2008-12-15 08:32:28 +00001275//===---------------------------------------------------------------------===//
1276
1277There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
1278GCC testsuite. For example, predcom-1.c is:
1279
1280 for (i = 2; i < 1000; i++)
1281 fib[i] = (fib[i-1] + fib[i - 2]) & 0xffff;
1282
1283which compiles into:
1284
1285bb1: ; preds = %bb1, %bb1.thread
1286 %indvar = phi i32 [ 0, %bb1.thread ], [ %0, %bb1 ]
1287 %i.0.reg2mem.0 = add i32 %indvar, 2
1288 %0 = add i32 %indvar, 1 ; <i32> [#uses=3]
1289 %1 = getelementptr [1000 x i32]* @fib, i32 0, i32 %0
1290 %2 = load i32* %1, align 4 ; <i32> [#uses=1]
1291 %3 = getelementptr [1000 x i32]* @fib, i32 0, i32 %indvar
1292 %4 = load i32* %3, align 4 ; <i32> [#uses=1]
1293 %5 = add i32 %4, %2 ; <i32> [#uses=1]
1294 %6 = and i32 %5, 65535 ; <i32> [#uses=1]
1295 %7 = getelementptr [1000 x i32]* @fib, i32 0, i32 %i.0.reg2mem.0
1296 store i32 %6, i32* %7, align 4
1297 %exitcond = icmp eq i32 %0, 998 ; <i1> [#uses=1]
1298 br i1 %exitcond, label %return, label %bb1
1299
1300This is basically:
1301 LOAD fib[i+1]
1302 LOAD fib[i]
1303 STORE fib[i+2]
1304
1305instead of handling this as a loop or other xform, all we'd need to do is teach
1306load PRE to phi translate the %0 add (i+1) into the predecessor as (i'+1+1) =
1307(i'+2) (where i' is the previous iteration of i). This would find the store
1308which feeds it.
1309
1310predcom-2.c is apparently the same as predcom-1.c
1311predcom-3.c is very similar but needs loads feeding each other instead of
1312store->load.
1313predcom-4.c seems the same as the rest.
1314
1315
1316//===---------------------------------------------------------------------===//
1317
Chris Lattner7d675ed2008-12-06 22:52:12 +00001318Other simple load PRE cases:
Chris Lattnera8ff4242008-12-15 07:49:24 +00001319http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35287 [LPRE crit edge splitting]
1320
1321http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34677 (licm does this, LPRE crit edge)
1322 llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as | opt -mem2reg -simplifycfg -gvn | llvm-dis
1323
Chris Lattnerfaf145c2008-12-15 08:32:28 +00001324//===---------------------------------------------------------------------===//
1325
1326Type based alias analysis:
Chris Lattner7d675ed2008-12-06 22:52:12 +00001327http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
1328
1329//===---------------------------------------------------------------------===//
1330
1331When GVN/PRE finds a store of float* to a must aliases pointer when expecting
1332an int*, it should turn it into a bitcast. This is a nice generalization of
Chris Lattnere3a64972008-12-07 00:15:10 +00001333the SROA hack that would apply to other cases, e.g.:
1334
1335int foo(int C, int *P, float X) {
1336 if (C) {
1337 bar();
1338 *P = 42;
1339 } else
1340 *(float*)P = X;
1341
1342 return *P;
1343}
1344
Chris Lattner7d675ed2008-12-06 22:52:12 +00001345
1346One example (that requires crazy phi translation) is:
Chris Lattnerfaf145c2008-12-15 08:32:28 +00001347http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16799 [BITCAST PHI TRANS]
Chris Lattner7d675ed2008-12-06 22:52:12 +00001348
1349//===---------------------------------------------------------------------===//
1350
1351A/B get pinned to the stack because we turn an if/then into a select instead
1352of PRE'ing the load/store. This may be fixable in instcombine:
1353http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
1354
Chris Lattnerfaf145c2008-12-15 08:32:28 +00001355
1356
Chris Lattner7d675ed2008-12-06 22:52:12 +00001357Interesting missed case because of control flow flattening (should be 2 loads):
1358http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattnerfaf145c2008-12-15 08:32:28 +00001359With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |
1360 opt -mem2reg -gvn -instcombine | llvm-dis
1361we miss it because we need 1) GEP PHI TRAN, 2) CRIT EDGE 3) MULTIPLE DIFFERENT
1362VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner7d675ed2008-12-06 22:52:12 +00001363
1364//===---------------------------------------------------------------------===//
1365
1366http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
1367We could eliminate the branch condition here, loading from null is undefined:
1368
1369struct S { int w, x, y, z; };
1370struct T { int r; struct S s; };
1371void bar (struct S, int);
1372void foo (int a, struct T b)
1373{
1374 struct S *c = 0;
1375 if (a)
1376 c = &b.s;
1377 bar (*c, a);
1378}
1379
1380//===---------------------------------------------------------------------===//
Chris Lattner8cbcc3a2008-12-02 06:32:34 +00001381
Chris Lattnerb36ace92008-12-23 20:52:52 +00001382simplifylibcalls should do several optimizations for strspn/strcspn:
1383
1384strcspn(x, "") -> strlen(x)
1385strcspn("", x) -> 0
1386strspn("", x) -> 0
1387strspn(x, "") -> strlen(x)
1388strspn(x, "a") -> strchr(x, 'a')-x
1389
1390strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
1391
1392size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
1393 int __reject3) {
1394 register size_t __result = 0;
1395 while (__s[__result] != '\0' && __s[__result] != __reject1 &&
1396 __s[__result] != __reject2 && __s[__result] != __reject3)
1397 ++__result;
1398 return __result;
1399}
1400
1401This should turn into a switch on the character. See PR3253 for some notes on
1402codegen.
1403
1404456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
1405
1406//===---------------------------------------------------------------------===//
Chris Lattner107d30a2008-12-31 00:54:13 +00001407
1408"gas" uses this idiom:
1409 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string))
1410..
1411 else if (strchr ("<>", *intel_parser.op_string)
1412
1413Those should be turned into a switch.
1414
1415//===---------------------------------------------------------------------===//
Chris Lattnera146bc52009-01-08 06:52:57 +00001416
1417252.eon contains this interesting code:
1418
1419 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
1420 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1421 %strlen = call i32 @strlen(i8* %3072) ; uses = 1
1422 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
1423 call void @llvm.memcpy.i32(i8* %endptr,
1424 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
1425 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly
1426
1427This is interesting for a couple reasons. First, in this:
1428
1429 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1430 %strlen = call i32 @strlen(i8* %3072)
1431
1432The strlen could be replaced with: %strlen = sub %3072, %3073, because the
1433strcpy call returns a pointer to the end of the string. Based on that, the
1434endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
1435
1436Second, the memcpy+strlen strlen can be replaced with:
1437
1438 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
1439
1440Because the destination was just copied into the specified memory buffer. This,
1441in turn, can be constant folded to "4".
1442
1443In other code, it contains:
1444
1445 %endptr6978 = bitcast i8* %endptr69 to i32*
1446 store i32 7107374, i32* %endptr6978, align 1
1447 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
1448
1449Which could also be constant folded. Whatever is producing this should probably
1450be fixed to leave this as a memcpy from a string.
1451
1452Further, eon also has an interesting partially redundant strlen call:
1453
1454bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
1455 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2]
1456 %683 = load i8** %682, align 4 ; <i8*> [#uses=4]
1457 %684 = load i8* %683, align 1 ; <i8> [#uses=1]
1458 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
1459 br i1 %685, label %bb10, label %bb9
1460
1461bb9: ; preds = %bb8
1462 %686 = call i32 @strlen(i8* %683) nounwind readonly
1463 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
1464 br i1 %687, label %bb10, label %bb11
1465
1466bb10: ; preds = %bb9, %bb8
1467 %688 = call i32 @strlen(i8* %683) nounwind readonly
1468
1469This could be eliminated by doing the strlen once in bb8, saving code size and
1470improving perf on the bb8->9->10 path.
1471
1472//===---------------------------------------------------------------------===//
Chris Lattner9c044632009-01-08 07:34:55 +00001473
1474I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
1475which looks like:
1476 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
1477
1478
1479bb62: ; preds = %bb55, %bb53
1480 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
1481 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1482 %172 = add i32 %171, -1 ; <i32> [#uses=1]
1483 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
1484
1485... no stores ...
1486 br i1 %or.cond, label %bb65, label %bb72
1487
1488bb65: ; preds = %bb62
1489 store i8 0, i8* %173, align 1
1490 br label %bb72
1491
1492bb72: ; preds = %bb65, %bb62
1493 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
1494 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1495
1496Note that on the bb62->bb72 path, that the %177 strlen call is partially
1497redundant with the %171 call. At worst, we could shove the %177 strlen call
1498up into the bb65 block moving it out of the bb62->bb72 path. However, note
1499that bb65 stores to the string, zeroing out the last byte. This means that on
1500that path the value of %177 is actually just %171-1. A sub is cheaper than a
1501strlen!
1502
1503This pattern repeats several times, basically doing:
1504
1505 A = strlen(P);
1506 P[A-1] = 0;
1507 B = strlen(P);
1508 where it is "obvious" that B = A-1.
1509
1510//===---------------------------------------------------------------------===//
1511
1512186.crafty contains this interesting pattern:
1513
1514%77 = call i8* @strstr(i8* getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0),
1515 i8* %30)
1516%phitmp648 = icmp eq i8* %77, getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0)
1517br i1 %phitmp648, label %bb70, label %bb76
1518
1519bb70: ; preds = %OptionMatch.exit91, %bb69
1520 %78 = call i32 @strlen(i8* %30) nounwind readonly align 1 ; <i32> [#uses=1]
1521
1522This is basically:
1523 cststr = "abcdef";
1524 if (strstr(cststr, P) == cststr) {
1525 x = strlen(P);
1526 ...
1527
1528The strstr call would be significantly cheaper written as:
1529
1530cststr = "abcdef";
1531if (memcmp(P, str, strlen(P)))
1532 x = strlen(P);
1533
1534This is memcmp+strlen instead of strstr. This also makes the strlen fully
1535redundant.
1536
1537//===---------------------------------------------------------------------===//
1538
1539186.crafty also contains this code:
1540
1541%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1542%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
1543%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
1544%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1545%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
1546
1547The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
1548
1549//===---------------------------------------------------------------------===//
1550
1551186.crafty has this interesting pattern with the "out.4543" variable:
1552
1553call void @llvm.memcpy.i32(
1554 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
1555 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
1556%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
1557
1558It is basically doing:
1559
1560 memcpy(globalarray, "string");
1561 printf(..., globalarray);
1562
1563Anyway, by knowing that printf just reads the memory and forward substituting
1564the string directly into the printf, this eliminates reads from globalarray.
1565Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
1566other similar functions) there are many stores to "out". Once all the printfs
1567stop using "out", all that is left is the memcpy's into it. This should allow
1568globalopt to remove the "stored only" global.
1569
1570//===---------------------------------------------------------------------===//
1571
Dan Gohmana3401022009-01-20 01:07:33 +00001572This code:
1573
1574define inreg i32 @foo(i8* inreg %p) nounwind {
1575 %tmp0 = load i8* %p
1576 %tmp1 = ashr i8 %tmp0, 5
1577 %tmp2 = sext i8 %tmp1 to i32
1578 ret i32 %tmp2
1579}
1580
1581could be dagcombine'd to a sign-extending load with a shift.
1582For example, on x86 this currently gets this:
1583
1584 movb (%eax), %al
1585 sarb $5, %al
1586 movsbl %al, %eax
1587
1588while it could get this:
1589
1590 movsbl (%eax), %eax
1591 sarl $5, %eax
1592
1593//===---------------------------------------------------------------------===//
Chris Lattnered047e92009-01-22 07:16:03 +00001594
1595GCC PR31029:
1596
1597int test(int x) { return 1-x == x; } // --> return false
1598int test2(int x) { return 2-x == x; } // --> return x == 1 ?
1599
1600Always foldable for odd constants, what is the rule for even?
1601
1602//===---------------------------------------------------------------------===//
1603
edwin331eef22009-01-24 19:30:25 +00001604PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
1605for next field in struct (which is at same address).
1606
1607For example: store of float into { {{}}, float } could be turned into a store to
1608the float directly.
1609
Edwin Török13ede9c2009-02-20 18:42:06 +00001610//===---------------------------------------------------------------------===//
Nick Lewycky7ff62412009-02-25 06:52:48 +00001611
Edwin Török13ede9c2009-02-20 18:42:06 +00001612#include <math.h>
1613double foo(double a) { return sin(a); }
1614
1615This compiles into this on x86-64 Linux:
1616foo:
1617 subq $8, %rsp
1618 call sin
1619 addq $8, %rsp
1620 ret
1621vs:
1622
1623foo:
1624 jmp sin
1625
Nick Lewycky7ff62412009-02-25 06:52:48 +00001626//===---------------------------------------------------------------------===//
1627
Chris Lattnerb1f817f2009-05-11 17:41:40 +00001628The arg promotion pass should make use of nocapture to make its alias analysis
1629stuff much more precise.
1630
1631//===---------------------------------------------------------------------===//
1632
1633The following functions should be optimized to use a select instead of a
1634branch (from gcc PR40072):
1635
1636char char_int(int m) {if(m>7) return 0; return m;}
1637int int_char(char m) {if(m>7) return 0; return m;}
1638
1639//===---------------------------------------------------------------------===//
1640
Nick Lewycky7ff62412009-02-25 06:52:48 +00001641Instcombine should replace the load with a constant in:
1642
1643 static const char x[4] = {'a', 'b', 'c', 'd'};
1644
1645 unsigned int y(void) {
1646 return *(unsigned int *)x;
1647 }
1648
1649It currently only does this transformation when the size of the constant
1650is the same as the size of the integer (so, try x[5]) and the last byte
1651is a null (making it a C string). There's no need for these restrictions.
1652
1653//===---------------------------------------------------------------------===//
1654
Chris Lattnerfd62ccc2009-05-11 17:36:33 +00001655InstCombine's "turn load from constant into constant" optimization should be
1656more aggressive in the presence of bitcasts. For example, because of unions,
1657this code:
1658
1659union vec2d {
1660 double e[2];
1661 double v __attribute__((vector_size(16)));
1662};
1663typedef union vec2d vec2d;
1664
1665static vec2d a={{1,2}}, b={{3,4}};
1666
1667vec2d foo () {
1668 return (vec2d){ .v = a.v + b.v * (vec2d){{5,5}}.v };
1669}
1670
1671Compiles into:
1672
1673@a = internal constant %0 { [2 x double]
1674 [double 1.000000e+00, double 2.000000e+00] }, align 16
1675@b = internal constant %0 { [2 x double]
1676 [double 3.000000e+00, double 4.000000e+00] }, align 16
1677...
1678define void @foo(%struct.vec2d* noalias nocapture sret %agg.result) nounwind {
1679entry:
1680 %0 = load <2 x double>* getelementptr (%struct.vec2d*
1681 bitcast (%0* @a to %struct.vec2d*), i32 0, i32 0), align 16
1682 %1 = load <2 x double>* getelementptr (%struct.vec2d*
1683 bitcast (%0* @b to %struct.vec2d*), i32 0, i32 0), align 16
1684
1685
1686Instcombine should be able to optimize away the loads (and thus the globals).
1687
1688
1689//===---------------------------------------------------------------------===//