blob: 4e374e59d6dbf1c5f5d68042a06c8f2aad88cc10 [file] [log] [blame]
Chris Lattner086c0142006-02-03 06:21:43 +00001Target Independent Opportunities:
2
Chris Lattnerf308ea02006-09-28 06:01:17 +00003//===---------------------------------------------------------------------===//
4
Chris Lattner313a94c2010-09-19 00:37:34 +00005We should recognize idioms for add-with-carry and turn it into the appropriate
6intrinsics. This example:
7
8unsigned add32carry(unsigned sum, unsigned x) {
9 unsigned z = sum + x;
10 if (sum + x < x)
11 z++;
12 return z;
13}
14
15Compiles to: clang t.c -S -o - -O3 -fomit-frame-pointer -m64 -mkernel
16
17_add32carry: ## @add32carry
18 addl %esi, %edi
19 cmpl %esi, %edi
20 sbbl %eax, %eax
21 andl $1, %eax
22 addl %edi, %eax
23 ret
24
25with clang, but to:
26
27_add32carry:
28 leal (%rsi,%rdi), %eax
29 cmpl %esi, %eax
30 adcl $0, %eax
31 ret
32
33with gcc.
34
35//===---------------------------------------------------------------------===//
36
Chris Lattner1d159832009-11-27 17:12:30 +000037Dead argument elimination should be enhanced to handle cases when an argument is
38dead to an externally visible function. Though the argument can't be removed
39from the externally visible function, the caller doesn't need to pass it in.
40For example in this testcase:
41
42 void foo(int X) __attribute__((noinline));
43 void foo(int X) { sideeffect(); }
44 void bar(int A) { foo(A+1); }
45
46We compile bar to:
47
48define void @bar(i32 %A) nounwind ssp {
49 %0 = add nsw i32 %A, 1 ; <i32> [#uses=1]
50 tail call void @foo(i32 %0) nounwind noinline ssp
51 ret void
52}
53
54The add is dead, we could pass in 'i32 undef' instead. This occurs for C++
55templates etc, which usually have linkonce_odr/weak_odr linkage, not internal
56linkage.
57
58//===---------------------------------------------------------------------===//
59
Chris Lattner9b62b452006-11-14 01:57:53 +000060With the recent changes to make the implicit def/use set explicit in
61machineinstrs, we should change the target descriptions for 'call' instructions
62so that the .td files don't list all the call-clobbered registers as implicit
63defs. Instead, these should be added by the code generator (e.g. on the dag).
64
65This has a number of uses:
66
671. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
68 for their different impdef sets.
692. Targets with multiple calling convs (e.g. x86) which have different clobber
70 sets don't need copies of call instructions.
713. 'Interprocedural register allocation' can be done to reduce the clobber sets
72 of calls.
73
74//===---------------------------------------------------------------------===//
75
Chris Lattner08859ff2010-12-15 07:25:55 +000076We should recognized various "overflow detection" idioms and translate them into
Chris Lattnere5cbdca2010-12-19 19:37:52 +000077llvm.uadd.with.overflow and similar intrinsics. Here is a multiply idiom:
Chris Lattner94481842010-12-15 07:28:58 +000078
79unsigned int mul(unsigned int a,unsigned int b) {
80 if ((unsigned long long)a*b>0xffffffff)
81 exit(0);
82 return a*b;
83}
84
Nate Begeman81e80972006-03-17 01:40:33 +000085//===---------------------------------------------------------------------===//
Chris Lattner086c0142006-02-03 06:21:43 +000086
87Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
Chris Lattner2dae65d2008-12-10 01:30:48 +000088precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
89safe in general, even on darwin. See the libm implementation of hypot for
90examples (which special case when x/y are exactly zero to get signed zeros etc
91right).
Chris Lattner086c0142006-02-03 06:21:43 +000092
Chris Lattner086c0142006-02-03 06:21:43 +000093//===---------------------------------------------------------------------===//
94
95Solve this DAG isel folding deficiency:
96
97int X, Y;
98
99void fn1(void)
100{
101 X = X | (Y << 3);
102}
103
104compiles to
105
106fn1:
107 movl Y, %eax
108 shll $3, %eax
109 orl X, %eax
110 movl %eax, X
111 ret
112
113The problem is the store's chain operand is not the load X but rather
114a TokenFactor of the load X and load Y, which prevents the folding.
115
116There are two ways to fix this:
117
1181. The dag combiner can start using alias analysis to realize that y/x
119 don't alias, making the store to X not dependent on the load from Y.
1202. The generated isel could be made smarter in the case it can't
121 disambiguate the pointers.
122
123Number 1 is the preferred solution.
124
Evan Chenge617b082006-03-13 23:19:10 +0000125This has been "fixed" by a TableGen hack. But that is a short term workaround
126which will be removed once the proper fix is made.
127
Chris Lattner086c0142006-02-03 06:21:43 +0000128//===---------------------------------------------------------------------===//
129
Chris Lattnerb27b69f2006-03-04 01:19:34 +0000130On targets with expensive 64-bit multiply, we could LSR this:
131
132for (i = ...; ++i) {
133 x = 1ULL << i;
134
135into:
136 long long tmp = 1;
137 for (i = ...; ++i, tmp+=tmp)
138 x = tmp;
139
140This would be a win on ppc32, but not x86 or ppc64.
141
Chris Lattnerad019932006-03-04 08:44:51 +0000142//===---------------------------------------------------------------------===//
Chris Lattner5b0fe7d2006-03-05 20:00:08 +0000143
144Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
145
146//===---------------------------------------------------------------------===//
Chris Lattner549f27d22006-03-07 02:46:26 +0000147
Chris Lattner398ffba2010-01-01 01:29:26 +0000148Reassociate should turn things like:
149
150int factorial(int X) {
151 return X*X*X*X*X*X*X*X;
152}
153
154into llvm.powi calls, allowing the code generator to produce balanced
155multiplication trees.
156
157First, the intrinsic needs to be extended to support integers, and second the
158code generator needs to be enhanced to lower these to multiplication trees.
Chris Lattnerc20995e2006-03-11 20:17:08 +0000159
160//===---------------------------------------------------------------------===//
161
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000162Interesting? testcase for add/shift/mul reassoc:
163
164int bar(int x, int y) {
165 return x*x*x+y+x*x*x*x*x*y*y*y*y;
166}
167int foo(int z, int n) {
168 return bar(z, n) + bar(2*z, 2*n);
169}
170
Chris Lattner398ffba2010-01-01 01:29:26 +0000171This is blocked on not handling X*X*X -> powi(X, 3) (see note above). The issue
172is that we end up getting t = 2*X s = t*t and don't turn this into 4*X*X,
173which is the same number of multiplies and is canonical, because the 2*X has
174multiple uses. Here's a simple example:
175
176define i32 @test15(i32 %X1) {
177 %B = mul i32 %X1, 47 ; X1*47
178 %C = mul i32 %B, %B
179 ret i32 %C
180}
181
182
183//===---------------------------------------------------------------------===//
184
185Reassociate should handle the example in GCC PR16157:
186
187extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;
188void f () { /* this can be optimized to four additions... */
189 b4 = a4 + a3 + a2 + a1 + a0;
190 b3 = a3 + a2 + a1 + a0;
191 b2 = a2 + a1 + a0;
192 b1 = a1 + a0;
193}
194
195This requires reassociating to forms of expressions that are already available,
196something that reassoc doesn't think about yet.
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000197
Chris Lattner10c42452010-01-24 20:01:41 +0000198
199//===---------------------------------------------------------------------===//
200
201This function: (derived from GCC PR19988)
202double foo(double x, double y) {
203 return ((x + 0.1234 * y) * (x + -0.1234 * y));
204}
205
206compiles to:
207_foo:
208 movapd %xmm1, %xmm2
209 mulsd LCPI1_1(%rip), %xmm1
210 mulsd LCPI1_0(%rip), %xmm2
211 addsd %xmm0, %xmm1
212 addsd %xmm0, %xmm2
213 movapd %xmm1, %xmm0
214 mulsd %xmm2, %xmm0
215 ret
216
Chris Lattner43dc2e62010-01-24 20:17:09 +0000217Reassociate should be able to turn it into:
Chris Lattner10c42452010-01-24 20:01:41 +0000218
219double foo(double x, double y) {
220 return ((x + 0.1234 * y) * (x - 0.1234 * y));
221}
222
223Which allows the multiply by constant to be CSE'd, producing:
224
225_foo:
226 mulsd LCPI1_0(%rip), %xmm1
227 movapd %xmm1, %xmm2
228 addsd %xmm0, %xmm2
229 subsd %xmm1, %xmm0
230 mulsd %xmm2, %xmm0
231 ret
232
233This doesn't need -ffast-math support at all. This is particularly bad because
234the llvm-gcc frontend is canonicalizing the later into the former, but clang
235doesn't have this problem.
236
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000237//===---------------------------------------------------------------------===//
238
Chris Lattner82c78b22006-03-09 20:13:21 +0000239These two functions should generate the same code on big-endian systems:
240
241int g(int *j,int *l) { return memcmp(j,l,4); }
242int h(int *j, int *l) { return *j - *l; }
243
244this could be done in SelectionDAGISel.cpp, along with other special cases,
245for 1,2,4,8 bytes.
246
247//===---------------------------------------------------------------------===//
248
Chris Lattnerc04b4232006-03-22 07:33:46 +0000249It would be nice to revert this patch:
250http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
251
252And teach the dag combiner enough to simplify the code expanded before
253legalize. It seems plausible that this knowledge would let it simplify other
254stuff too.
255
Chris Lattnere6cd96d2006-03-24 19:59:17 +0000256//===---------------------------------------------------------------------===//
257
Reid Spencerac9dcb92007-02-15 03:39:18 +0000258For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng67d3d4c2006-03-31 22:35:14 +0000259to the type size. It works but can be overly conservative as the alignment of
Reid Spencerac9dcb92007-02-15 03:39:18 +0000260specific vector types are target dependent.
Chris Lattnereaa7c062006-04-01 04:08:29 +0000261
262//===---------------------------------------------------------------------===//
263
Dan Gohman1f3be1a2009-05-11 18:51:16 +0000264We should produce an unaligned load from code like this:
Chris Lattnereaa7c062006-04-01 04:08:29 +0000265
266v4sf example(float *P) {
267 return (v4sf){P[0], P[1], P[2], P[3] };
268}
269
270//===---------------------------------------------------------------------===//
271
Chris Lattner16abfdf2006-05-18 18:26:13 +0000272Add support for conditional increments, and other related patterns. Instead
273of:
274
275 movl 136(%esp), %eax
276 cmpl $0, %eax
277 je LBB16_2 #cond_next
278LBB16_1: #cond_true
279 incl _foo
280LBB16_2: #cond_next
281
282emit:
283 movl _foo, %eax
284 cmpl $1, %edi
285 sbbl $-1, %eax
286 movl %eax, _foo
287
288//===---------------------------------------------------------------------===//
Chris Lattner870cf1b2006-05-19 20:45:08 +0000289
290Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
291
292Expand these to calls of sin/cos and stores:
293 double sincos(double x, double *sin, double *cos);
294 float sincosf(float x, float *sin, float *cos);
295 long double sincosl(long double x, long double *sin, long double *cos);
296
297Doing so could allow SROA of the destination pointers. See also:
298http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
299
Chris Lattner2dae65d2008-12-10 01:30:48 +0000300This is now easily doable with MRVs. We could even make an intrinsic for this
301if anyone cared enough about sincos.
302
Chris Lattner870cf1b2006-05-19 20:45:08 +0000303//===---------------------------------------------------------------------===//
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000304
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000305quantum_sigma_x in 462.libquantum contains the following loop:
306
307 for(i=0; i<reg->size; i++)
308 {
309 /* Flip the target bit of each basis state */
310 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
311 }
312
313Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
314so cool to turn it into something like:
315
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000316 long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000317 if (target < 32) {
318 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000319 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000320 } else {
321 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000322 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000323 }
324
325... which would only do one 32-bit XOR per loop iteration instead of two.
326
327It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000328this requires TBAA.
Chris Lattnerfaa6adf2009-09-21 06:04:07 +0000329
330//===---------------------------------------------------------------------===//
331
Chris Lattnerb1ac7692008-10-05 02:16:12 +0000332This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattnerf9bae432006-12-08 02:01:32 +0000333
334unsigned long reverse(unsigned v) {
335 unsigned t;
336 t = v ^ ((v << 16) | (v >> 16));
337 t &= ~0xff0000;
338 v = (v << 24) | (v >> 8);
339 return v ^ (t >> 8);
340}
341
Eric Christopher33634d02010-06-29 22:22:22 +0000342Neither is this (very standard idiom):
343
344int f(int n)
345{
346 return (((n) << 24) | (((n) & 0xff00) << 8)
347 | (((n) >> 8) & 0xff00) | ((n) >> 24));
348}
349
Chris Lattnerfb981f32006-09-25 17:12:14 +0000350//===---------------------------------------------------------------------===//
351
Chris Lattner818ff342010-01-23 18:49:30 +0000352[LOOP RECOGNITION]
353
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000354These idioms should be recognized as popcount (see PR1488):
355
356unsigned countbits_slow(unsigned v) {
357 unsigned c;
358 for (c = 0; v; v >>= 1)
359 c += v & 1;
360 return c;
361}
362unsigned countbits_fast(unsigned v){
363 unsigned c;
364 for (c = 0; v; c++)
365 v &= v - 1; // clear the least significant bit set
366 return c;
367}
368
369BITBOARD = unsigned long long
370int PopCnt(register BITBOARD a) {
371 register int c=0;
372 while(a) {
373 c++;
374 a &= a - 1;
375 }
376 return c;
377}
378unsigned int popcount(unsigned int input) {
379 unsigned int count = 0;
380 for (unsigned int i = 0; i < 4 * 8; i++)
381 count += (input >> i) & i;
382 return count;
383}
384
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000385This is a form of idiom recognition for loops, the same thing that could be
Chris Lattner9c8fb9e2011-01-01 22:52:11 +0000386useful for recognizing memset/memcpy. This sort of thing should be added to the
387loop idiom pass.
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000388
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000389//===---------------------------------------------------------------------===//
390
Chris Lattnerfb981f32006-09-25 17:12:14 +0000391These should turn into single 16-bit (unaligned?) loads on little/big endian
392processors.
393
394unsigned short read_16_le(const unsigned char *adr) {
395 return adr[0] | (adr[1] << 8);
396}
397unsigned short read_16_be(const unsigned char *adr) {
398 return (adr[0] << 8) | adr[1];
399}
400
401//===---------------------------------------------------------------------===//
Chris Lattnercf103912006-10-24 16:12:47 +0000402
Reid Spencer1628cec2006-10-26 06:15:43 +0000403-instcombine should handle this transform:
Reid Spencere4d87aa2006-12-23 06:05:41 +0000404 icmp pred (sdiv X / C1 ), C2
Reid Spencer1628cec2006-10-26 06:15:43 +0000405when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
406
407Currently InstCombine avoids this transform but will do it when the signs of
408the operands and the sign of the divide match. See the FIXME in
409InstructionCombining.cpp in the visitSetCondInst method after the switch case
410for Instruction::UDiv (around line 4447) for more details.
411
412The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
413this construct.
Chris Lattnerd7c628d2006-11-03 22:27:39 +0000414
415//===---------------------------------------------------------------------===//
416
Chris Lattneraa306c22010-01-23 17:59:23 +0000417[LOOP RECOGNITION]
418
Chris Lattner578d2df2006-11-10 00:23:26 +0000419viterbi speeds up *significantly* if the various "history" related copy loops
420are turned into memcpy calls at the source level. We need a "loops to memcpy"
421pass.
422
423//===---------------------------------------------------------------------===//
Nick Lewyckybf637342006-11-13 00:23:28 +0000424
Chris Lattneraa306c22010-01-23 17:59:23 +0000425[LOOP OPTIMIZATION]
426
427SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization
428opportunities in its double_array_divs_variable function: it needs loop
429interchange, memory promotion (which LICM already does), vectorization and
430variable trip count loop unrolling (since it has a constant trip count). ICC
431apparently produces this very nice code with -ffast-math:
432
433..B1.70: # Preds ..B1.70 ..B1.69
434 mulpd %xmm0, %xmm1 #108.2
435 mulpd %xmm0, %xmm1 #108.2
436 mulpd %xmm0, %xmm1 #108.2
437 mulpd %xmm0, %xmm1 #108.2
438 addl $8, %edx #
439 cmpl $131072, %edx #108.2
440 jb ..B1.70 # Prob 99% #108.2
441
442It would be better to count down to zero, but this is a lot better than what we
443do.
444
445//===---------------------------------------------------------------------===//
446
Chris Lattner03a6d962007-01-16 06:39:48 +0000447Consider:
448
449typedef unsigned U32;
450typedef unsigned long long U64;
451int test (U32 *inst, U64 *regs) {
452 U64 effective_addr2;
453 U32 temp = *inst;
454 int r1 = (temp >> 20) & 0xf;
455 int b2 = (temp >> 16) & 0xf;
456 effective_addr2 = temp & 0xfff;
457 if (b2) effective_addr2 += regs[b2];
458 b2 = (temp >> 12) & 0xf;
459 if (b2) effective_addr2 += regs[b2];
460 effective_addr2 &= regs[4];
461 if ((effective_addr2 & 3) == 0)
462 return 1;
463 return 0;
464}
465
466Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
467we don't eliminate the computation of the top half of effective_addr2 because
468we don't have whole-function selection dags. On x86, this means we use one
469extra register for the function when effective_addr2 is declared as U64 than
470when it is declared U32.
471
Chris Lattner17424982009-11-10 23:47:45 +0000472PHI Slicing could be extended to do this.
473
Chris Lattner03a6d962007-01-16 06:39:48 +0000474//===---------------------------------------------------------------------===//
475
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000476LSR should know what GPR types a target has from TargetData. This code:
Chris Lattner1a77a552007-03-24 06:01:32 +0000477
478volatile short X, Y; // globals
479
480void foo(int N) {
481 int i;
482 for (i = 0; i < N; i++) { X = i; Y = i*4; }
483}
484
Chris Lattnerc1491f32009-09-20 17:37:38 +0000485produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner1a77a552007-03-24 06:01:32 +0000486
Chris Lattnerc1491f32009-09-20 17:37:38 +0000487LBB1_2:
488 ldr r3, LCPI1_0
489 ldr r3, [r3]
490 strh r2, [r3]
491 ldr r3, LCPI1_1
492 ldr r3, [r3]
493 strh r1, [r3]
494 add r1, r1, #4
495 add r2, r2, #1 <- [0,+,1]
496 sub r0, r0, #1 <- [0,-,1]
497 cmp r0, #0
498 bne LBB1_2
499
500LSR should reuse the "+" IV for the exit test.
Chris Lattner1a77a552007-03-24 06:01:32 +0000501
Chris Lattner1a77a552007-03-24 06:01:32 +0000502//===---------------------------------------------------------------------===//
503
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000504Tail call elim should be more aggressive, checking to see if the call is
505followed by an uncond branch to an exit block.
506
507; This testcase is due to tail-duplication not wanting to copy the return
508; instruction into the terminating blocks because there was other code
509; optimized out of the function after the taildup happened.
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000510; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000511
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000512define i32 @t4(i32 %a) {
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000513entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000514 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
515 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
516 br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000517
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000518then.0: ; preds = %entry
519 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
520 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
521 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000522
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000523else.0: ; preds = %entry
524 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
525 br i1 %tmp.7, label %then.1, label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000526
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000527then.1: ; preds = %else.0
528 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
529 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
530 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000531
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000532return: ; preds = %then.1, %else.0, %then.0
533 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000534 [ %tmp.9, %then.1 ]
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000535 ret i32 %result.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000536}
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000537
538//===---------------------------------------------------------------------===//
539
Chris Lattnerc90b8662008-08-10 00:47:21 +0000540Tail recursion elimination should handle:
541
542int pow2m1(int n) {
543 if (n == 0)
544 return 0;
545 return 2 * pow2m1 (n - 1) + 1;
546}
547
548Also, multiplies can be turned into SHL's, so they should be handled as if
549they were associative. "return foo() << 1" can be tail recursion eliminated.
550
551//===---------------------------------------------------------------------===//
552
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000553Argument promotion should promote arguments for recursive functions, like
554this:
555
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000556; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000557
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000558define internal i32 @foo(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000559entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000560 %tmp = load i32* %x ; <i32> [#uses=0]
561 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
562 ret i32 %tmp.foo
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000563}
564
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000565define i32 @bar(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000566entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000567 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
568 ret i32 %tmp3
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000569}
570
Chris Lattner81f2d712007-12-05 23:05:06 +0000571//===---------------------------------------------------------------------===//
Chris Lattner166a2682007-12-28 04:42:05 +0000572
Chris Lattnera1643ba2007-12-28 22:30:05 +0000573We should investigate an instruction sinking pass. Consider this silly
574example in pic mode:
575
576#include <assert.h>
577void foo(int x) {
578 assert(x);
579 //...
580}
581
582we compile this to:
583_foo:
584 subl $28, %esp
585 call "L1$pb"
586"L1$pb":
587 popl %eax
588 cmpl $0, 32(%esp)
589 je LBB1_2 # cond_true
590LBB1_1: # return
591 # ...
592 addl $28, %esp
593 ret
594LBB1_2: # cond_true
595...
596
597The PIC base computation (call+popl) is only used on one path through the
598code, but is currently always computed in the entry block. It would be
599better to sink the picbase computation down into the block for the
600assertion, as it is the only one that uses it. This happens for a lot of
601code with early outs.
602
Chris Lattner92c06a02007-12-29 01:05:01 +0000603Another example is loads of arguments, which are usually emitted into the
604entry block on targets like x86. If not used in all paths through a
605function, they should be sunk into the ones that do.
606
Chris Lattnera1643ba2007-12-28 22:30:05 +0000607In this case, whole-function-isel would also handle this.
Chris Lattner166a2682007-12-28 04:42:05 +0000608
609//===---------------------------------------------------------------------===//
Chris Lattnerb3041942008-01-07 21:38:14 +0000610
611Investigate lowering of sparse switch statements into perfect hash tables:
612http://burtleburtle.net/bob/hash/perfect.html
613
614//===---------------------------------------------------------------------===//
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000615
616We should turn things like "load+fabs+store" and "load+fneg+store" into the
617corresponding integer operations. On a yonah, this loop:
618
619double a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000620void foo() {
621 int i, b;
622 for (b = 0; b < 10000000; b++)
623 for (i = 0; i < 256; i++)
624 a[i] = -a[i];
625}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000626
627is twice as slow as this loop:
628
629long long a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000630void foo() {
631 int i, b;
632 for (b = 0; b < 10000000; b++)
633 for (i = 0; i < 256; i++)
634 a[i] ^= (1ULL << 63);
635}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000636
637and I suspect other processors are similar. On X86 in particular this is a
638big win because doing this with integers allows the use of read/modify/write
639instructions.
640
641//===---------------------------------------------------------------------===//
Chris Lattner83726012008-01-10 18:25:41 +0000642
643DAG Combiner should try to combine small loads into larger loads when
644profitable. For example, we compile this C++ example:
645
646struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
647extern THotKey m_HotKey;
648THotKey GetHotKey () { return m_HotKey; }
649
650into (-O3 -fno-exceptions -static -fomit-frame-pointer):
651
652__Z9GetHotKeyv:
653 pushl %esi
654 movl 8(%esp), %eax
655 movb _m_HotKey+3, %cl
656 movb _m_HotKey+4, %dl
657 movb _m_HotKey+2, %ch
658 movw _m_HotKey, %si
659 movw %si, (%eax)
660 movb %ch, 2(%eax)
661 movb %cl, 3(%eax)
662 movb %dl, 4(%eax)
663 popl %esi
664 ret $4
665
666GCC produces:
667
668__Z9GetHotKeyv:
669 movl _m_HotKey, %edx
670 movl 4(%esp), %eax
671 movl %edx, (%eax)
672 movzwl _m_HotKey+4, %edx
673 movw %dx, 4(%eax)
674 ret $4
675
676The LLVM IR contains the needed alignment info, so we should be able to
677merge the loads and stores into 4-byte loads:
678
679 %struct.THotKey = type { i16, i8, i8, i8 }
680define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
681...
682 %tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
683 %tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
684 %tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
685 %tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
686
687Alternatively, we should use a small amount of base-offset alias analysis
688to make it so the scheduler doesn't need to hold all the loads in regs at
689once.
690
691//===---------------------------------------------------------------------===//
Chris Lattner497b7e92008-01-11 06:17:47 +0000692
Nate Begemane9fe65c2008-02-18 18:39:23 +0000693We should add an FRINT node to the DAG to model targets that have legal
694implementations of ceil/floor/rint.
Chris Lattner48840f82008-02-28 05:34:27 +0000695
696//===---------------------------------------------------------------------===//
697
698Consider:
699
700int test() {
Benjamin Kramer9d071cb2010-12-23 15:32:07 +0000701 long long input[8] = {1,0,1,0,1,0,1,0};
Chris Lattner48840f82008-02-28 05:34:27 +0000702 foo(input);
703}
704
Chris Lattner9c8fb9e2011-01-01 22:52:11 +0000705Clang compiles this into:
Chris Lattner48840f82008-02-28 05:34:27 +0000706
Chris Lattner9c8fb9e2011-01-01 22:52:11 +0000707 call void @llvm.memset.p0i8.i64(i8* %tmp, i8 0, i64 64, i32 16, i1 false)
708 %0 = getelementptr [8 x i64]* %input, i64 0, i64 0
709 store i64 1, i64* %0, align 16
710 %1 = getelementptr [8 x i64]* %input, i64 0, i64 2
711 store i64 1, i64* %1, align 16
712 %2 = getelementptr [8 x i64]* %input, i64 0, i64 4
713 store i64 1, i64* %2, align 16
714 %3 = getelementptr [8 x i64]* %input, i64 0, i64 6
715 store i64 1, i64* %3, align 16
Chris Lattner48840f82008-02-28 05:34:27 +0000716
Chris Lattner9c8fb9e2011-01-01 22:52:11 +0000717Which gets codegen'd into:
718
719 pxor %xmm0, %xmm0
720 movaps %xmm0, -16(%rbp)
721 movaps %xmm0, -32(%rbp)
722 movaps %xmm0, -48(%rbp)
723 movaps %xmm0, -64(%rbp)
724 movq $1, -64(%rbp)
725 movq $1, -48(%rbp)
726 movq $1, -32(%rbp)
727 movq $1, -16(%rbp)
728
729It would be better to have 4 movq's of 0 instead of the movaps's.
Chris Lattner48840f82008-02-28 05:34:27 +0000730
731//===---------------------------------------------------------------------===//
Chris Lattnera11deb02008-03-02 02:51:40 +0000732
733http://llvm.org/PR717:
734
735The following code should compile into "ret int undef". Instead, LLVM
736produces "ret int 0":
737
738int f() {
739 int x = 4;
740 int y;
741 if (x == 3) y = 0;
742 return y;
743}
744
745//===---------------------------------------------------------------------===//
Chris Lattner53b72772008-03-02 19:29:42 +0000746
747The loop unroller should partially unroll loops (instead of peeling them)
748when code growth isn't too bad and when an unroll count allows simplification
749of some code within the loop. One trivial example is:
750
751#include <stdio.h>
752int main() {
753 int nRet = 17;
754 int nLoop;
755 for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
756 if ( nLoop & 1 )
757 nRet += 2;
758 else
759 nRet -= 1;
760 }
761 return nRet;
762}
763
764Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
765reduction in code size. The resultant code would then also be suitable for
766exit value computation.
767
768//===---------------------------------------------------------------------===//
Chris Lattner349155b2008-03-17 01:47:51 +0000769
770We miss a bunch of rotate opportunities on various targets, including ppc, x86,
771etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
772matching code in dag combine doesn't look through truncates aggressively
773enough. Here are some testcases reduces from GCC PR17886:
774
775unsigned long long f(unsigned long long x, int y) {
776 return (x << y) | (x >> 64-y);
777}
778unsigned f2(unsigned x, int y){
779 return (x << y) | (x >> 32-y);
780}
781unsigned long long f3(unsigned long long x){
782 int y = 9;
783 return (x << y) | (x >> 64-y);
784}
785unsigned f4(unsigned x){
786 int y = 10;
787 return (x << y) | (x >> 32-y);
788}
789unsigned long long f5(unsigned long long x, unsigned long long y) {
790 return (x << 8) | ((y >> 48) & 0xffull);
791}
792unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
793 switch(z) {
794 case 1:
795 return (x << 8) | ((y >> 48) & 0xffull);
796 case 2:
797 return (x << 16) | ((y >> 40) & 0xffffull);
798 case 3:
799 return (x << 24) | ((y >> 32) & 0xffffffull);
800 case 4:
801 return (x << 32) | ((y >> 24) & 0xffffffffull);
802 default:
803 return (x << 40) | ((y >> 16) & 0xffffffffffull);
804 }
805}
806
Dan Gohmancb747c52008-10-17 21:39:27 +0000807On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner349155b2008-03-17 01:47:51 +0000808generate truly horrible code, instead of using shld and friends. On
809ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
810badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
811
812//===---------------------------------------------------------------------===//
Chris Lattnerf70107f2008-03-20 04:46:13 +0000813
Chris Lattneref17f082010-12-15 07:10:43 +0000814This (and similar related idioms):
815
816unsigned int foo(unsigned char i) {
817 return i | (i<<8) | (i<<16) | (i<<24);
818}
819
820compiles into:
821
822define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone {
823entry:
824 %conv = zext i8 %i to i32
825 %shl = shl i32 %conv, 8
826 %shl5 = shl i32 %conv, 16
827 %shl9 = shl i32 %conv, 24
828 %or = or i32 %shl9, %conv
829 %or6 = or i32 %or, %shl5
830 %or10 = or i32 %or6, %shl
831 ret i32 %or10
832}
833
834it would be better as:
835
836unsigned int bar(unsigned char i) {
837 unsigned int j=i | (i << 8);
838 return j | (j<<16);
839}
840
841aka:
842
843define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone {
844entry:
845 %conv = zext i8 %i to i32
846 %shl = shl i32 %conv, 8
847 %or = or i32 %shl, %conv
848 %shl5 = shl i32 %or, 16
849 %or6 = or i32 %shl5, %or
850 ret i32 %or6
851}
852
853or even i*0x01010101, depending on the speed of the multiplier. The best way to
854handle this is to canonicalize it to a multiply in IR and have codegen handle
855lowering multiplies to shifts on cpus where shifts are faster.
856
857//===---------------------------------------------------------------------===//
858
Chris Lattnerf70107f2008-03-20 04:46:13 +0000859We do a number of simplifications in simplify libcalls to strength reduce
860standard library functions, but we don't currently merge them together. For
861example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
862be done safely if "b" isn't modified between the strlen and memcpy of course.
863
864//===---------------------------------------------------------------------===//
865
Chris Lattner26e150f2008-08-10 01:14:08 +0000866We compile this program: (from GCC PR11680)
867http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
868
869Into code that runs the same speed in fast/slow modes, but both modes run 2x
870slower than when compile with GCC (either 4.0 or 4.2):
871
872$ llvm-g++ perf.cpp -O3 -fno-exceptions
873$ time ./a.out fast
8741.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
875
876$ g++ perf.cpp -O3 -fno-exceptions
877$ time ./a.out fast
8780.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
879
880It looks like we are making the same inlining decisions, so this may be raw
881codegen badness or something else (haven't investigated).
882
883//===---------------------------------------------------------------------===//
884
885We miss some instcombines for stuff like this:
886void bar (void);
887void foo (unsigned int a) {
888 /* This one is equivalent to a >= (3 << 2). */
889 if ((a >> 2) >= 3)
890 bar ();
891}
892
893A few other related ones are in GCC PR14753.
894
895//===---------------------------------------------------------------------===//
896
897Divisibility by constant can be simplified (according to GCC PR12849) from
898being a mulhi to being a mul lo (cheaper). Testcase:
899
900void bar(unsigned n) {
901 if (n % 3 == 0)
902 true();
903}
904
Eli Friedmanbcae2052009-12-12 23:23:43 +0000905This is equivalent to the following, where 2863311531 is the multiplicative
906inverse of 3, and 1431655766 is ((2^32)-1)/3+1:
907void bar(unsigned n) {
908 if (n * 2863311531U < 1431655766U)
909 true();
910}
911
912The same transformation can work with an even modulo with the addition of a
913rotate: rotate the result of the multiply to the right by the number of bits
914which need to be zero for the condition to be true, and shrink the compare RHS
915by the same amount. Unless the target supports rotates, though, that
916transformation probably isn't worthwhile.
917
918The transformation can also easily be made to work with non-zero equality
919comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".
Chris Lattner26e150f2008-08-10 01:14:08 +0000920
921//===---------------------------------------------------------------------===//
Chris Lattner23f35bc2008-08-19 06:22:16 +0000922
Chris Lattnerdb039832008-10-15 16:06:03 +0000923Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
924bunch of other stuff from this example (see PR1604):
925
926#include <cstdio>
927struct test {
928 int val;
929 virtual ~test() {}
930};
931
932int main() {
933 test t;
934 std::scanf("%d", &t.val);
935 std::printf("%d\n", t.val);
936}
937
938//===---------------------------------------------------------------------===//
939
Nick Lewyckyd2f0db12008-11-27 22:41:45 +0000940These functions perform the same computation, but produce different assembly.
Nick Lewyckydf563ca2008-11-27 22:12:22 +0000941
942define i8 @select(i8 %x) readnone nounwind {
943 %A = icmp ult i8 %x, 250
944 %B = select i1 %A, i8 0, i8 1
945 ret i8 %B
946}
947
948define i8 @addshr(i8 %x) readnone nounwind {
949 %A = zext i8 %x to i9
950 %B = add i9 %A, 6 ;; 256 - 250 == 6
951 %C = lshr i9 %B, 8
952 %D = trunc i9 %C to i8
953 ret i8 %D
954}
955
956//===---------------------------------------------------------------------===//
Eli Friedman4e16b292008-11-30 07:36:04 +0000957
958From gcc bug 24696:
959int
960f (unsigned long a, unsigned long b, unsigned long c)
961{
962 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0);
963}
964int
965f (unsigned long a, unsigned long b, unsigned long c)
966{
967 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0);
968}
969Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with
970"clang -emit-llvm-bc | opt -std-compile-opts".
971
972//===---------------------------------------------------------------------===//
973
974From GCC Bug 20192:
975#define PMD_MASK (~((1UL << 23) - 1))
976void clear_pmd_range(unsigned long start, unsigned long end)
977{
978 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
979 f();
980}
981The expression should optimize to something like
982"!((start|end)&~PMD_MASK). Currently not optimized with "clang
983-emit-llvm-bc | opt -std-compile-opts".
984
985//===---------------------------------------------------------------------===//
986
Eli Friedman4e16b292008-11-30 07:36:04 +0000987unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
988i;}
989unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
990These should combine to the same thing. Currently, the first function
991produces better code on X86.
992
993//===---------------------------------------------------------------------===//
994
Eli Friedman4e16b292008-11-30 07:36:04 +0000995From GCC Bug 15784:
996#define abs(x) x>0?x:-x
997int f(int x, int y)
998{
999 return (abs(x)) >= 0;
1000}
1001This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
1002optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1003
1004//===---------------------------------------------------------------------===//
1005
1006From GCC Bug 14753:
1007void
1008rotate_cst (unsigned int a)
1009{
1010 a = (a << 10) | (a >> 22);
1011 if (a == 123)
1012 bar ();
1013}
1014void
1015minus_cst (unsigned int a)
1016{
1017 unsigned int tem;
1018
1019 tem = 20 - a;
1020 if (tem == 5)
1021 bar ();
1022}
1023void
1024mask_gt (unsigned int a)
1025{
1026 /* This is equivalent to a > 15. */
1027 if ((a & ~7) > 8)
1028 bar ();
1029}
1030void
1031rshift_gt (unsigned int a)
1032{
1033 /* This is equivalent to a > 23. */
1034 if ((a >> 2) > 5)
1035 bar ();
1036}
1037All should simplify to a single comparison. All of these are
1038currently not optimized with "clang -emit-llvm-bc | opt
1039-std-compile-opts".
1040
1041//===---------------------------------------------------------------------===//
1042
1043From GCC Bug 32605:
1044int c(int* x) {return (char*)x+2 == (char*)x;}
1045Should combine to 0. Currently not optimized with "clang
1046-emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it).
1047
1048//===---------------------------------------------------------------------===//
1049
Eli Friedman4e16b292008-11-30 07:36:04 +00001050int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;}
1051Should be combined to "((b >> 1) | b) & 1". Currently not optimized
1052with "clang -emit-llvm-bc | opt -std-compile-opts".
1053
1054//===---------------------------------------------------------------------===//
1055
1056unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);}
1057Should combine to "x | (y & 3)". Currently not optimized with "clang
1058-emit-llvm-bc | opt -std-compile-opts".
1059
1060//===---------------------------------------------------------------------===//
1061
Eli Friedman4e16b292008-11-30 07:36:04 +00001062int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);}
1063Should fold to "(~a & c) | (a & b)". Currently not optimized with
1064"clang -emit-llvm-bc | opt -std-compile-opts".
1065
1066//===---------------------------------------------------------------------===//
1067
1068int a(int a,int b) {return (~(a|b))|a;}
1069Should fold to "a|~b". Currently not optimized with "clang
1070-emit-llvm-bc | opt -std-compile-opts".
1071
1072//===---------------------------------------------------------------------===//
1073
1074int a(int a, int b) {return (a&&b) || (a&&!b);}
1075Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
1076| opt -std-compile-opts".
1077
1078//===---------------------------------------------------------------------===//
1079
1080int a(int a, int b, int c) {return (a&&b) || (!a&&c);}
1081Should fold to "a ? b : c", or at least something sane. Currently not
1082optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1083
1084//===---------------------------------------------------------------------===//
1085
1086int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);}
1087Should fold to a && (b || c). Currently not optimized with "clang
1088-emit-llvm-bc | opt -std-compile-opts".
1089
1090//===---------------------------------------------------------------------===//
1091
1092int a(int x) {return x | ((x & 8) ^ 8);}
1093Should combine to x | 8. Currently not optimized with "clang
1094-emit-llvm-bc | opt -std-compile-opts".
1095
1096//===---------------------------------------------------------------------===//
1097
1098int a(int x) {return x ^ ((x & 8) ^ 8);}
1099Should also combine to x | 8. Currently not optimized with "clang
1100-emit-llvm-bc | opt -std-compile-opts".
1101
1102//===---------------------------------------------------------------------===//
1103
Eli Friedman4e16b292008-11-30 07:36:04 +00001104int a(int x) {return ((x | -9) ^ 8) & x;}
1105Should combine to x & -9. Currently not optimized with "clang
1106-emit-llvm-bc | opt -std-compile-opts".
1107
1108//===---------------------------------------------------------------------===//
1109
1110unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
1111Should combine to "a * 0x88888888 >> 31". Currently not optimized
1112with "clang -emit-llvm-bc | opt -std-compile-opts".
1113
1114//===---------------------------------------------------------------------===//
1115
1116unsigned a(char* x) {if ((*x & 32) == 0) return b();}
1117There's an unnecessary zext in the generated code with "clang
1118-emit-llvm-bc | opt -std-compile-opts".
1119
1120//===---------------------------------------------------------------------===//
1121
1122unsigned a(unsigned long long x) {return 40 * (x >> 1);}
1123Should combine to "20 * (((unsigned)x) & -2)". Currently not
1124optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1125
1126//===---------------------------------------------------------------------===//
Bill Wendling3bdcda82008-12-02 05:12:47 +00001127
Chris Lattner88d84b22008-12-02 06:32:34 +00001128This was noticed in the entryblock for grokdeclarator in 403.gcc:
1129
1130 %tmp = icmp eq i32 %decl_context, 4
1131 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
1132 %tmp1 = icmp eq i32 %decl_context_addr.0, 1
1133 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
1134
1135tmp1 should be simplified to something like:
1136 (!tmp || decl_context == 1)
1137
1138This allows recursive simplifications, tmp1 is used all over the place in
1139the function, e.g. by:
1140
1141 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
1142 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
1143 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
1144
1145later.
1146
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001147//===---------------------------------------------------------------------===//
1148
Chris Lattnerd4137f42009-11-29 02:19:52 +00001149[STORE SINKING]
1150
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001151Store sinking: This code:
1152
1153void f (int n, int *cond, int *res) {
1154 int i;
1155 *res = 0;
1156 for (i = 0; i < n; i++)
1157 if (*cond)
1158 *res ^= 234; /* (*) */
1159}
1160
1161On this function GVN hoists the fully redundant value of *res, but nothing
1162moves the store out. This gives us this code:
1163
1164bb: ; preds = %bb2, %entry
1165 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
1166 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
1167 %1 = load i32* %cond, align 4
1168 %2 = icmp eq i32 %1, 0
1169 br i1 %2, label %bb2, label %bb1
1170
1171bb1: ; preds = %bb
1172 %3 = xor i32 %.rle, 234
1173 store i32 %3, i32* %res, align 4
1174 br label %bb2
1175
1176bb2: ; preds = %bb, %bb1
1177 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
1178 %indvar.next = add i32 %i.05, 1
1179 %exitcond = icmp eq i32 %indvar.next, %n
1180 br i1 %exitcond, label %return, label %bb
1181
1182DSE should sink partially dead stores to get the store out of the loop.
1183
Chris Lattner6a09a742008-12-06 22:52:12 +00001184Here's another partial dead case:
1185http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
1186
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001187//===---------------------------------------------------------------------===//
1188
1189Scalar PRE hoists the mul in the common block up to the else:
1190
1191int test (int a, int b, int c, int g) {
1192 int d, e;
1193 if (a)
1194 d = b * c;
1195 else
1196 d = b - c;
1197 e = b * c + g;
1198 return d + e;
1199}
1200
1201It would be better to do the mul once to reduce codesize above the if.
1202This is GCC PR38204.
1203
1204//===---------------------------------------------------------------------===//
1205
Chris Lattnerd4137f42009-11-29 02:19:52 +00001206[STORE SINKING]
1207
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001208GCC PR37810 is an interesting case where we should sink load/store reload
1209into the if block and outside the loop, so we don't reload/store it on the
1210non-call path.
1211
1212for () {
1213 *P += 1;
1214 if ()
1215 call();
1216 else
1217 ...
1218->
1219tmp = *P
1220for () {
1221 tmp += 1;
1222 if () {
1223 *P = tmp;
1224 call();
1225 tmp = *P;
1226 } else ...
1227}
1228*P = tmp;
1229
Chris Lattner8f416f32008-12-15 07:49:24 +00001230We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
1231we don't sink the store. We need partially dead store sinking.
1232
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001233//===---------------------------------------------------------------------===//
1234
Chris Lattnerd4137f42009-11-29 02:19:52 +00001235[LOAD PRE CRIT EDGE SPLITTING]
Chris Lattner8f416f32008-12-15 07:49:24 +00001236
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001237GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
1238leading to excess stack traffic. This could be handled by GVN with some crazy
1239symbolic phi translation. The code we get looks like (g is on the stack):
1240
1241bb2: ; preds = %bb1
1242..
1243 %9 = getelementptr %struct.f* %g, i32 0, i32 0
1244 store i32 %8, i32* %9, align bel %bb3
1245
1246bb3: ; preds = %bb1, %bb2, %bb
1247 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
1248 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
1249 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
1250 %11 = load i32* %10, align 4
1251
Chris Lattner6d949262009-11-27 16:53:57 +00001252%11 is partially redundant, an in BB2 it should have the value %8.
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001253
Chris Lattnerd4137f42009-11-29 02:19:52 +00001254GCC PR33344 and PR35287 are similar cases.
Chris Lattner6a09a742008-12-06 22:52:12 +00001255
Chris Lattner6c9fab72009-11-05 18:19:19 +00001256
1257//===---------------------------------------------------------------------===//
1258
Chris Lattnerd4137f42009-11-29 02:19:52 +00001259[LOAD PRE]
1260
Chris Lattner6a09a742008-12-06 22:52:12 +00001261There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
Chris Lattnerd4137f42009-11-29 02:19:52 +00001262GCC testsuite, ones we don't get yet are (checked through loadpre25):
1263
1264[CRIT EDGE BREAKING]
1265loadpre3.c predcom-4.c
1266
1267[PRE OF READONLY CALL]
1268loadpre5.c
1269
1270[TURN SELECT INTO BRANCH]
1271loadpre14.c loadpre15.c
1272
1273actually a conditional increment: loadpre18.c loadpre19.c
1274
Chris Lattner2fc36e12010-12-15 06:38:24 +00001275//===---------------------------------------------------------------------===//
1276
1277[LOAD PRE / STORE SINKING / SPEC HACK]
1278
1279This is a chunk of code from 456.hmmer:
1280
1281int f(int M, int *mc, int *mpp, int *tpmm, int *ip, int *tpim, int *dpp,
1282 int *tpdm, int xmb, int *bp, int *ms) {
1283 int k, sc;
1284 for (k = 1; k <= M; k++) {
1285 mc[k] = mpp[k-1] + tpmm[k-1];
1286 if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;
1287 if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
1288 if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
1289 mc[k] += ms[k];
1290 }
1291}
1292
1293It is very profitable for this benchmark to turn the conditional stores to mc[k]
1294into a conditional move (select instr in IR) and allow the final store to do the
1295store. See GCC PR27313 for more details. Note that this is valid to xform even
1296with the new C++ memory model, since mc[k] is previously loaded and later
1297stored.
Chris Lattnerd4137f42009-11-29 02:19:52 +00001298
1299//===---------------------------------------------------------------------===//
1300
1301[SCALAR PRE]
1302There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the
1303GCC testsuite.
Chris Lattner6a09a742008-12-06 22:52:12 +00001304
Chris Lattner582048d2008-12-15 08:32:28 +00001305//===---------------------------------------------------------------------===//
1306
1307There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
Chris Lattnerd4137f42009-11-29 02:19:52 +00001308GCC testsuite. For example, we get the first example in predcom-1.c, but
1309miss the second one:
Chris Lattner582048d2008-12-15 08:32:28 +00001310
Chris Lattnerd4137f42009-11-29 02:19:52 +00001311unsigned fib[1000];
1312unsigned avg[1000];
Chris Lattner582048d2008-12-15 08:32:28 +00001313
Chris Lattnerd4137f42009-11-29 02:19:52 +00001314__attribute__ ((noinline))
1315void count_averages(int n) {
1316 int i;
1317 for (i = 1; i < n; i++)
1318 avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;
1319}
Chris Lattner582048d2008-12-15 08:32:28 +00001320
Chris Lattnerd4137f42009-11-29 02:19:52 +00001321which compiles into two loads instead of one in the loop.
Chris Lattner582048d2008-12-15 08:32:28 +00001322
Chris Lattnerd4137f42009-11-29 02:19:52 +00001323predcom-2.c is the same as predcom-1.c
Chris Lattner582048d2008-12-15 08:32:28 +00001324
Chris Lattner582048d2008-12-15 08:32:28 +00001325predcom-3.c is very similar but needs loads feeding each other instead of
1326store->load.
Chris Lattner582048d2008-12-15 08:32:28 +00001327
1328
1329//===---------------------------------------------------------------------===//
1330
Chris Lattneraa306c22010-01-23 17:59:23 +00001331[ALIAS ANALYSIS]
1332
Chris Lattner582048d2008-12-15 08:32:28 +00001333Type based alias analysis:
Chris Lattner6a09a742008-12-06 22:52:12 +00001334http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
1335
Chris Lattneraa306c22010-01-23 17:59:23 +00001336We should do better analysis of posix_memalign. At the least it should
1337no-capture its pointer argument, at best, we should know that the out-value
1338result doesn't point to anything (like malloc). One example of this is in
1339SingleSource/Benchmarks/Misc/dt.c
1340
Chris Lattner6a09a742008-12-06 22:52:12 +00001341//===---------------------------------------------------------------------===//
1342
Chris Lattner6a09a742008-12-06 22:52:12 +00001343A/B get pinned to the stack because we turn an if/then into a select instead
1344of PRE'ing the load/store. This may be fixable in instcombine:
1345http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
1346
Chris Lattner93c6c772009-09-21 02:53:57 +00001347struct X { int i; };
1348int foo (int x) {
1349 struct X a;
1350 struct X b;
1351 struct X *p;
1352 a.i = 1;
1353 b.i = 2;
1354 if (x)
1355 p = &a;
1356 else
1357 p = &b;
1358 return p->i;
1359}
Chris Lattner582048d2008-12-15 08:32:28 +00001360
Chris Lattner93c6c772009-09-21 02:53:57 +00001361//===---------------------------------------------------------------------===//
Chris Lattner582048d2008-12-15 08:32:28 +00001362
Chris Lattner6a09a742008-12-06 22:52:12 +00001363Interesting missed case because of control flow flattening (should be 2 loads):
1364http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner582048d2008-12-15 08:32:28 +00001365With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |
1366 opt -mem2reg -gvn -instcombine | llvm-dis
Chris Lattnerd4137f42009-11-29 02:19:52 +00001367we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT
Chris Lattner582048d2008-12-15 08:32:28 +00001368VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner6a09a742008-12-06 22:52:12 +00001369
1370//===---------------------------------------------------------------------===//
1371
1372http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
1373We could eliminate the branch condition here, loading from null is undefined:
1374
1375struct S { int w, x, y, z; };
1376struct T { int r; struct S s; };
1377void bar (struct S, int);
1378void foo (int a, struct T b)
1379{
1380 struct S *c = 0;
1381 if (a)
1382 c = &b.s;
1383 bar (*c, a);
1384}
1385
1386//===---------------------------------------------------------------------===//
Chris Lattner88d84b22008-12-02 06:32:34 +00001387
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001388simplifylibcalls should do several optimizations for strspn/strcspn:
1389
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001390strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
1391
1392size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
1393 int __reject3) {
1394 register size_t __result = 0;
1395 while (__s[__result] != '\0' && __s[__result] != __reject1 &&
1396 __s[__result] != __reject2 && __s[__result] != __reject3)
1397 ++__result;
1398 return __result;
1399}
1400
1401This should turn into a switch on the character. See PR3253 for some notes on
1402codegen.
1403
1404456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
1405
1406//===---------------------------------------------------------------------===//
Chris Lattnerd23b7992008-12-31 00:54:13 +00001407
1408"gas" uses this idiom:
1409 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string))
1410..
1411 else if (strchr ("<>", *intel_parser.op_string)
1412
1413Those should be turned into a switch.
1414
1415//===---------------------------------------------------------------------===//
Chris Lattnerffb08f52009-01-08 06:52:57 +00001416
1417252.eon contains this interesting code:
1418
1419 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
1420 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1421 %strlen = call i32 @strlen(i8* %3072) ; uses = 1
1422 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
1423 call void @llvm.memcpy.i32(i8* %endptr,
1424 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
1425 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly
1426
1427This is interesting for a couple reasons. First, in this:
1428
Benjamin Kramer9d071cb2010-12-23 15:32:07 +00001429The memcpy+strlen strlen can be replaced with:
Chris Lattnerffb08f52009-01-08 06:52:57 +00001430
1431 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
1432
1433Because the destination was just copied into the specified memory buffer. This,
1434in turn, can be constant folded to "4".
1435
1436In other code, it contains:
1437
1438 %endptr6978 = bitcast i8* %endptr69 to i32*
1439 store i32 7107374, i32* %endptr6978, align 1
1440 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
1441
1442Which could also be constant folded. Whatever is producing this should probably
1443be fixed to leave this as a memcpy from a string.
1444
1445Further, eon also has an interesting partially redundant strlen call:
1446
1447bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
1448 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2]
1449 %683 = load i8** %682, align 4 ; <i8*> [#uses=4]
1450 %684 = load i8* %683, align 1 ; <i8> [#uses=1]
1451 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
1452 br i1 %685, label %bb10, label %bb9
1453
1454bb9: ; preds = %bb8
1455 %686 = call i32 @strlen(i8* %683) nounwind readonly
1456 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
1457 br i1 %687, label %bb10, label %bb11
1458
1459bb10: ; preds = %bb9, %bb8
1460 %688 = call i32 @strlen(i8* %683) nounwind readonly
1461
1462This could be eliminated by doing the strlen once in bb8, saving code size and
1463improving perf on the bb8->9->10 path.
1464
1465//===---------------------------------------------------------------------===//
Chris Lattner9fee08f2009-01-08 07:34:55 +00001466
1467I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
1468which looks like:
1469 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
1470
1471
1472bb62: ; preds = %bb55, %bb53
1473 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
1474 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1475 %172 = add i32 %171, -1 ; <i32> [#uses=1]
1476 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
1477
1478... no stores ...
1479 br i1 %or.cond, label %bb65, label %bb72
1480
1481bb65: ; preds = %bb62
1482 store i8 0, i8* %173, align 1
1483 br label %bb72
1484
1485bb72: ; preds = %bb65, %bb62
1486 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
1487 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1488
1489Note that on the bb62->bb72 path, that the %177 strlen call is partially
1490redundant with the %171 call. At worst, we could shove the %177 strlen call
1491up into the bb65 block moving it out of the bb62->bb72 path. However, note
1492that bb65 stores to the string, zeroing out the last byte. This means that on
1493that path the value of %177 is actually just %171-1. A sub is cheaper than a
1494strlen!
1495
1496This pattern repeats several times, basically doing:
1497
1498 A = strlen(P);
1499 P[A-1] = 0;
1500 B = strlen(P);
1501 where it is "obvious" that B = A-1.
1502
1503//===---------------------------------------------------------------------===//
1504
Chris Lattner9fee08f2009-01-08 07:34:55 +00001505186.crafty has this interesting pattern with the "out.4543" variable:
1506
1507call void @llvm.memcpy.i32(
1508 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
1509 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
1510%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
1511
1512It is basically doing:
1513
1514 memcpy(globalarray, "string");
1515 printf(..., globalarray);
1516
1517Anyway, by knowing that printf just reads the memory and forward substituting
1518the string directly into the printf, this eliminates reads from globalarray.
1519Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
1520other similar functions) there are many stores to "out". Once all the printfs
1521stop using "out", all that is left is the memcpy's into it. This should allow
1522globalopt to remove the "stored only" global.
1523
1524//===---------------------------------------------------------------------===//
1525
Dan Gohman8289b052009-01-20 01:07:33 +00001526This code:
1527
1528define inreg i32 @foo(i8* inreg %p) nounwind {
1529 %tmp0 = load i8* %p
1530 %tmp1 = ashr i8 %tmp0, 5
1531 %tmp2 = sext i8 %tmp1 to i32
1532 ret i32 %tmp2
1533}
1534
1535could be dagcombine'd to a sign-extending load with a shift.
1536For example, on x86 this currently gets this:
1537
1538 movb (%eax), %al
1539 sarb $5, %al
1540 movsbl %al, %eax
1541
1542while it could get this:
1543
1544 movsbl (%eax), %eax
1545 sarl $5, %eax
1546
1547//===---------------------------------------------------------------------===//
Chris Lattner256baa42009-01-22 07:16:03 +00001548
1549GCC PR31029:
1550
1551int test(int x) { return 1-x == x; } // --> return false
1552int test2(int x) { return 2-x == x; } // --> return x == 1 ?
1553
1554Always foldable for odd constants, what is the rule for even?
1555
1556//===---------------------------------------------------------------------===//
1557
Torok Edwine46a6862009-01-24 19:30:25 +00001558PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
1559for next field in struct (which is at same address).
1560
1561For example: store of float into { {{}}, float } could be turned into a store to
1562the float directly.
1563
Torok Edwin474479f2009-02-20 18:42:06 +00001564//===---------------------------------------------------------------------===//
Nick Lewycky20babb12009-02-25 06:52:48 +00001565
Chris Lattner32c5f172009-05-11 17:41:40 +00001566The arg promotion pass should make use of nocapture to make its alias analysis
1567stuff much more precise.
1568
1569//===---------------------------------------------------------------------===//
1570
1571The following functions should be optimized to use a select instead of a
1572branch (from gcc PR40072):
1573
1574char char_int(int m) {if(m>7) return 0; return m;}
1575int int_char(char m) {if(m>7) return 0; return m;}
1576
1577//===---------------------------------------------------------------------===//
1578
Bill Wendling5a569272009-10-27 22:48:31 +00001579int func(int a, int b) { if (a & 0x80) b |= 0x80; else b &= ~0x80; return b; }
1580
1581Generates this:
1582
1583define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1584entry:
1585 %0 = and i32 %a, 128 ; <i32> [#uses=1]
1586 %1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
1587 %2 = or i32 %b, 128 ; <i32> [#uses=1]
1588 %3 = and i32 %b, -129 ; <i32> [#uses=1]
1589 %b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
1590 ret i32 %b_addr.0
1591}
1592
1593However, it's functionally equivalent to:
1594
1595 b = (b & ~0x80) | (a & 0x80);
1596
1597Which generates this:
1598
1599define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1600entry:
1601 %0 = and i32 %b, -129 ; <i32> [#uses=1]
1602 %1 = and i32 %a, 128 ; <i32> [#uses=1]
1603 %2 = or i32 %0, %1 ; <i32> [#uses=1]
1604 ret i32 %2
1605}
1606
1607This can be generalized for other forms:
1608
1609 b = (b & ~0x80) | (a & 0x40) << 1;
1610
1611//===---------------------------------------------------------------------===//
Bill Wendlingc872e9c2009-10-27 23:30:07 +00001612
1613These two functions produce different code. They shouldn't:
1614
1615#include <stdint.h>
1616
1617uint8_t p1(uint8_t b, uint8_t a) {
1618 b = (b & ~0xc0) | (a & 0xc0);
1619 return (b);
1620}
1621
1622uint8_t p2(uint8_t b, uint8_t a) {
1623 b = (b & ~0x40) | (a & 0x40);
1624 b = (b & ~0x80) | (a & 0x80);
1625 return (b);
1626}
1627
1628define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1629entry:
1630 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1631 %1 = and i8 %a, -64 ; <i8> [#uses=1]
1632 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1633 ret i8 %2
1634}
1635
1636define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1637entry:
1638 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1639 %.masked = and i8 %a, 64 ; <i8> [#uses=1]
1640 %1 = and i8 %a, -128 ; <i8> [#uses=1]
1641 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1642 %3 = or i8 %2, %.masked ; <i8> [#uses=1]
1643 ret i8 %3
1644}
1645
1646//===---------------------------------------------------------------------===//
Chris Lattner6fdfc9c2009-11-11 17:51:27 +00001647
1648IPSCCP does not currently propagate argument dependent constants through
1649functions where it does not not all of the callers. This includes functions
1650with normal external linkage as well as templates, C99 inline functions etc.
1651Specifically, it does nothing to:
1652
1653define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
1654entry:
1655 %0 = add nsw i32 %y, %z
1656 %1 = mul i32 %0, %x
1657 %2 = mul i32 %y, %z
1658 %3 = add nsw i32 %1, %2
1659 ret i32 %3
1660}
1661
1662define i32 @test2() nounwind {
1663entry:
1664 %0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
1665 ret i32 %0
1666}
1667
1668It would be interesting extend IPSCCP to be able to handle simple cases like
1669this, where all of the arguments to a call are constant. Because IPSCCP runs
1670before inlining, trivial templates and inline functions are not yet inlined.
1671The results for a function + set of constant arguments should be memoized in a
1672map.
1673
1674//===---------------------------------------------------------------------===//
Chris Lattnerfc926c22009-11-11 17:54:02 +00001675
1676The libcall constant folding stuff should be moved out of SimplifyLibcalls into
1677libanalysis' constantfolding logic. This would allow IPSCCP to be able to
1678handle simple things like this:
1679
1680static int foo(const char *X) { return strlen(X); }
1681int bar() { return foo("abcd"); }
1682
1683//===---------------------------------------------------------------------===//
Nick Lewycky93f9f7a2009-11-15 17:51:23 +00001684
1685InstCombine should use SimplifyDemandedBits to remove the or instruction:
1686
1687define i1 @test(i8 %x, i8 %y) {
1688 %A = or i8 %x, 1
1689 %B = icmp ugt i8 %A, 3
1690 ret i1 %B
1691}
1692
1693Currently instcombine calls SimplifyDemandedBits with either all bits or just
1694the sign bit, if the comparison is obviously a sign test. In this case, we only
1695need all but the bottom two bits from %A, and if we gave that mask to SDB it
1696would delete the or instruction for us.
1697
1698//===---------------------------------------------------------------------===//
Chris Lattner05332172009-12-03 07:41:54 +00001699
Duncan Sandse10920d2010-01-06 15:37:47 +00001700functionattrs doesn't know much about memcpy/memset. This function should be
Duncan Sands7c422ac2010-01-06 08:45:52 +00001701marked readnone rather than readonly, since it only twiddles local memory, but
1702functionattrs doesn't handle memset/memcpy/memmove aggressively:
Chris Lattner89742c22009-12-03 07:43:46 +00001703
1704struct X { int *p; int *q; };
1705int foo() {
1706 int i = 0, j = 1;
1707 struct X x, y;
1708 int **p;
1709 y.p = &i;
1710 x.q = &j;
1711 p = __builtin_memcpy (&x, &y, sizeof (int *));
1712 return **p;
1713}
1714
Chris Lattner9c8fb9e2011-01-01 22:52:11 +00001715This can be seen at:
1716$ clang t.c -S -o - -mkernel -O0 -emit-llvm | opt -functionattrs -S
1717
1718
Chris Lattner05332172009-12-03 07:41:54 +00001719//===---------------------------------------------------------------------===//
1720
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001721Missed instcombine transformation:
1722define i1 @a(i32 %x) nounwind readnone {
1723entry:
1724 %cmp = icmp eq i32 %x, 30
1725 %sub = add i32 %x, -30
1726 %cmp2 = icmp ugt i32 %sub, 9
1727 %or = or i1 %cmp, %cmp2
1728 ret i1 %or
1729}
1730This should be optimized to a single compare. Testcase derived from gcc.
1731
1732//===---------------------------------------------------------------------===//
1733
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001734Missed instcombine or reassociate transformation:
1735int a(int a, int b) { return (a==12)&(b>47)&(b<58); }
1736
1737The sgt and slt should be combined into a single comparison. Testcase derived
1738from gcc.
1739
1740//===---------------------------------------------------------------------===//
1741
1742Missed instcombine transformation:
Chris Lattner3e411062010-11-21 07:05:31 +00001743
1744 %382 = srem i32 %tmp14.i, 64 ; [#uses=1]
1745 %383 = zext i32 %382 to i64 ; [#uses=1]
1746 %384 = shl i64 %381, %383 ; [#uses=1]
1747 %385 = icmp slt i32 %tmp14.i, 64 ; [#uses=1]
1748
Benjamin Kramerc21a8212010-11-23 20:33:57 +00001749The srem can be transformed to an and because if %tmp14.i is negative, the
1750shift is undefined. Testcase derived from 403.gcc.
Chris Lattner3e411062010-11-21 07:05:31 +00001751
1752//===---------------------------------------------------------------------===//
1753
1754This is a range comparison on a divided result (from 403.gcc):
1755
1756 %1337 = sdiv i32 %1336, 8 ; [#uses=1]
1757 %.off.i208 = add i32 %1336, 7 ; [#uses=1]
1758 %1338 = icmp ult i32 %.off.i208, 15 ; [#uses=1]
1759
1760We already catch this (removing the sdiv) if there isn't an add, we should
1761handle the 'add' as well. This is a common idiom with it's builtin_alloca code.
1762C testcase:
1763
1764int a(int x) { return (unsigned)(x/16+7) < 15; }
1765
1766Another similar case involves truncations on 64-bit targets:
1767
1768 %361 = sdiv i64 %.046, 8 ; [#uses=1]
1769 %362 = trunc i64 %361 to i32 ; [#uses=2]
1770...
1771 %367 = icmp eq i32 %362, 0 ; [#uses=1]
1772
Eli Friedman1144d7e2010-01-31 04:55:32 +00001773//===---------------------------------------------------------------------===//
1774
1775Missed instcombine/dagcombine transformation:
1776define void @lshift_lt(i8 zeroext %a) nounwind {
1777entry:
1778 %conv = zext i8 %a to i32
1779 %shl = shl i32 %conv, 3
1780 %cmp = icmp ult i32 %shl, 33
1781 br i1 %cmp, label %if.then, label %if.end
1782
1783if.then:
1784 tail call void @bar() nounwind
1785 ret void
1786
1787if.end:
1788 ret void
1789}
1790declare void @bar() nounwind
1791
1792The shift should be eliminated. Testcase derived from gcc.
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001793
1794//===---------------------------------------------------------------------===//
Chris Lattnercf031f62010-02-09 00:11:10 +00001795
1796These compile into different code, one gets recognized as a switch and the
1797other doesn't due to phase ordering issues (PR6212):
1798
1799int test1(int mainType, int subType) {
1800 if (mainType == 7)
1801 subType = 4;
1802 else if (mainType == 9)
1803 subType = 6;
1804 else if (mainType == 11)
1805 subType = 9;
1806 return subType;
1807}
1808
1809int test2(int mainType, int subType) {
1810 if (mainType == 7)
1811 subType = 4;
1812 if (mainType == 9)
1813 subType = 6;
1814 if (mainType == 11)
1815 subType = 9;
1816 return subType;
1817}
1818
1819//===---------------------------------------------------------------------===//
Chris Lattner66636702010-03-10 21:42:42 +00001820
1821The following test case (from PR6576):
1822
1823define i32 @mul(i32 %a, i32 %b) nounwind readnone {
1824entry:
1825 %cond1 = icmp eq i32 %b, 0 ; <i1> [#uses=1]
1826 br i1 %cond1, label %exit, label %bb.nph
1827bb.nph: ; preds = %entry
1828 %tmp = mul i32 %b, %a ; <i32> [#uses=1]
1829 ret i32 %tmp
1830exit: ; preds = %entry
1831 ret i32 0
1832}
1833
1834could be reduced to:
1835
1836define i32 @mul(i32 %a, i32 %b) nounwind readnone {
1837entry:
1838 %tmp = mul i32 %b, %a
1839 ret i32 %tmp
1840}
1841
1842//===---------------------------------------------------------------------===//
1843
Chris Lattner94846892010-04-16 23:52:30 +00001844We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates.
1845See GCC PR34949
1846
Chris Lattnerc2685a92010-05-21 23:16:21 +00001847Another interesting case is that something related could be used for variables
1848that go const after their ctor has finished. In these cases, globalopt (which
1849can statically run the constructor) could mark the global const (so it gets put
1850in the readonly section). A testcase would be:
1851
1852#include <complex>
1853using namespace std;
1854const complex<char> should_be_in_rodata (42,-42);
1855complex<char> should_be_in_data (42,-42);
1856complex<char> should_be_in_bss;
1857
1858Where we currently evaluate the ctors but the globals don't become const because
1859the optimizer doesn't know they "become const" after the ctor is done. See
1860GCC PR4131 for more examples.
1861
Chris Lattner94846892010-04-16 23:52:30 +00001862//===---------------------------------------------------------------------===//
1863
Dan Gohman3a2a4842010-05-03 14:31:00 +00001864In this code:
1865
1866long foo(long x) {
1867 return x > 1 ? x : 1;
1868}
1869
1870LLVM emits a comparison with 1 instead of 0. 0 would be equivalent
1871and cheaper on most targets.
1872
1873LLVM prefers comparisons with zero over non-zero in general, but in this
1874case it choses instead to keep the max operation obvious.
1875
1876//===---------------------------------------------------------------------===//
Eli Friedman8c47d3b2010-06-12 05:54:27 +00001877
1878Take the following testcase on x86-64 (similar testcases exist for all targets
1879with addc/adde):
1880
1881define void @a(i64* nocapture %s, i64* nocapture %t, i64 %a, i64 %b,
1882i64 %c) nounwind {
1883entry:
1884 %0 = zext i64 %a to i128 ; <i128> [#uses=1]
1885 %1 = zext i64 %b to i128 ; <i128> [#uses=1]
1886 %2 = add i128 %1, %0 ; <i128> [#uses=2]
1887 %3 = zext i64 %c to i128 ; <i128> [#uses=1]
1888 %4 = shl i128 %3, 64 ; <i128> [#uses=1]
1889 %5 = add i128 %4, %2 ; <i128> [#uses=1]
1890 %6 = lshr i128 %5, 64 ; <i128> [#uses=1]
1891 %7 = trunc i128 %6 to i64 ; <i64> [#uses=1]
1892 store i64 %7, i64* %s, align 8
1893 %8 = trunc i128 %2 to i64 ; <i64> [#uses=1]
1894 store i64 %8, i64* %t, align 8
1895 ret void
1896}
1897
1898Generated code:
1899 addq %rcx, %rdx
1900 movl $0, %eax
1901 adcq $0, %rax
1902 addq %r8, %rax
1903 movq %rax, (%rdi)
1904 movq %rdx, (%rsi)
1905 ret
1906
1907Expected code:
1908 addq %rcx, %rdx
1909 adcq $0, %r8
1910 movq %r8, (%rdi)
1911 movq %rdx, (%rsi)
1912 ret
1913
1914The generated SelectionDAG has an ADD of an ADDE, where both operands of the
1915ADDE are zero. Replacing one of the operands of the ADDE with the other operand
1916of the ADD, and replacing the ADD with the ADDE, should give the desired result.
1917
1918(That said, we are doing a lot better than gcc on this testcase. :) )
1919
1920//===---------------------------------------------------------------------===//
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001921
1922Switch lowering generates less than ideal code for the following switch:
1923define void @a(i32 %x) nounwind {
1924entry:
1925 switch i32 %x, label %if.end [
1926 i32 0, label %if.then
1927 i32 1, label %if.then
1928 i32 2, label %if.then
1929 i32 3, label %if.then
1930 i32 5, label %if.then
1931 ]
1932if.then:
1933 tail call void @foo() nounwind
1934 ret void
1935if.end:
1936 ret void
1937}
1938declare void @foo()
1939
1940Generated code on x86-64 (other platforms give similar results):
1941a:
1942 cmpl $5, %edi
1943 ja .LBB0_2
1944 movl %edi, %eax
1945 movl $47, %ecx
1946 btq %rax, %rcx
1947 jb .LBB0_3
1948.LBB0_2:
1949 ret
1950.LBB0_3:
Eli Friedmanb4828292010-07-03 08:43:32 +00001951 jmp foo # TAILCALL
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001952
1953The movl+movl+btq+jb could be simplified to a cmpl+jne.
1954
Eli Friedmanb4828292010-07-03 08:43:32 +00001955Or, if we wanted to be really clever, we could simplify the whole thing to
1956something like the following, which eliminates a branch:
1957 xorl $1, %edi
1958 cmpl $4, %edi
1959 ja .LBB0_2
1960 ret
1961.LBB0_2:
1962 jmp foo # TAILCALL
Nick Lewyckyb1e4eeb2010-08-08 07:04:25 +00001963//===---------------------------------------------------------------------===//
1964Given a branch where the two target blocks are identical ("ret i32 %b" in
1965both), simplifycfg will simplify them away. But not so for a switch statement:
Eli Friedmanb4828292010-07-03 08:43:32 +00001966
Nick Lewyckyb1e4eeb2010-08-08 07:04:25 +00001967define i32 @f(i32 %a, i32 %b) nounwind readnone {
1968entry:
1969 switch i32 %a, label %bb3 [
1970 i32 4, label %bb
1971 i32 6, label %bb
1972 ]
1973
1974bb: ; preds = %entry, %entry
1975 ret i32 %b
1976
1977bb3: ; preds = %entry
1978 ret i32 %b
1979}
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001980//===---------------------------------------------------------------------===//
Chris Lattner274191f2010-11-09 19:37:28 +00001981
1982clang -O3 fails to devirtualize this virtual inheritance case: (GCC PR45875)
Chris Lattner1e68fdb2010-11-11 17:17:56 +00001983Looks related to PR3100
Chris Lattner274191f2010-11-09 19:37:28 +00001984
1985struct c1 {};
1986struct c10 : c1{
1987 virtual void foo ();
1988};
1989struct c11 : c10, c1{
1990 virtual void f6 ();
1991};
1992struct c28 : virtual c11{
1993 void f6 ();
1994};
1995void check_c28 () {
1996 c28 obj;
1997 c11 *ptr = &obj;
1998 ptr->f6 ();
1999}
2000
2001//===---------------------------------------------------------------------===//
Chris Lattneraf510f12010-11-11 18:23:57 +00002002
2003We compile this:
2004
2005int foo(int a) { return (a & (~15)) / 16; }
2006
2007Into:
2008
2009define i32 @foo(i32 %a) nounwind readnone ssp {
2010entry:
2011 %and = and i32 %a, -16
2012 %div = sdiv i32 %and, 16
2013 ret i32 %div
2014}
2015
2016but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case
2017should be instcombined into just "a >> 4".
2018
2019We do get this at the codegen level, so something knows about it, but
2020instcombine should catch it earlier:
2021
2022_foo: ## @foo
2023## BB#0: ## %entry
2024 movl %edi, %eax
2025 sarl $4, %eax
2026 ret
2027
2028//===---------------------------------------------------------------------===//
2029
Chris Lattnera97c91f2010-12-13 00:15:25 +00002030This code (from GCC PR28685):
2031
2032int test(int a, int b) {
2033 int lt = a < b;
2034 int eq = a == b;
2035 if (lt)
2036 return 1;
2037 return eq;
2038}
2039
2040Is compiled to:
2041
2042define i32 @test(i32 %a, i32 %b) nounwind readnone ssp {
2043entry:
2044 %cmp = icmp slt i32 %a, %b
2045 br i1 %cmp, label %return, label %if.end
2046
2047if.end: ; preds = %entry
2048 %cmp5 = icmp eq i32 %a, %b
2049 %conv6 = zext i1 %cmp5 to i32
2050 ret i32 %conv6
2051
2052return: ; preds = %entry
2053 ret i32 1
2054}
2055
2056it could be:
2057
2058define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp {
2059entry:
2060 %0 = icmp sle i32 %a, %b
2061 %retval = zext i1 %0 to i32
2062 ret i32 %retval
2063}
2064
2065//===---------------------------------------------------------------------===//
Duncan Sands124708d2011-01-01 20:08:02 +00002066
2067This compare could fold to false:
2068
2069define i1 @g(i32 a) nounwind readnone {
2070 %add = shl i32 %a, 1
2071 %mul = shl i32 %a, 1
2072 %cmp = icmp ugt i32 %add, %mul
2073 ret i1 %cmp
2074}
2075
2076//===---------------------------------------------------------------------===//
Chris Lattner15df0442011-01-01 22:57:31 +00002077
2078This code can be seen in viterbi:
2079
2080 %64 = call noalias i8* @malloc(i64 %62) nounwind
2081...
2082 %67 = call i64 @llvm.objectsize.i64(i8* %64, i1 false) nounwind
2083 %68 = call i8* @__memset_chk(i8* %64, i32 0, i64 %62, i64 %67) nounwind
2084
2085llvm.objectsize.i64 should be taught about malloc/calloc, allowing it to
2086fold to %62. This is a security win (overflows of malloc will get caught)
2087and also a performance win by exposing more memsets to the optimizer.
2088
2089This occurs several times in viterbi.
2090
2091//===---------------------------------------------------------------------===//
2092
2093