blob: c9561660b1c25b1b5433ad6ee9379e0953265814 [file] [log] [blame]
Chris Lattner086c0142006-02-03 06:21:43 +00001Target Independent Opportunities:
2
Chris Lattnerf308ea02006-09-28 06:01:17 +00003//===---------------------------------------------------------------------===//
4
Chris Lattner313a94c2010-09-19 00:37:34 +00005We should recognize idioms for add-with-carry and turn it into the appropriate
6intrinsics. This example:
7
8unsigned add32carry(unsigned sum, unsigned x) {
9 unsigned z = sum + x;
10 if (sum + x < x)
11 z++;
12 return z;
13}
14
15Compiles to: clang t.c -S -o - -O3 -fomit-frame-pointer -m64 -mkernel
16
17_add32carry: ## @add32carry
18 addl %esi, %edi
19 cmpl %esi, %edi
20 sbbl %eax, %eax
21 andl $1, %eax
22 addl %edi, %eax
23 ret
24
25with clang, but to:
26
27_add32carry:
28 leal (%rsi,%rdi), %eax
29 cmpl %esi, %eax
30 adcl $0, %eax
31 ret
32
33with gcc.
34
35//===---------------------------------------------------------------------===//
36
Chris Lattner1d159832009-11-27 17:12:30 +000037Dead argument elimination should be enhanced to handle cases when an argument is
38dead to an externally visible function. Though the argument can't be removed
39from the externally visible function, the caller doesn't need to pass it in.
40For example in this testcase:
41
42 void foo(int X) __attribute__((noinline));
43 void foo(int X) { sideeffect(); }
44 void bar(int A) { foo(A+1); }
45
46We compile bar to:
47
48define void @bar(i32 %A) nounwind ssp {
49 %0 = add nsw i32 %A, 1 ; <i32> [#uses=1]
50 tail call void @foo(i32 %0) nounwind noinline ssp
51 ret void
52}
53
54The add is dead, we could pass in 'i32 undef' instead. This occurs for C++
55templates etc, which usually have linkonce_odr/weak_odr linkage, not internal
56linkage.
57
58//===---------------------------------------------------------------------===//
59
Chris Lattner9b62b452006-11-14 01:57:53 +000060With the recent changes to make the implicit def/use set explicit in
61machineinstrs, we should change the target descriptions for 'call' instructions
62so that the .td files don't list all the call-clobbered registers as implicit
63defs. Instead, these should be added by the code generator (e.g. on the dag).
64
65This has a number of uses:
66
671. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
68 for their different impdef sets.
692. Targets with multiple calling convs (e.g. x86) which have different clobber
70 sets don't need copies of call instructions.
713. 'Interprocedural register allocation' can be done to reduce the clobber sets
72 of calls.
73
74//===---------------------------------------------------------------------===//
75
Chris Lattner08859ff2010-12-15 07:25:55 +000076We should recognized various "overflow detection" idioms and translate them into
Chris Lattnere5cbdca2010-12-19 19:37:52 +000077llvm.uadd.with.overflow and similar intrinsics. Here is a multiply idiom:
Chris Lattner94481842010-12-15 07:28:58 +000078
79unsigned int mul(unsigned int a,unsigned int b) {
80 if ((unsigned long long)a*b>0xffffffff)
81 exit(0);
82 return a*b;
83}
84
Nate Begeman81e80972006-03-17 01:40:33 +000085//===---------------------------------------------------------------------===//
Chris Lattner086c0142006-02-03 06:21:43 +000086
87Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
Chris Lattner2dae65d2008-12-10 01:30:48 +000088precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
89safe in general, even on darwin. See the libm implementation of hypot for
90examples (which special case when x/y are exactly zero to get signed zeros etc
91right).
Chris Lattner086c0142006-02-03 06:21:43 +000092
Chris Lattner086c0142006-02-03 06:21:43 +000093//===---------------------------------------------------------------------===//
94
95Solve this DAG isel folding deficiency:
96
97int X, Y;
98
99void fn1(void)
100{
101 X = X | (Y << 3);
102}
103
104compiles to
105
106fn1:
107 movl Y, %eax
108 shll $3, %eax
109 orl X, %eax
110 movl %eax, X
111 ret
112
113The problem is the store's chain operand is not the load X but rather
114a TokenFactor of the load X and load Y, which prevents the folding.
115
116There are two ways to fix this:
117
1181. The dag combiner can start using alias analysis to realize that y/x
119 don't alias, making the store to X not dependent on the load from Y.
1202. The generated isel could be made smarter in the case it can't
121 disambiguate the pointers.
122
123Number 1 is the preferred solution.
124
Evan Chenge617b082006-03-13 23:19:10 +0000125This has been "fixed" by a TableGen hack. But that is a short term workaround
126which will be removed once the proper fix is made.
127
Chris Lattner086c0142006-02-03 06:21:43 +0000128//===---------------------------------------------------------------------===//
129
Chris Lattnerb27b69f2006-03-04 01:19:34 +0000130On targets with expensive 64-bit multiply, we could LSR this:
131
132for (i = ...; ++i) {
133 x = 1ULL << i;
134
135into:
136 long long tmp = 1;
137 for (i = ...; ++i, tmp+=tmp)
138 x = tmp;
139
140This would be a win on ppc32, but not x86 or ppc64.
141
Chris Lattnerad019932006-03-04 08:44:51 +0000142//===---------------------------------------------------------------------===//
Chris Lattner5b0fe7d2006-03-05 20:00:08 +0000143
144Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
145
146//===---------------------------------------------------------------------===//
Chris Lattner549f27d22006-03-07 02:46:26 +0000147
Chris Lattner398ffba2010-01-01 01:29:26 +0000148Reassociate should turn things like:
149
150int factorial(int X) {
151 return X*X*X*X*X*X*X*X;
152}
153
154into llvm.powi calls, allowing the code generator to produce balanced
155multiplication trees.
156
157First, the intrinsic needs to be extended to support integers, and second the
158code generator needs to be enhanced to lower these to multiplication trees.
Chris Lattnerc20995e2006-03-11 20:17:08 +0000159
160//===---------------------------------------------------------------------===//
161
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000162Interesting? testcase for add/shift/mul reassoc:
163
164int bar(int x, int y) {
165 return x*x*x+y+x*x*x*x*x*y*y*y*y;
166}
167int foo(int z, int n) {
168 return bar(z, n) + bar(2*z, 2*n);
169}
170
Chris Lattner398ffba2010-01-01 01:29:26 +0000171This is blocked on not handling X*X*X -> powi(X, 3) (see note above). The issue
172is that we end up getting t = 2*X s = t*t and don't turn this into 4*X*X,
173which is the same number of multiplies and is canonical, because the 2*X has
174multiple uses. Here's a simple example:
175
176define i32 @test15(i32 %X1) {
177 %B = mul i32 %X1, 47 ; X1*47
178 %C = mul i32 %B, %B
179 ret i32 %C
180}
181
182
183//===---------------------------------------------------------------------===//
184
185Reassociate should handle the example in GCC PR16157:
186
187extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;
188void f () { /* this can be optimized to four additions... */
189 b4 = a4 + a3 + a2 + a1 + a0;
190 b3 = a3 + a2 + a1 + a0;
191 b2 = a2 + a1 + a0;
192 b1 = a1 + a0;
193}
194
195This requires reassociating to forms of expressions that are already available,
196something that reassoc doesn't think about yet.
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000197
Chris Lattner10c42452010-01-24 20:01:41 +0000198
199//===---------------------------------------------------------------------===//
200
201This function: (derived from GCC PR19988)
202double foo(double x, double y) {
203 return ((x + 0.1234 * y) * (x + -0.1234 * y));
204}
205
206compiles to:
207_foo:
208 movapd %xmm1, %xmm2
209 mulsd LCPI1_1(%rip), %xmm1
210 mulsd LCPI1_0(%rip), %xmm2
211 addsd %xmm0, %xmm1
212 addsd %xmm0, %xmm2
213 movapd %xmm1, %xmm0
214 mulsd %xmm2, %xmm0
215 ret
216
Chris Lattner43dc2e62010-01-24 20:17:09 +0000217Reassociate should be able to turn it into:
Chris Lattner10c42452010-01-24 20:01:41 +0000218
219double foo(double x, double y) {
220 return ((x + 0.1234 * y) * (x - 0.1234 * y));
221}
222
223Which allows the multiply by constant to be CSE'd, producing:
224
225_foo:
226 mulsd LCPI1_0(%rip), %xmm1
227 movapd %xmm1, %xmm2
228 addsd %xmm0, %xmm2
229 subsd %xmm1, %xmm0
230 mulsd %xmm2, %xmm0
231 ret
232
233This doesn't need -ffast-math support at all. This is particularly bad because
234the llvm-gcc frontend is canonicalizing the later into the former, but clang
235doesn't have this problem.
236
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000237//===---------------------------------------------------------------------===//
238
Chris Lattner82c78b22006-03-09 20:13:21 +0000239These two functions should generate the same code on big-endian systems:
240
241int g(int *j,int *l) { return memcmp(j,l,4); }
242int h(int *j, int *l) { return *j - *l; }
243
244this could be done in SelectionDAGISel.cpp, along with other special cases,
245for 1,2,4,8 bytes.
246
247//===---------------------------------------------------------------------===//
248
Chris Lattnerc04b4232006-03-22 07:33:46 +0000249It would be nice to revert this patch:
250http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
251
252And teach the dag combiner enough to simplify the code expanded before
253legalize. It seems plausible that this knowledge would let it simplify other
254stuff too.
255
Chris Lattnere6cd96d2006-03-24 19:59:17 +0000256//===---------------------------------------------------------------------===//
257
Reid Spencerac9dcb92007-02-15 03:39:18 +0000258For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng67d3d4c2006-03-31 22:35:14 +0000259to the type size. It works but can be overly conservative as the alignment of
Reid Spencerac9dcb92007-02-15 03:39:18 +0000260specific vector types are target dependent.
Chris Lattnereaa7c062006-04-01 04:08:29 +0000261
262//===---------------------------------------------------------------------===//
263
Dan Gohman1f3be1a2009-05-11 18:51:16 +0000264We should produce an unaligned load from code like this:
Chris Lattnereaa7c062006-04-01 04:08:29 +0000265
266v4sf example(float *P) {
267 return (v4sf){P[0], P[1], P[2], P[3] };
268}
269
270//===---------------------------------------------------------------------===//
271
Chris Lattner16abfdf2006-05-18 18:26:13 +0000272Add support for conditional increments, and other related patterns. Instead
273of:
274
275 movl 136(%esp), %eax
276 cmpl $0, %eax
277 je LBB16_2 #cond_next
278LBB16_1: #cond_true
279 incl _foo
280LBB16_2: #cond_next
281
282emit:
283 movl _foo, %eax
284 cmpl $1, %edi
285 sbbl $-1, %eax
286 movl %eax, _foo
287
288//===---------------------------------------------------------------------===//
Chris Lattner870cf1b2006-05-19 20:45:08 +0000289
290Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
291
292Expand these to calls of sin/cos and stores:
293 double sincos(double x, double *sin, double *cos);
294 float sincosf(float x, float *sin, float *cos);
295 long double sincosl(long double x, long double *sin, long double *cos);
296
297Doing so could allow SROA of the destination pointers. See also:
298http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
299
Chris Lattner2dae65d2008-12-10 01:30:48 +0000300This is now easily doable with MRVs. We could even make an intrinsic for this
301if anyone cared enough about sincos.
302
Chris Lattner870cf1b2006-05-19 20:45:08 +0000303//===---------------------------------------------------------------------===//
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000304
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000305quantum_sigma_x in 462.libquantum contains the following loop:
306
307 for(i=0; i<reg->size; i++)
308 {
309 /* Flip the target bit of each basis state */
310 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
311 }
312
313Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
314so cool to turn it into something like:
315
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000316 long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000317 if (target < 32) {
318 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000319 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000320 } else {
321 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000322 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000323 }
324
325... which would only do one 32-bit XOR per loop iteration instead of two.
326
327It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000328this requires TBAA.
Chris Lattnerfaa6adf2009-09-21 06:04:07 +0000329
330//===---------------------------------------------------------------------===//
331
Chris Lattnerb1ac7692008-10-05 02:16:12 +0000332This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattnerf9bae432006-12-08 02:01:32 +0000333
334unsigned long reverse(unsigned v) {
335 unsigned t;
336 t = v ^ ((v << 16) | (v >> 16));
337 t &= ~0xff0000;
338 v = (v << 24) | (v >> 8);
339 return v ^ (t >> 8);
340}
341
Eric Christopher33634d02010-06-29 22:22:22 +0000342Neither is this (very standard idiom):
343
344int f(int n)
345{
346 return (((n) << 24) | (((n) & 0xff00) << 8)
347 | (((n) >> 8) & 0xff00) | ((n) >> 24));
348}
349
Chris Lattnerfb981f32006-09-25 17:12:14 +0000350//===---------------------------------------------------------------------===//
351
Chris Lattner818ff342010-01-23 18:49:30 +0000352[LOOP RECOGNITION]
353
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000354These idioms should be recognized as popcount (see PR1488):
355
356unsigned countbits_slow(unsigned v) {
357 unsigned c;
358 for (c = 0; v; v >>= 1)
359 c += v & 1;
360 return c;
361}
362unsigned countbits_fast(unsigned v){
363 unsigned c;
364 for (c = 0; v; c++)
365 v &= v - 1; // clear the least significant bit set
366 return c;
367}
368
369BITBOARD = unsigned long long
370int PopCnt(register BITBOARD a) {
371 register int c=0;
372 while(a) {
373 c++;
374 a &= a - 1;
375 }
376 return c;
377}
378unsigned int popcount(unsigned int input) {
379 unsigned int count = 0;
380 for (unsigned int i = 0; i < 4 * 8; i++)
381 count += (input >> i) & i;
382 return count;
383}
384
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000385This is a form of idiom recognition for loops, the same thing that could be
386useful for recognizing memset/memcpy.
387
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000388//===---------------------------------------------------------------------===//
389
Chris Lattnerfb981f32006-09-25 17:12:14 +0000390These should turn into single 16-bit (unaligned?) loads on little/big endian
391processors.
392
393unsigned short read_16_le(const unsigned char *adr) {
394 return adr[0] | (adr[1] << 8);
395}
396unsigned short read_16_be(const unsigned char *adr) {
397 return (adr[0] << 8) | adr[1];
398}
399
400//===---------------------------------------------------------------------===//
Chris Lattnercf103912006-10-24 16:12:47 +0000401
Reid Spencer1628cec2006-10-26 06:15:43 +0000402-instcombine should handle this transform:
Reid Spencere4d87aa2006-12-23 06:05:41 +0000403 icmp pred (sdiv X / C1 ), C2
Reid Spencer1628cec2006-10-26 06:15:43 +0000404when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
405
406Currently InstCombine avoids this transform but will do it when the signs of
407the operands and the sign of the divide match. See the FIXME in
408InstructionCombining.cpp in the visitSetCondInst method after the switch case
409for Instruction::UDiv (around line 4447) for more details.
410
411The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
412this construct.
Chris Lattnerd7c628d2006-11-03 22:27:39 +0000413
414//===---------------------------------------------------------------------===//
415
Chris Lattneraa306c22010-01-23 17:59:23 +0000416[LOOP RECOGNITION]
417
Chris Lattner578d2df2006-11-10 00:23:26 +0000418viterbi speeds up *significantly* if the various "history" related copy loops
419are turned into memcpy calls at the source level. We need a "loops to memcpy"
420pass.
421
422//===---------------------------------------------------------------------===//
Nick Lewyckybf637342006-11-13 00:23:28 +0000423
Chris Lattneraa306c22010-01-23 17:59:23 +0000424[LOOP OPTIMIZATION]
425
426SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization
427opportunities in its double_array_divs_variable function: it needs loop
428interchange, memory promotion (which LICM already does), vectorization and
429variable trip count loop unrolling (since it has a constant trip count). ICC
430apparently produces this very nice code with -ffast-math:
431
432..B1.70: # Preds ..B1.70 ..B1.69
433 mulpd %xmm0, %xmm1 #108.2
434 mulpd %xmm0, %xmm1 #108.2
435 mulpd %xmm0, %xmm1 #108.2
436 mulpd %xmm0, %xmm1 #108.2
437 addl $8, %edx #
438 cmpl $131072, %edx #108.2
439 jb ..B1.70 # Prob 99% #108.2
440
441It would be better to count down to zero, but this is a lot better than what we
442do.
443
444//===---------------------------------------------------------------------===//
445
Chris Lattner03a6d962007-01-16 06:39:48 +0000446Consider:
447
448typedef unsigned U32;
449typedef unsigned long long U64;
450int test (U32 *inst, U64 *regs) {
451 U64 effective_addr2;
452 U32 temp = *inst;
453 int r1 = (temp >> 20) & 0xf;
454 int b2 = (temp >> 16) & 0xf;
455 effective_addr2 = temp & 0xfff;
456 if (b2) effective_addr2 += regs[b2];
457 b2 = (temp >> 12) & 0xf;
458 if (b2) effective_addr2 += regs[b2];
459 effective_addr2 &= regs[4];
460 if ((effective_addr2 & 3) == 0)
461 return 1;
462 return 0;
463}
464
465Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
466we don't eliminate the computation of the top half of effective_addr2 because
467we don't have whole-function selection dags. On x86, this means we use one
468extra register for the function when effective_addr2 is declared as U64 than
469when it is declared U32.
470
Chris Lattner17424982009-11-10 23:47:45 +0000471PHI Slicing could be extended to do this.
472
Chris Lattner03a6d962007-01-16 06:39:48 +0000473//===---------------------------------------------------------------------===//
474
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000475LSR should know what GPR types a target has from TargetData. This code:
Chris Lattner1a77a552007-03-24 06:01:32 +0000476
477volatile short X, Y; // globals
478
479void foo(int N) {
480 int i;
481 for (i = 0; i < N; i++) { X = i; Y = i*4; }
482}
483
Chris Lattnerc1491f32009-09-20 17:37:38 +0000484produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner1a77a552007-03-24 06:01:32 +0000485
Chris Lattnerc1491f32009-09-20 17:37:38 +0000486LBB1_2:
487 ldr r3, LCPI1_0
488 ldr r3, [r3]
489 strh r2, [r3]
490 ldr r3, LCPI1_1
491 ldr r3, [r3]
492 strh r1, [r3]
493 add r1, r1, #4
494 add r2, r2, #1 <- [0,+,1]
495 sub r0, r0, #1 <- [0,-,1]
496 cmp r0, #0
497 bne LBB1_2
498
499LSR should reuse the "+" IV for the exit test.
Chris Lattner1a77a552007-03-24 06:01:32 +0000500
Chris Lattner1a77a552007-03-24 06:01:32 +0000501//===---------------------------------------------------------------------===//
502
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000503Tail call elim should be more aggressive, checking to see if the call is
504followed by an uncond branch to an exit block.
505
506; This testcase is due to tail-duplication not wanting to copy the return
507; instruction into the terminating blocks because there was other code
508; optimized out of the function after the taildup happened.
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000509; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000510
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000511define i32 @t4(i32 %a) {
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000512entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000513 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
514 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
515 br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000516
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000517then.0: ; preds = %entry
518 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
519 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
520 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000521
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000522else.0: ; preds = %entry
523 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
524 br i1 %tmp.7, label %then.1, label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000525
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000526then.1: ; preds = %else.0
527 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
528 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
529 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000530
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000531return: ; preds = %then.1, %else.0, %then.0
532 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000533 [ %tmp.9, %then.1 ]
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000534 ret i32 %result.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000535}
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000536
537//===---------------------------------------------------------------------===//
538
Chris Lattnerc90b8662008-08-10 00:47:21 +0000539Tail recursion elimination should handle:
540
541int pow2m1(int n) {
542 if (n == 0)
543 return 0;
544 return 2 * pow2m1 (n - 1) + 1;
545}
546
547Also, multiplies can be turned into SHL's, so they should be handled as if
548they were associative. "return foo() << 1" can be tail recursion eliminated.
549
550//===---------------------------------------------------------------------===//
551
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000552Argument promotion should promote arguments for recursive functions, like
553this:
554
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000555; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000556
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000557define internal i32 @foo(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000558entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000559 %tmp = load i32* %x ; <i32> [#uses=0]
560 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
561 ret i32 %tmp.foo
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000562}
563
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000564define i32 @bar(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000565entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000566 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
567 ret i32 %tmp3
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000568}
569
Chris Lattner81f2d712007-12-05 23:05:06 +0000570//===---------------------------------------------------------------------===//
Chris Lattner166a2682007-12-28 04:42:05 +0000571
Chris Lattnera1643ba2007-12-28 22:30:05 +0000572We should investigate an instruction sinking pass. Consider this silly
573example in pic mode:
574
575#include <assert.h>
576void foo(int x) {
577 assert(x);
578 //...
579}
580
581we compile this to:
582_foo:
583 subl $28, %esp
584 call "L1$pb"
585"L1$pb":
586 popl %eax
587 cmpl $0, 32(%esp)
588 je LBB1_2 # cond_true
589LBB1_1: # return
590 # ...
591 addl $28, %esp
592 ret
593LBB1_2: # cond_true
594...
595
596The PIC base computation (call+popl) is only used on one path through the
597code, but is currently always computed in the entry block. It would be
598better to sink the picbase computation down into the block for the
599assertion, as it is the only one that uses it. This happens for a lot of
600code with early outs.
601
Chris Lattner92c06a02007-12-29 01:05:01 +0000602Another example is loads of arguments, which are usually emitted into the
603entry block on targets like x86. If not used in all paths through a
604function, they should be sunk into the ones that do.
605
Chris Lattnera1643ba2007-12-28 22:30:05 +0000606In this case, whole-function-isel would also handle this.
Chris Lattner166a2682007-12-28 04:42:05 +0000607
608//===---------------------------------------------------------------------===//
Chris Lattnerb3041942008-01-07 21:38:14 +0000609
610Investigate lowering of sparse switch statements into perfect hash tables:
611http://burtleburtle.net/bob/hash/perfect.html
612
613//===---------------------------------------------------------------------===//
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000614
615We should turn things like "load+fabs+store" and "load+fneg+store" into the
616corresponding integer operations. On a yonah, this loop:
617
618double a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000619void foo() {
620 int i, b;
621 for (b = 0; b < 10000000; b++)
622 for (i = 0; i < 256; i++)
623 a[i] = -a[i];
624}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000625
626is twice as slow as this loop:
627
628long long a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000629void foo() {
630 int i, b;
631 for (b = 0; b < 10000000; b++)
632 for (i = 0; i < 256; i++)
633 a[i] ^= (1ULL << 63);
634}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000635
636and I suspect other processors are similar. On X86 in particular this is a
637big win because doing this with integers allows the use of read/modify/write
638instructions.
639
640//===---------------------------------------------------------------------===//
Chris Lattner83726012008-01-10 18:25:41 +0000641
642DAG Combiner should try to combine small loads into larger loads when
643profitable. For example, we compile this C++ example:
644
645struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
646extern THotKey m_HotKey;
647THotKey GetHotKey () { return m_HotKey; }
648
649into (-O3 -fno-exceptions -static -fomit-frame-pointer):
650
651__Z9GetHotKeyv:
652 pushl %esi
653 movl 8(%esp), %eax
654 movb _m_HotKey+3, %cl
655 movb _m_HotKey+4, %dl
656 movb _m_HotKey+2, %ch
657 movw _m_HotKey, %si
658 movw %si, (%eax)
659 movb %ch, 2(%eax)
660 movb %cl, 3(%eax)
661 movb %dl, 4(%eax)
662 popl %esi
663 ret $4
664
665GCC produces:
666
667__Z9GetHotKeyv:
668 movl _m_HotKey, %edx
669 movl 4(%esp), %eax
670 movl %edx, (%eax)
671 movzwl _m_HotKey+4, %edx
672 movw %dx, 4(%eax)
673 ret $4
674
675The LLVM IR contains the needed alignment info, so we should be able to
676merge the loads and stores into 4-byte loads:
677
678 %struct.THotKey = type { i16, i8, i8, i8 }
679define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
680...
681 %tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
682 %tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
683 %tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
684 %tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
685
686Alternatively, we should use a small amount of base-offset alias analysis
687to make it so the scheduler doesn't need to hold all the loads in regs at
688once.
689
690//===---------------------------------------------------------------------===//
Chris Lattner497b7e92008-01-11 06:17:47 +0000691
Nate Begemane9fe65c2008-02-18 18:39:23 +0000692We should add an FRINT node to the DAG to model targets that have legal
693implementations of ceil/floor/rint.
Chris Lattner48840f82008-02-28 05:34:27 +0000694
695//===---------------------------------------------------------------------===//
696
697Consider:
698
699int test() {
Benjamin Kramer9d071cb2010-12-23 15:32:07 +0000700 long long input[8] = {1,0,1,0,1,0,1,0};
Chris Lattner48840f82008-02-28 05:34:27 +0000701 foo(input);
702}
703
704We currently compile this into a memcpy from a global array since the
705initializer is fairly large and not memset'able. This is good, but the memcpy
706gets lowered to load/stores in the code generator. This is also ok, except
707that the codegen lowering for memcpy doesn't handle the case when the source
708is a constant global. This gives us atrocious code like this:
709
710 call "L1$pb"
711"L1$pb":
712 popl %eax
713 movl _C.0.1444-"L1$pb"+32(%eax), %ecx
714 movl %ecx, 40(%esp)
715 movl _C.0.1444-"L1$pb"+20(%eax), %ecx
716 movl %ecx, 28(%esp)
717 movl _C.0.1444-"L1$pb"+36(%eax), %ecx
718 movl %ecx, 44(%esp)
719 movl _C.0.1444-"L1$pb"+44(%eax), %ecx
720 movl %ecx, 52(%esp)
721 movl _C.0.1444-"L1$pb"+40(%eax), %ecx
722 movl %ecx, 48(%esp)
723 movl _C.0.1444-"L1$pb"+12(%eax), %ecx
724 movl %ecx, 20(%esp)
725 movl _C.0.1444-"L1$pb"+4(%eax), %ecx
726...
727
728instead of:
729 movl $1, 16(%esp)
730 movl $0, 20(%esp)
731 movl $1, 24(%esp)
732 movl $0, 28(%esp)
733 movl $1, 32(%esp)
734 movl $0, 36(%esp)
735 ...
736
737//===---------------------------------------------------------------------===//
Chris Lattnera11deb02008-03-02 02:51:40 +0000738
739http://llvm.org/PR717:
740
741The following code should compile into "ret int undef". Instead, LLVM
742produces "ret int 0":
743
744int f() {
745 int x = 4;
746 int y;
747 if (x == 3) y = 0;
748 return y;
749}
750
751//===---------------------------------------------------------------------===//
Chris Lattner53b72772008-03-02 19:29:42 +0000752
753The loop unroller should partially unroll loops (instead of peeling them)
754when code growth isn't too bad and when an unroll count allows simplification
755of some code within the loop. One trivial example is:
756
757#include <stdio.h>
758int main() {
759 int nRet = 17;
760 int nLoop;
761 for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
762 if ( nLoop & 1 )
763 nRet += 2;
764 else
765 nRet -= 1;
766 }
767 return nRet;
768}
769
770Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
771reduction in code size. The resultant code would then also be suitable for
772exit value computation.
773
774//===---------------------------------------------------------------------===//
Chris Lattner349155b2008-03-17 01:47:51 +0000775
776We miss a bunch of rotate opportunities on various targets, including ppc, x86,
777etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
778matching code in dag combine doesn't look through truncates aggressively
779enough. Here are some testcases reduces from GCC PR17886:
780
781unsigned long long f(unsigned long long x, int y) {
782 return (x << y) | (x >> 64-y);
783}
784unsigned f2(unsigned x, int y){
785 return (x << y) | (x >> 32-y);
786}
787unsigned long long f3(unsigned long long x){
788 int y = 9;
789 return (x << y) | (x >> 64-y);
790}
791unsigned f4(unsigned x){
792 int y = 10;
793 return (x << y) | (x >> 32-y);
794}
795unsigned long long f5(unsigned long long x, unsigned long long y) {
796 return (x << 8) | ((y >> 48) & 0xffull);
797}
798unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
799 switch(z) {
800 case 1:
801 return (x << 8) | ((y >> 48) & 0xffull);
802 case 2:
803 return (x << 16) | ((y >> 40) & 0xffffull);
804 case 3:
805 return (x << 24) | ((y >> 32) & 0xffffffull);
806 case 4:
807 return (x << 32) | ((y >> 24) & 0xffffffffull);
808 default:
809 return (x << 40) | ((y >> 16) & 0xffffffffffull);
810 }
811}
812
Dan Gohmancb747c52008-10-17 21:39:27 +0000813On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner349155b2008-03-17 01:47:51 +0000814generate truly horrible code, instead of using shld and friends. On
815ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
816badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
817
818//===---------------------------------------------------------------------===//
Chris Lattnerf70107f2008-03-20 04:46:13 +0000819
Chris Lattneref17f082010-12-15 07:10:43 +0000820This (and similar related idioms):
821
822unsigned int foo(unsigned char i) {
823 return i | (i<<8) | (i<<16) | (i<<24);
824}
825
826compiles into:
827
828define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone {
829entry:
830 %conv = zext i8 %i to i32
831 %shl = shl i32 %conv, 8
832 %shl5 = shl i32 %conv, 16
833 %shl9 = shl i32 %conv, 24
834 %or = or i32 %shl9, %conv
835 %or6 = or i32 %or, %shl5
836 %or10 = or i32 %or6, %shl
837 ret i32 %or10
838}
839
840it would be better as:
841
842unsigned int bar(unsigned char i) {
843 unsigned int j=i | (i << 8);
844 return j | (j<<16);
845}
846
847aka:
848
849define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone {
850entry:
851 %conv = zext i8 %i to i32
852 %shl = shl i32 %conv, 8
853 %or = or i32 %shl, %conv
854 %shl5 = shl i32 %or, 16
855 %or6 = or i32 %shl5, %or
856 ret i32 %or6
857}
858
859or even i*0x01010101, depending on the speed of the multiplier. The best way to
860handle this is to canonicalize it to a multiply in IR and have codegen handle
861lowering multiplies to shifts on cpus where shifts are faster.
862
863//===---------------------------------------------------------------------===//
864
Chris Lattnerf70107f2008-03-20 04:46:13 +0000865We do a number of simplifications in simplify libcalls to strength reduce
866standard library functions, but we don't currently merge them together. For
867example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
868be done safely if "b" isn't modified between the strlen and memcpy of course.
869
870//===---------------------------------------------------------------------===//
871
Chris Lattner26e150f2008-08-10 01:14:08 +0000872We compile this program: (from GCC PR11680)
873http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
874
875Into code that runs the same speed in fast/slow modes, but both modes run 2x
876slower than when compile with GCC (either 4.0 or 4.2):
877
878$ llvm-g++ perf.cpp -O3 -fno-exceptions
879$ time ./a.out fast
8801.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
881
882$ g++ perf.cpp -O3 -fno-exceptions
883$ time ./a.out fast
8840.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
885
886It looks like we are making the same inlining decisions, so this may be raw
887codegen badness or something else (haven't investigated).
888
889//===---------------------------------------------------------------------===//
890
891We miss some instcombines for stuff like this:
892void bar (void);
893void foo (unsigned int a) {
894 /* This one is equivalent to a >= (3 << 2). */
895 if ((a >> 2) >= 3)
896 bar ();
897}
898
899A few other related ones are in GCC PR14753.
900
901//===---------------------------------------------------------------------===//
902
903Divisibility by constant can be simplified (according to GCC PR12849) from
904being a mulhi to being a mul lo (cheaper). Testcase:
905
906void bar(unsigned n) {
907 if (n % 3 == 0)
908 true();
909}
910
Eli Friedmanbcae2052009-12-12 23:23:43 +0000911This is equivalent to the following, where 2863311531 is the multiplicative
912inverse of 3, and 1431655766 is ((2^32)-1)/3+1:
913void bar(unsigned n) {
914 if (n * 2863311531U < 1431655766U)
915 true();
916}
917
918The same transformation can work with an even modulo with the addition of a
919rotate: rotate the result of the multiply to the right by the number of bits
920which need to be zero for the condition to be true, and shrink the compare RHS
921by the same amount. Unless the target supports rotates, though, that
922transformation probably isn't worthwhile.
923
924The transformation can also easily be made to work with non-zero equality
925comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".
Chris Lattner26e150f2008-08-10 01:14:08 +0000926
927//===---------------------------------------------------------------------===//
Chris Lattner23f35bc2008-08-19 06:22:16 +0000928
Chris Lattnerdb039832008-10-15 16:06:03 +0000929Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
930bunch of other stuff from this example (see PR1604):
931
932#include <cstdio>
933struct test {
934 int val;
935 virtual ~test() {}
936};
937
938int main() {
939 test t;
940 std::scanf("%d", &t.val);
941 std::printf("%d\n", t.val);
942}
943
944//===---------------------------------------------------------------------===//
945
Nick Lewyckyd2f0db12008-11-27 22:41:45 +0000946These functions perform the same computation, but produce different assembly.
Nick Lewyckydf563ca2008-11-27 22:12:22 +0000947
948define i8 @select(i8 %x) readnone nounwind {
949 %A = icmp ult i8 %x, 250
950 %B = select i1 %A, i8 0, i8 1
951 ret i8 %B
952}
953
954define i8 @addshr(i8 %x) readnone nounwind {
955 %A = zext i8 %x to i9
956 %B = add i9 %A, 6 ;; 256 - 250 == 6
957 %C = lshr i9 %B, 8
958 %D = trunc i9 %C to i8
959 ret i8 %D
960}
961
962//===---------------------------------------------------------------------===//
Eli Friedman4e16b292008-11-30 07:36:04 +0000963
964From gcc bug 24696:
965int
966f (unsigned long a, unsigned long b, unsigned long c)
967{
968 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0);
969}
970int
971f (unsigned long a, unsigned long b, unsigned long c)
972{
973 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0);
974}
975Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with
976"clang -emit-llvm-bc | opt -std-compile-opts".
977
978//===---------------------------------------------------------------------===//
979
980From GCC Bug 20192:
981#define PMD_MASK (~((1UL << 23) - 1))
982void clear_pmd_range(unsigned long start, unsigned long end)
983{
984 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
985 f();
986}
987The expression should optimize to something like
988"!((start|end)&~PMD_MASK). Currently not optimized with "clang
989-emit-llvm-bc | opt -std-compile-opts".
990
991//===---------------------------------------------------------------------===//
992
Eli Friedman4e16b292008-11-30 07:36:04 +0000993unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
994i;}
995unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
996These should combine to the same thing. Currently, the first function
997produces better code on X86.
998
999//===---------------------------------------------------------------------===//
1000
Eli Friedman4e16b292008-11-30 07:36:04 +00001001From GCC Bug 15784:
1002#define abs(x) x>0?x:-x
1003int f(int x, int y)
1004{
1005 return (abs(x)) >= 0;
1006}
1007This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
1008optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1009
1010//===---------------------------------------------------------------------===//
1011
1012From GCC Bug 14753:
1013void
1014rotate_cst (unsigned int a)
1015{
1016 a = (a << 10) | (a >> 22);
1017 if (a == 123)
1018 bar ();
1019}
1020void
1021minus_cst (unsigned int a)
1022{
1023 unsigned int tem;
1024
1025 tem = 20 - a;
1026 if (tem == 5)
1027 bar ();
1028}
1029void
1030mask_gt (unsigned int a)
1031{
1032 /* This is equivalent to a > 15. */
1033 if ((a & ~7) > 8)
1034 bar ();
1035}
1036void
1037rshift_gt (unsigned int a)
1038{
1039 /* This is equivalent to a > 23. */
1040 if ((a >> 2) > 5)
1041 bar ();
1042}
1043All should simplify to a single comparison. All of these are
1044currently not optimized with "clang -emit-llvm-bc | opt
1045-std-compile-opts".
1046
1047//===---------------------------------------------------------------------===//
1048
1049From GCC Bug 32605:
1050int c(int* x) {return (char*)x+2 == (char*)x;}
1051Should combine to 0. Currently not optimized with "clang
1052-emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it).
1053
1054//===---------------------------------------------------------------------===//
1055
Eli Friedman4e16b292008-11-30 07:36:04 +00001056int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;}
1057Should be combined to "((b >> 1) | b) & 1". Currently not optimized
1058with "clang -emit-llvm-bc | opt -std-compile-opts".
1059
1060//===---------------------------------------------------------------------===//
1061
1062unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);}
1063Should combine to "x | (y & 3)". Currently not optimized with "clang
1064-emit-llvm-bc | opt -std-compile-opts".
1065
1066//===---------------------------------------------------------------------===//
1067
Eli Friedman4e16b292008-11-30 07:36:04 +00001068int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);}
1069Should fold to "(~a & c) | (a & b)". Currently not optimized with
1070"clang -emit-llvm-bc | opt -std-compile-opts".
1071
1072//===---------------------------------------------------------------------===//
1073
1074int a(int a,int b) {return (~(a|b))|a;}
1075Should fold to "a|~b". Currently not optimized with "clang
1076-emit-llvm-bc | opt -std-compile-opts".
1077
1078//===---------------------------------------------------------------------===//
1079
1080int a(int a, int b) {return (a&&b) || (a&&!b);}
1081Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
1082| opt -std-compile-opts".
1083
1084//===---------------------------------------------------------------------===//
1085
1086int a(int a, int b, int c) {return (a&&b) || (!a&&c);}
1087Should fold to "a ? b : c", or at least something sane. Currently not
1088optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1089
1090//===---------------------------------------------------------------------===//
1091
1092int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);}
1093Should fold to a && (b || c). Currently not optimized with "clang
1094-emit-llvm-bc | opt -std-compile-opts".
1095
1096//===---------------------------------------------------------------------===//
1097
1098int a(int x) {return x | ((x & 8) ^ 8);}
1099Should combine to x | 8. Currently not optimized with "clang
1100-emit-llvm-bc | opt -std-compile-opts".
1101
1102//===---------------------------------------------------------------------===//
1103
1104int a(int x) {return x ^ ((x & 8) ^ 8);}
1105Should also combine to x | 8. Currently not optimized with "clang
1106-emit-llvm-bc | opt -std-compile-opts".
1107
1108//===---------------------------------------------------------------------===//
1109
Eli Friedman4e16b292008-11-30 07:36:04 +00001110int a(int x) {return ((x | -9) ^ 8) & x;}
1111Should combine to x & -9. Currently not optimized with "clang
1112-emit-llvm-bc | opt -std-compile-opts".
1113
1114//===---------------------------------------------------------------------===//
1115
1116unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
1117Should combine to "a * 0x88888888 >> 31". Currently not optimized
1118with "clang -emit-llvm-bc | opt -std-compile-opts".
1119
1120//===---------------------------------------------------------------------===//
1121
1122unsigned a(char* x) {if ((*x & 32) == 0) return b();}
1123There's an unnecessary zext in the generated code with "clang
1124-emit-llvm-bc | opt -std-compile-opts".
1125
1126//===---------------------------------------------------------------------===//
1127
1128unsigned a(unsigned long long x) {return 40 * (x >> 1);}
1129Should combine to "20 * (((unsigned)x) & -2)". Currently not
1130optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1131
1132//===---------------------------------------------------------------------===//
Bill Wendling3bdcda82008-12-02 05:12:47 +00001133
Chris Lattner88d84b22008-12-02 06:32:34 +00001134This was noticed in the entryblock for grokdeclarator in 403.gcc:
1135
1136 %tmp = icmp eq i32 %decl_context, 4
1137 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
1138 %tmp1 = icmp eq i32 %decl_context_addr.0, 1
1139 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
1140
1141tmp1 should be simplified to something like:
1142 (!tmp || decl_context == 1)
1143
1144This allows recursive simplifications, tmp1 is used all over the place in
1145the function, e.g. by:
1146
1147 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
1148 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
1149 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
1150
1151later.
1152
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001153//===---------------------------------------------------------------------===//
1154
Chris Lattnerd4137f42009-11-29 02:19:52 +00001155[STORE SINKING]
1156
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001157Store sinking: This code:
1158
1159void f (int n, int *cond, int *res) {
1160 int i;
1161 *res = 0;
1162 for (i = 0; i < n; i++)
1163 if (*cond)
1164 *res ^= 234; /* (*) */
1165}
1166
1167On this function GVN hoists the fully redundant value of *res, but nothing
1168moves the store out. This gives us this code:
1169
1170bb: ; preds = %bb2, %entry
1171 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
1172 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
1173 %1 = load i32* %cond, align 4
1174 %2 = icmp eq i32 %1, 0
1175 br i1 %2, label %bb2, label %bb1
1176
1177bb1: ; preds = %bb
1178 %3 = xor i32 %.rle, 234
1179 store i32 %3, i32* %res, align 4
1180 br label %bb2
1181
1182bb2: ; preds = %bb, %bb1
1183 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
1184 %indvar.next = add i32 %i.05, 1
1185 %exitcond = icmp eq i32 %indvar.next, %n
1186 br i1 %exitcond, label %return, label %bb
1187
1188DSE should sink partially dead stores to get the store out of the loop.
1189
Chris Lattner6a09a742008-12-06 22:52:12 +00001190Here's another partial dead case:
1191http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
1192
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001193//===---------------------------------------------------------------------===//
1194
1195Scalar PRE hoists the mul in the common block up to the else:
1196
1197int test (int a, int b, int c, int g) {
1198 int d, e;
1199 if (a)
1200 d = b * c;
1201 else
1202 d = b - c;
1203 e = b * c + g;
1204 return d + e;
1205}
1206
1207It would be better to do the mul once to reduce codesize above the if.
1208This is GCC PR38204.
1209
1210//===---------------------------------------------------------------------===//
1211
Chris Lattnerd4137f42009-11-29 02:19:52 +00001212[STORE SINKING]
1213
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001214GCC PR37810 is an interesting case where we should sink load/store reload
1215into the if block and outside the loop, so we don't reload/store it on the
1216non-call path.
1217
1218for () {
1219 *P += 1;
1220 if ()
1221 call();
1222 else
1223 ...
1224->
1225tmp = *P
1226for () {
1227 tmp += 1;
1228 if () {
1229 *P = tmp;
1230 call();
1231 tmp = *P;
1232 } else ...
1233}
1234*P = tmp;
1235
Chris Lattner8f416f32008-12-15 07:49:24 +00001236We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
1237we don't sink the store. We need partially dead store sinking.
1238
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001239//===---------------------------------------------------------------------===//
1240
Chris Lattnerd4137f42009-11-29 02:19:52 +00001241[LOAD PRE CRIT EDGE SPLITTING]
Chris Lattner8f416f32008-12-15 07:49:24 +00001242
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001243GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
1244leading to excess stack traffic. This could be handled by GVN with some crazy
1245symbolic phi translation. The code we get looks like (g is on the stack):
1246
1247bb2: ; preds = %bb1
1248..
1249 %9 = getelementptr %struct.f* %g, i32 0, i32 0
1250 store i32 %8, i32* %9, align bel %bb3
1251
1252bb3: ; preds = %bb1, %bb2, %bb
1253 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
1254 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
1255 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
1256 %11 = load i32* %10, align 4
1257
Chris Lattner6d949262009-11-27 16:53:57 +00001258%11 is partially redundant, an in BB2 it should have the value %8.
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001259
Chris Lattnerd4137f42009-11-29 02:19:52 +00001260GCC PR33344 and PR35287 are similar cases.
Chris Lattner6a09a742008-12-06 22:52:12 +00001261
Chris Lattner6c9fab72009-11-05 18:19:19 +00001262
1263//===---------------------------------------------------------------------===//
1264
Chris Lattnerd4137f42009-11-29 02:19:52 +00001265[LOAD PRE]
1266
Chris Lattner6a09a742008-12-06 22:52:12 +00001267There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
Chris Lattnerd4137f42009-11-29 02:19:52 +00001268GCC testsuite, ones we don't get yet are (checked through loadpre25):
1269
1270[CRIT EDGE BREAKING]
1271loadpre3.c predcom-4.c
1272
1273[PRE OF READONLY CALL]
1274loadpre5.c
1275
1276[TURN SELECT INTO BRANCH]
1277loadpre14.c loadpre15.c
1278
1279actually a conditional increment: loadpre18.c loadpre19.c
1280
Chris Lattner2fc36e12010-12-15 06:38:24 +00001281//===---------------------------------------------------------------------===//
1282
1283[LOAD PRE / STORE SINKING / SPEC HACK]
1284
1285This is a chunk of code from 456.hmmer:
1286
1287int f(int M, int *mc, int *mpp, int *tpmm, int *ip, int *tpim, int *dpp,
1288 int *tpdm, int xmb, int *bp, int *ms) {
1289 int k, sc;
1290 for (k = 1; k <= M; k++) {
1291 mc[k] = mpp[k-1] + tpmm[k-1];
1292 if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;
1293 if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
1294 if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
1295 mc[k] += ms[k];
1296 }
1297}
1298
1299It is very profitable for this benchmark to turn the conditional stores to mc[k]
1300into a conditional move (select instr in IR) and allow the final store to do the
1301store. See GCC PR27313 for more details. Note that this is valid to xform even
1302with the new C++ memory model, since mc[k] is previously loaded and later
1303stored.
Chris Lattnerd4137f42009-11-29 02:19:52 +00001304
1305//===---------------------------------------------------------------------===//
1306
1307[SCALAR PRE]
1308There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the
1309GCC testsuite.
Chris Lattner6a09a742008-12-06 22:52:12 +00001310
Chris Lattner582048d2008-12-15 08:32:28 +00001311//===---------------------------------------------------------------------===//
1312
1313There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
Chris Lattnerd4137f42009-11-29 02:19:52 +00001314GCC testsuite. For example, we get the first example in predcom-1.c, but
1315miss the second one:
Chris Lattner582048d2008-12-15 08:32:28 +00001316
Chris Lattnerd4137f42009-11-29 02:19:52 +00001317unsigned fib[1000];
1318unsigned avg[1000];
Chris Lattner582048d2008-12-15 08:32:28 +00001319
Chris Lattnerd4137f42009-11-29 02:19:52 +00001320__attribute__ ((noinline))
1321void count_averages(int n) {
1322 int i;
1323 for (i = 1; i < n; i++)
1324 avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;
1325}
Chris Lattner582048d2008-12-15 08:32:28 +00001326
Chris Lattnerd4137f42009-11-29 02:19:52 +00001327which compiles into two loads instead of one in the loop.
Chris Lattner582048d2008-12-15 08:32:28 +00001328
Chris Lattnerd4137f42009-11-29 02:19:52 +00001329predcom-2.c is the same as predcom-1.c
Chris Lattner582048d2008-12-15 08:32:28 +00001330
Chris Lattner582048d2008-12-15 08:32:28 +00001331predcom-3.c is very similar but needs loads feeding each other instead of
1332store->load.
Chris Lattner582048d2008-12-15 08:32:28 +00001333
1334
1335//===---------------------------------------------------------------------===//
1336
Chris Lattneraa306c22010-01-23 17:59:23 +00001337[ALIAS ANALYSIS]
1338
Chris Lattner582048d2008-12-15 08:32:28 +00001339Type based alias analysis:
Chris Lattner6a09a742008-12-06 22:52:12 +00001340http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
1341
Chris Lattneraa306c22010-01-23 17:59:23 +00001342We should do better analysis of posix_memalign. At the least it should
1343no-capture its pointer argument, at best, we should know that the out-value
1344result doesn't point to anything (like malloc). One example of this is in
1345SingleSource/Benchmarks/Misc/dt.c
1346
Chris Lattner6a09a742008-12-06 22:52:12 +00001347//===---------------------------------------------------------------------===//
1348
Chris Lattner6a09a742008-12-06 22:52:12 +00001349A/B get pinned to the stack because we turn an if/then into a select instead
1350of PRE'ing the load/store. This may be fixable in instcombine:
1351http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
1352
Chris Lattner93c6c772009-09-21 02:53:57 +00001353struct X { int i; };
1354int foo (int x) {
1355 struct X a;
1356 struct X b;
1357 struct X *p;
1358 a.i = 1;
1359 b.i = 2;
1360 if (x)
1361 p = &a;
1362 else
1363 p = &b;
1364 return p->i;
1365}
Chris Lattner582048d2008-12-15 08:32:28 +00001366
Chris Lattner93c6c772009-09-21 02:53:57 +00001367//===---------------------------------------------------------------------===//
Chris Lattner582048d2008-12-15 08:32:28 +00001368
Chris Lattner6a09a742008-12-06 22:52:12 +00001369Interesting missed case because of control flow flattening (should be 2 loads):
1370http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner582048d2008-12-15 08:32:28 +00001371With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |
1372 opt -mem2reg -gvn -instcombine | llvm-dis
Chris Lattnerd4137f42009-11-29 02:19:52 +00001373we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT
Chris Lattner582048d2008-12-15 08:32:28 +00001374VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner6a09a742008-12-06 22:52:12 +00001375
1376//===---------------------------------------------------------------------===//
1377
1378http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
1379We could eliminate the branch condition here, loading from null is undefined:
1380
1381struct S { int w, x, y, z; };
1382struct T { int r; struct S s; };
1383void bar (struct S, int);
1384void foo (int a, struct T b)
1385{
1386 struct S *c = 0;
1387 if (a)
1388 c = &b.s;
1389 bar (*c, a);
1390}
1391
1392//===---------------------------------------------------------------------===//
Chris Lattner88d84b22008-12-02 06:32:34 +00001393
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001394simplifylibcalls should do several optimizations for strspn/strcspn:
1395
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001396strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
1397
1398size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
1399 int __reject3) {
1400 register size_t __result = 0;
1401 while (__s[__result] != '\0' && __s[__result] != __reject1 &&
1402 __s[__result] != __reject2 && __s[__result] != __reject3)
1403 ++__result;
1404 return __result;
1405}
1406
1407This should turn into a switch on the character. See PR3253 for some notes on
1408codegen.
1409
1410456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
1411
1412//===---------------------------------------------------------------------===//
Chris Lattnerd23b7992008-12-31 00:54:13 +00001413
1414"gas" uses this idiom:
1415 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string))
1416..
1417 else if (strchr ("<>", *intel_parser.op_string)
1418
1419Those should be turned into a switch.
1420
1421//===---------------------------------------------------------------------===//
Chris Lattnerffb08f52009-01-08 06:52:57 +00001422
1423252.eon contains this interesting code:
1424
1425 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
1426 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1427 %strlen = call i32 @strlen(i8* %3072) ; uses = 1
1428 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
1429 call void @llvm.memcpy.i32(i8* %endptr,
1430 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
1431 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly
1432
1433This is interesting for a couple reasons. First, in this:
1434
Benjamin Kramer9d071cb2010-12-23 15:32:07 +00001435The memcpy+strlen strlen can be replaced with:
Chris Lattnerffb08f52009-01-08 06:52:57 +00001436
1437 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
1438
1439Because the destination was just copied into the specified memory buffer. This,
1440in turn, can be constant folded to "4".
1441
1442In other code, it contains:
1443
1444 %endptr6978 = bitcast i8* %endptr69 to i32*
1445 store i32 7107374, i32* %endptr6978, align 1
1446 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
1447
1448Which could also be constant folded. Whatever is producing this should probably
1449be fixed to leave this as a memcpy from a string.
1450
1451Further, eon also has an interesting partially redundant strlen call:
1452
1453bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
1454 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2]
1455 %683 = load i8** %682, align 4 ; <i8*> [#uses=4]
1456 %684 = load i8* %683, align 1 ; <i8> [#uses=1]
1457 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
1458 br i1 %685, label %bb10, label %bb9
1459
1460bb9: ; preds = %bb8
1461 %686 = call i32 @strlen(i8* %683) nounwind readonly
1462 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
1463 br i1 %687, label %bb10, label %bb11
1464
1465bb10: ; preds = %bb9, %bb8
1466 %688 = call i32 @strlen(i8* %683) nounwind readonly
1467
1468This could be eliminated by doing the strlen once in bb8, saving code size and
1469improving perf on the bb8->9->10 path.
1470
1471//===---------------------------------------------------------------------===//
Chris Lattner9fee08f2009-01-08 07:34:55 +00001472
1473I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
1474which looks like:
1475 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
1476
1477
1478bb62: ; preds = %bb55, %bb53
1479 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
1480 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1481 %172 = add i32 %171, -1 ; <i32> [#uses=1]
1482 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
1483
1484... no stores ...
1485 br i1 %or.cond, label %bb65, label %bb72
1486
1487bb65: ; preds = %bb62
1488 store i8 0, i8* %173, align 1
1489 br label %bb72
1490
1491bb72: ; preds = %bb65, %bb62
1492 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
1493 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1494
1495Note that on the bb62->bb72 path, that the %177 strlen call is partially
1496redundant with the %171 call. At worst, we could shove the %177 strlen call
1497up into the bb65 block moving it out of the bb62->bb72 path. However, note
1498that bb65 stores to the string, zeroing out the last byte. This means that on
1499that path the value of %177 is actually just %171-1. A sub is cheaper than a
1500strlen!
1501
1502This pattern repeats several times, basically doing:
1503
1504 A = strlen(P);
1505 P[A-1] = 0;
1506 B = strlen(P);
1507 where it is "obvious" that B = A-1.
1508
1509//===---------------------------------------------------------------------===//
1510
Chris Lattner9fee08f2009-01-08 07:34:55 +00001511186.crafty has this interesting pattern with the "out.4543" variable:
1512
1513call void @llvm.memcpy.i32(
1514 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
1515 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
1516%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
1517
1518It is basically doing:
1519
1520 memcpy(globalarray, "string");
1521 printf(..., globalarray);
1522
1523Anyway, by knowing that printf just reads the memory and forward substituting
1524the string directly into the printf, this eliminates reads from globalarray.
1525Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
1526other similar functions) there are many stores to "out". Once all the printfs
1527stop using "out", all that is left is the memcpy's into it. This should allow
1528globalopt to remove the "stored only" global.
1529
1530//===---------------------------------------------------------------------===//
1531
Dan Gohman8289b052009-01-20 01:07:33 +00001532This code:
1533
1534define inreg i32 @foo(i8* inreg %p) nounwind {
1535 %tmp0 = load i8* %p
1536 %tmp1 = ashr i8 %tmp0, 5
1537 %tmp2 = sext i8 %tmp1 to i32
1538 ret i32 %tmp2
1539}
1540
1541could be dagcombine'd to a sign-extending load with a shift.
1542For example, on x86 this currently gets this:
1543
1544 movb (%eax), %al
1545 sarb $5, %al
1546 movsbl %al, %eax
1547
1548while it could get this:
1549
1550 movsbl (%eax), %eax
1551 sarl $5, %eax
1552
1553//===---------------------------------------------------------------------===//
Chris Lattner256baa42009-01-22 07:16:03 +00001554
1555GCC PR31029:
1556
1557int test(int x) { return 1-x == x; } // --> return false
1558int test2(int x) { return 2-x == x; } // --> return x == 1 ?
1559
1560Always foldable for odd constants, what is the rule for even?
1561
1562//===---------------------------------------------------------------------===//
1563
Torok Edwine46a6862009-01-24 19:30:25 +00001564PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
1565for next field in struct (which is at same address).
1566
1567For example: store of float into { {{}}, float } could be turned into a store to
1568the float directly.
1569
Torok Edwin474479f2009-02-20 18:42:06 +00001570//===---------------------------------------------------------------------===//
Nick Lewycky20babb12009-02-25 06:52:48 +00001571
Chris Lattner32c5f172009-05-11 17:41:40 +00001572The arg promotion pass should make use of nocapture to make its alias analysis
1573stuff much more precise.
1574
1575//===---------------------------------------------------------------------===//
1576
1577The following functions should be optimized to use a select instead of a
1578branch (from gcc PR40072):
1579
1580char char_int(int m) {if(m>7) return 0; return m;}
1581int int_char(char m) {if(m>7) return 0; return m;}
1582
1583//===---------------------------------------------------------------------===//
1584
Bill Wendling5a569272009-10-27 22:48:31 +00001585int func(int a, int b) { if (a & 0x80) b |= 0x80; else b &= ~0x80; return b; }
1586
1587Generates this:
1588
1589define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1590entry:
1591 %0 = and i32 %a, 128 ; <i32> [#uses=1]
1592 %1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
1593 %2 = or i32 %b, 128 ; <i32> [#uses=1]
1594 %3 = and i32 %b, -129 ; <i32> [#uses=1]
1595 %b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
1596 ret i32 %b_addr.0
1597}
1598
1599However, it's functionally equivalent to:
1600
1601 b = (b & ~0x80) | (a & 0x80);
1602
1603Which generates this:
1604
1605define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1606entry:
1607 %0 = and i32 %b, -129 ; <i32> [#uses=1]
1608 %1 = and i32 %a, 128 ; <i32> [#uses=1]
1609 %2 = or i32 %0, %1 ; <i32> [#uses=1]
1610 ret i32 %2
1611}
1612
1613This can be generalized for other forms:
1614
1615 b = (b & ~0x80) | (a & 0x40) << 1;
1616
1617//===---------------------------------------------------------------------===//
Bill Wendlingc872e9c2009-10-27 23:30:07 +00001618
1619These two functions produce different code. They shouldn't:
1620
1621#include <stdint.h>
1622
1623uint8_t p1(uint8_t b, uint8_t a) {
1624 b = (b & ~0xc0) | (a & 0xc0);
1625 return (b);
1626}
1627
1628uint8_t p2(uint8_t b, uint8_t a) {
1629 b = (b & ~0x40) | (a & 0x40);
1630 b = (b & ~0x80) | (a & 0x80);
1631 return (b);
1632}
1633
1634define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1635entry:
1636 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1637 %1 = and i8 %a, -64 ; <i8> [#uses=1]
1638 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1639 ret i8 %2
1640}
1641
1642define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1643entry:
1644 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1645 %.masked = and i8 %a, 64 ; <i8> [#uses=1]
1646 %1 = and i8 %a, -128 ; <i8> [#uses=1]
1647 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1648 %3 = or i8 %2, %.masked ; <i8> [#uses=1]
1649 ret i8 %3
1650}
1651
1652//===---------------------------------------------------------------------===//
Chris Lattner6fdfc9c2009-11-11 17:51:27 +00001653
1654IPSCCP does not currently propagate argument dependent constants through
1655functions where it does not not all of the callers. This includes functions
1656with normal external linkage as well as templates, C99 inline functions etc.
1657Specifically, it does nothing to:
1658
1659define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
1660entry:
1661 %0 = add nsw i32 %y, %z
1662 %1 = mul i32 %0, %x
1663 %2 = mul i32 %y, %z
1664 %3 = add nsw i32 %1, %2
1665 ret i32 %3
1666}
1667
1668define i32 @test2() nounwind {
1669entry:
1670 %0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
1671 ret i32 %0
1672}
1673
1674It would be interesting extend IPSCCP to be able to handle simple cases like
1675this, where all of the arguments to a call are constant. Because IPSCCP runs
1676before inlining, trivial templates and inline functions are not yet inlined.
1677The results for a function + set of constant arguments should be memoized in a
1678map.
1679
1680//===---------------------------------------------------------------------===//
Chris Lattnerfc926c22009-11-11 17:54:02 +00001681
1682The libcall constant folding stuff should be moved out of SimplifyLibcalls into
1683libanalysis' constantfolding logic. This would allow IPSCCP to be able to
1684handle simple things like this:
1685
1686static int foo(const char *X) { return strlen(X); }
1687int bar() { return foo("abcd"); }
1688
1689//===---------------------------------------------------------------------===//
Nick Lewycky93f9f7a2009-11-15 17:51:23 +00001690
1691InstCombine should use SimplifyDemandedBits to remove the or instruction:
1692
1693define i1 @test(i8 %x, i8 %y) {
1694 %A = or i8 %x, 1
1695 %B = icmp ugt i8 %A, 3
1696 ret i1 %B
1697}
1698
1699Currently instcombine calls SimplifyDemandedBits with either all bits or just
1700the sign bit, if the comparison is obviously a sign test. In this case, we only
1701need all but the bottom two bits from %A, and if we gave that mask to SDB it
1702would delete the or instruction for us.
1703
1704//===---------------------------------------------------------------------===//
Chris Lattner05332172009-12-03 07:41:54 +00001705
Duncan Sandse10920d2010-01-06 15:37:47 +00001706functionattrs doesn't know much about memcpy/memset. This function should be
Duncan Sands7c422ac2010-01-06 08:45:52 +00001707marked readnone rather than readonly, since it only twiddles local memory, but
1708functionattrs doesn't handle memset/memcpy/memmove aggressively:
Chris Lattner89742c22009-12-03 07:43:46 +00001709
1710struct X { int *p; int *q; };
1711int foo() {
1712 int i = 0, j = 1;
1713 struct X x, y;
1714 int **p;
1715 y.p = &i;
1716 x.q = &j;
1717 p = __builtin_memcpy (&x, &y, sizeof (int *));
1718 return **p;
1719}
1720
Chris Lattner05332172009-12-03 07:41:54 +00001721//===---------------------------------------------------------------------===//
1722
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001723Missed instcombine transformation:
1724define i1 @a(i32 %x) nounwind readnone {
1725entry:
1726 %cmp = icmp eq i32 %x, 30
1727 %sub = add i32 %x, -30
1728 %cmp2 = icmp ugt i32 %sub, 9
1729 %or = or i1 %cmp, %cmp2
1730 ret i1 %or
1731}
1732This should be optimized to a single compare. Testcase derived from gcc.
1733
1734//===---------------------------------------------------------------------===//
1735
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001736Missed instcombine or reassociate transformation:
1737int a(int a, int b) { return (a==12)&(b>47)&(b<58); }
1738
1739The sgt and slt should be combined into a single comparison. Testcase derived
1740from gcc.
1741
1742//===---------------------------------------------------------------------===//
1743
1744Missed instcombine transformation:
Chris Lattner3e411062010-11-21 07:05:31 +00001745
1746 %382 = srem i32 %tmp14.i, 64 ; [#uses=1]
1747 %383 = zext i32 %382 to i64 ; [#uses=1]
1748 %384 = shl i64 %381, %383 ; [#uses=1]
1749 %385 = icmp slt i32 %tmp14.i, 64 ; [#uses=1]
1750
Benjamin Kramerc21a8212010-11-23 20:33:57 +00001751The srem can be transformed to an and because if %tmp14.i is negative, the
1752shift is undefined. Testcase derived from 403.gcc.
Chris Lattner3e411062010-11-21 07:05:31 +00001753
1754//===---------------------------------------------------------------------===//
1755
1756This is a range comparison on a divided result (from 403.gcc):
1757
1758 %1337 = sdiv i32 %1336, 8 ; [#uses=1]
1759 %.off.i208 = add i32 %1336, 7 ; [#uses=1]
1760 %1338 = icmp ult i32 %.off.i208, 15 ; [#uses=1]
1761
1762We already catch this (removing the sdiv) if there isn't an add, we should
1763handle the 'add' as well. This is a common idiom with it's builtin_alloca code.
1764C testcase:
1765
1766int a(int x) { return (unsigned)(x/16+7) < 15; }
1767
1768Another similar case involves truncations on 64-bit targets:
1769
1770 %361 = sdiv i64 %.046, 8 ; [#uses=1]
1771 %362 = trunc i64 %361 to i32 ; [#uses=2]
1772...
1773 %367 = icmp eq i32 %362, 0 ; [#uses=1]
1774
Eli Friedman1144d7e2010-01-31 04:55:32 +00001775//===---------------------------------------------------------------------===//
1776
1777Missed instcombine/dagcombine transformation:
1778define void @lshift_lt(i8 zeroext %a) nounwind {
1779entry:
1780 %conv = zext i8 %a to i32
1781 %shl = shl i32 %conv, 3
1782 %cmp = icmp ult i32 %shl, 33
1783 br i1 %cmp, label %if.then, label %if.end
1784
1785if.then:
1786 tail call void @bar() nounwind
1787 ret void
1788
1789if.end:
1790 ret void
1791}
1792declare void @bar() nounwind
1793
1794The shift should be eliminated. Testcase derived from gcc.
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001795
1796//===---------------------------------------------------------------------===//
Chris Lattnercf031f62010-02-09 00:11:10 +00001797
1798These compile into different code, one gets recognized as a switch and the
1799other doesn't due to phase ordering issues (PR6212):
1800
1801int test1(int mainType, int subType) {
1802 if (mainType == 7)
1803 subType = 4;
1804 else if (mainType == 9)
1805 subType = 6;
1806 else if (mainType == 11)
1807 subType = 9;
1808 return subType;
1809}
1810
1811int test2(int mainType, int subType) {
1812 if (mainType == 7)
1813 subType = 4;
1814 if (mainType == 9)
1815 subType = 6;
1816 if (mainType == 11)
1817 subType = 9;
1818 return subType;
1819}
1820
1821//===---------------------------------------------------------------------===//
Chris Lattner66636702010-03-10 21:42:42 +00001822
1823The following test case (from PR6576):
1824
1825define i32 @mul(i32 %a, i32 %b) nounwind readnone {
1826entry:
1827 %cond1 = icmp eq i32 %b, 0 ; <i1> [#uses=1]
1828 br i1 %cond1, label %exit, label %bb.nph
1829bb.nph: ; preds = %entry
1830 %tmp = mul i32 %b, %a ; <i32> [#uses=1]
1831 ret i32 %tmp
1832exit: ; preds = %entry
1833 ret i32 0
1834}
1835
1836could be reduced to:
1837
1838define i32 @mul(i32 %a, i32 %b) nounwind readnone {
1839entry:
1840 %tmp = mul i32 %b, %a
1841 ret i32 %tmp
1842}
1843
1844//===---------------------------------------------------------------------===//
1845
Chris Lattner94846892010-04-16 23:52:30 +00001846We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates.
1847See GCC PR34949
1848
Chris Lattnerc2685a92010-05-21 23:16:21 +00001849Another interesting case is that something related could be used for variables
1850that go const after their ctor has finished. In these cases, globalopt (which
1851can statically run the constructor) could mark the global const (so it gets put
1852in the readonly section). A testcase would be:
1853
1854#include <complex>
1855using namespace std;
1856const complex<char> should_be_in_rodata (42,-42);
1857complex<char> should_be_in_data (42,-42);
1858complex<char> should_be_in_bss;
1859
1860Where we currently evaluate the ctors but the globals don't become const because
1861the optimizer doesn't know they "become const" after the ctor is done. See
1862GCC PR4131 for more examples.
1863
Chris Lattner94846892010-04-16 23:52:30 +00001864//===---------------------------------------------------------------------===//
1865
Dan Gohman3a2a4842010-05-03 14:31:00 +00001866In this code:
1867
1868long foo(long x) {
1869 return x > 1 ? x : 1;
1870}
1871
1872LLVM emits a comparison with 1 instead of 0. 0 would be equivalent
1873and cheaper on most targets.
1874
1875LLVM prefers comparisons with zero over non-zero in general, but in this
1876case it choses instead to keep the max operation obvious.
1877
1878//===---------------------------------------------------------------------===//
Eli Friedman8c47d3b2010-06-12 05:54:27 +00001879
1880Take the following testcase on x86-64 (similar testcases exist for all targets
1881with addc/adde):
1882
1883define void @a(i64* nocapture %s, i64* nocapture %t, i64 %a, i64 %b,
1884i64 %c) nounwind {
1885entry:
1886 %0 = zext i64 %a to i128 ; <i128> [#uses=1]
1887 %1 = zext i64 %b to i128 ; <i128> [#uses=1]
1888 %2 = add i128 %1, %0 ; <i128> [#uses=2]
1889 %3 = zext i64 %c to i128 ; <i128> [#uses=1]
1890 %4 = shl i128 %3, 64 ; <i128> [#uses=1]
1891 %5 = add i128 %4, %2 ; <i128> [#uses=1]
1892 %6 = lshr i128 %5, 64 ; <i128> [#uses=1]
1893 %7 = trunc i128 %6 to i64 ; <i64> [#uses=1]
1894 store i64 %7, i64* %s, align 8
1895 %8 = trunc i128 %2 to i64 ; <i64> [#uses=1]
1896 store i64 %8, i64* %t, align 8
1897 ret void
1898}
1899
1900Generated code:
1901 addq %rcx, %rdx
1902 movl $0, %eax
1903 adcq $0, %rax
1904 addq %r8, %rax
1905 movq %rax, (%rdi)
1906 movq %rdx, (%rsi)
1907 ret
1908
1909Expected code:
1910 addq %rcx, %rdx
1911 adcq $0, %r8
1912 movq %r8, (%rdi)
1913 movq %rdx, (%rsi)
1914 ret
1915
1916The generated SelectionDAG has an ADD of an ADDE, where both operands of the
1917ADDE are zero. Replacing one of the operands of the ADDE with the other operand
1918of the ADD, and replacing the ADD with the ADDE, should give the desired result.
1919
1920(That said, we are doing a lot better than gcc on this testcase. :) )
1921
1922//===---------------------------------------------------------------------===//
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001923
1924Switch lowering generates less than ideal code for the following switch:
1925define void @a(i32 %x) nounwind {
1926entry:
1927 switch i32 %x, label %if.end [
1928 i32 0, label %if.then
1929 i32 1, label %if.then
1930 i32 2, label %if.then
1931 i32 3, label %if.then
1932 i32 5, label %if.then
1933 ]
1934if.then:
1935 tail call void @foo() nounwind
1936 ret void
1937if.end:
1938 ret void
1939}
1940declare void @foo()
1941
1942Generated code on x86-64 (other platforms give similar results):
1943a:
1944 cmpl $5, %edi
1945 ja .LBB0_2
1946 movl %edi, %eax
1947 movl $47, %ecx
1948 btq %rax, %rcx
1949 jb .LBB0_3
1950.LBB0_2:
1951 ret
1952.LBB0_3:
Eli Friedmanb4828292010-07-03 08:43:32 +00001953 jmp foo # TAILCALL
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001954
1955The movl+movl+btq+jb could be simplified to a cmpl+jne.
1956
Eli Friedmanb4828292010-07-03 08:43:32 +00001957Or, if we wanted to be really clever, we could simplify the whole thing to
1958something like the following, which eliminates a branch:
1959 xorl $1, %edi
1960 cmpl $4, %edi
1961 ja .LBB0_2
1962 ret
1963.LBB0_2:
1964 jmp foo # TAILCALL
Nick Lewyckyb1e4eeb2010-08-08 07:04:25 +00001965//===---------------------------------------------------------------------===//
1966Given a branch where the two target blocks are identical ("ret i32 %b" in
1967both), simplifycfg will simplify them away. But not so for a switch statement:
Eli Friedmanb4828292010-07-03 08:43:32 +00001968
Nick Lewyckyb1e4eeb2010-08-08 07:04:25 +00001969define i32 @f(i32 %a, i32 %b) nounwind readnone {
1970entry:
1971 switch i32 %a, label %bb3 [
1972 i32 4, label %bb
1973 i32 6, label %bb
1974 ]
1975
1976bb: ; preds = %entry, %entry
1977 ret i32 %b
1978
1979bb3: ; preds = %entry
1980 ret i32 %b
1981}
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001982//===---------------------------------------------------------------------===//
Chris Lattner274191f2010-11-09 19:37:28 +00001983
1984clang -O3 fails to devirtualize this virtual inheritance case: (GCC PR45875)
Chris Lattner1e68fdb2010-11-11 17:17:56 +00001985Looks related to PR3100
Chris Lattner274191f2010-11-09 19:37:28 +00001986
1987struct c1 {};
1988struct c10 : c1{
1989 virtual void foo ();
1990};
1991struct c11 : c10, c1{
1992 virtual void f6 ();
1993};
1994struct c28 : virtual c11{
1995 void f6 ();
1996};
1997void check_c28 () {
1998 c28 obj;
1999 c11 *ptr = &obj;
2000 ptr->f6 ();
2001}
2002
2003//===---------------------------------------------------------------------===//
Chris Lattneraf510f12010-11-11 18:23:57 +00002004
2005We compile this:
2006
2007int foo(int a) { return (a & (~15)) / 16; }
2008
2009Into:
2010
2011define i32 @foo(i32 %a) nounwind readnone ssp {
2012entry:
2013 %and = and i32 %a, -16
2014 %div = sdiv i32 %and, 16
2015 ret i32 %div
2016}
2017
2018but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case
2019should be instcombined into just "a >> 4".
2020
2021We do get this at the codegen level, so something knows about it, but
2022instcombine should catch it earlier:
2023
2024_foo: ## @foo
2025## BB#0: ## %entry
2026 movl %edi, %eax
2027 sarl $4, %eax
2028 ret
2029
2030//===---------------------------------------------------------------------===//
2031
Chris Lattnera97c91f2010-12-13 00:15:25 +00002032This code (from GCC PR28685):
2033
2034int test(int a, int b) {
2035 int lt = a < b;
2036 int eq = a == b;
2037 if (lt)
2038 return 1;
2039 return eq;
2040}
2041
2042Is compiled to:
2043
2044define i32 @test(i32 %a, i32 %b) nounwind readnone ssp {
2045entry:
2046 %cmp = icmp slt i32 %a, %b
2047 br i1 %cmp, label %return, label %if.end
2048
2049if.end: ; preds = %entry
2050 %cmp5 = icmp eq i32 %a, %b
2051 %conv6 = zext i1 %cmp5 to i32
2052 ret i32 %conv6
2053
2054return: ; preds = %entry
2055 ret i32 1
2056}
2057
2058it could be:
2059
2060define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp {
2061entry:
2062 %0 = icmp sle i32 %a, %b
2063 %retval = zext i1 %0 to i32
2064 ret i32 %retval
2065}
2066
2067//===---------------------------------------------------------------------===//