blob: 17617ad547357351d813034aa85b934497517399 [file] [log] [blame]
Chris Lattner086c0142006-02-03 06:21:43 +00001Target Independent Opportunities:
2
Chris Lattnerf308ea02006-09-28 06:01:17 +00003//===---------------------------------------------------------------------===//
4
Chris Lattner313a94c2010-09-19 00:37:34 +00005We should recognize idioms for add-with-carry and turn it into the appropriate
6intrinsics. This example:
7
8unsigned add32carry(unsigned sum, unsigned x) {
9 unsigned z = sum + x;
10 if (sum + x < x)
11 z++;
12 return z;
13}
14
15Compiles to: clang t.c -S -o - -O3 -fomit-frame-pointer -m64 -mkernel
16
17_add32carry: ## @add32carry
18 addl %esi, %edi
19 cmpl %esi, %edi
20 sbbl %eax, %eax
21 andl $1, %eax
22 addl %edi, %eax
23 ret
24
25with clang, but to:
26
27_add32carry:
28 leal (%rsi,%rdi), %eax
29 cmpl %esi, %eax
30 adcl $0, %eax
31 ret
32
33with gcc.
34
35//===---------------------------------------------------------------------===//
36
Chris Lattner1d159832009-11-27 17:12:30 +000037Dead argument elimination should be enhanced to handle cases when an argument is
38dead to an externally visible function. Though the argument can't be removed
39from the externally visible function, the caller doesn't need to pass it in.
40For example in this testcase:
41
42 void foo(int X) __attribute__((noinline));
43 void foo(int X) { sideeffect(); }
44 void bar(int A) { foo(A+1); }
45
46We compile bar to:
47
48define void @bar(i32 %A) nounwind ssp {
49 %0 = add nsw i32 %A, 1 ; <i32> [#uses=1]
50 tail call void @foo(i32 %0) nounwind noinline ssp
51 ret void
52}
53
54The add is dead, we could pass in 'i32 undef' instead. This occurs for C++
55templates etc, which usually have linkonce_odr/weak_odr linkage, not internal
56linkage.
57
58//===---------------------------------------------------------------------===//
59
Chris Lattner9b62b452006-11-14 01:57:53 +000060With the recent changes to make the implicit def/use set explicit in
61machineinstrs, we should change the target descriptions for 'call' instructions
62so that the .td files don't list all the call-clobbered registers as implicit
63defs. Instead, these should be added by the code generator (e.g. on the dag).
64
65This has a number of uses:
66
671. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
68 for their different impdef sets.
692. Targets with multiple calling convs (e.g. x86) which have different clobber
70 sets don't need copies of call instructions.
713. 'Interprocedural register allocation' can be done to reduce the clobber sets
72 of calls.
73
74//===---------------------------------------------------------------------===//
75
Nate Begeman81e80972006-03-17 01:40:33 +000076Make the PPC branch selector target independant
77
78//===---------------------------------------------------------------------===//
Chris Lattner086c0142006-02-03 06:21:43 +000079
80Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
Chris Lattner2dae65d2008-12-10 01:30:48 +000081precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
82safe in general, even on darwin. See the libm implementation of hypot for
83examples (which special case when x/y are exactly zero to get signed zeros etc
84right).
Chris Lattner086c0142006-02-03 06:21:43 +000085
Chris Lattner086c0142006-02-03 06:21:43 +000086//===---------------------------------------------------------------------===//
87
88Solve this DAG isel folding deficiency:
89
90int X, Y;
91
92void fn1(void)
93{
94 X = X | (Y << 3);
95}
96
97compiles to
98
99fn1:
100 movl Y, %eax
101 shll $3, %eax
102 orl X, %eax
103 movl %eax, X
104 ret
105
106The problem is the store's chain operand is not the load X but rather
107a TokenFactor of the load X and load Y, which prevents the folding.
108
109There are two ways to fix this:
110
1111. The dag combiner can start using alias analysis to realize that y/x
112 don't alias, making the store to X not dependent on the load from Y.
1132. The generated isel could be made smarter in the case it can't
114 disambiguate the pointers.
115
116Number 1 is the preferred solution.
117
Evan Chenge617b082006-03-13 23:19:10 +0000118This has been "fixed" by a TableGen hack. But that is a short term workaround
119which will be removed once the proper fix is made.
120
Chris Lattner086c0142006-02-03 06:21:43 +0000121//===---------------------------------------------------------------------===//
122
Chris Lattnerb27b69f2006-03-04 01:19:34 +0000123On targets with expensive 64-bit multiply, we could LSR this:
124
125for (i = ...; ++i) {
126 x = 1ULL << i;
127
128into:
129 long long tmp = 1;
130 for (i = ...; ++i, tmp+=tmp)
131 x = tmp;
132
133This would be a win on ppc32, but not x86 or ppc64.
134
Chris Lattnerad019932006-03-04 08:44:51 +0000135//===---------------------------------------------------------------------===//
Chris Lattner5b0fe7d2006-03-05 20:00:08 +0000136
137Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
138
139//===---------------------------------------------------------------------===//
Chris Lattner549f27d22006-03-07 02:46:26 +0000140
Chris Lattner398ffba2010-01-01 01:29:26 +0000141Reassociate should turn things like:
142
143int factorial(int X) {
144 return X*X*X*X*X*X*X*X;
145}
146
147into llvm.powi calls, allowing the code generator to produce balanced
148multiplication trees.
149
150First, the intrinsic needs to be extended to support integers, and second the
151code generator needs to be enhanced to lower these to multiplication trees.
Chris Lattnerc20995e2006-03-11 20:17:08 +0000152
153//===---------------------------------------------------------------------===//
154
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000155Interesting? testcase for add/shift/mul reassoc:
156
157int bar(int x, int y) {
158 return x*x*x+y+x*x*x*x*x*y*y*y*y;
159}
160int foo(int z, int n) {
161 return bar(z, n) + bar(2*z, 2*n);
162}
163
Chris Lattner398ffba2010-01-01 01:29:26 +0000164This is blocked on not handling X*X*X -> powi(X, 3) (see note above). The issue
165is that we end up getting t = 2*X s = t*t and don't turn this into 4*X*X,
166which is the same number of multiplies and is canonical, because the 2*X has
167multiple uses. Here's a simple example:
168
169define i32 @test15(i32 %X1) {
170 %B = mul i32 %X1, 47 ; X1*47
171 %C = mul i32 %B, %B
172 ret i32 %C
173}
174
175
176//===---------------------------------------------------------------------===//
177
178Reassociate should handle the example in GCC PR16157:
179
180extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;
181void f () { /* this can be optimized to four additions... */
182 b4 = a4 + a3 + a2 + a1 + a0;
183 b3 = a3 + a2 + a1 + a0;
184 b2 = a2 + a1 + a0;
185 b1 = a1 + a0;
186}
187
188This requires reassociating to forms of expressions that are already available,
189something that reassoc doesn't think about yet.
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000190
Chris Lattner10c42452010-01-24 20:01:41 +0000191
192//===---------------------------------------------------------------------===//
193
194This function: (derived from GCC PR19988)
195double foo(double x, double y) {
196 return ((x + 0.1234 * y) * (x + -0.1234 * y));
197}
198
199compiles to:
200_foo:
201 movapd %xmm1, %xmm2
202 mulsd LCPI1_1(%rip), %xmm1
203 mulsd LCPI1_0(%rip), %xmm2
204 addsd %xmm0, %xmm1
205 addsd %xmm0, %xmm2
206 movapd %xmm1, %xmm0
207 mulsd %xmm2, %xmm0
208 ret
209
Chris Lattner43dc2e62010-01-24 20:17:09 +0000210Reassociate should be able to turn it into:
Chris Lattner10c42452010-01-24 20:01:41 +0000211
212double foo(double x, double y) {
213 return ((x + 0.1234 * y) * (x - 0.1234 * y));
214}
215
216Which allows the multiply by constant to be CSE'd, producing:
217
218_foo:
219 mulsd LCPI1_0(%rip), %xmm1
220 movapd %xmm1, %xmm2
221 addsd %xmm0, %xmm2
222 subsd %xmm1, %xmm0
223 mulsd %xmm2, %xmm0
224 ret
225
226This doesn't need -ffast-math support at all. This is particularly bad because
227the llvm-gcc frontend is canonicalizing the later into the former, but clang
228doesn't have this problem.
229
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000230//===---------------------------------------------------------------------===//
231
Chris Lattner82c78b22006-03-09 20:13:21 +0000232These two functions should generate the same code on big-endian systems:
233
234int g(int *j,int *l) { return memcmp(j,l,4); }
235int h(int *j, int *l) { return *j - *l; }
236
237this could be done in SelectionDAGISel.cpp, along with other special cases,
238for 1,2,4,8 bytes.
239
240//===---------------------------------------------------------------------===//
241
Chris Lattnerc04b4232006-03-22 07:33:46 +0000242It would be nice to revert this patch:
243http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
244
245And teach the dag combiner enough to simplify the code expanded before
246legalize. It seems plausible that this knowledge would let it simplify other
247stuff too.
248
Chris Lattnere6cd96d2006-03-24 19:59:17 +0000249//===---------------------------------------------------------------------===//
250
Reid Spencerac9dcb92007-02-15 03:39:18 +0000251For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng67d3d4c2006-03-31 22:35:14 +0000252to the type size. It works but can be overly conservative as the alignment of
Reid Spencerac9dcb92007-02-15 03:39:18 +0000253specific vector types are target dependent.
Chris Lattnereaa7c062006-04-01 04:08:29 +0000254
255//===---------------------------------------------------------------------===//
256
Dan Gohman1f3be1a2009-05-11 18:51:16 +0000257We should produce an unaligned load from code like this:
Chris Lattnereaa7c062006-04-01 04:08:29 +0000258
259v4sf example(float *P) {
260 return (v4sf){P[0], P[1], P[2], P[3] };
261}
262
263//===---------------------------------------------------------------------===//
264
Chris Lattner16abfdf2006-05-18 18:26:13 +0000265Add support for conditional increments, and other related patterns. Instead
266of:
267
268 movl 136(%esp), %eax
269 cmpl $0, %eax
270 je LBB16_2 #cond_next
271LBB16_1: #cond_true
272 incl _foo
273LBB16_2: #cond_next
274
275emit:
276 movl _foo, %eax
277 cmpl $1, %edi
278 sbbl $-1, %eax
279 movl %eax, _foo
280
281//===---------------------------------------------------------------------===//
Chris Lattner870cf1b2006-05-19 20:45:08 +0000282
283Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
284
285Expand these to calls of sin/cos and stores:
286 double sincos(double x, double *sin, double *cos);
287 float sincosf(float x, float *sin, float *cos);
288 long double sincosl(long double x, long double *sin, long double *cos);
289
290Doing so could allow SROA of the destination pointers. See also:
291http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
292
Chris Lattner2dae65d2008-12-10 01:30:48 +0000293This is now easily doable with MRVs. We could even make an intrinsic for this
294if anyone cared enough about sincos.
295
Chris Lattner870cf1b2006-05-19 20:45:08 +0000296//===---------------------------------------------------------------------===//
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000297
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000298quantum_sigma_x in 462.libquantum contains the following loop:
299
300 for(i=0; i<reg->size; i++)
301 {
302 /* Flip the target bit of each basis state */
303 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
304 }
305
306Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
307so cool to turn it into something like:
308
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000309 long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000310 if (target < 32) {
311 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000312 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000313 } else {
314 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000315 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000316 }
317
318... which would only do one 32-bit XOR per loop iteration instead of two.
319
320It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000321this requires TBAA.
Chris Lattnerfaa6adf2009-09-21 06:04:07 +0000322
323//===---------------------------------------------------------------------===//
324
Chris Lattnerb1ac7692008-10-05 02:16:12 +0000325This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattnerf9bae432006-12-08 02:01:32 +0000326
327unsigned long reverse(unsigned v) {
328 unsigned t;
329 t = v ^ ((v << 16) | (v >> 16));
330 t &= ~0xff0000;
331 v = (v << 24) | (v >> 8);
332 return v ^ (t >> 8);
333}
334
Eric Christopher33634d02010-06-29 22:22:22 +0000335Neither is this (very standard idiom):
336
337int f(int n)
338{
339 return (((n) << 24) | (((n) & 0xff00) << 8)
340 | (((n) >> 8) & 0xff00) | ((n) >> 24));
341}
342
Chris Lattnerfb981f32006-09-25 17:12:14 +0000343//===---------------------------------------------------------------------===//
344
Chris Lattner818ff342010-01-23 18:49:30 +0000345[LOOP RECOGNITION]
346
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000347These idioms should be recognized as popcount (see PR1488):
348
349unsigned countbits_slow(unsigned v) {
350 unsigned c;
351 for (c = 0; v; v >>= 1)
352 c += v & 1;
353 return c;
354}
355unsigned countbits_fast(unsigned v){
356 unsigned c;
357 for (c = 0; v; c++)
358 v &= v - 1; // clear the least significant bit set
359 return c;
360}
361
362BITBOARD = unsigned long long
363int PopCnt(register BITBOARD a) {
364 register int c=0;
365 while(a) {
366 c++;
367 a &= a - 1;
368 }
369 return c;
370}
371unsigned int popcount(unsigned int input) {
372 unsigned int count = 0;
373 for (unsigned int i = 0; i < 4 * 8; i++)
374 count += (input >> i) & i;
375 return count;
376}
377
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000378This is a form of idiom recognition for loops, the same thing that could be
379useful for recognizing memset/memcpy.
380
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000381//===---------------------------------------------------------------------===//
382
Chris Lattnerfb981f32006-09-25 17:12:14 +0000383These should turn into single 16-bit (unaligned?) loads on little/big endian
384processors.
385
386unsigned short read_16_le(const unsigned char *adr) {
387 return adr[0] | (adr[1] << 8);
388}
389unsigned short read_16_be(const unsigned char *adr) {
390 return (adr[0] << 8) | adr[1];
391}
392
393//===---------------------------------------------------------------------===//
Chris Lattnercf103912006-10-24 16:12:47 +0000394
Reid Spencer1628cec2006-10-26 06:15:43 +0000395-instcombine should handle this transform:
Reid Spencere4d87aa2006-12-23 06:05:41 +0000396 icmp pred (sdiv X / C1 ), C2
Reid Spencer1628cec2006-10-26 06:15:43 +0000397when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
398
399Currently InstCombine avoids this transform but will do it when the signs of
400the operands and the sign of the divide match. See the FIXME in
401InstructionCombining.cpp in the visitSetCondInst method after the switch case
402for Instruction::UDiv (around line 4447) for more details.
403
404The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
405this construct.
Chris Lattnerd7c628d2006-11-03 22:27:39 +0000406
407//===---------------------------------------------------------------------===//
408
Chris Lattneraa306c22010-01-23 17:59:23 +0000409[LOOP RECOGNITION]
410
Chris Lattner578d2df2006-11-10 00:23:26 +0000411viterbi speeds up *significantly* if the various "history" related copy loops
412are turned into memcpy calls at the source level. We need a "loops to memcpy"
413pass.
414
415//===---------------------------------------------------------------------===//
Nick Lewyckybf637342006-11-13 00:23:28 +0000416
Chris Lattneraa306c22010-01-23 17:59:23 +0000417[LOOP OPTIMIZATION]
418
419SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization
420opportunities in its double_array_divs_variable function: it needs loop
421interchange, memory promotion (which LICM already does), vectorization and
422variable trip count loop unrolling (since it has a constant trip count). ICC
423apparently produces this very nice code with -ffast-math:
424
425..B1.70: # Preds ..B1.70 ..B1.69
426 mulpd %xmm0, %xmm1 #108.2
427 mulpd %xmm0, %xmm1 #108.2
428 mulpd %xmm0, %xmm1 #108.2
429 mulpd %xmm0, %xmm1 #108.2
430 addl $8, %edx #
431 cmpl $131072, %edx #108.2
432 jb ..B1.70 # Prob 99% #108.2
433
434It would be better to count down to zero, but this is a lot better than what we
435do.
436
437//===---------------------------------------------------------------------===//
438
Chris Lattner03a6d962007-01-16 06:39:48 +0000439Consider:
440
441typedef unsigned U32;
442typedef unsigned long long U64;
443int test (U32 *inst, U64 *regs) {
444 U64 effective_addr2;
445 U32 temp = *inst;
446 int r1 = (temp >> 20) & 0xf;
447 int b2 = (temp >> 16) & 0xf;
448 effective_addr2 = temp & 0xfff;
449 if (b2) effective_addr2 += regs[b2];
450 b2 = (temp >> 12) & 0xf;
451 if (b2) effective_addr2 += regs[b2];
452 effective_addr2 &= regs[4];
453 if ((effective_addr2 & 3) == 0)
454 return 1;
455 return 0;
456}
457
458Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
459we don't eliminate the computation of the top half of effective_addr2 because
460we don't have whole-function selection dags. On x86, this means we use one
461extra register for the function when effective_addr2 is declared as U64 than
462when it is declared U32.
463
Chris Lattner17424982009-11-10 23:47:45 +0000464PHI Slicing could be extended to do this.
465
Chris Lattner03a6d962007-01-16 06:39:48 +0000466//===---------------------------------------------------------------------===//
467
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000468LSR should know what GPR types a target has from TargetData. This code:
Chris Lattner1a77a552007-03-24 06:01:32 +0000469
470volatile short X, Y; // globals
471
472void foo(int N) {
473 int i;
474 for (i = 0; i < N; i++) { X = i; Y = i*4; }
475}
476
Chris Lattnerc1491f32009-09-20 17:37:38 +0000477produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner1a77a552007-03-24 06:01:32 +0000478
Chris Lattnerc1491f32009-09-20 17:37:38 +0000479LBB1_2:
480 ldr r3, LCPI1_0
481 ldr r3, [r3]
482 strh r2, [r3]
483 ldr r3, LCPI1_1
484 ldr r3, [r3]
485 strh r1, [r3]
486 add r1, r1, #4
487 add r2, r2, #1 <- [0,+,1]
488 sub r0, r0, #1 <- [0,-,1]
489 cmp r0, #0
490 bne LBB1_2
491
492LSR should reuse the "+" IV for the exit test.
Chris Lattner1a77a552007-03-24 06:01:32 +0000493
Chris Lattner1a77a552007-03-24 06:01:32 +0000494//===---------------------------------------------------------------------===//
495
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000496Tail call elim should be more aggressive, checking to see if the call is
497followed by an uncond branch to an exit block.
498
499; This testcase is due to tail-duplication not wanting to copy the return
500; instruction into the terminating blocks because there was other code
501; optimized out of the function after the taildup happened.
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000502; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000503
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000504define i32 @t4(i32 %a) {
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000505entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000506 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
507 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
508 br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000509
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000510then.0: ; preds = %entry
511 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
512 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
513 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000514
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000515else.0: ; preds = %entry
516 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
517 br i1 %tmp.7, label %then.1, label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000518
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000519then.1: ; preds = %else.0
520 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
521 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
522 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000523
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000524return: ; preds = %then.1, %else.0, %then.0
525 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000526 [ %tmp.9, %then.1 ]
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000527 ret i32 %result.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000528}
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000529
530//===---------------------------------------------------------------------===//
531
Chris Lattnerc90b8662008-08-10 00:47:21 +0000532Tail recursion elimination should handle:
533
534int pow2m1(int n) {
535 if (n == 0)
536 return 0;
537 return 2 * pow2m1 (n - 1) + 1;
538}
539
540Also, multiplies can be turned into SHL's, so they should be handled as if
541they were associative. "return foo() << 1" can be tail recursion eliminated.
542
543//===---------------------------------------------------------------------===//
544
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000545Argument promotion should promote arguments for recursive functions, like
546this:
547
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000548; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000549
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000550define internal i32 @foo(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000551entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000552 %tmp = load i32* %x ; <i32> [#uses=0]
553 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
554 ret i32 %tmp.foo
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000555}
556
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000557define i32 @bar(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000558entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000559 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
560 ret i32 %tmp3
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000561}
562
Chris Lattner81f2d712007-12-05 23:05:06 +0000563//===---------------------------------------------------------------------===//
Chris Lattner166a2682007-12-28 04:42:05 +0000564
Chris Lattnera1643ba2007-12-28 22:30:05 +0000565We should investigate an instruction sinking pass. Consider this silly
566example in pic mode:
567
568#include <assert.h>
569void foo(int x) {
570 assert(x);
571 //...
572}
573
574we compile this to:
575_foo:
576 subl $28, %esp
577 call "L1$pb"
578"L1$pb":
579 popl %eax
580 cmpl $0, 32(%esp)
581 je LBB1_2 # cond_true
582LBB1_1: # return
583 # ...
584 addl $28, %esp
585 ret
586LBB1_2: # cond_true
587...
588
589The PIC base computation (call+popl) is only used on one path through the
590code, but is currently always computed in the entry block. It would be
591better to sink the picbase computation down into the block for the
592assertion, as it is the only one that uses it. This happens for a lot of
593code with early outs.
594
Chris Lattner92c06a02007-12-29 01:05:01 +0000595Another example is loads of arguments, which are usually emitted into the
596entry block on targets like x86. If not used in all paths through a
597function, they should be sunk into the ones that do.
598
Chris Lattnera1643ba2007-12-28 22:30:05 +0000599In this case, whole-function-isel would also handle this.
Chris Lattner166a2682007-12-28 04:42:05 +0000600
601//===---------------------------------------------------------------------===//
Chris Lattnerb3041942008-01-07 21:38:14 +0000602
603Investigate lowering of sparse switch statements into perfect hash tables:
604http://burtleburtle.net/bob/hash/perfect.html
605
606//===---------------------------------------------------------------------===//
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000607
608We should turn things like "load+fabs+store" and "load+fneg+store" into the
609corresponding integer operations. On a yonah, this loop:
610
611double a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000612void foo() {
613 int i, b;
614 for (b = 0; b < 10000000; b++)
615 for (i = 0; i < 256; i++)
616 a[i] = -a[i];
617}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000618
619is twice as slow as this loop:
620
621long long a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000622void foo() {
623 int i, b;
624 for (b = 0; b < 10000000; b++)
625 for (i = 0; i < 256; i++)
626 a[i] ^= (1ULL << 63);
627}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000628
629and I suspect other processors are similar. On X86 in particular this is a
630big win because doing this with integers allows the use of read/modify/write
631instructions.
632
633//===---------------------------------------------------------------------===//
Chris Lattner83726012008-01-10 18:25:41 +0000634
635DAG Combiner should try to combine small loads into larger loads when
636profitable. For example, we compile this C++ example:
637
638struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
639extern THotKey m_HotKey;
640THotKey GetHotKey () { return m_HotKey; }
641
642into (-O3 -fno-exceptions -static -fomit-frame-pointer):
643
644__Z9GetHotKeyv:
645 pushl %esi
646 movl 8(%esp), %eax
647 movb _m_HotKey+3, %cl
648 movb _m_HotKey+4, %dl
649 movb _m_HotKey+2, %ch
650 movw _m_HotKey, %si
651 movw %si, (%eax)
652 movb %ch, 2(%eax)
653 movb %cl, 3(%eax)
654 movb %dl, 4(%eax)
655 popl %esi
656 ret $4
657
658GCC produces:
659
660__Z9GetHotKeyv:
661 movl _m_HotKey, %edx
662 movl 4(%esp), %eax
663 movl %edx, (%eax)
664 movzwl _m_HotKey+4, %edx
665 movw %dx, 4(%eax)
666 ret $4
667
668The LLVM IR contains the needed alignment info, so we should be able to
669merge the loads and stores into 4-byte loads:
670
671 %struct.THotKey = type { i16, i8, i8, i8 }
672define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
673...
674 %tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
675 %tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
676 %tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
677 %tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
678
679Alternatively, we should use a small amount of base-offset alias analysis
680to make it so the scheduler doesn't need to hold all the loads in regs at
681once.
682
683//===---------------------------------------------------------------------===//
Chris Lattner497b7e92008-01-11 06:17:47 +0000684
Nate Begemane9fe65c2008-02-18 18:39:23 +0000685We should add an FRINT node to the DAG to model targets that have legal
686implementations of ceil/floor/rint.
Chris Lattner48840f82008-02-28 05:34:27 +0000687
688//===---------------------------------------------------------------------===//
689
690Consider:
691
692int test() {
693 long long input[8] = {1,1,1,1,1,1,1,1};
694 foo(input);
695}
696
697We currently compile this into a memcpy from a global array since the
698initializer is fairly large and not memset'able. This is good, but the memcpy
699gets lowered to load/stores in the code generator. This is also ok, except
700that the codegen lowering for memcpy doesn't handle the case when the source
701is a constant global. This gives us atrocious code like this:
702
703 call "L1$pb"
704"L1$pb":
705 popl %eax
706 movl _C.0.1444-"L1$pb"+32(%eax), %ecx
707 movl %ecx, 40(%esp)
708 movl _C.0.1444-"L1$pb"+20(%eax), %ecx
709 movl %ecx, 28(%esp)
710 movl _C.0.1444-"L1$pb"+36(%eax), %ecx
711 movl %ecx, 44(%esp)
712 movl _C.0.1444-"L1$pb"+44(%eax), %ecx
713 movl %ecx, 52(%esp)
714 movl _C.0.1444-"L1$pb"+40(%eax), %ecx
715 movl %ecx, 48(%esp)
716 movl _C.0.1444-"L1$pb"+12(%eax), %ecx
717 movl %ecx, 20(%esp)
718 movl _C.0.1444-"L1$pb"+4(%eax), %ecx
719...
720
721instead of:
722 movl $1, 16(%esp)
723 movl $0, 20(%esp)
724 movl $1, 24(%esp)
725 movl $0, 28(%esp)
726 movl $1, 32(%esp)
727 movl $0, 36(%esp)
728 ...
729
730//===---------------------------------------------------------------------===//
Chris Lattnera11deb02008-03-02 02:51:40 +0000731
732http://llvm.org/PR717:
733
734The following code should compile into "ret int undef". Instead, LLVM
735produces "ret int 0":
736
737int f() {
738 int x = 4;
739 int y;
740 if (x == 3) y = 0;
741 return y;
742}
743
744//===---------------------------------------------------------------------===//
Chris Lattner53b72772008-03-02 19:29:42 +0000745
746The loop unroller should partially unroll loops (instead of peeling them)
747when code growth isn't too bad and when an unroll count allows simplification
748of some code within the loop. One trivial example is:
749
750#include <stdio.h>
751int main() {
752 int nRet = 17;
753 int nLoop;
754 for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
755 if ( nLoop & 1 )
756 nRet += 2;
757 else
758 nRet -= 1;
759 }
760 return nRet;
761}
762
763Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
764reduction in code size. The resultant code would then also be suitable for
765exit value computation.
766
767//===---------------------------------------------------------------------===//
Chris Lattner349155b2008-03-17 01:47:51 +0000768
769We miss a bunch of rotate opportunities on various targets, including ppc, x86,
770etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
771matching code in dag combine doesn't look through truncates aggressively
772enough. Here are some testcases reduces from GCC PR17886:
773
774unsigned long long f(unsigned long long x, int y) {
775 return (x << y) | (x >> 64-y);
776}
777unsigned f2(unsigned x, int y){
778 return (x << y) | (x >> 32-y);
779}
780unsigned long long f3(unsigned long long x){
781 int y = 9;
782 return (x << y) | (x >> 64-y);
783}
784unsigned f4(unsigned x){
785 int y = 10;
786 return (x << y) | (x >> 32-y);
787}
788unsigned long long f5(unsigned long long x, unsigned long long y) {
789 return (x << 8) | ((y >> 48) & 0xffull);
790}
791unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
792 switch(z) {
793 case 1:
794 return (x << 8) | ((y >> 48) & 0xffull);
795 case 2:
796 return (x << 16) | ((y >> 40) & 0xffffull);
797 case 3:
798 return (x << 24) | ((y >> 32) & 0xffffffull);
799 case 4:
800 return (x << 32) | ((y >> 24) & 0xffffffffull);
801 default:
802 return (x << 40) | ((y >> 16) & 0xffffffffffull);
803 }
804}
805
Dan Gohmancb747c52008-10-17 21:39:27 +0000806On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner349155b2008-03-17 01:47:51 +0000807generate truly horrible code, instead of using shld and friends. On
808ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
809badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
810
811//===---------------------------------------------------------------------===//
Chris Lattnerf70107f2008-03-20 04:46:13 +0000812
Chris Lattneref17f082010-12-15 07:10:43 +0000813This (and similar related idioms):
814
815unsigned int foo(unsigned char i) {
816 return i | (i<<8) | (i<<16) | (i<<24);
817}
818
819compiles into:
820
821define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone {
822entry:
823 %conv = zext i8 %i to i32
824 %shl = shl i32 %conv, 8
825 %shl5 = shl i32 %conv, 16
826 %shl9 = shl i32 %conv, 24
827 %or = or i32 %shl9, %conv
828 %or6 = or i32 %or, %shl5
829 %or10 = or i32 %or6, %shl
830 ret i32 %or10
831}
832
833it would be better as:
834
835unsigned int bar(unsigned char i) {
836 unsigned int j=i | (i << 8);
837 return j | (j<<16);
838}
839
840aka:
841
842define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone {
843entry:
844 %conv = zext i8 %i to i32
845 %shl = shl i32 %conv, 8
846 %or = or i32 %shl, %conv
847 %shl5 = shl i32 %or, 16
848 %or6 = or i32 %shl5, %or
849 ret i32 %or6
850}
851
852or even i*0x01010101, depending on the speed of the multiplier. The best way to
853handle this is to canonicalize it to a multiply in IR and have codegen handle
854lowering multiplies to shifts on cpus where shifts are faster.
855
856//===---------------------------------------------------------------------===//
857
Chris Lattnerf70107f2008-03-20 04:46:13 +0000858We do a number of simplifications in simplify libcalls to strength reduce
859standard library functions, but we don't currently merge them together. For
860example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
861be done safely if "b" isn't modified between the strlen and memcpy of course.
862
863//===---------------------------------------------------------------------===//
864
Chris Lattner26e150f2008-08-10 01:14:08 +0000865We compile this program: (from GCC PR11680)
866http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
867
868Into code that runs the same speed in fast/slow modes, but both modes run 2x
869slower than when compile with GCC (either 4.0 or 4.2):
870
871$ llvm-g++ perf.cpp -O3 -fno-exceptions
872$ time ./a.out fast
8731.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
874
875$ g++ perf.cpp -O3 -fno-exceptions
876$ time ./a.out fast
8770.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
878
879It looks like we are making the same inlining decisions, so this may be raw
880codegen badness or something else (haven't investigated).
881
882//===---------------------------------------------------------------------===//
883
884We miss some instcombines for stuff like this:
885void bar (void);
886void foo (unsigned int a) {
887 /* This one is equivalent to a >= (3 << 2). */
888 if ((a >> 2) >= 3)
889 bar ();
890}
891
892A few other related ones are in GCC PR14753.
893
894//===---------------------------------------------------------------------===//
895
896Divisibility by constant can be simplified (according to GCC PR12849) from
897being a mulhi to being a mul lo (cheaper). Testcase:
898
899void bar(unsigned n) {
900 if (n % 3 == 0)
901 true();
902}
903
Eli Friedmanbcae2052009-12-12 23:23:43 +0000904This is equivalent to the following, where 2863311531 is the multiplicative
905inverse of 3, and 1431655766 is ((2^32)-1)/3+1:
906void bar(unsigned n) {
907 if (n * 2863311531U < 1431655766U)
908 true();
909}
910
911The same transformation can work with an even modulo with the addition of a
912rotate: rotate the result of the multiply to the right by the number of bits
913which need to be zero for the condition to be true, and shrink the compare RHS
914by the same amount. Unless the target supports rotates, though, that
915transformation probably isn't worthwhile.
916
917The transformation can also easily be made to work with non-zero equality
918comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".
Chris Lattner26e150f2008-08-10 01:14:08 +0000919
920//===---------------------------------------------------------------------===//
Chris Lattner23f35bc2008-08-19 06:22:16 +0000921
Chris Lattnerdb039832008-10-15 16:06:03 +0000922Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
923bunch of other stuff from this example (see PR1604):
924
925#include <cstdio>
926struct test {
927 int val;
928 virtual ~test() {}
929};
930
931int main() {
932 test t;
933 std::scanf("%d", &t.val);
934 std::printf("%d\n", t.val);
935}
936
937//===---------------------------------------------------------------------===//
938
Nick Lewyckyd2f0db12008-11-27 22:41:45 +0000939These functions perform the same computation, but produce different assembly.
Nick Lewyckydf563ca2008-11-27 22:12:22 +0000940
941define i8 @select(i8 %x) readnone nounwind {
942 %A = icmp ult i8 %x, 250
943 %B = select i1 %A, i8 0, i8 1
944 ret i8 %B
945}
946
947define i8 @addshr(i8 %x) readnone nounwind {
948 %A = zext i8 %x to i9
949 %B = add i9 %A, 6 ;; 256 - 250 == 6
950 %C = lshr i9 %B, 8
951 %D = trunc i9 %C to i8
952 ret i8 %D
953}
954
955//===---------------------------------------------------------------------===//
Eli Friedman4e16b292008-11-30 07:36:04 +0000956
957From gcc bug 24696:
958int
959f (unsigned long a, unsigned long b, unsigned long c)
960{
961 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0);
962}
963int
964f (unsigned long a, unsigned long b, unsigned long c)
965{
966 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0);
967}
968Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with
969"clang -emit-llvm-bc | opt -std-compile-opts".
970
971//===---------------------------------------------------------------------===//
972
973From GCC Bug 20192:
974#define PMD_MASK (~((1UL << 23) - 1))
975void clear_pmd_range(unsigned long start, unsigned long end)
976{
977 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
978 f();
979}
980The expression should optimize to something like
981"!((start|end)&~PMD_MASK). Currently not optimized with "clang
982-emit-llvm-bc | opt -std-compile-opts".
983
984//===---------------------------------------------------------------------===//
985
Eli Friedman4e16b292008-11-30 07:36:04 +0000986unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
987i;}
988unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
989These should combine to the same thing. Currently, the first function
990produces better code on X86.
991
992//===---------------------------------------------------------------------===//
993
Eli Friedman4e16b292008-11-30 07:36:04 +0000994From GCC Bug 15784:
995#define abs(x) x>0?x:-x
996int f(int x, int y)
997{
998 return (abs(x)) >= 0;
999}
1000This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
1001optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1002
1003//===---------------------------------------------------------------------===//
1004
1005From GCC Bug 14753:
1006void
1007rotate_cst (unsigned int a)
1008{
1009 a = (a << 10) | (a >> 22);
1010 if (a == 123)
1011 bar ();
1012}
1013void
1014minus_cst (unsigned int a)
1015{
1016 unsigned int tem;
1017
1018 tem = 20 - a;
1019 if (tem == 5)
1020 bar ();
1021}
1022void
1023mask_gt (unsigned int a)
1024{
1025 /* This is equivalent to a > 15. */
1026 if ((a & ~7) > 8)
1027 bar ();
1028}
1029void
1030rshift_gt (unsigned int a)
1031{
1032 /* This is equivalent to a > 23. */
1033 if ((a >> 2) > 5)
1034 bar ();
1035}
1036All should simplify to a single comparison. All of these are
1037currently not optimized with "clang -emit-llvm-bc | opt
1038-std-compile-opts".
1039
1040//===---------------------------------------------------------------------===//
1041
1042From GCC Bug 32605:
1043int c(int* x) {return (char*)x+2 == (char*)x;}
1044Should combine to 0. Currently not optimized with "clang
1045-emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it).
1046
1047//===---------------------------------------------------------------------===//
1048
Eli Friedman4e16b292008-11-30 07:36:04 +00001049int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;}
1050Should be combined to "((b >> 1) | b) & 1". Currently not optimized
1051with "clang -emit-llvm-bc | opt -std-compile-opts".
1052
1053//===---------------------------------------------------------------------===//
1054
1055unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);}
1056Should combine to "x | (y & 3)". Currently not optimized with "clang
1057-emit-llvm-bc | opt -std-compile-opts".
1058
1059//===---------------------------------------------------------------------===//
1060
Eli Friedman4e16b292008-11-30 07:36:04 +00001061int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);}
1062Should fold to "(~a & c) | (a & b)". Currently not optimized with
1063"clang -emit-llvm-bc | opt -std-compile-opts".
1064
1065//===---------------------------------------------------------------------===//
1066
1067int a(int a,int b) {return (~(a|b))|a;}
1068Should fold to "a|~b". Currently not optimized with "clang
1069-emit-llvm-bc | opt -std-compile-opts".
1070
1071//===---------------------------------------------------------------------===//
1072
1073int a(int a, int b) {return (a&&b) || (a&&!b);}
1074Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
1075| opt -std-compile-opts".
1076
1077//===---------------------------------------------------------------------===//
1078
1079int a(int a, int b, int c) {return (a&&b) || (!a&&c);}
1080Should fold to "a ? b : c", or at least something sane. Currently not
1081optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1082
1083//===---------------------------------------------------------------------===//
1084
1085int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);}
1086Should fold to a && (b || c). Currently not optimized with "clang
1087-emit-llvm-bc | opt -std-compile-opts".
1088
1089//===---------------------------------------------------------------------===//
1090
1091int a(int x) {return x | ((x & 8) ^ 8);}
1092Should combine to x | 8. Currently not optimized with "clang
1093-emit-llvm-bc | opt -std-compile-opts".
1094
1095//===---------------------------------------------------------------------===//
1096
1097int a(int x) {return x ^ ((x & 8) ^ 8);}
1098Should also combine to x | 8. Currently not optimized with "clang
1099-emit-llvm-bc | opt -std-compile-opts".
1100
1101//===---------------------------------------------------------------------===//
1102
Eli Friedman4e16b292008-11-30 07:36:04 +00001103int a(int x) {return ((x | -9) ^ 8) & x;}
1104Should combine to x & -9. Currently not optimized with "clang
1105-emit-llvm-bc | opt -std-compile-opts".
1106
1107//===---------------------------------------------------------------------===//
1108
1109unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
1110Should combine to "a * 0x88888888 >> 31". Currently not optimized
1111with "clang -emit-llvm-bc | opt -std-compile-opts".
1112
1113//===---------------------------------------------------------------------===//
1114
1115unsigned a(char* x) {if ((*x & 32) == 0) return b();}
1116There's an unnecessary zext in the generated code with "clang
1117-emit-llvm-bc | opt -std-compile-opts".
1118
1119//===---------------------------------------------------------------------===//
1120
1121unsigned a(unsigned long long x) {return 40 * (x >> 1);}
1122Should combine to "20 * (((unsigned)x) & -2)". Currently not
1123optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
1124
1125//===---------------------------------------------------------------------===//
Bill Wendling3bdcda82008-12-02 05:12:47 +00001126
Chris Lattner88d84b22008-12-02 06:32:34 +00001127This was noticed in the entryblock for grokdeclarator in 403.gcc:
1128
1129 %tmp = icmp eq i32 %decl_context, 4
1130 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
1131 %tmp1 = icmp eq i32 %decl_context_addr.0, 1
1132 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
1133
1134tmp1 should be simplified to something like:
1135 (!tmp || decl_context == 1)
1136
1137This allows recursive simplifications, tmp1 is used all over the place in
1138the function, e.g. by:
1139
1140 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
1141 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
1142 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
1143
1144later.
1145
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001146//===---------------------------------------------------------------------===//
1147
Chris Lattnerd4137f42009-11-29 02:19:52 +00001148[STORE SINKING]
1149
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001150Store sinking: This code:
1151
1152void f (int n, int *cond, int *res) {
1153 int i;
1154 *res = 0;
1155 for (i = 0; i < n; i++)
1156 if (*cond)
1157 *res ^= 234; /* (*) */
1158}
1159
1160On this function GVN hoists the fully redundant value of *res, but nothing
1161moves the store out. This gives us this code:
1162
1163bb: ; preds = %bb2, %entry
1164 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
1165 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
1166 %1 = load i32* %cond, align 4
1167 %2 = icmp eq i32 %1, 0
1168 br i1 %2, label %bb2, label %bb1
1169
1170bb1: ; preds = %bb
1171 %3 = xor i32 %.rle, 234
1172 store i32 %3, i32* %res, align 4
1173 br label %bb2
1174
1175bb2: ; preds = %bb, %bb1
1176 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
1177 %indvar.next = add i32 %i.05, 1
1178 %exitcond = icmp eq i32 %indvar.next, %n
1179 br i1 %exitcond, label %return, label %bb
1180
1181DSE should sink partially dead stores to get the store out of the loop.
1182
Chris Lattner6a09a742008-12-06 22:52:12 +00001183Here's another partial dead case:
1184http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
1185
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001186//===---------------------------------------------------------------------===//
1187
1188Scalar PRE hoists the mul in the common block up to the else:
1189
1190int test (int a, int b, int c, int g) {
1191 int d, e;
1192 if (a)
1193 d = b * c;
1194 else
1195 d = b - c;
1196 e = b * c + g;
1197 return d + e;
1198}
1199
1200It would be better to do the mul once to reduce codesize above the if.
1201This is GCC PR38204.
1202
1203//===---------------------------------------------------------------------===//
1204
Chris Lattnerd4137f42009-11-29 02:19:52 +00001205[STORE SINKING]
1206
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001207GCC PR37810 is an interesting case where we should sink load/store reload
1208into the if block and outside the loop, so we don't reload/store it on the
1209non-call path.
1210
1211for () {
1212 *P += 1;
1213 if ()
1214 call();
1215 else
1216 ...
1217->
1218tmp = *P
1219for () {
1220 tmp += 1;
1221 if () {
1222 *P = tmp;
1223 call();
1224 tmp = *P;
1225 } else ...
1226}
1227*P = tmp;
1228
Chris Lattner8f416f32008-12-15 07:49:24 +00001229We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
1230we don't sink the store. We need partially dead store sinking.
1231
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001232//===---------------------------------------------------------------------===//
1233
Chris Lattnerd4137f42009-11-29 02:19:52 +00001234[LOAD PRE CRIT EDGE SPLITTING]
Chris Lattner8f416f32008-12-15 07:49:24 +00001235
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001236GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
1237leading to excess stack traffic. This could be handled by GVN with some crazy
1238symbolic phi translation. The code we get looks like (g is on the stack):
1239
1240bb2: ; preds = %bb1
1241..
1242 %9 = getelementptr %struct.f* %g, i32 0, i32 0
1243 store i32 %8, i32* %9, align bel %bb3
1244
1245bb3: ; preds = %bb1, %bb2, %bb
1246 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
1247 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
1248 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
1249 %11 = load i32* %10, align 4
1250
Chris Lattner6d949262009-11-27 16:53:57 +00001251%11 is partially redundant, an in BB2 it should have the value %8.
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001252
Chris Lattnerd4137f42009-11-29 02:19:52 +00001253GCC PR33344 and PR35287 are similar cases.
Chris Lattner6a09a742008-12-06 22:52:12 +00001254
Chris Lattner6c9fab72009-11-05 18:19:19 +00001255
1256//===---------------------------------------------------------------------===//
1257
Chris Lattnerd4137f42009-11-29 02:19:52 +00001258[LOAD PRE]
1259
Chris Lattner6a09a742008-12-06 22:52:12 +00001260There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
Chris Lattnerd4137f42009-11-29 02:19:52 +00001261GCC testsuite, ones we don't get yet are (checked through loadpre25):
1262
1263[CRIT EDGE BREAKING]
1264loadpre3.c predcom-4.c
1265
1266[PRE OF READONLY CALL]
1267loadpre5.c
1268
1269[TURN SELECT INTO BRANCH]
1270loadpre14.c loadpre15.c
1271
1272actually a conditional increment: loadpre18.c loadpre19.c
1273
Chris Lattner2fc36e12010-12-15 06:38:24 +00001274//===---------------------------------------------------------------------===//
1275
1276[LOAD PRE / STORE SINKING / SPEC HACK]
1277
1278This is a chunk of code from 456.hmmer:
1279
1280int f(int M, int *mc, int *mpp, int *tpmm, int *ip, int *tpim, int *dpp,
1281 int *tpdm, int xmb, int *bp, int *ms) {
1282 int k, sc;
1283 for (k = 1; k <= M; k++) {
1284 mc[k] = mpp[k-1] + tpmm[k-1];
1285 if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;
1286 if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
1287 if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
1288 mc[k] += ms[k];
1289 }
1290}
1291
1292It is very profitable for this benchmark to turn the conditional stores to mc[k]
1293into a conditional move (select instr in IR) and allow the final store to do the
1294store. See GCC PR27313 for more details. Note that this is valid to xform even
1295with the new C++ memory model, since mc[k] is previously loaded and later
1296stored.
Chris Lattnerd4137f42009-11-29 02:19:52 +00001297
1298//===---------------------------------------------------------------------===//
1299
1300[SCALAR PRE]
1301There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the
1302GCC testsuite.
Chris Lattner6a09a742008-12-06 22:52:12 +00001303
Chris Lattner582048d2008-12-15 08:32:28 +00001304//===---------------------------------------------------------------------===//
1305
1306There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
Chris Lattnerd4137f42009-11-29 02:19:52 +00001307GCC testsuite. For example, we get the first example in predcom-1.c, but
1308miss the second one:
Chris Lattner582048d2008-12-15 08:32:28 +00001309
Chris Lattnerd4137f42009-11-29 02:19:52 +00001310unsigned fib[1000];
1311unsigned avg[1000];
Chris Lattner582048d2008-12-15 08:32:28 +00001312
Chris Lattnerd4137f42009-11-29 02:19:52 +00001313__attribute__ ((noinline))
1314void count_averages(int n) {
1315 int i;
1316 for (i = 1; i < n; i++)
1317 avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;
1318}
Chris Lattner582048d2008-12-15 08:32:28 +00001319
Chris Lattnerd4137f42009-11-29 02:19:52 +00001320which compiles into two loads instead of one in the loop.
Chris Lattner582048d2008-12-15 08:32:28 +00001321
Chris Lattnerd4137f42009-11-29 02:19:52 +00001322predcom-2.c is the same as predcom-1.c
Chris Lattner582048d2008-12-15 08:32:28 +00001323
Chris Lattner582048d2008-12-15 08:32:28 +00001324predcom-3.c is very similar but needs loads feeding each other instead of
1325store->load.
Chris Lattner582048d2008-12-15 08:32:28 +00001326
1327
1328//===---------------------------------------------------------------------===//
1329
Chris Lattneraa306c22010-01-23 17:59:23 +00001330[ALIAS ANALYSIS]
1331
Chris Lattner582048d2008-12-15 08:32:28 +00001332Type based alias analysis:
Chris Lattner6a09a742008-12-06 22:52:12 +00001333http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
1334
Chris Lattneraa306c22010-01-23 17:59:23 +00001335We should do better analysis of posix_memalign. At the least it should
1336no-capture its pointer argument, at best, we should know that the out-value
1337result doesn't point to anything (like malloc). One example of this is in
1338SingleSource/Benchmarks/Misc/dt.c
1339
Chris Lattner6a09a742008-12-06 22:52:12 +00001340//===---------------------------------------------------------------------===//
1341
Chris Lattner6a09a742008-12-06 22:52:12 +00001342A/B get pinned to the stack because we turn an if/then into a select instead
1343of PRE'ing the load/store. This may be fixable in instcombine:
1344http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
1345
Chris Lattner93c6c772009-09-21 02:53:57 +00001346struct X { int i; };
1347int foo (int x) {
1348 struct X a;
1349 struct X b;
1350 struct X *p;
1351 a.i = 1;
1352 b.i = 2;
1353 if (x)
1354 p = &a;
1355 else
1356 p = &b;
1357 return p->i;
1358}
Chris Lattner582048d2008-12-15 08:32:28 +00001359
Chris Lattner93c6c772009-09-21 02:53:57 +00001360//===---------------------------------------------------------------------===//
Chris Lattner582048d2008-12-15 08:32:28 +00001361
Chris Lattner6a09a742008-12-06 22:52:12 +00001362Interesting missed case because of control flow flattening (should be 2 loads):
1363http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner582048d2008-12-15 08:32:28 +00001364With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |
1365 opt -mem2reg -gvn -instcombine | llvm-dis
Chris Lattnerd4137f42009-11-29 02:19:52 +00001366we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT
Chris Lattner582048d2008-12-15 08:32:28 +00001367VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner6a09a742008-12-06 22:52:12 +00001368
1369//===---------------------------------------------------------------------===//
1370
1371http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
1372We could eliminate the branch condition here, loading from null is undefined:
1373
1374struct S { int w, x, y, z; };
1375struct T { int r; struct S s; };
1376void bar (struct S, int);
1377void foo (int a, struct T b)
1378{
1379 struct S *c = 0;
1380 if (a)
1381 c = &b.s;
1382 bar (*c, a);
1383}
1384
1385//===---------------------------------------------------------------------===//
Chris Lattner88d84b22008-12-02 06:32:34 +00001386
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001387simplifylibcalls should do several optimizations for strspn/strcspn:
1388
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001389strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
1390
1391size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
1392 int __reject3) {
1393 register size_t __result = 0;
1394 while (__s[__result] != '\0' && __s[__result] != __reject1 &&
1395 __s[__result] != __reject2 && __s[__result] != __reject3)
1396 ++__result;
1397 return __result;
1398}
1399
1400This should turn into a switch on the character. See PR3253 for some notes on
1401codegen.
1402
1403456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
1404
1405//===---------------------------------------------------------------------===//
Chris Lattnerd23b7992008-12-31 00:54:13 +00001406
1407"gas" uses this idiom:
1408 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string))
1409..
1410 else if (strchr ("<>", *intel_parser.op_string)
1411
1412Those should be turned into a switch.
1413
1414//===---------------------------------------------------------------------===//
Chris Lattnerffb08f52009-01-08 06:52:57 +00001415
1416252.eon contains this interesting code:
1417
1418 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
1419 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1420 %strlen = call i32 @strlen(i8* %3072) ; uses = 1
1421 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
1422 call void @llvm.memcpy.i32(i8* %endptr,
1423 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
1424 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly
1425
1426This is interesting for a couple reasons. First, in this:
1427
1428 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1429 %strlen = call i32 @strlen(i8* %3072)
1430
1431The strlen could be replaced with: %strlen = sub %3072, %3073, because the
1432strcpy call returns a pointer to the end of the string. Based on that, the
1433endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
1434
1435Second, the memcpy+strlen strlen can be replaced with:
1436
1437 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
1438
1439Because the destination was just copied into the specified memory buffer. This,
1440in turn, can be constant folded to "4".
1441
1442In other code, it contains:
1443
1444 %endptr6978 = bitcast i8* %endptr69 to i32*
1445 store i32 7107374, i32* %endptr6978, align 1
1446 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
1447
1448Which could also be constant folded. Whatever is producing this should probably
1449be fixed to leave this as a memcpy from a string.
1450
1451Further, eon also has an interesting partially redundant strlen call:
1452
1453bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
1454 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2]
1455 %683 = load i8** %682, align 4 ; <i8*> [#uses=4]
1456 %684 = load i8* %683, align 1 ; <i8> [#uses=1]
1457 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
1458 br i1 %685, label %bb10, label %bb9
1459
1460bb9: ; preds = %bb8
1461 %686 = call i32 @strlen(i8* %683) nounwind readonly
1462 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
1463 br i1 %687, label %bb10, label %bb11
1464
1465bb10: ; preds = %bb9, %bb8
1466 %688 = call i32 @strlen(i8* %683) nounwind readonly
1467
1468This could be eliminated by doing the strlen once in bb8, saving code size and
1469improving perf on the bb8->9->10 path.
1470
1471//===---------------------------------------------------------------------===//
Chris Lattner9fee08f2009-01-08 07:34:55 +00001472
1473I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
1474which looks like:
1475 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
1476
1477
1478bb62: ; preds = %bb55, %bb53
1479 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
1480 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1481 %172 = add i32 %171, -1 ; <i32> [#uses=1]
1482 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
1483
1484... no stores ...
1485 br i1 %or.cond, label %bb65, label %bb72
1486
1487bb65: ; preds = %bb62
1488 store i8 0, i8* %173, align 1
1489 br label %bb72
1490
1491bb72: ; preds = %bb65, %bb62
1492 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
1493 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1494
1495Note that on the bb62->bb72 path, that the %177 strlen call is partially
1496redundant with the %171 call. At worst, we could shove the %177 strlen call
1497up into the bb65 block moving it out of the bb62->bb72 path. However, note
1498that bb65 stores to the string, zeroing out the last byte. This means that on
1499that path the value of %177 is actually just %171-1. A sub is cheaper than a
1500strlen!
1501
1502This pattern repeats several times, basically doing:
1503
1504 A = strlen(P);
1505 P[A-1] = 0;
1506 B = strlen(P);
1507 where it is "obvious" that B = A-1.
1508
1509//===---------------------------------------------------------------------===//
1510
Chris Lattner9fee08f2009-01-08 07:34:55 +00001511186.crafty also contains this code:
1512
1513%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1514%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
1515%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
1516%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
1517%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
1518
1519The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
1520
1521//===---------------------------------------------------------------------===//
1522
1523186.crafty has this interesting pattern with the "out.4543" variable:
1524
1525call void @llvm.memcpy.i32(
1526 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
1527 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
1528%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
1529
1530It is basically doing:
1531
1532 memcpy(globalarray, "string");
1533 printf(..., globalarray);
1534
1535Anyway, by knowing that printf just reads the memory and forward substituting
1536the string directly into the printf, this eliminates reads from globalarray.
1537Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
1538other similar functions) there are many stores to "out". Once all the printfs
1539stop using "out", all that is left is the memcpy's into it. This should allow
1540globalopt to remove the "stored only" global.
1541
1542//===---------------------------------------------------------------------===//
1543
Dan Gohman8289b052009-01-20 01:07:33 +00001544This code:
1545
1546define inreg i32 @foo(i8* inreg %p) nounwind {
1547 %tmp0 = load i8* %p
1548 %tmp1 = ashr i8 %tmp0, 5
1549 %tmp2 = sext i8 %tmp1 to i32
1550 ret i32 %tmp2
1551}
1552
1553could be dagcombine'd to a sign-extending load with a shift.
1554For example, on x86 this currently gets this:
1555
1556 movb (%eax), %al
1557 sarb $5, %al
1558 movsbl %al, %eax
1559
1560while it could get this:
1561
1562 movsbl (%eax), %eax
1563 sarl $5, %eax
1564
1565//===---------------------------------------------------------------------===//
Chris Lattner256baa42009-01-22 07:16:03 +00001566
1567GCC PR31029:
1568
1569int test(int x) { return 1-x == x; } // --> return false
1570int test2(int x) { return 2-x == x; } // --> return x == 1 ?
1571
1572Always foldable for odd constants, what is the rule for even?
1573
1574//===---------------------------------------------------------------------===//
1575
Torok Edwine46a6862009-01-24 19:30:25 +00001576PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
1577for next field in struct (which is at same address).
1578
1579For example: store of float into { {{}}, float } could be turned into a store to
1580the float directly.
1581
Torok Edwin474479f2009-02-20 18:42:06 +00001582//===---------------------------------------------------------------------===//
Nick Lewycky20babb12009-02-25 06:52:48 +00001583
Chris Lattner32c5f172009-05-11 17:41:40 +00001584The arg promotion pass should make use of nocapture to make its alias analysis
1585stuff much more precise.
1586
1587//===---------------------------------------------------------------------===//
1588
1589The following functions should be optimized to use a select instead of a
1590branch (from gcc PR40072):
1591
1592char char_int(int m) {if(m>7) return 0; return m;}
1593int int_char(char m) {if(m>7) return 0; return m;}
1594
1595//===---------------------------------------------------------------------===//
1596
Bill Wendling5a569272009-10-27 22:48:31 +00001597int func(int a, int b) { if (a & 0x80) b |= 0x80; else b &= ~0x80; return b; }
1598
1599Generates this:
1600
1601define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1602entry:
1603 %0 = and i32 %a, 128 ; <i32> [#uses=1]
1604 %1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
1605 %2 = or i32 %b, 128 ; <i32> [#uses=1]
1606 %3 = and i32 %b, -129 ; <i32> [#uses=1]
1607 %b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
1608 ret i32 %b_addr.0
1609}
1610
1611However, it's functionally equivalent to:
1612
1613 b = (b & ~0x80) | (a & 0x80);
1614
1615Which generates this:
1616
1617define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1618entry:
1619 %0 = and i32 %b, -129 ; <i32> [#uses=1]
1620 %1 = and i32 %a, 128 ; <i32> [#uses=1]
1621 %2 = or i32 %0, %1 ; <i32> [#uses=1]
1622 ret i32 %2
1623}
1624
1625This can be generalized for other forms:
1626
1627 b = (b & ~0x80) | (a & 0x40) << 1;
1628
1629//===---------------------------------------------------------------------===//
Bill Wendlingc872e9c2009-10-27 23:30:07 +00001630
1631These two functions produce different code. They shouldn't:
1632
1633#include <stdint.h>
1634
1635uint8_t p1(uint8_t b, uint8_t a) {
1636 b = (b & ~0xc0) | (a & 0xc0);
1637 return (b);
1638}
1639
1640uint8_t p2(uint8_t b, uint8_t a) {
1641 b = (b & ~0x40) | (a & 0x40);
1642 b = (b & ~0x80) | (a & 0x80);
1643 return (b);
1644}
1645
1646define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1647entry:
1648 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1649 %1 = and i8 %a, -64 ; <i8> [#uses=1]
1650 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1651 ret i8 %2
1652}
1653
1654define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1655entry:
1656 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1657 %.masked = and i8 %a, 64 ; <i8> [#uses=1]
1658 %1 = and i8 %a, -128 ; <i8> [#uses=1]
1659 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1660 %3 = or i8 %2, %.masked ; <i8> [#uses=1]
1661 ret i8 %3
1662}
1663
1664//===---------------------------------------------------------------------===//
Chris Lattner6fdfc9c2009-11-11 17:51:27 +00001665
1666IPSCCP does not currently propagate argument dependent constants through
1667functions where it does not not all of the callers. This includes functions
1668with normal external linkage as well as templates, C99 inline functions etc.
1669Specifically, it does nothing to:
1670
1671define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
1672entry:
1673 %0 = add nsw i32 %y, %z
1674 %1 = mul i32 %0, %x
1675 %2 = mul i32 %y, %z
1676 %3 = add nsw i32 %1, %2
1677 ret i32 %3
1678}
1679
1680define i32 @test2() nounwind {
1681entry:
1682 %0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
1683 ret i32 %0
1684}
1685
1686It would be interesting extend IPSCCP to be able to handle simple cases like
1687this, where all of the arguments to a call are constant. Because IPSCCP runs
1688before inlining, trivial templates and inline functions are not yet inlined.
1689The results for a function + set of constant arguments should be memoized in a
1690map.
1691
1692//===---------------------------------------------------------------------===//
Chris Lattnerfc926c22009-11-11 17:54:02 +00001693
1694The libcall constant folding stuff should be moved out of SimplifyLibcalls into
1695libanalysis' constantfolding logic. This would allow IPSCCP to be able to
1696handle simple things like this:
1697
1698static int foo(const char *X) { return strlen(X); }
1699int bar() { return foo("abcd"); }
1700
1701//===---------------------------------------------------------------------===//
Nick Lewycky93f9f7a2009-11-15 17:51:23 +00001702
1703InstCombine should use SimplifyDemandedBits to remove the or instruction:
1704
1705define i1 @test(i8 %x, i8 %y) {
1706 %A = or i8 %x, 1
1707 %B = icmp ugt i8 %A, 3
1708 ret i1 %B
1709}
1710
1711Currently instcombine calls SimplifyDemandedBits with either all bits or just
1712the sign bit, if the comparison is obviously a sign test. In this case, we only
1713need all but the bottom two bits from %A, and if we gave that mask to SDB it
1714would delete the or instruction for us.
1715
1716//===---------------------------------------------------------------------===//
Chris Lattner05332172009-12-03 07:41:54 +00001717
Duncan Sandse10920d2010-01-06 15:37:47 +00001718functionattrs doesn't know much about memcpy/memset. This function should be
Duncan Sands7c422ac2010-01-06 08:45:52 +00001719marked readnone rather than readonly, since it only twiddles local memory, but
1720functionattrs doesn't handle memset/memcpy/memmove aggressively:
Chris Lattner89742c22009-12-03 07:43:46 +00001721
1722struct X { int *p; int *q; };
1723int foo() {
1724 int i = 0, j = 1;
1725 struct X x, y;
1726 int **p;
1727 y.p = &i;
1728 x.q = &j;
1729 p = __builtin_memcpy (&x, &y, sizeof (int *));
1730 return **p;
1731}
1732
Chris Lattner05332172009-12-03 07:41:54 +00001733//===---------------------------------------------------------------------===//
1734
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001735Missed instcombine transformation:
1736define i1 @a(i32 %x) nounwind readnone {
1737entry:
1738 %cmp = icmp eq i32 %x, 30
1739 %sub = add i32 %x, -30
1740 %cmp2 = icmp ugt i32 %sub, 9
1741 %or = or i1 %cmp, %cmp2
1742 ret i1 %or
1743}
1744This should be optimized to a single compare. Testcase derived from gcc.
1745
1746//===---------------------------------------------------------------------===//
1747
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001748Missed instcombine or reassociate transformation:
1749int a(int a, int b) { return (a==12)&(b>47)&(b<58); }
1750
1751The sgt and slt should be combined into a single comparison. Testcase derived
1752from gcc.
1753
1754//===---------------------------------------------------------------------===//
1755
1756Missed instcombine transformation:
Chris Lattner3e411062010-11-21 07:05:31 +00001757
1758 %382 = srem i32 %tmp14.i, 64 ; [#uses=1]
1759 %383 = zext i32 %382 to i64 ; [#uses=1]
1760 %384 = shl i64 %381, %383 ; [#uses=1]
1761 %385 = icmp slt i32 %tmp14.i, 64 ; [#uses=1]
1762
Benjamin Kramerc21a8212010-11-23 20:33:57 +00001763The srem can be transformed to an and because if %tmp14.i is negative, the
1764shift is undefined. Testcase derived from 403.gcc.
Chris Lattner3e411062010-11-21 07:05:31 +00001765
1766//===---------------------------------------------------------------------===//
1767
1768This is a range comparison on a divided result (from 403.gcc):
1769
1770 %1337 = sdiv i32 %1336, 8 ; [#uses=1]
1771 %.off.i208 = add i32 %1336, 7 ; [#uses=1]
1772 %1338 = icmp ult i32 %.off.i208, 15 ; [#uses=1]
1773
1774We already catch this (removing the sdiv) if there isn't an add, we should
1775handle the 'add' as well. This is a common idiom with it's builtin_alloca code.
1776C testcase:
1777
1778int a(int x) { return (unsigned)(x/16+7) < 15; }
1779
1780Another similar case involves truncations on 64-bit targets:
1781
1782 %361 = sdiv i64 %.046, 8 ; [#uses=1]
1783 %362 = trunc i64 %361 to i32 ; [#uses=2]
1784...
1785 %367 = icmp eq i32 %362, 0 ; [#uses=1]
1786
Eli Friedman1144d7e2010-01-31 04:55:32 +00001787//===---------------------------------------------------------------------===//
1788
1789Missed instcombine/dagcombine transformation:
1790define void @lshift_lt(i8 zeroext %a) nounwind {
1791entry:
1792 %conv = zext i8 %a to i32
1793 %shl = shl i32 %conv, 3
1794 %cmp = icmp ult i32 %shl, 33
1795 br i1 %cmp, label %if.then, label %if.end
1796
1797if.then:
1798 tail call void @bar() nounwind
1799 ret void
1800
1801if.end:
1802 ret void
1803}
1804declare void @bar() nounwind
1805
1806The shift should be eliminated. Testcase derived from gcc.
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001807
1808//===---------------------------------------------------------------------===//
Chris Lattnercf031f62010-02-09 00:11:10 +00001809
1810These compile into different code, one gets recognized as a switch and the
1811other doesn't due to phase ordering issues (PR6212):
1812
1813int test1(int mainType, int subType) {
1814 if (mainType == 7)
1815 subType = 4;
1816 else if (mainType == 9)
1817 subType = 6;
1818 else if (mainType == 11)
1819 subType = 9;
1820 return subType;
1821}
1822
1823int test2(int mainType, int subType) {
1824 if (mainType == 7)
1825 subType = 4;
1826 if (mainType == 9)
1827 subType = 6;
1828 if (mainType == 11)
1829 subType = 9;
1830 return subType;
1831}
1832
1833//===---------------------------------------------------------------------===//
Chris Lattner66636702010-03-10 21:42:42 +00001834
1835The following test case (from PR6576):
1836
1837define i32 @mul(i32 %a, i32 %b) nounwind readnone {
1838entry:
1839 %cond1 = icmp eq i32 %b, 0 ; <i1> [#uses=1]
1840 br i1 %cond1, label %exit, label %bb.nph
1841bb.nph: ; preds = %entry
1842 %tmp = mul i32 %b, %a ; <i32> [#uses=1]
1843 ret i32 %tmp
1844exit: ; preds = %entry
1845 ret i32 0
1846}
1847
1848could be reduced to:
1849
1850define i32 @mul(i32 %a, i32 %b) nounwind readnone {
1851entry:
1852 %tmp = mul i32 %b, %a
1853 ret i32 %tmp
1854}
1855
1856//===---------------------------------------------------------------------===//
1857
Chris Lattner94846892010-04-16 23:52:30 +00001858We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates.
1859See GCC PR34949
1860
Chris Lattnerc2685a92010-05-21 23:16:21 +00001861Another interesting case is that something related could be used for variables
1862that go const after their ctor has finished. In these cases, globalopt (which
1863can statically run the constructor) could mark the global const (so it gets put
1864in the readonly section). A testcase would be:
1865
1866#include <complex>
1867using namespace std;
1868const complex<char> should_be_in_rodata (42,-42);
1869complex<char> should_be_in_data (42,-42);
1870complex<char> should_be_in_bss;
1871
1872Where we currently evaluate the ctors but the globals don't become const because
1873the optimizer doesn't know they "become const" after the ctor is done. See
1874GCC PR4131 for more examples.
1875
Chris Lattner94846892010-04-16 23:52:30 +00001876//===---------------------------------------------------------------------===//
1877
Dan Gohman3a2a4842010-05-03 14:31:00 +00001878In this code:
1879
1880long foo(long x) {
1881 return x > 1 ? x : 1;
1882}
1883
1884LLVM emits a comparison with 1 instead of 0. 0 would be equivalent
1885and cheaper on most targets.
1886
1887LLVM prefers comparisons with zero over non-zero in general, but in this
1888case it choses instead to keep the max operation obvious.
1889
1890//===---------------------------------------------------------------------===//
Eli Friedman8c47d3b2010-06-12 05:54:27 +00001891
1892Take the following testcase on x86-64 (similar testcases exist for all targets
1893with addc/adde):
1894
1895define void @a(i64* nocapture %s, i64* nocapture %t, i64 %a, i64 %b,
1896i64 %c) nounwind {
1897entry:
1898 %0 = zext i64 %a to i128 ; <i128> [#uses=1]
1899 %1 = zext i64 %b to i128 ; <i128> [#uses=1]
1900 %2 = add i128 %1, %0 ; <i128> [#uses=2]
1901 %3 = zext i64 %c to i128 ; <i128> [#uses=1]
1902 %4 = shl i128 %3, 64 ; <i128> [#uses=1]
1903 %5 = add i128 %4, %2 ; <i128> [#uses=1]
1904 %6 = lshr i128 %5, 64 ; <i128> [#uses=1]
1905 %7 = trunc i128 %6 to i64 ; <i64> [#uses=1]
1906 store i64 %7, i64* %s, align 8
1907 %8 = trunc i128 %2 to i64 ; <i64> [#uses=1]
1908 store i64 %8, i64* %t, align 8
1909 ret void
1910}
1911
1912Generated code:
1913 addq %rcx, %rdx
1914 movl $0, %eax
1915 adcq $0, %rax
1916 addq %r8, %rax
1917 movq %rax, (%rdi)
1918 movq %rdx, (%rsi)
1919 ret
1920
1921Expected code:
1922 addq %rcx, %rdx
1923 adcq $0, %r8
1924 movq %r8, (%rdi)
1925 movq %rdx, (%rsi)
1926 ret
1927
1928The generated SelectionDAG has an ADD of an ADDE, where both operands of the
1929ADDE are zero. Replacing one of the operands of the ADDE with the other operand
1930of the ADD, and replacing the ADD with the ADDE, should give the desired result.
1931
1932(That said, we are doing a lot better than gcc on this testcase. :) )
1933
1934//===---------------------------------------------------------------------===//
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001935
1936Switch lowering generates less than ideal code for the following switch:
1937define void @a(i32 %x) nounwind {
1938entry:
1939 switch i32 %x, label %if.end [
1940 i32 0, label %if.then
1941 i32 1, label %if.then
1942 i32 2, label %if.then
1943 i32 3, label %if.then
1944 i32 5, label %if.then
1945 ]
1946if.then:
1947 tail call void @foo() nounwind
1948 ret void
1949if.end:
1950 ret void
1951}
1952declare void @foo()
1953
1954Generated code on x86-64 (other platforms give similar results):
1955a:
1956 cmpl $5, %edi
1957 ja .LBB0_2
1958 movl %edi, %eax
1959 movl $47, %ecx
1960 btq %rax, %rcx
1961 jb .LBB0_3
1962.LBB0_2:
1963 ret
1964.LBB0_3:
Eli Friedmanb4828292010-07-03 08:43:32 +00001965 jmp foo # TAILCALL
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001966
1967The movl+movl+btq+jb could be simplified to a cmpl+jne.
1968
Eli Friedmanb4828292010-07-03 08:43:32 +00001969Or, if we wanted to be really clever, we could simplify the whole thing to
1970something like the following, which eliminates a branch:
1971 xorl $1, %edi
1972 cmpl $4, %edi
1973 ja .LBB0_2
1974 ret
1975.LBB0_2:
1976 jmp foo # TAILCALL
Nick Lewyckyb1e4eeb2010-08-08 07:04:25 +00001977//===---------------------------------------------------------------------===//
1978Given a branch where the two target blocks are identical ("ret i32 %b" in
1979both), simplifycfg will simplify them away. But not so for a switch statement:
Eli Friedmanb4828292010-07-03 08:43:32 +00001980
Nick Lewyckyb1e4eeb2010-08-08 07:04:25 +00001981define i32 @f(i32 %a, i32 %b) nounwind readnone {
1982entry:
1983 switch i32 %a, label %bb3 [
1984 i32 4, label %bb
1985 i32 6, label %bb
1986 ]
1987
1988bb: ; preds = %entry, %entry
1989 ret i32 %b
1990
1991bb3: ; preds = %entry
1992 ret i32 %b
1993}
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001994//===---------------------------------------------------------------------===//
Chris Lattner274191f2010-11-09 19:37:28 +00001995
1996clang -O3 fails to devirtualize this virtual inheritance case: (GCC PR45875)
Chris Lattner1e68fdb2010-11-11 17:17:56 +00001997Looks related to PR3100
Chris Lattner274191f2010-11-09 19:37:28 +00001998
1999struct c1 {};
2000struct c10 : c1{
2001 virtual void foo ();
2002};
2003struct c11 : c10, c1{
2004 virtual void f6 ();
2005};
2006struct c28 : virtual c11{
2007 void f6 ();
2008};
2009void check_c28 () {
2010 c28 obj;
2011 c11 *ptr = &obj;
2012 ptr->f6 ();
2013}
2014
2015//===---------------------------------------------------------------------===//
Chris Lattneraf510f12010-11-11 18:23:57 +00002016
2017We compile this:
2018
2019int foo(int a) { return (a & (~15)) / 16; }
2020
2021Into:
2022
2023define i32 @foo(i32 %a) nounwind readnone ssp {
2024entry:
2025 %and = and i32 %a, -16
2026 %div = sdiv i32 %and, 16
2027 ret i32 %div
2028}
2029
2030but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case
2031should be instcombined into just "a >> 4".
2032
2033We do get this at the codegen level, so something knows about it, but
2034instcombine should catch it earlier:
2035
2036_foo: ## @foo
2037## BB#0: ## %entry
2038 movl %edi, %eax
2039 sarl $4, %eax
2040 ret
2041
2042//===---------------------------------------------------------------------===//
2043
Chris Lattnera97c91f2010-12-13 00:15:25 +00002044This code (from GCC PR28685):
2045
2046int test(int a, int b) {
2047 int lt = a < b;
2048 int eq = a == b;
2049 if (lt)
2050 return 1;
2051 return eq;
2052}
2053
2054Is compiled to:
2055
2056define i32 @test(i32 %a, i32 %b) nounwind readnone ssp {
2057entry:
2058 %cmp = icmp slt i32 %a, %b
2059 br i1 %cmp, label %return, label %if.end
2060
2061if.end: ; preds = %entry
2062 %cmp5 = icmp eq i32 %a, %b
2063 %conv6 = zext i1 %cmp5 to i32
2064 ret i32 %conv6
2065
2066return: ; preds = %entry
2067 ret i32 1
2068}
2069
2070it could be:
2071
2072define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp {
2073entry:
2074 %0 = icmp sle i32 %a, %b
2075 %retval = zext i1 %0 to i32
2076 ret i32 %retval
2077}
2078
2079//===---------------------------------------------------------------------===//