blob: c3a9330ba6ed1a7c7c12f3c82f8ac35e3d3086eb [file] [log] [blame]
Chris Lattner086c0142006-02-03 06:21:43 +00001Target Independent Opportunities:
2
Chris Lattnerf308ea02006-09-28 06:01:17 +00003//===---------------------------------------------------------------------===//
4
Chris Lattner1d159832009-11-27 17:12:30 +00005Dead argument elimination should be enhanced to handle cases when an argument is
6dead to an externally visible function. Though the argument can't be removed
7from the externally visible function, the caller doesn't need to pass it in.
8For example in this testcase:
9
10 void foo(int X) __attribute__((noinline));
11 void foo(int X) { sideeffect(); }
12 void bar(int A) { foo(A+1); }
13
14We compile bar to:
15
16define void @bar(i32 %A) nounwind ssp {
17 %0 = add nsw i32 %A, 1 ; <i32> [#uses=1]
18 tail call void @foo(i32 %0) nounwind noinline ssp
19 ret void
20}
21
22The add is dead, we could pass in 'i32 undef' instead. This occurs for C++
23templates etc, which usually have linkonce_odr/weak_odr linkage, not internal
24linkage.
25
26//===---------------------------------------------------------------------===//
27
Chris Lattner9b62b452006-11-14 01:57:53 +000028With the recent changes to make the implicit def/use set explicit in
29machineinstrs, we should change the target descriptions for 'call' instructions
30so that the .td files don't list all the call-clobbered registers as implicit
31defs. Instead, these should be added by the code generator (e.g. on the dag).
32
33This has a number of uses:
34
351. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
36 for their different impdef sets.
372. Targets with multiple calling convs (e.g. x86) which have different clobber
38 sets don't need copies of call instructions.
393. 'Interprocedural register allocation' can be done to reduce the clobber sets
40 of calls.
41
42//===---------------------------------------------------------------------===//
43
Chris Lattner08859ff2010-12-15 07:25:55 +000044We should recognized various "overflow detection" idioms and translate them into
Chris Lattnere5cbdca2010-12-19 19:37:52 +000045llvm.uadd.with.overflow and similar intrinsics. Here is a multiply idiom:
Chris Lattner94481842010-12-15 07:28:58 +000046
47unsigned int mul(unsigned int a,unsigned int b) {
48 if ((unsigned long long)a*b>0xffffffff)
49 exit(0);
50 return a*b;
51}
52
Chris Lattner527b47d2011-01-02 18:31:38 +000053The legalization code for mul-with-overflow needs to be made more robust before
54this can be implemented though.
55
Nate Begeman81e80972006-03-17 01:40:33 +000056//===---------------------------------------------------------------------===//
Chris Lattner086c0142006-02-03 06:21:43 +000057
58Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
Chris Lattner2dae65d2008-12-10 01:30:48 +000059precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
60safe in general, even on darwin. See the libm implementation of hypot for
61examples (which special case when x/y are exactly zero to get signed zeros etc
62right).
Chris Lattner086c0142006-02-03 06:21:43 +000063
Chris Lattner086c0142006-02-03 06:21:43 +000064//===---------------------------------------------------------------------===//
65
Chris Lattnerb27b69f2006-03-04 01:19:34 +000066On targets with expensive 64-bit multiply, we could LSR this:
67
68for (i = ...; ++i) {
69 x = 1ULL << i;
70
71into:
72 long long tmp = 1;
73 for (i = ...; ++i, tmp+=tmp)
74 x = tmp;
75
76This would be a win on ppc32, but not x86 or ppc64.
77
Chris Lattnerad019932006-03-04 08:44:51 +000078//===---------------------------------------------------------------------===//
Chris Lattner5b0fe7d2006-03-05 20:00:08 +000079
80Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
81
82//===---------------------------------------------------------------------===//
Chris Lattner549f27d22006-03-07 02:46:26 +000083
Chris Lattner398ffba2010-01-01 01:29:26 +000084Reassociate should turn things like:
85
86int factorial(int X) {
87 return X*X*X*X*X*X*X*X;
88}
89
90into llvm.powi calls, allowing the code generator to produce balanced
91multiplication trees.
92
93First, the intrinsic needs to be extended to support integers, and second the
94code generator needs to be enhanced to lower these to multiplication trees.
Chris Lattnerc20995e2006-03-11 20:17:08 +000095
96//===---------------------------------------------------------------------===//
97
Chris Lattner74cfb7d2006-03-11 20:20:40 +000098Interesting? testcase for add/shift/mul reassoc:
99
100int bar(int x, int y) {
101 return x*x*x+y+x*x*x*x*x*y*y*y*y;
102}
103int foo(int z, int n) {
104 return bar(z, n) + bar(2*z, 2*n);
105}
106
Chris Lattner398ffba2010-01-01 01:29:26 +0000107This is blocked on not handling X*X*X -> powi(X, 3) (see note above). The issue
108is that we end up getting t = 2*X s = t*t and don't turn this into 4*X*X,
109which is the same number of multiplies and is canonical, because the 2*X has
110multiple uses. Here's a simple example:
111
112define i32 @test15(i32 %X1) {
113 %B = mul i32 %X1, 47 ; X1*47
114 %C = mul i32 %B, %B
115 ret i32 %C
116}
117
118
119//===---------------------------------------------------------------------===//
120
121Reassociate should handle the example in GCC PR16157:
122
123extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;
124void f () { /* this can be optimized to four additions... */
125 b4 = a4 + a3 + a2 + a1 + a0;
126 b3 = a3 + a2 + a1 + a0;
127 b2 = a2 + a1 + a0;
128 b1 = a1 + a0;
129}
130
131This requires reassociating to forms of expressions that are already available,
132something that reassoc doesn't think about yet.
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000133
Chris Lattner10c42452010-01-24 20:01:41 +0000134
135//===---------------------------------------------------------------------===//
136
137This function: (derived from GCC PR19988)
138double foo(double x, double y) {
139 return ((x + 0.1234 * y) * (x + -0.1234 * y));
140}
141
142compiles to:
143_foo:
144 movapd %xmm1, %xmm2
145 mulsd LCPI1_1(%rip), %xmm1
146 mulsd LCPI1_0(%rip), %xmm2
147 addsd %xmm0, %xmm1
148 addsd %xmm0, %xmm2
149 movapd %xmm1, %xmm0
150 mulsd %xmm2, %xmm0
151 ret
152
Chris Lattner43dc2e62010-01-24 20:17:09 +0000153Reassociate should be able to turn it into:
Chris Lattner10c42452010-01-24 20:01:41 +0000154
155double foo(double x, double y) {
156 return ((x + 0.1234 * y) * (x - 0.1234 * y));
157}
158
159Which allows the multiply by constant to be CSE'd, producing:
160
161_foo:
162 mulsd LCPI1_0(%rip), %xmm1
163 movapd %xmm1, %xmm2
164 addsd %xmm0, %xmm2
165 subsd %xmm1, %xmm0
166 mulsd %xmm2, %xmm0
167 ret
168
169This doesn't need -ffast-math support at all. This is particularly bad because
170the llvm-gcc frontend is canonicalizing the later into the former, but clang
171doesn't have this problem.
172
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000173//===---------------------------------------------------------------------===//
174
Chris Lattner82c78b22006-03-09 20:13:21 +0000175These two functions should generate the same code on big-endian systems:
176
177int g(int *j,int *l) { return memcmp(j,l,4); }
178int h(int *j, int *l) { return *j - *l; }
179
180this could be done in SelectionDAGISel.cpp, along with other special cases,
181for 1,2,4,8 bytes.
182
183//===---------------------------------------------------------------------===//
184
Chris Lattnerc04b4232006-03-22 07:33:46 +0000185It would be nice to revert this patch:
186http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
187
188And teach the dag combiner enough to simplify the code expanded before
189legalize. It seems plausible that this knowledge would let it simplify other
190stuff too.
191
Chris Lattnere6cd96d2006-03-24 19:59:17 +0000192//===---------------------------------------------------------------------===//
193
Reid Spencerac9dcb92007-02-15 03:39:18 +0000194For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng67d3d4c2006-03-31 22:35:14 +0000195to the type size. It works but can be overly conservative as the alignment of
Reid Spencerac9dcb92007-02-15 03:39:18 +0000196specific vector types are target dependent.
Chris Lattnereaa7c062006-04-01 04:08:29 +0000197
198//===---------------------------------------------------------------------===//
199
Dan Gohman1f3be1a2009-05-11 18:51:16 +0000200We should produce an unaligned load from code like this:
Chris Lattnereaa7c062006-04-01 04:08:29 +0000201
202v4sf example(float *P) {
203 return (v4sf){P[0], P[1], P[2], P[3] };
204}
205
206//===---------------------------------------------------------------------===//
207
Chris Lattner16abfdf2006-05-18 18:26:13 +0000208Add support for conditional increments, and other related patterns. Instead
209of:
210
211 movl 136(%esp), %eax
212 cmpl $0, %eax
213 je LBB16_2 #cond_next
214LBB16_1: #cond_true
215 incl _foo
216LBB16_2: #cond_next
217
218emit:
219 movl _foo, %eax
220 cmpl $1, %edi
221 sbbl $-1, %eax
222 movl %eax, _foo
223
224//===---------------------------------------------------------------------===//
Chris Lattner870cf1b2006-05-19 20:45:08 +0000225
226Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
227
228Expand these to calls of sin/cos and stores:
229 double sincos(double x, double *sin, double *cos);
230 float sincosf(float x, float *sin, float *cos);
231 long double sincosl(long double x, long double *sin, long double *cos);
232
233Doing so could allow SROA of the destination pointers. See also:
234http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
235
Chris Lattner2dae65d2008-12-10 01:30:48 +0000236This is now easily doable with MRVs. We could even make an intrinsic for this
237if anyone cared enough about sincos.
238
Chris Lattner870cf1b2006-05-19 20:45:08 +0000239//===---------------------------------------------------------------------===//
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000240
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000241quantum_sigma_x in 462.libquantum contains the following loop:
242
243 for(i=0; i<reg->size; i++)
244 {
245 /* Flip the target bit of each basis state */
246 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
247 }
248
249Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
250so cool to turn it into something like:
251
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000252 long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000253 if (target < 32) {
254 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000255 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000256 } else {
257 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000258 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000259 }
260
261... which would only do one 32-bit XOR per loop iteration instead of two.
262
263It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000264this requires TBAA.
Chris Lattnerfaa6adf2009-09-21 06:04:07 +0000265
266//===---------------------------------------------------------------------===//
267
Chris Lattnerb1ac7692008-10-05 02:16:12 +0000268This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattnerf9bae432006-12-08 02:01:32 +0000269
270unsigned long reverse(unsigned v) {
271 unsigned t;
272 t = v ^ ((v << 16) | (v >> 16));
273 t &= ~0xff0000;
274 v = (v << 24) | (v >> 8);
275 return v ^ (t >> 8);
276}
277
Chris Lattnerfb981f32006-09-25 17:12:14 +0000278//===---------------------------------------------------------------------===//
279
Chris Lattner818ff342010-01-23 18:49:30 +0000280[LOOP RECOGNITION]
281
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000282These idioms should be recognized as popcount (see PR1488):
283
284unsigned countbits_slow(unsigned v) {
285 unsigned c;
286 for (c = 0; v; v >>= 1)
287 c += v & 1;
288 return c;
289}
290unsigned countbits_fast(unsigned v){
291 unsigned c;
292 for (c = 0; v; c++)
293 v &= v - 1; // clear the least significant bit set
294 return c;
295}
296
297BITBOARD = unsigned long long
298int PopCnt(register BITBOARD a) {
299 register int c=0;
300 while(a) {
301 c++;
302 a &= a - 1;
303 }
304 return c;
305}
306unsigned int popcount(unsigned int input) {
307 unsigned int count = 0;
308 for (unsigned int i = 0; i < 4 * 8; i++)
309 count += (input >> i) & i;
310 return count;
311}
312
Chris Lattner527b47d2011-01-02 18:31:38 +0000313This sort of thing should be added to the loop idiom pass.
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000314
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000315//===---------------------------------------------------------------------===//
316
Chris Lattnerfb981f32006-09-25 17:12:14 +0000317These should turn into single 16-bit (unaligned?) loads on little/big endian
318processors.
319
320unsigned short read_16_le(const unsigned char *adr) {
321 return adr[0] | (adr[1] << 8);
322}
323unsigned short read_16_be(const unsigned char *adr) {
324 return (adr[0] << 8) | adr[1];
325}
326
327//===---------------------------------------------------------------------===//
Chris Lattnercf103912006-10-24 16:12:47 +0000328
Reid Spencer1628cec2006-10-26 06:15:43 +0000329-instcombine should handle this transform:
Reid Spencere4d87aa2006-12-23 06:05:41 +0000330 icmp pred (sdiv X / C1 ), C2
Reid Spencer1628cec2006-10-26 06:15:43 +0000331when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
332
333Currently InstCombine avoids this transform but will do it when the signs of
334the operands and the sign of the divide match. See the FIXME in
335InstructionCombining.cpp in the visitSetCondInst method after the switch case
336for Instruction::UDiv (around line 4447) for more details.
337
338The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
339this construct.
Chris Lattnerd7c628d2006-11-03 22:27:39 +0000340
341//===---------------------------------------------------------------------===//
342
Chris Lattneraa306c22010-01-23 17:59:23 +0000343[LOOP OPTIMIZATION]
344
345SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization
346opportunities in its double_array_divs_variable function: it needs loop
347interchange, memory promotion (which LICM already does), vectorization and
348variable trip count loop unrolling (since it has a constant trip count). ICC
349apparently produces this very nice code with -ffast-math:
350
351..B1.70: # Preds ..B1.70 ..B1.69
352 mulpd %xmm0, %xmm1 #108.2
353 mulpd %xmm0, %xmm1 #108.2
354 mulpd %xmm0, %xmm1 #108.2
355 mulpd %xmm0, %xmm1 #108.2
356 addl $8, %edx #
357 cmpl $131072, %edx #108.2
358 jb ..B1.70 # Prob 99% #108.2
359
360It would be better to count down to zero, but this is a lot better than what we
361do.
362
363//===---------------------------------------------------------------------===//
364
Chris Lattner03a6d962007-01-16 06:39:48 +0000365Consider:
366
367typedef unsigned U32;
368typedef unsigned long long U64;
369int test (U32 *inst, U64 *regs) {
370 U64 effective_addr2;
371 U32 temp = *inst;
372 int r1 = (temp >> 20) & 0xf;
373 int b2 = (temp >> 16) & 0xf;
374 effective_addr2 = temp & 0xfff;
375 if (b2) effective_addr2 += regs[b2];
376 b2 = (temp >> 12) & 0xf;
377 if (b2) effective_addr2 += regs[b2];
378 effective_addr2 &= regs[4];
379 if ((effective_addr2 & 3) == 0)
380 return 1;
381 return 0;
382}
383
384Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
385we don't eliminate the computation of the top half of effective_addr2 because
386we don't have whole-function selection dags. On x86, this means we use one
387extra register for the function when effective_addr2 is declared as U64 than
388when it is declared U32.
389
Chris Lattner17424982009-11-10 23:47:45 +0000390PHI Slicing could be extended to do this.
391
Chris Lattner03a6d962007-01-16 06:39:48 +0000392//===---------------------------------------------------------------------===//
393
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000394LSR should know what GPR types a target has from TargetData. This code:
Chris Lattner1a77a552007-03-24 06:01:32 +0000395
396volatile short X, Y; // globals
397
398void foo(int N) {
399 int i;
400 for (i = 0; i < N; i++) { X = i; Y = i*4; }
401}
402
Chris Lattnerc1491f32009-09-20 17:37:38 +0000403produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner1a77a552007-03-24 06:01:32 +0000404
Chris Lattnerc1491f32009-09-20 17:37:38 +0000405LBB1_2:
406 ldr r3, LCPI1_0
407 ldr r3, [r3]
408 strh r2, [r3]
409 ldr r3, LCPI1_1
410 ldr r3, [r3]
411 strh r1, [r3]
412 add r1, r1, #4
413 add r2, r2, #1 <- [0,+,1]
414 sub r0, r0, #1 <- [0,-,1]
415 cmp r0, #0
416 bne LBB1_2
417
418LSR should reuse the "+" IV for the exit test.
Chris Lattner1a77a552007-03-24 06:01:32 +0000419
Chris Lattner1a77a552007-03-24 06:01:32 +0000420//===---------------------------------------------------------------------===//
421
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000422Tail call elim should be more aggressive, checking to see if the call is
423followed by an uncond branch to an exit block.
424
425; This testcase is due to tail-duplication not wanting to copy the return
426; instruction into the terminating blocks because there was other code
427; optimized out of the function after the taildup happened.
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000428; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000429
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000430define i32 @t4(i32 %a) {
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000431entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000432 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
433 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
434 br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000435
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000436then.0: ; preds = %entry
437 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
438 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
439 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000440
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000441else.0: ; preds = %entry
442 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
443 br i1 %tmp.7, label %then.1, label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000444
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000445then.1: ; preds = %else.0
446 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
447 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
448 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000449
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000450return: ; preds = %then.1, %else.0, %then.0
451 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000452 [ %tmp.9, %then.1 ]
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000453 ret i32 %result.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000454}
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000455
456//===---------------------------------------------------------------------===//
457
Chris Lattnerc90b8662008-08-10 00:47:21 +0000458Tail recursion elimination should handle:
459
460int pow2m1(int n) {
461 if (n == 0)
462 return 0;
463 return 2 * pow2m1 (n - 1) + 1;
464}
465
466Also, multiplies can be turned into SHL's, so they should be handled as if
467they were associative. "return foo() << 1" can be tail recursion eliminated.
468
469//===---------------------------------------------------------------------===//
470
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000471Argument promotion should promote arguments for recursive functions, like
472this:
473
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000474; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000475
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000476define internal i32 @foo(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000477entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000478 %tmp = load i32* %x ; <i32> [#uses=0]
479 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
480 ret i32 %tmp.foo
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000481}
482
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000483define i32 @bar(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000484entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000485 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
486 ret i32 %tmp3
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000487}
488
Chris Lattner81f2d712007-12-05 23:05:06 +0000489//===---------------------------------------------------------------------===//
Chris Lattner166a2682007-12-28 04:42:05 +0000490
Chris Lattnera1643ba2007-12-28 22:30:05 +0000491We should investigate an instruction sinking pass. Consider this silly
492example in pic mode:
493
494#include <assert.h>
495void foo(int x) {
496 assert(x);
497 //...
498}
499
500we compile this to:
501_foo:
502 subl $28, %esp
503 call "L1$pb"
504"L1$pb":
505 popl %eax
506 cmpl $0, 32(%esp)
507 je LBB1_2 # cond_true
508LBB1_1: # return
509 # ...
510 addl $28, %esp
511 ret
512LBB1_2: # cond_true
513...
514
515The PIC base computation (call+popl) is only used on one path through the
516code, but is currently always computed in the entry block. It would be
517better to sink the picbase computation down into the block for the
518assertion, as it is the only one that uses it. This happens for a lot of
519code with early outs.
520
Chris Lattner92c06a02007-12-29 01:05:01 +0000521Another example is loads of arguments, which are usually emitted into the
522entry block on targets like x86. If not used in all paths through a
523function, they should be sunk into the ones that do.
524
Chris Lattnera1643ba2007-12-28 22:30:05 +0000525In this case, whole-function-isel would also handle this.
Chris Lattner166a2682007-12-28 04:42:05 +0000526
527//===---------------------------------------------------------------------===//
Chris Lattnerb3041942008-01-07 21:38:14 +0000528
529Investigate lowering of sparse switch statements into perfect hash tables:
530http://burtleburtle.net/bob/hash/perfect.html
531
532//===---------------------------------------------------------------------===//
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000533
534We should turn things like "load+fabs+store" and "load+fneg+store" into the
535corresponding integer operations. On a yonah, this loop:
536
537double a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000538void foo() {
539 int i, b;
540 for (b = 0; b < 10000000; b++)
541 for (i = 0; i < 256; i++)
542 a[i] = -a[i];
543}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000544
545is twice as slow as this loop:
546
547long long a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000548void foo() {
549 int i, b;
550 for (b = 0; b < 10000000; b++)
551 for (i = 0; i < 256; i++)
552 a[i] ^= (1ULL << 63);
553}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000554
555and I suspect other processors are similar. On X86 in particular this is a
556big win because doing this with integers allows the use of read/modify/write
557instructions.
558
559//===---------------------------------------------------------------------===//
Chris Lattner83726012008-01-10 18:25:41 +0000560
561DAG Combiner should try to combine small loads into larger loads when
562profitable. For example, we compile this C++ example:
563
564struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
565extern THotKey m_HotKey;
566THotKey GetHotKey () { return m_HotKey; }
567
Chris Lattner527b47d2011-01-02 18:31:38 +0000568into (-m64 -O3 -fno-exceptions -static -fomit-frame-pointer):
Chris Lattner83726012008-01-10 18:25:41 +0000569
Chris Lattner527b47d2011-01-02 18:31:38 +0000570__Z9GetHotKeyv: ## @_Z9GetHotKeyv
571 movq _m_HotKey@GOTPCREL(%rip), %rax
572 movzwl (%rax), %ecx
573 movzbl 2(%rax), %edx
574 shlq $16, %rdx
575 orq %rcx, %rdx
576 movzbl 3(%rax), %ecx
577 shlq $24, %rcx
578 orq %rdx, %rcx
579 movzbl 4(%rax), %eax
580 shlq $32, %rax
581 orq %rcx, %rax
582 ret
Chris Lattner83726012008-01-10 18:25:41 +0000583
584//===---------------------------------------------------------------------===//
Chris Lattner497b7e92008-01-11 06:17:47 +0000585
Nate Begemane9fe65c2008-02-18 18:39:23 +0000586We should add an FRINT node to the DAG to model targets that have legal
587implementations of ceil/floor/rint.
Chris Lattner48840f82008-02-28 05:34:27 +0000588
589//===---------------------------------------------------------------------===//
590
591Consider:
592
593int test() {
Benjamin Kramer9d071cb2010-12-23 15:32:07 +0000594 long long input[8] = {1,0,1,0,1,0,1,0};
Chris Lattner48840f82008-02-28 05:34:27 +0000595 foo(input);
596}
597
Chris Lattner9c8fb9e2011-01-01 22:52:11 +0000598Clang compiles this into:
Chris Lattner48840f82008-02-28 05:34:27 +0000599
Chris Lattner9c8fb9e2011-01-01 22:52:11 +0000600 call void @llvm.memset.p0i8.i64(i8* %tmp, i8 0, i64 64, i32 16, i1 false)
601 %0 = getelementptr [8 x i64]* %input, i64 0, i64 0
602 store i64 1, i64* %0, align 16
603 %1 = getelementptr [8 x i64]* %input, i64 0, i64 2
604 store i64 1, i64* %1, align 16
605 %2 = getelementptr [8 x i64]* %input, i64 0, i64 4
606 store i64 1, i64* %2, align 16
607 %3 = getelementptr [8 x i64]* %input, i64 0, i64 6
608 store i64 1, i64* %3, align 16
Chris Lattner48840f82008-02-28 05:34:27 +0000609
Chris Lattner9c8fb9e2011-01-01 22:52:11 +0000610Which gets codegen'd into:
611
612 pxor %xmm0, %xmm0
613 movaps %xmm0, -16(%rbp)
614 movaps %xmm0, -32(%rbp)
615 movaps %xmm0, -48(%rbp)
616 movaps %xmm0, -64(%rbp)
617 movq $1, -64(%rbp)
618 movq $1, -48(%rbp)
619 movq $1, -32(%rbp)
620 movq $1, -16(%rbp)
621
622It would be better to have 4 movq's of 0 instead of the movaps's.
Chris Lattner48840f82008-02-28 05:34:27 +0000623
624//===---------------------------------------------------------------------===//
Chris Lattnera11deb02008-03-02 02:51:40 +0000625
626http://llvm.org/PR717:
627
628The following code should compile into "ret int undef". Instead, LLVM
629produces "ret int 0":
630
631int f() {
632 int x = 4;
633 int y;
634 if (x == 3) y = 0;
635 return y;
636}
637
638//===---------------------------------------------------------------------===//
Chris Lattner53b72772008-03-02 19:29:42 +0000639
640The loop unroller should partially unroll loops (instead of peeling them)
641when code growth isn't too bad and when an unroll count allows simplification
642of some code within the loop. One trivial example is:
643
644#include <stdio.h>
645int main() {
646 int nRet = 17;
647 int nLoop;
648 for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
649 if ( nLoop & 1 )
650 nRet += 2;
651 else
652 nRet -= 1;
653 }
654 return nRet;
655}
656
657Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
658reduction in code size. The resultant code would then also be suitable for
659exit value computation.
660
661//===---------------------------------------------------------------------===//
Chris Lattner349155b2008-03-17 01:47:51 +0000662
663We miss a bunch of rotate opportunities on various targets, including ppc, x86,
664etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
665matching code in dag combine doesn't look through truncates aggressively
666enough. Here are some testcases reduces from GCC PR17886:
667
Chris Lattner349155b2008-03-17 01:47:51 +0000668unsigned long long f5(unsigned long long x, unsigned long long y) {
669 return (x << 8) | ((y >> 48) & 0xffull);
670}
671unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
672 switch(z) {
673 case 1:
674 return (x << 8) | ((y >> 48) & 0xffull);
675 case 2:
676 return (x << 16) | ((y >> 40) & 0xffffull);
677 case 3:
678 return (x << 24) | ((y >> 32) & 0xffffffull);
679 case 4:
680 return (x << 32) | ((y >> 24) & 0xffffffffull);
681 default:
682 return (x << 40) | ((y >> 16) & 0xffffffffffull);
683 }
684}
685
Chris Lattner349155b2008-03-17 01:47:51 +0000686//===---------------------------------------------------------------------===//
Chris Lattnerf70107f2008-03-20 04:46:13 +0000687
Chris Lattneref17f082010-12-15 07:10:43 +0000688This (and similar related idioms):
689
690unsigned int foo(unsigned char i) {
691 return i | (i<<8) | (i<<16) | (i<<24);
692}
693
694compiles into:
695
696define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone {
697entry:
698 %conv = zext i8 %i to i32
699 %shl = shl i32 %conv, 8
700 %shl5 = shl i32 %conv, 16
701 %shl9 = shl i32 %conv, 24
702 %or = or i32 %shl9, %conv
703 %or6 = or i32 %or, %shl5
704 %or10 = or i32 %or6, %shl
705 ret i32 %or10
706}
707
708it would be better as:
709
710unsigned int bar(unsigned char i) {
711 unsigned int j=i | (i << 8);
712 return j | (j<<16);
713}
714
715aka:
716
717define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone {
718entry:
719 %conv = zext i8 %i to i32
720 %shl = shl i32 %conv, 8
721 %or = or i32 %shl, %conv
722 %shl5 = shl i32 %or, 16
723 %or6 = or i32 %shl5, %or
724 ret i32 %or6
725}
726
727or even i*0x01010101, depending on the speed of the multiplier. The best way to
728handle this is to canonicalize it to a multiply in IR and have codegen handle
729lowering multiplies to shifts on cpus where shifts are faster.
730
731//===---------------------------------------------------------------------===//
732
Chris Lattnerf70107f2008-03-20 04:46:13 +0000733We do a number of simplifications in simplify libcalls to strength reduce
734standard library functions, but we don't currently merge them together. For
735example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
736be done safely if "b" isn't modified between the strlen and memcpy of course.
737
738//===---------------------------------------------------------------------===//
739
Chris Lattner26e150f2008-08-10 01:14:08 +0000740We compile this program: (from GCC PR11680)
741http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
742
743Into code that runs the same speed in fast/slow modes, but both modes run 2x
744slower than when compile with GCC (either 4.0 or 4.2):
745
746$ llvm-g++ perf.cpp -O3 -fno-exceptions
747$ time ./a.out fast
7481.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
749
750$ g++ perf.cpp -O3 -fno-exceptions
751$ time ./a.out fast
7520.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
753
754It looks like we are making the same inlining decisions, so this may be raw
755codegen badness or something else (haven't investigated).
756
757//===---------------------------------------------------------------------===//
758
759We miss some instcombines for stuff like this:
760void bar (void);
761void foo (unsigned int a) {
762 /* This one is equivalent to a >= (3 << 2). */
763 if ((a >> 2) >= 3)
764 bar ();
765}
766
767A few other related ones are in GCC PR14753.
768
769//===---------------------------------------------------------------------===//
770
771Divisibility by constant can be simplified (according to GCC PR12849) from
772being a mulhi to being a mul lo (cheaper). Testcase:
773
774void bar(unsigned n) {
775 if (n % 3 == 0)
776 true();
777}
778
Eli Friedmanbcae2052009-12-12 23:23:43 +0000779This is equivalent to the following, where 2863311531 is the multiplicative
780inverse of 3, and 1431655766 is ((2^32)-1)/3+1:
781void bar(unsigned n) {
782 if (n * 2863311531U < 1431655766U)
783 true();
784}
785
786The same transformation can work with an even modulo with the addition of a
787rotate: rotate the result of the multiply to the right by the number of bits
788which need to be zero for the condition to be true, and shrink the compare RHS
789by the same amount. Unless the target supports rotates, though, that
790transformation probably isn't worthwhile.
791
792The transformation can also easily be made to work with non-zero equality
793comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".
Chris Lattner26e150f2008-08-10 01:14:08 +0000794
795//===---------------------------------------------------------------------===//
Chris Lattner23f35bc2008-08-19 06:22:16 +0000796
Chris Lattnerdb039832008-10-15 16:06:03 +0000797Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
798bunch of other stuff from this example (see PR1604):
799
800#include <cstdio>
801struct test {
802 int val;
803 virtual ~test() {}
804};
805
806int main() {
807 test t;
808 std::scanf("%d", &t.val);
809 std::printf("%d\n", t.val);
810}
811
812//===---------------------------------------------------------------------===//
813
Nick Lewyckyd2f0db12008-11-27 22:41:45 +0000814These functions perform the same computation, but produce different assembly.
Nick Lewyckydf563ca2008-11-27 22:12:22 +0000815
816define i8 @select(i8 %x) readnone nounwind {
817 %A = icmp ult i8 %x, 250
818 %B = select i1 %A, i8 0, i8 1
819 ret i8 %B
820}
821
822define i8 @addshr(i8 %x) readnone nounwind {
823 %A = zext i8 %x to i9
824 %B = add i9 %A, 6 ;; 256 - 250 == 6
825 %C = lshr i9 %B, 8
826 %D = trunc i9 %C to i8
827 ret i8 %D
828}
829
830//===---------------------------------------------------------------------===//
Eli Friedman4e16b292008-11-30 07:36:04 +0000831
832From gcc bug 24696:
833int
834f (unsigned long a, unsigned long b, unsigned long c)
835{
836 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0);
837}
838int
839f (unsigned long a, unsigned long b, unsigned long c)
840{
841 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0);
842}
843Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with
844"clang -emit-llvm-bc | opt -std-compile-opts".
845
846//===---------------------------------------------------------------------===//
847
848From GCC Bug 20192:
849#define PMD_MASK (~((1UL << 23) - 1))
850void clear_pmd_range(unsigned long start, unsigned long end)
851{
852 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
853 f();
854}
855The expression should optimize to something like
856"!((start|end)&~PMD_MASK). Currently not optimized with "clang
857-emit-llvm-bc | opt -std-compile-opts".
858
859//===---------------------------------------------------------------------===//
860
Eli Friedman4e16b292008-11-30 07:36:04 +0000861unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
862i;}
863unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
864These should combine to the same thing. Currently, the first function
865produces better code on X86.
866
867//===---------------------------------------------------------------------===//
868
Eli Friedman4e16b292008-11-30 07:36:04 +0000869From GCC Bug 15784:
870#define abs(x) x>0?x:-x
871int f(int x, int y)
872{
873 return (abs(x)) >= 0;
874}
875This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
876optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
877
878//===---------------------------------------------------------------------===//
879
880From GCC Bug 14753:
881void
882rotate_cst (unsigned int a)
883{
884 a = (a << 10) | (a >> 22);
885 if (a == 123)
886 bar ();
887}
888void
889minus_cst (unsigned int a)
890{
891 unsigned int tem;
892
893 tem = 20 - a;
894 if (tem == 5)
895 bar ();
896}
897void
898mask_gt (unsigned int a)
899{
900 /* This is equivalent to a > 15. */
901 if ((a & ~7) > 8)
902 bar ();
903}
904void
905rshift_gt (unsigned int a)
906{
907 /* This is equivalent to a > 23. */
908 if ((a >> 2) > 5)
909 bar ();
910}
911All should simplify to a single comparison. All of these are
912currently not optimized with "clang -emit-llvm-bc | opt
913-std-compile-opts".
914
915//===---------------------------------------------------------------------===//
916
917From GCC Bug 32605:
918int c(int* x) {return (char*)x+2 == (char*)x;}
919Should combine to 0. Currently not optimized with "clang
920-emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it).
921
922//===---------------------------------------------------------------------===//
923
Eli Friedman4e16b292008-11-30 07:36:04 +0000924int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;}
925Should be combined to "((b >> 1) | b) & 1". Currently not optimized
926with "clang -emit-llvm-bc | opt -std-compile-opts".
927
928//===---------------------------------------------------------------------===//
929
930unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);}
931Should combine to "x | (y & 3)". Currently not optimized with "clang
932-emit-llvm-bc | opt -std-compile-opts".
933
934//===---------------------------------------------------------------------===//
935
Eli Friedman4e16b292008-11-30 07:36:04 +0000936int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);}
937Should fold to "(~a & c) | (a & b)". Currently not optimized with
938"clang -emit-llvm-bc | opt -std-compile-opts".
939
940//===---------------------------------------------------------------------===//
941
942int a(int a,int b) {return (~(a|b))|a;}
943Should fold to "a|~b". Currently not optimized with "clang
944-emit-llvm-bc | opt -std-compile-opts".
945
946//===---------------------------------------------------------------------===//
947
948int a(int a, int b) {return (a&&b) || (a&&!b);}
949Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
950| opt -std-compile-opts".
951
952//===---------------------------------------------------------------------===//
953
954int a(int a, int b, int c) {return (a&&b) || (!a&&c);}
955Should fold to "a ? b : c", or at least something sane. Currently not
956optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
957
958//===---------------------------------------------------------------------===//
959
960int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);}
961Should fold to a && (b || c). Currently not optimized with "clang
962-emit-llvm-bc | opt -std-compile-opts".
963
964//===---------------------------------------------------------------------===//
965
966int a(int x) {return x | ((x & 8) ^ 8);}
967Should combine to x | 8. Currently not optimized with "clang
968-emit-llvm-bc | opt -std-compile-opts".
969
970//===---------------------------------------------------------------------===//
971
972int a(int x) {return x ^ ((x & 8) ^ 8);}
973Should also combine to x | 8. Currently not optimized with "clang
974-emit-llvm-bc | opt -std-compile-opts".
975
976//===---------------------------------------------------------------------===//
977
Eli Friedman4e16b292008-11-30 07:36:04 +0000978int a(int x) {return ((x | -9) ^ 8) & x;}
979Should combine to x & -9. Currently not optimized with "clang
980-emit-llvm-bc | opt -std-compile-opts".
981
982//===---------------------------------------------------------------------===//
983
984unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
985Should combine to "a * 0x88888888 >> 31". Currently not optimized
986with "clang -emit-llvm-bc | opt -std-compile-opts".
987
988//===---------------------------------------------------------------------===//
989
990unsigned a(char* x) {if ((*x & 32) == 0) return b();}
991There's an unnecessary zext in the generated code with "clang
992-emit-llvm-bc | opt -std-compile-opts".
993
994//===---------------------------------------------------------------------===//
995
996unsigned a(unsigned long long x) {return 40 * (x >> 1);}
997Should combine to "20 * (((unsigned)x) & -2)". Currently not
998optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
999
1000//===---------------------------------------------------------------------===//
Bill Wendling3bdcda82008-12-02 05:12:47 +00001001
Chris Lattner88d84b22008-12-02 06:32:34 +00001002This was noticed in the entryblock for grokdeclarator in 403.gcc:
1003
1004 %tmp = icmp eq i32 %decl_context, 4
1005 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
1006 %tmp1 = icmp eq i32 %decl_context_addr.0, 1
1007 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
1008
1009tmp1 should be simplified to something like:
1010 (!tmp || decl_context == 1)
1011
1012This allows recursive simplifications, tmp1 is used all over the place in
1013the function, e.g. by:
1014
1015 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
1016 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
1017 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
1018
1019later.
1020
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001021//===---------------------------------------------------------------------===//
1022
Chris Lattnerd4137f42009-11-29 02:19:52 +00001023[STORE SINKING]
1024
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001025Store sinking: This code:
1026
1027void f (int n, int *cond, int *res) {
1028 int i;
1029 *res = 0;
1030 for (i = 0; i < n; i++)
1031 if (*cond)
1032 *res ^= 234; /* (*) */
1033}
1034
1035On this function GVN hoists the fully redundant value of *res, but nothing
1036moves the store out. This gives us this code:
1037
1038bb: ; preds = %bb2, %entry
1039 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
1040 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
1041 %1 = load i32* %cond, align 4
1042 %2 = icmp eq i32 %1, 0
1043 br i1 %2, label %bb2, label %bb1
1044
1045bb1: ; preds = %bb
1046 %3 = xor i32 %.rle, 234
1047 store i32 %3, i32* %res, align 4
1048 br label %bb2
1049
1050bb2: ; preds = %bb, %bb1
1051 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
1052 %indvar.next = add i32 %i.05, 1
1053 %exitcond = icmp eq i32 %indvar.next, %n
1054 br i1 %exitcond, label %return, label %bb
1055
1056DSE should sink partially dead stores to get the store out of the loop.
1057
Chris Lattner6a09a742008-12-06 22:52:12 +00001058Here's another partial dead case:
1059http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
1060
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001061//===---------------------------------------------------------------------===//
1062
1063Scalar PRE hoists the mul in the common block up to the else:
1064
1065int test (int a, int b, int c, int g) {
1066 int d, e;
1067 if (a)
1068 d = b * c;
1069 else
1070 d = b - c;
1071 e = b * c + g;
1072 return d + e;
1073}
1074
1075It would be better to do the mul once to reduce codesize above the if.
1076This is GCC PR38204.
1077
Chris Lattnercce240d2011-01-06 07:41:22 +00001078
1079//===---------------------------------------------------------------------===//
1080This simple function from 179.art:
1081
1082int winner, numf2s;
1083struct { double y; int reset; } *Y;
1084
1085void find_match() {
1086 int i;
1087 winner = 0;
1088 for (i=0;i<numf2s;i++)
1089 if (Y[i].y > Y[winner].y)
1090 winner =i;
1091}
1092
1093Compiles into (with clang TBAA):
1094
1095for.body: ; preds = %for.inc, %bb.nph
1096 %indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %for.inc ]
1097 %i.01718 = phi i32 [ 0, %bb.nph ], [ %i.01719, %for.inc ]
1098 %tmp4 = getelementptr inbounds %struct.anon* %tmp3, i64 %indvar, i32 0
1099 %tmp5 = load double* %tmp4, align 8, !tbaa !4
1100 %idxprom7 = sext i32 %i.01718 to i64
1101 %tmp10 = getelementptr inbounds %struct.anon* %tmp3, i64 %idxprom7, i32 0
1102 %tmp11 = load double* %tmp10, align 8, !tbaa !4
1103 %cmp12 = fcmp ogt double %tmp5, %tmp11
1104 br i1 %cmp12, label %if.then, label %for.inc
1105
1106if.then: ; preds = %for.body
1107 %i.017 = trunc i64 %indvar to i32
1108 br label %for.inc
1109
1110for.inc: ; preds = %for.body, %if.then
1111 %i.01719 = phi i32 [ %i.01718, %for.body ], [ %i.017, %if.then ]
1112 %indvar.next = add i64 %indvar, 1
1113 %exitcond = icmp eq i64 %indvar.next, %tmp22
1114 br i1 %exitcond, label %for.cond.for.end_crit_edge, label %for.body
1115
1116
1117It is good that we hoisted the reloads of numf2's, and Y out of the loop and
1118sunk the store to winner out.
1119
1120However, this is awful on several levels: the conditional truncate in the loop
1121(-indvars at fault? why can't we completely promote the IV to i64?).
1122
1123Beyond that, we have a partially redundant load in the loop: if "winner" (aka
1124%i.01718) isn't updated, we reload Y[winner].y the next time through the loop.
1125Similarly, the addressing that feeds it (including the sext) is redundant. In
1126the end we get this generated assembly:
1127
1128LBB0_2: ## %for.body
1129 ## =>This Inner Loop Header: Depth=1
1130 movsd (%rdi), %xmm0
1131 movslq %edx, %r8
1132 shlq $4, %r8
1133 ucomisd (%rcx,%r8), %xmm0
1134 jbe LBB0_4
1135 movl %esi, %edx
1136LBB0_4: ## %for.inc
1137 addq $16, %rdi
1138 incq %rsi
1139 cmpq %rsi, %rax
1140 jne LBB0_2
1141
1142All things considered this isn't too bad, but we shouldn't need the movslq or
1143the shlq instruction, or the load folded into ucomisd every time through the
1144loop.
1145
1146On an x86-specific topic, if the loop can't be restructure, the movl should be a
1147cmov.
1148
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001149//===---------------------------------------------------------------------===//
1150
Chris Lattnerd4137f42009-11-29 02:19:52 +00001151[STORE SINKING]
1152
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001153GCC PR37810 is an interesting case where we should sink load/store reload
1154into the if block and outside the loop, so we don't reload/store it on the
1155non-call path.
1156
1157for () {
1158 *P += 1;
1159 if ()
1160 call();
1161 else
1162 ...
1163->
1164tmp = *P
1165for () {
1166 tmp += 1;
1167 if () {
1168 *P = tmp;
1169 call();
1170 tmp = *P;
1171 } else ...
1172}
1173*P = tmp;
1174
Chris Lattner8f416f32008-12-15 07:49:24 +00001175We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
1176we don't sink the store. We need partially dead store sinking.
1177
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001178//===---------------------------------------------------------------------===//
1179
Chris Lattnerd4137f42009-11-29 02:19:52 +00001180[LOAD PRE CRIT EDGE SPLITTING]
Chris Lattner8f416f32008-12-15 07:49:24 +00001181
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001182GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
1183leading to excess stack traffic. This could be handled by GVN with some crazy
1184symbolic phi translation. The code we get looks like (g is on the stack):
1185
1186bb2: ; preds = %bb1
1187..
1188 %9 = getelementptr %struct.f* %g, i32 0, i32 0
1189 store i32 %8, i32* %9, align bel %bb3
1190
1191bb3: ; preds = %bb1, %bb2, %bb
1192 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
1193 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
1194 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
1195 %11 = load i32* %10, align 4
1196
Chris Lattner6d949262009-11-27 16:53:57 +00001197%11 is partially redundant, an in BB2 it should have the value %8.
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001198
Chris Lattnerd4137f42009-11-29 02:19:52 +00001199GCC PR33344 and PR35287 are similar cases.
Chris Lattner6a09a742008-12-06 22:52:12 +00001200
Chris Lattner6c9fab72009-11-05 18:19:19 +00001201
1202//===---------------------------------------------------------------------===//
1203
Chris Lattnerd4137f42009-11-29 02:19:52 +00001204[LOAD PRE]
1205
Chris Lattner6a09a742008-12-06 22:52:12 +00001206There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
Chris Lattnerd4137f42009-11-29 02:19:52 +00001207GCC testsuite, ones we don't get yet are (checked through loadpre25):
1208
1209[CRIT EDGE BREAKING]
1210loadpre3.c predcom-4.c
1211
1212[PRE OF READONLY CALL]
1213loadpre5.c
1214
1215[TURN SELECT INTO BRANCH]
1216loadpre14.c loadpre15.c
1217
1218actually a conditional increment: loadpre18.c loadpre19.c
1219
Chris Lattner2fc36e12010-12-15 06:38:24 +00001220//===---------------------------------------------------------------------===//
1221
1222[LOAD PRE / STORE SINKING / SPEC HACK]
1223
1224This is a chunk of code from 456.hmmer:
1225
1226int f(int M, int *mc, int *mpp, int *tpmm, int *ip, int *tpim, int *dpp,
1227 int *tpdm, int xmb, int *bp, int *ms) {
1228 int k, sc;
1229 for (k = 1; k <= M; k++) {
1230 mc[k] = mpp[k-1] + tpmm[k-1];
1231 if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;
1232 if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
1233 if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
1234 mc[k] += ms[k];
1235 }
1236}
1237
1238It is very profitable for this benchmark to turn the conditional stores to mc[k]
1239into a conditional move (select instr in IR) and allow the final store to do the
1240store. See GCC PR27313 for more details. Note that this is valid to xform even
1241with the new C++ memory model, since mc[k] is previously loaded and later
1242stored.
Chris Lattnerd4137f42009-11-29 02:19:52 +00001243
1244//===---------------------------------------------------------------------===//
1245
1246[SCALAR PRE]
1247There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the
1248GCC testsuite.
Chris Lattner6a09a742008-12-06 22:52:12 +00001249
Chris Lattner582048d2008-12-15 08:32:28 +00001250//===---------------------------------------------------------------------===//
1251
1252There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
Chris Lattnerd4137f42009-11-29 02:19:52 +00001253GCC testsuite. For example, we get the first example in predcom-1.c, but
1254miss the second one:
Chris Lattner582048d2008-12-15 08:32:28 +00001255
Chris Lattnerd4137f42009-11-29 02:19:52 +00001256unsigned fib[1000];
1257unsigned avg[1000];
Chris Lattner582048d2008-12-15 08:32:28 +00001258
Chris Lattnerd4137f42009-11-29 02:19:52 +00001259__attribute__ ((noinline))
1260void count_averages(int n) {
1261 int i;
1262 for (i = 1; i < n; i++)
1263 avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;
1264}
Chris Lattner582048d2008-12-15 08:32:28 +00001265
Chris Lattnerd4137f42009-11-29 02:19:52 +00001266which compiles into two loads instead of one in the loop.
Chris Lattner582048d2008-12-15 08:32:28 +00001267
Chris Lattnerd4137f42009-11-29 02:19:52 +00001268predcom-2.c is the same as predcom-1.c
Chris Lattner582048d2008-12-15 08:32:28 +00001269
Chris Lattner582048d2008-12-15 08:32:28 +00001270predcom-3.c is very similar but needs loads feeding each other instead of
1271store->load.
Chris Lattner582048d2008-12-15 08:32:28 +00001272
1273
1274//===---------------------------------------------------------------------===//
1275
Chris Lattneraa306c22010-01-23 17:59:23 +00001276[ALIAS ANALYSIS]
1277
Chris Lattner582048d2008-12-15 08:32:28 +00001278Type based alias analysis:
Chris Lattner6a09a742008-12-06 22:52:12 +00001279http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
1280
Chris Lattneraa306c22010-01-23 17:59:23 +00001281We should do better analysis of posix_memalign. At the least it should
1282no-capture its pointer argument, at best, we should know that the out-value
1283result doesn't point to anything (like malloc). One example of this is in
1284SingleSource/Benchmarks/Misc/dt.c
1285
Chris Lattner6a09a742008-12-06 22:52:12 +00001286//===---------------------------------------------------------------------===//
1287
Chris Lattner6a09a742008-12-06 22:52:12 +00001288A/B get pinned to the stack because we turn an if/then into a select instead
1289of PRE'ing the load/store. This may be fixable in instcombine:
1290http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
1291
Chris Lattner93c6c772009-09-21 02:53:57 +00001292struct X { int i; };
1293int foo (int x) {
1294 struct X a;
1295 struct X b;
1296 struct X *p;
1297 a.i = 1;
1298 b.i = 2;
1299 if (x)
1300 p = &a;
1301 else
1302 p = &b;
1303 return p->i;
1304}
Chris Lattner582048d2008-12-15 08:32:28 +00001305
Chris Lattner93c6c772009-09-21 02:53:57 +00001306//===---------------------------------------------------------------------===//
Chris Lattner582048d2008-12-15 08:32:28 +00001307
Chris Lattner6a09a742008-12-06 22:52:12 +00001308Interesting missed case because of control flow flattening (should be 2 loads):
1309http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner582048d2008-12-15 08:32:28 +00001310With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |
1311 opt -mem2reg -gvn -instcombine | llvm-dis
Chris Lattnerd4137f42009-11-29 02:19:52 +00001312we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT
Chris Lattner582048d2008-12-15 08:32:28 +00001313VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner6a09a742008-12-06 22:52:12 +00001314
1315//===---------------------------------------------------------------------===//
1316
1317http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
1318We could eliminate the branch condition here, loading from null is undefined:
1319
1320struct S { int w, x, y, z; };
1321struct T { int r; struct S s; };
1322void bar (struct S, int);
1323void foo (int a, struct T b)
1324{
1325 struct S *c = 0;
1326 if (a)
1327 c = &b.s;
1328 bar (*c, a);
1329}
1330
1331//===---------------------------------------------------------------------===//
Chris Lattner88d84b22008-12-02 06:32:34 +00001332
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001333simplifylibcalls should do several optimizations for strspn/strcspn:
1334
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001335strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
1336
1337size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
1338 int __reject3) {
1339 register size_t __result = 0;
1340 while (__s[__result] != '\0' && __s[__result] != __reject1 &&
1341 __s[__result] != __reject2 && __s[__result] != __reject3)
1342 ++__result;
1343 return __result;
1344}
1345
1346This should turn into a switch on the character. See PR3253 for some notes on
1347codegen.
1348
1349456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
1350
1351//===---------------------------------------------------------------------===//
Chris Lattnerd23b7992008-12-31 00:54:13 +00001352
1353"gas" uses this idiom:
1354 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string))
1355..
1356 else if (strchr ("<>", *intel_parser.op_string)
1357
1358Those should be turned into a switch.
1359
1360//===---------------------------------------------------------------------===//
Chris Lattnerffb08f52009-01-08 06:52:57 +00001361
1362252.eon contains this interesting code:
1363
1364 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
1365 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1366 %strlen = call i32 @strlen(i8* %3072) ; uses = 1
1367 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
1368 call void @llvm.memcpy.i32(i8* %endptr,
1369 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
1370 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly
1371
1372This is interesting for a couple reasons. First, in this:
1373
Benjamin Kramer9d071cb2010-12-23 15:32:07 +00001374The memcpy+strlen strlen can be replaced with:
Chris Lattnerffb08f52009-01-08 06:52:57 +00001375
1376 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
1377
1378Because the destination was just copied into the specified memory buffer. This,
1379in turn, can be constant folded to "4".
1380
1381In other code, it contains:
1382
1383 %endptr6978 = bitcast i8* %endptr69 to i32*
1384 store i32 7107374, i32* %endptr6978, align 1
1385 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
1386
1387Which could also be constant folded. Whatever is producing this should probably
1388be fixed to leave this as a memcpy from a string.
1389
1390Further, eon also has an interesting partially redundant strlen call:
1391
1392bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
1393 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2]
1394 %683 = load i8** %682, align 4 ; <i8*> [#uses=4]
1395 %684 = load i8* %683, align 1 ; <i8> [#uses=1]
1396 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
1397 br i1 %685, label %bb10, label %bb9
1398
1399bb9: ; preds = %bb8
1400 %686 = call i32 @strlen(i8* %683) nounwind readonly
1401 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
1402 br i1 %687, label %bb10, label %bb11
1403
1404bb10: ; preds = %bb9, %bb8
1405 %688 = call i32 @strlen(i8* %683) nounwind readonly
1406
1407This could be eliminated by doing the strlen once in bb8, saving code size and
1408improving perf on the bb8->9->10 path.
1409
1410//===---------------------------------------------------------------------===//
Chris Lattner9fee08f2009-01-08 07:34:55 +00001411
1412I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
1413which looks like:
1414 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
1415
1416
1417bb62: ; preds = %bb55, %bb53
1418 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
1419 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1420 %172 = add i32 %171, -1 ; <i32> [#uses=1]
1421 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
1422
1423... no stores ...
1424 br i1 %or.cond, label %bb65, label %bb72
1425
1426bb65: ; preds = %bb62
1427 store i8 0, i8* %173, align 1
1428 br label %bb72
1429
1430bb72: ; preds = %bb65, %bb62
1431 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
1432 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1433
1434Note that on the bb62->bb72 path, that the %177 strlen call is partially
1435redundant with the %171 call. At worst, we could shove the %177 strlen call
1436up into the bb65 block moving it out of the bb62->bb72 path. However, note
1437that bb65 stores to the string, zeroing out the last byte. This means that on
1438that path the value of %177 is actually just %171-1. A sub is cheaper than a
1439strlen!
1440
1441This pattern repeats several times, basically doing:
1442
1443 A = strlen(P);
1444 P[A-1] = 0;
1445 B = strlen(P);
1446 where it is "obvious" that B = A-1.
1447
1448//===---------------------------------------------------------------------===//
1449
Chris Lattner9fee08f2009-01-08 07:34:55 +00001450186.crafty has this interesting pattern with the "out.4543" variable:
1451
1452call void @llvm.memcpy.i32(
1453 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
1454 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
1455%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
1456
1457It is basically doing:
1458
1459 memcpy(globalarray, "string");
1460 printf(..., globalarray);
1461
1462Anyway, by knowing that printf just reads the memory and forward substituting
1463the string directly into the printf, this eliminates reads from globalarray.
1464Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
1465other similar functions) there are many stores to "out". Once all the printfs
1466stop using "out", all that is left is the memcpy's into it. This should allow
1467globalopt to remove the "stored only" global.
1468
1469//===---------------------------------------------------------------------===//
1470
Dan Gohman8289b052009-01-20 01:07:33 +00001471This code:
1472
1473define inreg i32 @foo(i8* inreg %p) nounwind {
1474 %tmp0 = load i8* %p
1475 %tmp1 = ashr i8 %tmp0, 5
1476 %tmp2 = sext i8 %tmp1 to i32
1477 ret i32 %tmp2
1478}
1479
1480could be dagcombine'd to a sign-extending load with a shift.
1481For example, on x86 this currently gets this:
1482
1483 movb (%eax), %al
1484 sarb $5, %al
1485 movsbl %al, %eax
1486
1487while it could get this:
1488
1489 movsbl (%eax), %eax
1490 sarl $5, %eax
1491
1492//===---------------------------------------------------------------------===//
Chris Lattner256baa42009-01-22 07:16:03 +00001493
1494GCC PR31029:
1495
1496int test(int x) { return 1-x == x; } // --> return false
1497int test2(int x) { return 2-x == x; } // --> return x == 1 ?
1498
1499Always foldable for odd constants, what is the rule for even?
1500
1501//===---------------------------------------------------------------------===//
1502
Torok Edwine46a6862009-01-24 19:30:25 +00001503PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
1504for next field in struct (which is at same address).
1505
1506For example: store of float into { {{}}, float } could be turned into a store to
1507the float directly.
1508
Torok Edwin474479f2009-02-20 18:42:06 +00001509//===---------------------------------------------------------------------===//
Nick Lewycky20babb12009-02-25 06:52:48 +00001510
Chris Lattner32c5f172009-05-11 17:41:40 +00001511The arg promotion pass should make use of nocapture to make its alias analysis
1512stuff much more precise.
1513
1514//===---------------------------------------------------------------------===//
1515
1516The following functions should be optimized to use a select instead of a
1517branch (from gcc PR40072):
1518
1519char char_int(int m) {if(m>7) return 0; return m;}
1520int int_char(char m) {if(m>7) return 0; return m;}
1521
1522//===---------------------------------------------------------------------===//
1523
Bill Wendling5a569272009-10-27 22:48:31 +00001524int func(int a, int b) { if (a & 0x80) b |= 0x80; else b &= ~0x80; return b; }
1525
1526Generates this:
1527
1528define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1529entry:
1530 %0 = and i32 %a, 128 ; <i32> [#uses=1]
1531 %1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
1532 %2 = or i32 %b, 128 ; <i32> [#uses=1]
1533 %3 = and i32 %b, -129 ; <i32> [#uses=1]
1534 %b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
1535 ret i32 %b_addr.0
1536}
1537
1538However, it's functionally equivalent to:
1539
1540 b = (b & ~0x80) | (a & 0x80);
1541
1542Which generates this:
1543
1544define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1545entry:
1546 %0 = and i32 %b, -129 ; <i32> [#uses=1]
1547 %1 = and i32 %a, 128 ; <i32> [#uses=1]
1548 %2 = or i32 %0, %1 ; <i32> [#uses=1]
1549 ret i32 %2
1550}
1551
1552This can be generalized for other forms:
1553
1554 b = (b & ~0x80) | (a & 0x40) << 1;
1555
1556//===---------------------------------------------------------------------===//
Bill Wendlingc872e9c2009-10-27 23:30:07 +00001557
1558These two functions produce different code. They shouldn't:
1559
1560#include <stdint.h>
1561
1562uint8_t p1(uint8_t b, uint8_t a) {
1563 b = (b & ~0xc0) | (a & 0xc0);
1564 return (b);
1565}
1566
1567uint8_t p2(uint8_t b, uint8_t a) {
1568 b = (b & ~0x40) | (a & 0x40);
1569 b = (b & ~0x80) | (a & 0x80);
1570 return (b);
1571}
1572
1573define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1574entry:
1575 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1576 %1 = and i8 %a, -64 ; <i8> [#uses=1]
1577 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1578 ret i8 %2
1579}
1580
1581define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1582entry:
1583 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1584 %.masked = and i8 %a, 64 ; <i8> [#uses=1]
1585 %1 = and i8 %a, -128 ; <i8> [#uses=1]
1586 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1587 %3 = or i8 %2, %.masked ; <i8> [#uses=1]
1588 ret i8 %3
1589}
1590
1591//===---------------------------------------------------------------------===//
Chris Lattner6fdfc9c2009-11-11 17:51:27 +00001592
1593IPSCCP does not currently propagate argument dependent constants through
1594functions where it does not not all of the callers. This includes functions
1595with normal external linkage as well as templates, C99 inline functions etc.
1596Specifically, it does nothing to:
1597
1598define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
1599entry:
1600 %0 = add nsw i32 %y, %z
1601 %1 = mul i32 %0, %x
1602 %2 = mul i32 %y, %z
1603 %3 = add nsw i32 %1, %2
1604 ret i32 %3
1605}
1606
1607define i32 @test2() nounwind {
1608entry:
1609 %0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
1610 ret i32 %0
1611}
1612
1613It would be interesting extend IPSCCP to be able to handle simple cases like
1614this, where all of the arguments to a call are constant. Because IPSCCP runs
1615before inlining, trivial templates and inline functions are not yet inlined.
1616The results for a function + set of constant arguments should be memoized in a
1617map.
1618
1619//===---------------------------------------------------------------------===//
Chris Lattnerfc926c22009-11-11 17:54:02 +00001620
1621The libcall constant folding stuff should be moved out of SimplifyLibcalls into
1622libanalysis' constantfolding logic. This would allow IPSCCP to be able to
1623handle simple things like this:
1624
1625static int foo(const char *X) { return strlen(X); }
1626int bar() { return foo("abcd"); }
1627
1628//===---------------------------------------------------------------------===//
Nick Lewycky93f9f7a2009-11-15 17:51:23 +00001629
Duncan Sandse10920d2010-01-06 15:37:47 +00001630functionattrs doesn't know much about memcpy/memset. This function should be
Duncan Sands7c422ac2010-01-06 08:45:52 +00001631marked readnone rather than readonly, since it only twiddles local memory, but
1632functionattrs doesn't handle memset/memcpy/memmove aggressively:
Chris Lattner89742c22009-12-03 07:43:46 +00001633
1634struct X { int *p; int *q; };
1635int foo() {
1636 int i = 0, j = 1;
1637 struct X x, y;
1638 int **p;
1639 y.p = &i;
1640 x.q = &j;
1641 p = __builtin_memcpy (&x, &y, sizeof (int *));
1642 return **p;
1643}
1644
Chris Lattner9c8fb9e2011-01-01 22:52:11 +00001645This can be seen at:
1646$ clang t.c -S -o - -mkernel -O0 -emit-llvm | opt -functionattrs -S
1647
1648
Chris Lattner05332172009-12-03 07:41:54 +00001649//===---------------------------------------------------------------------===//
1650
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001651Missed instcombine transformation:
1652define i1 @a(i32 %x) nounwind readnone {
1653entry:
1654 %cmp = icmp eq i32 %x, 30
1655 %sub = add i32 %x, -30
1656 %cmp2 = icmp ugt i32 %sub, 9
1657 %or = or i1 %cmp, %cmp2
1658 ret i1 %or
1659}
1660This should be optimized to a single compare. Testcase derived from gcc.
1661
1662//===---------------------------------------------------------------------===//
1663
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001664Missed instcombine or reassociate transformation:
1665int a(int a, int b) { return (a==12)&(b>47)&(b<58); }
1666
1667The sgt and slt should be combined into a single comparison. Testcase derived
1668from gcc.
1669
1670//===---------------------------------------------------------------------===//
1671
1672Missed instcombine transformation:
Chris Lattner3e411062010-11-21 07:05:31 +00001673
1674 %382 = srem i32 %tmp14.i, 64 ; [#uses=1]
1675 %383 = zext i32 %382 to i64 ; [#uses=1]
1676 %384 = shl i64 %381, %383 ; [#uses=1]
1677 %385 = icmp slt i32 %tmp14.i, 64 ; [#uses=1]
1678
Benjamin Kramerc21a8212010-11-23 20:33:57 +00001679The srem can be transformed to an and because if %tmp14.i is negative, the
1680shift is undefined. Testcase derived from 403.gcc.
Chris Lattner3e411062010-11-21 07:05:31 +00001681
1682//===---------------------------------------------------------------------===//
1683
1684This is a range comparison on a divided result (from 403.gcc):
1685
1686 %1337 = sdiv i32 %1336, 8 ; [#uses=1]
1687 %.off.i208 = add i32 %1336, 7 ; [#uses=1]
1688 %1338 = icmp ult i32 %.off.i208, 15 ; [#uses=1]
1689
1690We already catch this (removing the sdiv) if there isn't an add, we should
1691handle the 'add' as well. This is a common idiom with it's builtin_alloca code.
1692C testcase:
1693
1694int a(int x) { return (unsigned)(x/16+7) < 15; }
1695
1696Another similar case involves truncations on 64-bit targets:
1697
1698 %361 = sdiv i64 %.046, 8 ; [#uses=1]
1699 %362 = trunc i64 %361 to i32 ; [#uses=2]
1700...
1701 %367 = icmp eq i32 %362, 0 ; [#uses=1]
1702
Eli Friedman1144d7e2010-01-31 04:55:32 +00001703//===---------------------------------------------------------------------===//
1704
1705Missed instcombine/dagcombine transformation:
1706define void @lshift_lt(i8 zeroext %a) nounwind {
1707entry:
1708 %conv = zext i8 %a to i32
1709 %shl = shl i32 %conv, 3
1710 %cmp = icmp ult i32 %shl, 33
1711 br i1 %cmp, label %if.then, label %if.end
1712
1713if.then:
1714 tail call void @bar() nounwind
1715 ret void
1716
1717if.end:
1718 ret void
1719}
1720declare void @bar() nounwind
1721
1722The shift should be eliminated. Testcase derived from gcc.
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001723
1724//===---------------------------------------------------------------------===//
Chris Lattnercf031f62010-02-09 00:11:10 +00001725
1726These compile into different code, one gets recognized as a switch and the
1727other doesn't due to phase ordering issues (PR6212):
1728
1729int test1(int mainType, int subType) {
1730 if (mainType == 7)
1731 subType = 4;
1732 else if (mainType == 9)
1733 subType = 6;
1734 else if (mainType == 11)
1735 subType = 9;
1736 return subType;
1737}
1738
1739int test2(int mainType, int subType) {
1740 if (mainType == 7)
1741 subType = 4;
1742 if (mainType == 9)
1743 subType = 6;
1744 if (mainType == 11)
1745 subType = 9;
1746 return subType;
1747}
1748
1749//===---------------------------------------------------------------------===//
Chris Lattner66636702010-03-10 21:42:42 +00001750
1751The following test case (from PR6576):
1752
1753define i32 @mul(i32 %a, i32 %b) nounwind readnone {
1754entry:
1755 %cond1 = icmp eq i32 %b, 0 ; <i1> [#uses=1]
1756 br i1 %cond1, label %exit, label %bb.nph
1757bb.nph: ; preds = %entry
1758 %tmp = mul i32 %b, %a ; <i32> [#uses=1]
1759 ret i32 %tmp
1760exit: ; preds = %entry
1761 ret i32 0
1762}
1763
1764could be reduced to:
1765
1766define i32 @mul(i32 %a, i32 %b) nounwind readnone {
1767entry:
1768 %tmp = mul i32 %b, %a
1769 ret i32 %tmp
1770}
1771
1772//===---------------------------------------------------------------------===//
1773
Chris Lattner94846892010-04-16 23:52:30 +00001774We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates.
1775See GCC PR34949
1776
Chris Lattnerc2685a92010-05-21 23:16:21 +00001777Another interesting case is that something related could be used for variables
1778that go const after their ctor has finished. In these cases, globalopt (which
1779can statically run the constructor) could mark the global const (so it gets put
1780in the readonly section). A testcase would be:
1781
1782#include <complex>
1783using namespace std;
1784const complex<char> should_be_in_rodata (42,-42);
1785complex<char> should_be_in_data (42,-42);
1786complex<char> should_be_in_bss;
1787
1788Where we currently evaluate the ctors but the globals don't become const because
1789the optimizer doesn't know they "become const" after the ctor is done. See
1790GCC PR4131 for more examples.
1791
Chris Lattner94846892010-04-16 23:52:30 +00001792//===---------------------------------------------------------------------===//
1793
Dan Gohman3a2a4842010-05-03 14:31:00 +00001794In this code:
1795
1796long foo(long x) {
1797 return x > 1 ? x : 1;
1798}
1799
1800LLVM emits a comparison with 1 instead of 0. 0 would be equivalent
1801and cheaper on most targets.
1802
1803LLVM prefers comparisons with zero over non-zero in general, but in this
1804case it choses instead to keep the max operation obvious.
1805
1806//===---------------------------------------------------------------------===//
Eli Friedman8c47d3b2010-06-12 05:54:27 +00001807
1808Take the following testcase on x86-64 (similar testcases exist for all targets
1809with addc/adde):
1810
1811define void @a(i64* nocapture %s, i64* nocapture %t, i64 %a, i64 %b,
1812i64 %c) nounwind {
1813entry:
1814 %0 = zext i64 %a to i128 ; <i128> [#uses=1]
1815 %1 = zext i64 %b to i128 ; <i128> [#uses=1]
1816 %2 = add i128 %1, %0 ; <i128> [#uses=2]
1817 %3 = zext i64 %c to i128 ; <i128> [#uses=1]
1818 %4 = shl i128 %3, 64 ; <i128> [#uses=1]
1819 %5 = add i128 %4, %2 ; <i128> [#uses=1]
1820 %6 = lshr i128 %5, 64 ; <i128> [#uses=1]
1821 %7 = trunc i128 %6 to i64 ; <i64> [#uses=1]
1822 store i64 %7, i64* %s, align 8
1823 %8 = trunc i128 %2 to i64 ; <i64> [#uses=1]
1824 store i64 %8, i64* %t, align 8
1825 ret void
1826}
1827
1828Generated code:
1829 addq %rcx, %rdx
1830 movl $0, %eax
1831 adcq $0, %rax
1832 addq %r8, %rax
1833 movq %rax, (%rdi)
1834 movq %rdx, (%rsi)
1835 ret
1836
1837Expected code:
1838 addq %rcx, %rdx
1839 adcq $0, %r8
1840 movq %r8, (%rdi)
1841 movq %rdx, (%rsi)
1842 ret
1843
1844The generated SelectionDAG has an ADD of an ADDE, where both operands of the
1845ADDE are zero. Replacing one of the operands of the ADDE with the other operand
1846of the ADD, and replacing the ADD with the ADDE, should give the desired result.
1847
1848(That said, we are doing a lot better than gcc on this testcase. :) )
1849
1850//===---------------------------------------------------------------------===//
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001851
1852Switch lowering generates less than ideal code for the following switch:
1853define void @a(i32 %x) nounwind {
1854entry:
1855 switch i32 %x, label %if.end [
1856 i32 0, label %if.then
1857 i32 1, label %if.then
1858 i32 2, label %if.then
1859 i32 3, label %if.then
1860 i32 5, label %if.then
1861 ]
1862if.then:
1863 tail call void @foo() nounwind
1864 ret void
1865if.end:
1866 ret void
1867}
1868declare void @foo()
1869
1870Generated code on x86-64 (other platforms give similar results):
1871a:
1872 cmpl $5, %edi
1873 ja .LBB0_2
1874 movl %edi, %eax
1875 movl $47, %ecx
1876 btq %rax, %rcx
1877 jb .LBB0_3
1878.LBB0_2:
1879 ret
1880.LBB0_3:
Eli Friedmanb4828292010-07-03 08:43:32 +00001881 jmp foo # TAILCALL
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001882
1883The movl+movl+btq+jb could be simplified to a cmpl+jne.
1884
Eli Friedmanb4828292010-07-03 08:43:32 +00001885Or, if we wanted to be really clever, we could simplify the whole thing to
1886something like the following, which eliminates a branch:
1887 xorl $1, %edi
1888 cmpl $4, %edi
1889 ja .LBB0_2
1890 ret
1891.LBB0_2:
1892 jmp foo # TAILCALL
Nick Lewyckyb1e4eeb2010-08-08 07:04:25 +00001893//===---------------------------------------------------------------------===//
1894Given a branch where the two target blocks are identical ("ret i32 %b" in
1895both), simplifycfg will simplify them away. But not so for a switch statement:
Eli Friedmanb4828292010-07-03 08:43:32 +00001896
Nick Lewyckyb1e4eeb2010-08-08 07:04:25 +00001897define i32 @f(i32 %a, i32 %b) nounwind readnone {
1898entry:
1899 switch i32 %a, label %bb3 [
1900 i32 4, label %bb
1901 i32 6, label %bb
1902 ]
1903
1904bb: ; preds = %entry, %entry
1905 ret i32 %b
1906
1907bb3: ; preds = %entry
1908 ret i32 %b
1909}
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001910//===---------------------------------------------------------------------===//
Chris Lattner274191f2010-11-09 19:37:28 +00001911
1912clang -O3 fails to devirtualize this virtual inheritance case: (GCC PR45875)
Chris Lattner1e68fdb2010-11-11 17:17:56 +00001913Looks related to PR3100
Chris Lattner274191f2010-11-09 19:37:28 +00001914
1915struct c1 {};
1916struct c10 : c1{
1917 virtual void foo ();
1918};
1919struct c11 : c10, c1{
1920 virtual void f6 ();
1921};
1922struct c28 : virtual c11{
1923 void f6 ();
1924};
1925void check_c28 () {
1926 c28 obj;
1927 c11 *ptr = &obj;
1928 ptr->f6 ();
1929}
1930
1931//===---------------------------------------------------------------------===//
Chris Lattneraf510f12010-11-11 18:23:57 +00001932
1933We compile this:
1934
1935int foo(int a) { return (a & (~15)) / 16; }
1936
1937Into:
1938
1939define i32 @foo(i32 %a) nounwind readnone ssp {
1940entry:
1941 %and = and i32 %a, -16
1942 %div = sdiv i32 %and, 16
1943 ret i32 %div
1944}
1945
1946but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case
1947should be instcombined into just "a >> 4".
1948
1949We do get this at the codegen level, so something knows about it, but
1950instcombine should catch it earlier:
1951
1952_foo: ## @foo
1953## BB#0: ## %entry
1954 movl %edi, %eax
1955 sarl $4, %eax
1956 ret
1957
1958//===---------------------------------------------------------------------===//
1959
Chris Lattnera97c91f2010-12-13 00:15:25 +00001960This code (from GCC PR28685):
1961
1962int test(int a, int b) {
1963 int lt = a < b;
1964 int eq = a == b;
1965 if (lt)
1966 return 1;
1967 return eq;
1968}
1969
1970Is compiled to:
1971
1972define i32 @test(i32 %a, i32 %b) nounwind readnone ssp {
1973entry:
1974 %cmp = icmp slt i32 %a, %b
1975 br i1 %cmp, label %return, label %if.end
1976
1977if.end: ; preds = %entry
1978 %cmp5 = icmp eq i32 %a, %b
1979 %conv6 = zext i1 %cmp5 to i32
1980 ret i32 %conv6
1981
1982return: ; preds = %entry
1983 ret i32 1
1984}
1985
1986it could be:
1987
1988define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp {
1989entry:
1990 %0 = icmp sle i32 %a, %b
1991 %retval = zext i1 %0 to i32
1992 ret i32 %retval
1993}
1994
1995//===---------------------------------------------------------------------===//
Duncan Sands124708d2011-01-01 20:08:02 +00001996
Benjamin Kramereaff66a2011-01-07 20:42:20 +00001997This code can be seen in viterbi:
1998
1999 %64 = call noalias i8* @malloc(i64 %62) nounwind
2000...
2001 %67 = call i64 @llvm.objectsize.i64(i8* %64, i1 false) nounwind
2002 %68 = call i8* @__memset_chk(i8* %64, i32 0, i64 %62, i64 %67) nounwind
2003
2004llvm.objectsize.i64 should be taught about malloc/calloc, allowing it to
2005fold to %62. This is a security win (overflows of malloc will get caught)
2006and also a performance win by exposing more memsets to the optimizer.
2007
2008This occurs several times in viterbi.
2009
2010Note that this would change the semantics of @llvm.objectsize which by its
2011current definition always folds to a constant. We also should make sure that
2012we remove checking in code like
2013
2014 char *p = malloc(strlen(s)+1);
2015 __strcpy_chk(p, s, __builtin_objectsize(p, 0));
2016
2017//===---------------------------------------------------------------------===//
2018
Chris Lattnerc1853e42011-01-06 07:09:23 +00002019This code (from Benchmarks/Dhrystone/dry.c):
2020
2021define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
2022entry:
2023 %sext = shl i32 %0, 24
2024 %conv = ashr i32 %sext, 24
2025 %sext6 = shl i32 %1, 24
2026 %conv4 = ashr i32 %sext6, 24
2027 %cmp = icmp eq i32 %conv, %conv4
2028 %. = select i1 %cmp, i32 10000, i32 0
2029 ret i32 %.
2030}
2031
2032Should be simplified into something like:
2033
2034define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
2035entry:
2036 %sext = shl i32 %0, 24
2037 %conv = and i32 %sext, 0xFF000000
2038 %sext6 = shl i32 %1, 24
2039 %conv4 = and i32 %sext6, 0xFF000000
2040 %cmp = icmp eq i32 %conv, %conv4
2041 %. = select i1 %cmp, i32 10000, i32 0
2042 ret i32 %.
2043}
2044
2045and then to:
2046
2047define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
2048entry:
2049 %conv = and i32 %0, 0xFF
2050 %conv4 = and i32 %1, 0xFF
2051 %cmp = icmp eq i32 %conv, %conv4
2052 %. = select i1 %cmp, i32 10000, i32 0
2053 ret i32 %.
2054}
2055//===---------------------------------------------------------------------===//
Chris Lattner15df0442011-01-01 22:57:31 +00002056
Benjamin Kramerfa366802011-01-06 17:35:50 +00002057clang -O3 currently compiles this code
2058
2059int g(unsigned int a) {
2060 unsigned int c[100];
2061 c[10] = a;
2062 c[11] = a;
2063 unsigned int b = c[10] + c[11];
2064 if(b > a*2) a = 4;
2065 else a = 8;
2066 return a + 7;
2067}
2068
2069into
2070
2071define i32 @g(i32 a) nounwind readnone {
2072 %add = shl i32 %a, 1
2073 %mul = shl i32 %a, 1
2074 %cmp = icmp ugt i32 %add, %mul
2075 %a.addr.0 = select i1 %cmp, i32 11, i32 15
2076 ret i32 %a.addr.0
2077}
2078
2079The icmp should fold to false. This CSE opportunity is only available
2080after GVN and InstCombine have run.
2081
2082//===---------------------------------------------------------------------===//
Chris Lattner01cdc202011-01-06 22:25:00 +00002083
2084memcpyopt should turn this:
2085
2086define i8* @test10(i32 %x) {
2087 %alloc = call noalias i8* @malloc(i32 %x) nounwind
2088 call void @llvm.memset.p0i8.i32(i8* %alloc, i8 0, i32 %x, i32 1, i1 false)
2089 ret i8* %alloc
2090}
2091
2092into a call to calloc. We should make sure that we analyze calloc as
2093aggressively as malloc though.
2094
2095//===---------------------------------------------------------------------===//
Chandler Carruth75fbd372011-01-09 01:32:55 +00002096
Chris Lattner4a6fb942011-01-10 21:01:17 +00002097clang -O3 doesn't optimize this:
Chandler Carruth75fbd372011-01-09 01:32:55 +00002098
2099void f1(int* begin, int* end) {
2100 std::fill(begin, end, 0);
2101}
2102
Chris Lattner66d7a572011-01-09 23:48:41 +00002103into a memset. This is PR8942.
Chandler Carruth75fbd372011-01-09 01:32:55 +00002104
2105//===---------------------------------------------------------------------===//
Chandler Carruthd8723a92011-01-09 09:58:33 +00002106
2107clang -O3 -fno-exceptions currently compiles this code:
2108
2109void f(int N) {
2110 std::vector<int> v(N);
Chandler Carruth27a2a132011-01-09 10:10:59 +00002111
2112 extern void sink(void*); sink(&v);
Chandler Carruthd8723a92011-01-09 09:58:33 +00002113}
2114
2115into
2116
2117define void @_Z1fi(i32 %N) nounwind {
2118entry:
2119 %v2 = alloca [3 x i32*], align 8
2120 %v2.sub = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 0
2121 %tmpcast = bitcast [3 x i32*]* %v2 to %"class.std::vector"*
2122 %conv = sext i32 %N to i64
2123 store i32* null, i32** %v2.sub, align 8, !tbaa !0
2124 %tmp3.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 1
2125 store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
2126 %tmp4.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 2
2127 store i32* null, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
2128 %cmp.i.i.i.i = icmp eq i32 %N, 0
2129 br i1 %cmp.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i, label %cond.true.i.i.i.i
2130
2131_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i: ; preds = %entry
2132 store i32* null, i32** %v2.sub, align 8, !tbaa !0
2133 store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
2134 %add.ptr.i5.i.i = getelementptr inbounds i32* null, i64 %conv
2135 store i32* %add.ptr.i5.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
2136 br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit
2137
2138cond.true.i.i.i.i: ; preds = %entry
2139 %cmp.i.i.i.i.i = icmp slt i32 %N, 0
2140 br i1 %cmp.i.i.i.i.i, label %if.then.i.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i
2141
2142if.then.i.i.i.i.i: ; preds = %cond.true.i.i.i.i
2143 call void @_ZSt17__throw_bad_allocv() noreturn nounwind
2144 unreachable
2145
2146_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i: ; preds = %cond.true.i.i.i.i
2147 %mul.i.i.i.i.i = shl i64 %conv, 2
2148 %call3.i.i.i.i.i = call noalias i8* @_Znwm(i64 %mul.i.i.i.i.i) nounwind
2149 %0 = bitcast i8* %call3.i.i.i.i.i to i32*
2150 store i32* %0, i32** %v2.sub, align 8, !tbaa !0
2151 store i32* %0, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
2152 %add.ptr.i.i.i = getelementptr inbounds i32* %0, i64 %conv
2153 store i32* %add.ptr.i.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
2154 call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %mul.i.i.i.i.i, i32 4, i1 false)
2155 br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit
2156
2157This is just the handling the construction of the vector. Most surprising here
2158is the fact that all three null stores in %entry are dead, but not eliminated.
2159Also surprising is that %conv isn't simplified to 0 in %....exit.thread.i.i.
2160
2161//===---------------------------------------------------------------------===//
Chandler Carruthe5ca4942011-01-09 09:58:36 +00002162
2163clang -O3 -fno-exceptions currently compiles this code:
2164
2165void f(int N) {
2166 std::vector<int> v(N);
Chandler Carruth27a2a132011-01-09 10:10:59 +00002167 for (int k = 0; k < N; ++k)
2168 v[k] = 0;
2169
2170 extern void sink(void*); sink(&v);
Chandler Carruthe5ca4942011-01-09 09:58:36 +00002171}
2172
2173into almost the same as the previous note, but replace its final BB with:
2174
2175for.body.lr.ph: ; preds = %cond.true.i.i.i.i
2176 %mul.i.i.i.i.i = shl i64 %conv, 2
2177 %call3.i.i.i.i.i = call noalias i8* @_Znwm(i64 %mul.i.i.i.i.i) nounwind
2178 %0 = bitcast i8* %call3.i.i.i.i.i to i32*
2179 store i32* %0, i32** %v8.sub, align 8, !tbaa !0
2180 %add.ptr.i.i.i = getelementptr inbounds i32* %0, i64 %conv
2181 store i32* %add.ptr.i.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
2182 call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %mul.i.i.i.i.i, i32 4, i1 false)
2183 store i32* %add.ptr.i.i.i, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
2184 %tmp18 = add i32 %N, -1
2185 %tmp19 = zext i32 %tmp18 to i64
2186 %tmp20 = shl i64 %tmp19, 2
2187 %tmp21 = add i64 %tmp20, 4
2188 call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %tmp21, i32 4, i1 false)
2189 br label %for.end
2190
2191First off, why (((zext %N - 1) << 2) + 4) instead of the ((sext %N) << 2) done
2192previously? (or better yet, re-use that one?)
2193
2194Then, the really painful one is the second memset, of the same memory, to the
2195same value.
2196
2197//===---------------------------------------------------------------------===//
Chandler Carruth694d7532011-01-09 11:29:57 +00002198
2199clang -O3 -fno-exceptions currently compiles this code:
2200
2201struct S {
2202 unsigned short m1, m2;
2203 unsigned char m3, m4;
2204};
2205
2206void f(int N) {
2207 std::vector<S> v(N);
2208 extern void sink(void*); sink(&v);
2209}
2210
2211into poor code for zero-initializing 'v' when N is >0. The problem is that
2212S is only 6 bytes, but each element is 8 byte-aligned. We generate a loop and
22134 stores on each iteration. If the struct were 8 bytes, this gets turned into
2214a memset.
2215
2216//===---------------------------------------------------------------------===//
Chandler Carruth96b1b6c2011-01-09 21:00:19 +00002217
2218clang -O3 currently compiles this code:
2219
2220extern const int magic;
2221double f() { return 0.0 * magic; }
2222
2223into
2224
2225@magic = external constant i32
2226
2227define double @_Z1fv() nounwind readnone {
2228entry:
2229 %tmp = load i32* @magic, align 4, !tbaa !0
2230 %conv = sitofp i32 %tmp to double
2231 %mul = fmul double %conv, 0.000000e+00
2232 ret double %mul
2233}
2234
Chris Lattner00a35d02011-01-10 00:33:01 +00002235We should be able to fold away this fmul to 0.0. More generally, fmul(x,0.0)
2236can be folded to 0.0 if we can prove that the LHS is not -0.0, not a NaN, and
2237not an INF. The CannotBeNegativeZero predicate in value tracking should be
2238extended to support general "fpclassify" operations that can return
2239yes/no/unknown for each of these predicates.
2240
Chris Lattner4a6fb942011-01-10 21:01:17 +00002241In this predicate, we know that uitofp is trivially never NaN or -0.0, and
Chris Lattner00a35d02011-01-10 00:33:01 +00002242we know that it isn't +/-Inf if the floating point type has enough exponent bits
2243to represent the largest integer value as < inf.
Chandler Carruth96b1b6c2011-01-09 21:00:19 +00002244
2245//===---------------------------------------------------------------------===//
Chandler Carruthfb00e272011-01-09 22:36:18 +00002246
Chris Lattner4a6fb942011-01-10 21:01:17 +00002247When optimizing a transformation that can change the sign of 0.0 (such as the
22480.0*val -> 0.0 transformation above), it might be provable that the sign of the
2249expression doesn't matter. For example, by the above rules, we can't transform
2250fmul(sitofp(x), 0.0) into 0.0, because x might be -1 and the result of the
2251expression is defined to be -0.0.
2252
2253If we look at the uses of the fmul for example, we might be able to prove that
2254all uses don't care about the sign of zero. For example, if we have:
2255
2256 fadd(fmul(sitofp(x), 0.0), 2.0)
2257
2258Since we know that x+2.0 doesn't care about the sign of any zeros in X, we can
2259transform the fmul to 0.0, and then the fadd to 2.0.
2260
2261//===---------------------------------------------------------------------===//