blob: 45da3ddb607a1532a3b24faeed0549895271b308 [file] [log] [blame]
Chris Lattner086c0142006-02-03 06:21:43 +00001Target Independent Opportunities:
2
Chris Lattnerf308ea02006-09-28 06:01:17 +00003//===---------------------------------------------------------------------===//
4
Chris Lattner9b62b452006-11-14 01:57:53 +00005With the recent changes to make the implicit def/use set explicit in
6machineinstrs, we should change the target descriptions for 'call' instructions
7so that the .td files don't list all the call-clobbered registers as implicit
8defs. Instead, these should be added by the code generator (e.g. on the dag).
9
10This has a number of uses:
11
121. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
13 for their different impdef sets.
142. Targets with multiple calling convs (e.g. x86) which have different clobber
15 sets don't need copies of call instructions.
163. 'Interprocedural register allocation' can be done to reduce the clobber sets
17 of calls.
18
19//===---------------------------------------------------------------------===//
20
Chris Lattner08859ff2010-12-15 07:25:55 +000021We should recognized various "overflow detection" idioms and translate them into
Chris Lattnere5cbdca2010-12-19 19:37:52 +000022llvm.uadd.with.overflow and similar intrinsics. Here is a multiply idiom:
Chris Lattner94481842010-12-15 07:28:58 +000023
24unsigned int mul(unsigned int a,unsigned int b) {
25 if ((unsigned long long)a*b>0xffffffff)
26 exit(0);
27 return a*b;
28}
29
Chris Lattner527b47d2011-01-02 18:31:38 +000030The legalization code for mul-with-overflow needs to be made more robust before
31this can be implemented though.
32
Nate Begeman81e80972006-03-17 01:40:33 +000033//===---------------------------------------------------------------------===//
Chris Lattner086c0142006-02-03 06:21:43 +000034
35Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
Chris Lattner2dae65d2008-12-10 01:30:48 +000036precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
37safe in general, even on darwin. See the libm implementation of hypot for
38examples (which special case when x/y are exactly zero to get signed zeros etc
39right).
Chris Lattner086c0142006-02-03 06:21:43 +000040
Chris Lattner086c0142006-02-03 06:21:43 +000041//===---------------------------------------------------------------------===//
42
Chris Lattnerb27b69f2006-03-04 01:19:34 +000043On targets with expensive 64-bit multiply, we could LSR this:
44
45for (i = ...; ++i) {
46 x = 1ULL << i;
47
48into:
49 long long tmp = 1;
50 for (i = ...; ++i, tmp+=tmp)
51 x = tmp;
52
53This would be a win on ppc32, but not x86 or ppc64.
54
Chris Lattnerad019932006-03-04 08:44:51 +000055//===---------------------------------------------------------------------===//
Chris Lattner5b0fe7d2006-03-05 20:00:08 +000056
57Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
58
59//===---------------------------------------------------------------------===//
Chris Lattner549f27d22006-03-07 02:46:26 +000060
Chris Lattner398ffba2010-01-01 01:29:26 +000061Reassociate should turn things like:
62
63int factorial(int X) {
64 return X*X*X*X*X*X*X*X;
65}
66
67into llvm.powi calls, allowing the code generator to produce balanced
68multiplication trees.
69
70First, the intrinsic needs to be extended to support integers, and second the
71code generator needs to be enhanced to lower these to multiplication trees.
Chris Lattnerc20995e2006-03-11 20:17:08 +000072
73//===---------------------------------------------------------------------===//
74
Chris Lattner74cfb7d2006-03-11 20:20:40 +000075Interesting? testcase for add/shift/mul reassoc:
76
77int bar(int x, int y) {
78 return x*x*x+y+x*x*x*x*x*y*y*y*y;
79}
80int foo(int z, int n) {
81 return bar(z, n) + bar(2*z, 2*n);
82}
83
Chris Lattner398ffba2010-01-01 01:29:26 +000084This is blocked on not handling X*X*X -> powi(X, 3) (see note above). The issue
85is that we end up getting t = 2*X s = t*t and don't turn this into 4*X*X,
86which is the same number of multiplies and is canonical, because the 2*X has
87multiple uses. Here's a simple example:
88
89define i32 @test15(i32 %X1) {
90 %B = mul i32 %X1, 47 ; X1*47
91 %C = mul i32 %B, %B
92 ret i32 %C
93}
94
95
96//===---------------------------------------------------------------------===//
97
98Reassociate should handle the example in GCC PR16157:
99
100extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;
101void f () { /* this can be optimized to four additions... */
102 b4 = a4 + a3 + a2 + a1 + a0;
103 b3 = a3 + a2 + a1 + a0;
104 b2 = a2 + a1 + a0;
105 b1 = a1 + a0;
106}
107
108This requires reassociating to forms of expressions that are already available,
109something that reassoc doesn't think about yet.
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000110
Chris Lattner10c42452010-01-24 20:01:41 +0000111
112//===---------------------------------------------------------------------===//
113
114This function: (derived from GCC PR19988)
115double foo(double x, double y) {
116 return ((x + 0.1234 * y) * (x + -0.1234 * y));
117}
118
119compiles to:
120_foo:
121 movapd %xmm1, %xmm2
122 mulsd LCPI1_1(%rip), %xmm1
123 mulsd LCPI1_0(%rip), %xmm2
124 addsd %xmm0, %xmm1
125 addsd %xmm0, %xmm2
126 movapd %xmm1, %xmm0
127 mulsd %xmm2, %xmm0
128 ret
129
Chris Lattner43dc2e62010-01-24 20:17:09 +0000130Reassociate should be able to turn it into:
Chris Lattner10c42452010-01-24 20:01:41 +0000131
132double foo(double x, double y) {
133 return ((x + 0.1234 * y) * (x - 0.1234 * y));
134}
135
136Which allows the multiply by constant to be CSE'd, producing:
137
138_foo:
139 mulsd LCPI1_0(%rip), %xmm1
140 movapd %xmm1, %xmm2
141 addsd %xmm0, %xmm2
142 subsd %xmm1, %xmm0
143 mulsd %xmm2, %xmm0
144 ret
145
146This doesn't need -ffast-math support at all. This is particularly bad because
147the llvm-gcc frontend is canonicalizing the later into the former, but clang
148doesn't have this problem.
149
Chris Lattner74cfb7d2006-03-11 20:20:40 +0000150//===---------------------------------------------------------------------===//
151
Chris Lattner82c78b22006-03-09 20:13:21 +0000152These two functions should generate the same code on big-endian systems:
153
154int g(int *j,int *l) { return memcmp(j,l,4); }
155int h(int *j, int *l) { return *j - *l; }
156
157this could be done in SelectionDAGISel.cpp, along with other special cases,
158for 1,2,4,8 bytes.
159
160//===---------------------------------------------------------------------===//
161
Chris Lattnerc04b4232006-03-22 07:33:46 +0000162It would be nice to revert this patch:
163http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
164
165And teach the dag combiner enough to simplify the code expanded before
166legalize. It seems plausible that this knowledge would let it simplify other
167stuff too.
168
Chris Lattnere6cd96d2006-03-24 19:59:17 +0000169//===---------------------------------------------------------------------===//
170
Reid Spencerac9dcb92007-02-15 03:39:18 +0000171For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng67d3d4c2006-03-31 22:35:14 +0000172to the type size. It works but can be overly conservative as the alignment of
Reid Spencerac9dcb92007-02-15 03:39:18 +0000173specific vector types are target dependent.
Chris Lattnereaa7c062006-04-01 04:08:29 +0000174
175//===---------------------------------------------------------------------===//
176
Dan Gohman1f3be1a2009-05-11 18:51:16 +0000177We should produce an unaligned load from code like this:
Chris Lattnereaa7c062006-04-01 04:08:29 +0000178
179v4sf example(float *P) {
180 return (v4sf){P[0], P[1], P[2], P[3] };
181}
182
183//===---------------------------------------------------------------------===//
184
Chris Lattner16abfdf2006-05-18 18:26:13 +0000185Add support for conditional increments, and other related patterns. Instead
186of:
187
188 movl 136(%esp), %eax
189 cmpl $0, %eax
190 je LBB16_2 #cond_next
191LBB16_1: #cond_true
192 incl _foo
193LBB16_2: #cond_next
194
195emit:
196 movl _foo, %eax
197 cmpl $1, %edi
198 sbbl $-1, %eax
199 movl %eax, _foo
200
201//===---------------------------------------------------------------------===//
Chris Lattner870cf1b2006-05-19 20:45:08 +0000202
203Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
204
205Expand these to calls of sin/cos and stores:
206 double sincos(double x, double *sin, double *cos);
207 float sincosf(float x, float *sin, float *cos);
208 long double sincosl(long double x, long double *sin, long double *cos);
209
210Doing so could allow SROA of the destination pointers. See also:
211http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
212
Chris Lattner2dae65d2008-12-10 01:30:48 +0000213This is now easily doable with MRVs. We could even make an intrinsic for this
214if anyone cared enough about sincos.
215
Chris Lattner870cf1b2006-05-19 20:45:08 +0000216//===---------------------------------------------------------------------===//
Chris Lattnerf00f68a2006-05-19 21:01:38 +0000217
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000218quantum_sigma_x in 462.libquantum contains the following loop:
219
220 for(i=0; i<reg->size; i++)
221 {
222 /* Flip the target bit of each basis state */
223 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
224 }
225
226Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
227so cool to turn it into something like:
228
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000229 long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000230 if (target < 32) {
231 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000232 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000233 } else {
234 for(i=0; i<reg->size; i++)
Chris Lattnerb33a42a2006-09-18 04:54:35 +0000235 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner7ed96ab2006-09-16 23:57:51 +0000236 }
237
238... which would only do one 32-bit XOR per loop iteration instead of two.
239
240It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000241this requires TBAA.
Chris Lattnerfaa6adf2009-09-21 06:04:07 +0000242
243//===---------------------------------------------------------------------===//
244
Chris Lattnerb1ac7692008-10-05 02:16:12 +0000245This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattnerf9bae432006-12-08 02:01:32 +0000246
247unsigned long reverse(unsigned v) {
248 unsigned t;
249 t = v ^ ((v << 16) | (v >> 16));
250 t &= ~0xff0000;
251 v = (v << 24) | (v >> 8);
252 return v ^ (t >> 8);
253}
254
Chris Lattnerfb981f32006-09-25 17:12:14 +0000255//===---------------------------------------------------------------------===//
256
Chris Lattner19310fc2011-02-21 02:13:39 +0000257[LOOP DELETION]
258
259We don't delete this output free loop, because trip count analysis doesn't
260realize that it is finite (if it were infinite, it would be undefined). Not
261having this blocks Loop Idiom from matching strlen and friends.
262
263void foo(char *C) {
264 int x = 0;
265 while (*C)
266 ++x,++C;
267}
268
269//===---------------------------------------------------------------------===//
270
Chris Lattner818ff342010-01-23 18:49:30 +0000271[LOOP RECOGNITION]
272
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000273These idioms should be recognized as popcount (see PR1488):
274
275unsigned countbits_slow(unsigned v) {
276 unsigned c;
277 for (c = 0; v; v >>= 1)
278 c += v & 1;
279 return c;
280}
281unsigned countbits_fast(unsigned v){
282 unsigned c;
283 for (c = 0; v; c++)
284 v &= v - 1; // clear the least significant bit set
285 return c;
286}
287
288BITBOARD = unsigned long long
289int PopCnt(register BITBOARD a) {
290 register int c=0;
291 while(a) {
292 c++;
293 a &= a - 1;
294 }
295 return c;
296}
297unsigned int popcount(unsigned int input) {
298 unsigned int count = 0;
299 for (unsigned int i = 0; i < 4 * 8; i++)
300 count += (input >> i) & i;
301 return count;
302}
303
Chris Lattner477a9882011-02-21 01:33:38 +0000304This should be recognized as CLZ: rdar://8459039
305
306unsigned clz_a(unsigned a) {
307 int i;
308 for (i=0;i<32;i++)
309 if (a & (1<<(31-i)))
310 return i;
311 return 32;
312}
313
Chris Lattner527b47d2011-01-02 18:31:38 +0000314This sort of thing should be added to the loop idiom pass.
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000315
Chris Lattnerf4fee2a2008-10-15 16:02:15 +0000316//===---------------------------------------------------------------------===//
317
Chris Lattnerfb981f32006-09-25 17:12:14 +0000318These should turn into single 16-bit (unaligned?) loads on little/big endian
319processors.
320
321unsigned short read_16_le(const unsigned char *adr) {
322 return adr[0] | (adr[1] << 8);
323}
324unsigned short read_16_be(const unsigned char *adr) {
325 return (adr[0] << 8) | adr[1];
326}
327
328//===---------------------------------------------------------------------===//
Chris Lattnercf103912006-10-24 16:12:47 +0000329
Reid Spencer1628cec2006-10-26 06:15:43 +0000330-instcombine should handle this transform:
Reid Spencere4d87aa2006-12-23 06:05:41 +0000331 icmp pred (sdiv X / C1 ), C2
Reid Spencer1628cec2006-10-26 06:15:43 +0000332when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
333
334Currently InstCombine avoids this transform but will do it when the signs of
335the operands and the sign of the divide match. See the FIXME in
336InstructionCombining.cpp in the visitSetCondInst method after the switch case
337for Instruction::UDiv (around line 4447) for more details.
338
339The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
340this construct.
Chris Lattnerd7c628d2006-11-03 22:27:39 +0000341
342//===---------------------------------------------------------------------===//
343
Chris Lattneraa306c22010-01-23 17:59:23 +0000344[LOOP OPTIMIZATION]
345
346SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization
347opportunities in its double_array_divs_variable function: it needs loop
348interchange, memory promotion (which LICM already does), vectorization and
349variable trip count loop unrolling (since it has a constant trip count). ICC
350apparently produces this very nice code with -ffast-math:
351
352..B1.70: # Preds ..B1.70 ..B1.69
353 mulpd %xmm0, %xmm1 #108.2
354 mulpd %xmm0, %xmm1 #108.2
355 mulpd %xmm0, %xmm1 #108.2
356 mulpd %xmm0, %xmm1 #108.2
357 addl $8, %edx #
358 cmpl $131072, %edx #108.2
359 jb ..B1.70 # Prob 99% #108.2
360
361It would be better to count down to zero, but this is a lot better than what we
362do.
363
364//===---------------------------------------------------------------------===//
365
Chris Lattner03a6d962007-01-16 06:39:48 +0000366Consider:
367
368typedef unsigned U32;
369typedef unsigned long long U64;
370int test (U32 *inst, U64 *regs) {
371 U64 effective_addr2;
372 U32 temp = *inst;
373 int r1 = (temp >> 20) & 0xf;
374 int b2 = (temp >> 16) & 0xf;
375 effective_addr2 = temp & 0xfff;
376 if (b2) effective_addr2 += regs[b2];
377 b2 = (temp >> 12) & 0xf;
378 if (b2) effective_addr2 += regs[b2];
379 effective_addr2 &= regs[4];
380 if ((effective_addr2 & 3) == 0)
381 return 1;
382 return 0;
383}
384
385Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
386we don't eliminate the computation of the top half of effective_addr2 because
387we don't have whole-function selection dags. On x86, this means we use one
388extra register for the function when effective_addr2 is declared as U64 than
389when it is declared U32.
390
Chris Lattner17424982009-11-10 23:47:45 +0000391PHI Slicing could be extended to do this.
392
Chris Lattner03a6d962007-01-16 06:39:48 +0000393//===---------------------------------------------------------------------===//
394
Chris Lattner9c6a0dc2009-11-26 01:51:18 +0000395LSR should know what GPR types a target has from TargetData. This code:
Chris Lattner1a77a552007-03-24 06:01:32 +0000396
397volatile short X, Y; // globals
398
399void foo(int N) {
400 int i;
401 for (i = 0; i < N; i++) { X = i; Y = i*4; }
402}
403
Chris Lattnerc1491f32009-09-20 17:37:38 +0000404produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner1a77a552007-03-24 06:01:32 +0000405
Chris Lattnerc1491f32009-09-20 17:37:38 +0000406LBB1_2:
407 ldr r3, LCPI1_0
408 ldr r3, [r3]
409 strh r2, [r3]
410 ldr r3, LCPI1_1
411 ldr r3, [r3]
412 strh r1, [r3]
413 add r1, r1, #4
414 add r2, r2, #1 <- [0,+,1]
415 sub r0, r0, #1 <- [0,-,1]
416 cmp r0, #0
417 bne LBB1_2
418
419LSR should reuse the "+" IV for the exit test.
Chris Lattner1a77a552007-03-24 06:01:32 +0000420
Chris Lattner1a77a552007-03-24 06:01:32 +0000421//===---------------------------------------------------------------------===//
422
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000423Tail call elim should be more aggressive, checking to see if the call is
424followed by an uncond branch to an exit block.
425
426; This testcase is due to tail-duplication not wanting to copy the return
427; instruction into the terminating blocks because there was other code
428; optimized out of the function after the taildup happened.
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000429; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000430
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000431define i32 @t4(i32 %a) {
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000432entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000433 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
434 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
435 br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000436
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000437then.0: ; preds = %entry
438 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
439 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
440 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000441
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000442else.0: ; preds = %entry
443 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
444 br i1 %tmp.7, label %then.1, label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000445
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000446then.1: ; preds = %else.0
447 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
448 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
449 br label %return
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000450
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000451return: ; preds = %then.1, %else.0, %then.0
452 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000453 [ %tmp.9, %then.1 ]
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000454 ret i32 %result.0
Chris Lattner5e14b0d2007-05-05 22:29:06 +0000455}
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000456
457//===---------------------------------------------------------------------===//
458
Chris Lattnerc90b8662008-08-10 00:47:21 +0000459Tail recursion elimination should handle:
460
461int pow2m1(int n) {
462 if (n == 0)
463 return 0;
464 return 2 * pow2m1 (n - 1) + 1;
465}
466
467Also, multiplies can be turned into SHL's, so they should be handled as if
468they were associative. "return foo() << 1" can be tail recursion eliminated.
469
470//===---------------------------------------------------------------------===//
471
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000472Argument promotion should promote arguments for recursive functions, like
473this:
474
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000475; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000476
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000477define internal i32 @foo(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000478entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000479 %tmp = load i32* %x ; <i32> [#uses=0]
480 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
481 ret i32 %tmp.foo
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000482}
483
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000484define i32 @bar(i32* %x) {
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000485entry:
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000486 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
487 ret i32 %tmp3
Chris Lattnerf110a2b2007-05-05 22:44:08 +0000488}
489
Chris Lattner81f2d712007-12-05 23:05:06 +0000490//===---------------------------------------------------------------------===//
Chris Lattner166a2682007-12-28 04:42:05 +0000491
Chris Lattnera1643ba2007-12-28 22:30:05 +0000492We should investigate an instruction sinking pass. Consider this silly
493example in pic mode:
494
495#include <assert.h>
496void foo(int x) {
497 assert(x);
498 //...
499}
500
501we compile this to:
502_foo:
503 subl $28, %esp
504 call "L1$pb"
505"L1$pb":
506 popl %eax
507 cmpl $0, 32(%esp)
508 je LBB1_2 # cond_true
509LBB1_1: # return
510 # ...
511 addl $28, %esp
512 ret
513LBB1_2: # cond_true
514...
515
516The PIC base computation (call+popl) is only used on one path through the
517code, but is currently always computed in the entry block. It would be
518better to sink the picbase computation down into the block for the
519assertion, as it is the only one that uses it. This happens for a lot of
520code with early outs.
521
Chris Lattner92c06a02007-12-29 01:05:01 +0000522Another example is loads of arguments, which are usually emitted into the
523entry block on targets like x86. If not used in all paths through a
524function, they should be sunk into the ones that do.
525
Chris Lattnera1643ba2007-12-28 22:30:05 +0000526In this case, whole-function-isel would also handle this.
Chris Lattner166a2682007-12-28 04:42:05 +0000527
528//===---------------------------------------------------------------------===//
Chris Lattnerb3041942008-01-07 21:38:14 +0000529
530Investigate lowering of sparse switch statements into perfect hash tables:
531http://burtleburtle.net/bob/hash/perfect.html
532
533//===---------------------------------------------------------------------===//
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000534
535We should turn things like "load+fabs+store" and "load+fneg+store" into the
536corresponding integer operations. On a yonah, this loop:
537
538double a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000539void foo() {
540 int i, b;
541 for (b = 0; b < 10000000; b++)
542 for (i = 0; i < 256; i++)
543 a[i] = -a[i];
544}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000545
546is twice as slow as this loop:
547
548long long a[256];
Chris Lattner7c4e9a42008-02-18 18:46:39 +0000549void foo() {
550 int i, b;
551 for (b = 0; b < 10000000; b++)
552 for (i = 0; i < 256; i++)
553 a[i] ^= (1ULL << 63);
554}
Chris Lattnerf61b63e2008-01-09 00:17:57 +0000555
556and I suspect other processors are similar. On X86 in particular this is a
557big win because doing this with integers allows the use of read/modify/write
558instructions.
559
560//===---------------------------------------------------------------------===//
Chris Lattner83726012008-01-10 18:25:41 +0000561
562DAG Combiner should try to combine small loads into larger loads when
563profitable. For example, we compile this C++ example:
564
565struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
566extern THotKey m_HotKey;
567THotKey GetHotKey () { return m_HotKey; }
568
Chris Lattner527b47d2011-01-02 18:31:38 +0000569into (-m64 -O3 -fno-exceptions -static -fomit-frame-pointer):
Chris Lattner83726012008-01-10 18:25:41 +0000570
Chris Lattner527b47d2011-01-02 18:31:38 +0000571__Z9GetHotKeyv: ## @_Z9GetHotKeyv
572 movq _m_HotKey@GOTPCREL(%rip), %rax
573 movzwl (%rax), %ecx
574 movzbl 2(%rax), %edx
575 shlq $16, %rdx
576 orq %rcx, %rdx
577 movzbl 3(%rax), %ecx
578 shlq $24, %rcx
579 orq %rdx, %rcx
580 movzbl 4(%rax), %eax
581 shlq $32, %rax
582 orq %rcx, %rax
583 ret
Chris Lattner83726012008-01-10 18:25:41 +0000584
585//===---------------------------------------------------------------------===//
Chris Lattner497b7e92008-01-11 06:17:47 +0000586
Nate Begemane9fe65c2008-02-18 18:39:23 +0000587We should add an FRINT node to the DAG to model targets that have legal
588implementations of ceil/floor/rint.
Chris Lattner48840f82008-02-28 05:34:27 +0000589
590//===---------------------------------------------------------------------===//
591
592Consider:
593
594int test() {
Benjamin Kramer9d071cb2010-12-23 15:32:07 +0000595 long long input[8] = {1,0,1,0,1,0,1,0};
Chris Lattner48840f82008-02-28 05:34:27 +0000596 foo(input);
597}
598
Chris Lattner9c8fb9e2011-01-01 22:52:11 +0000599Clang compiles this into:
Chris Lattner48840f82008-02-28 05:34:27 +0000600
Chris Lattner9c8fb9e2011-01-01 22:52:11 +0000601 call void @llvm.memset.p0i8.i64(i8* %tmp, i8 0, i64 64, i32 16, i1 false)
602 %0 = getelementptr [8 x i64]* %input, i64 0, i64 0
603 store i64 1, i64* %0, align 16
604 %1 = getelementptr [8 x i64]* %input, i64 0, i64 2
605 store i64 1, i64* %1, align 16
606 %2 = getelementptr [8 x i64]* %input, i64 0, i64 4
607 store i64 1, i64* %2, align 16
608 %3 = getelementptr [8 x i64]* %input, i64 0, i64 6
609 store i64 1, i64* %3, align 16
Chris Lattner48840f82008-02-28 05:34:27 +0000610
Chris Lattner9c8fb9e2011-01-01 22:52:11 +0000611Which gets codegen'd into:
612
613 pxor %xmm0, %xmm0
614 movaps %xmm0, -16(%rbp)
615 movaps %xmm0, -32(%rbp)
616 movaps %xmm0, -48(%rbp)
617 movaps %xmm0, -64(%rbp)
618 movq $1, -64(%rbp)
619 movq $1, -48(%rbp)
620 movq $1, -32(%rbp)
621 movq $1, -16(%rbp)
622
623It would be better to have 4 movq's of 0 instead of the movaps's.
Chris Lattner48840f82008-02-28 05:34:27 +0000624
625//===---------------------------------------------------------------------===//
Chris Lattnera11deb02008-03-02 02:51:40 +0000626
627http://llvm.org/PR717:
628
629The following code should compile into "ret int undef". Instead, LLVM
630produces "ret int 0":
631
632int f() {
633 int x = 4;
634 int y;
635 if (x == 3) y = 0;
636 return y;
637}
638
639//===---------------------------------------------------------------------===//
Chris Lattner53b72772008-03-02 19:29:42 +0000640
641The loop unroller should partially unroll loops (instead of peeling them)
642when code growth isn't too bad and when an unroll count allows simplification
643of some code within the loop. One trivial example is:
644
645#include <stdio.h>
646int main() {
647 int nRet = 17;
648 int nLoop;
649 for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
650 if ( nLoop & 1 )
651 nRet += 2;
652 else
653 nRet -= 1;
654 }
655 return nRet;
656}
657
658Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
659reduction in code size. The resultant code would then also be suitable for
660exit value computation.
661
662//===---------------------------------------------------------------------===//
Chris Lattner349155b2008-03-17 01:47:51 +0000663
664We miss a bunch of rotate opportunities on various targets, including ppc, x86,
665etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
666matching code in dag combine doesn't look through truncates aggressively
667enough. Here are some testcases reduces from GCC PR17886:
668
Chris Lattner349155b2008-03-17 01:47:51 +0000669unsigned long long f5(unsigned long long x, unsigned long long y) {
670 return (x << 8) | ((y >> 48) & 0xffull);
671}
672unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
673 switch(z) {
674 case 1:
675 return (x << 8) | ((y >> 48) & 0xffull);
676 case 2:
677 return (x << 16) | ((y >> 40) & 0xffffull);
678 case 3:
679 return (x << 24) | ((y >> 32) & 0xffffffull);
680 case 4:
681 return (x << 32) | ((y >> 24) & 0xffffffffull);
682 default:
683 return (x << 40) | ((y >> 16) & 0xffffffffffull);
684 }
685}
686
Chris Lattner349155b2008-03-17 01:47:51 +0000687//===---------------------------------------------------------------------===//
Chris Lattnerf70107f2008-03-20 04:46:13 +0000688
Chris Lattneref17f082010-12-15 07:10:43 +0000689This (and similar related idioms):
690
691unsigned int foo(unsigned char i) {
692 return i | (i<<8) | (i<<16) | (i<<24);
693}
694
695compiles into:
696
697define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone {
698entry:
699 %conv = zext i8 %i to i32
700 %shl = shl i32 %conv, 8
701 %shl5 = shl i32 %conv, 16
702 %shl9 = shl i32 %conv, 24
703 %or = or i32 %shl9, %conv
704 %or6 = or i32 %or, %shl5
705 %or10 = or i32 %or6, %shl
706 ret i32 %or10
707}
708
709it would be better as:
710
711unsigned int bar(unsigned char i) {
712 unsigned int j=i | (i << 8);
713 return j | (j<<16);
714}
715
716aka:
717
718define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone {
719entry:
720 %conv = zext i8 %i to i32
721 %shl = shl i32 %conv, 8
722 %or = or i32 %shl, %conv
723 %shl5 = shl i32 %or, 16
724 %or6 = or i32 %shl5, %or
725 ret i32 %or6
726}
727
728or even i*0x01010101, depending on the speed of the multiplier. The best way to
729handle this is to canonicalize it to a multiply in IR and have codegen handle
730lowering multiplies to shifts on cpus where shifts are faster.
731
732//===---------------------------------------------------------------------===//
733
Chris Lattnerf70107f2008-03-20 04:46:13 +0000734We do a number of simplifications in simplify libcalls to strength reduce
735standard library functions, but we don't currently merge them together. For
736example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
737be done safely if "b" isn't modified between the strlen and memcpy of course.
738
739//===---------------------------------------------------------------------===//
740
Chris Lattner26e150f2008-08-10 01:14:08 +0000741We compile this program: (from GCC PR11680)
742http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
743
744Into code that runs the same speed in fast/slow modes, but both modes run 2x
745slower than when compile with GCC (either 4.0 or 4.2):
746
747$ llvm-g++ perf.cpp -O3 -fno-exceptions
748$ time ./a.out fast
7491.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
750
751$ g++ perf.cpp -O3 -fno-exceptions
752$ time ./a.out fast
7530.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
754
755It looks like we are making the same inlining decisions, so this may be raw
756codegen badness or something else (haven't investigated).
757
758//===---------------------------------------------------------------------===//
759
Chris Lattner26e150f2008-08-10 01:14:08 +0000760Divisibility by constant can be simplified (according to GCC PR12849) from
761being a mulhi to being a mul lo (cheaper). Testcase:
762
763void bar(unsigned n) {
764 if (n % 3 == 0)
765 true();
766}
767
Eli Friedmanbcae2052009-12-12 23:23:43 +0000768This is equivalent to the following, where 2863311531 is the multiplicative
769inverse of 3, and 1431655766 is ((2^32)-1)/3+1:
770void bar(unsigned n) {
771 if (n * 2863311531U < 1431655766U)
772 true();
773}
774
775The same transformation can work with an even modulo with the addition of a
776rotate: rotate the result of the multiply to the right by the number of bits
777which need to be zero for the condition to be true, and shrink the compare RHS
778by the same amount. Unless the target supports rotates, though, that
779transformation probably isn't worthwhile.
780
781The transformation can also easily be made to work with non-zero equality
782comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".
Chris Lattner26e150f2008-08-10 01:14:08 +0000783
784//===---------------------------------------------------------------------===//
Chris Lattner23f35bc2008-08-19 06:22:16 +0000785
Chris Lattnerdb039832008-10-15 16:06:03 +0000786Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
787bunch of other stuff from this example (see PR1604):
788
789#include <cstdio>
790struct test {
791 int val;
792 virtual ~test() {}
793};
794
795int main() {
796 test t;
797 std::scanf("%d", &t.val);
798 std::printf("%d\n", t.val);
799}
800
801//===---------------------------------------------------------------------===//
802
Nick Lewyckyd2f0db12008-11-27 22:41:45 +0000803These functions perform the same computation, but produce different assembly.
Nick Lewyckydf563ca2008-11-27 22:12:22 +0000804
805define i8 @select(i8 %x) readnone nounwind {
806 %A = icmp ult i8 %x, 250
807 %B = select i1 %A, i8 0, i8 1
808 ret i8 %B
809}
810
811define i8 @addshr(i8 %x) readnone nounwind {
812 %A = zext i8 %x to i9
813 %B = add i9 %A, 6 ;; 256 - 250 == 6
814 %C = lshr i9 %B, 8
815 %D = trunc i9 %C to i8
816 ret i8 %D
817}
818
819//===---------------------------------------------------------------------===//
Eli Friedman4e16b292008-11-30 07:36:04 +0000820
821From gcc bug 24696:
822int
823f (unsigned long a, unsigned long b, unsigned long c)
824{
825 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0);
826}
827int
828f (unsigned long a, unsigned long b, unsigned long c)
829{
830 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0);
831}
832Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with
833"clang -emit-llvm-bc | opt -std-compile-opts".
834
835//===---------------------------------------------------------------------===//
836
837From GCC Bug 20192:
838#define PMD_MASK (~((1UL << 23) - 1))
839void clear_pmd_range(unsigned long start, unsigned long end)
840{
841 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
842 f();
843}
844The expression should optimize to something like
845"!((start|end)&~PMD_MASK). Currently not optimized with "clang
846-emit-llvm-bc | opt -std-compile-opts".
847
848//===---------------------------------------------------------------------===//
849
Eli Friedman4e16b292008-11-30 07:36:04 +0000850unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
851i;}
852unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
853These should combine to the same thing. Currently, the first function
854produces better code on X86.
855
856//===---------------------------------------------------------------------===//
857
Eli Friedman4e16b292008-11-30 07:36:04 +0000858From GCC Bug 15784:
859#define abs(x) x>0?x:-x
860int f(int x, int y)
861{
862 return (abs(x)) >= 0;
863}
864This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
865optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
866
867//===---------------------------------------------------------------------===//
868
869From GCC Bug 14753:
870void
871rotate_cst (unsigned int a)
872{
873 a = (a << 10) | (a >> 22);
874 if (a == 123)
875 bar ();
876}
877void
878minus_cst (unsigned int a)
879{
880 unsigned int tem;
881
882 tem = 20 - a;
883 if (tem == 5)
884 bar ();
885}
886void
887mask_gt (unsigned int a)
888{
889 /* This is equivalent to a > 15. */
890 if ((a & ~7) > 8)
891 bar ();
892}
893void
894rshift_gt (unsigned int a)
895{
896 /* This is equivalent to a > 23. */
897 if ((a >> 2) > 5)
898 bar ();
899}
Chris Lattnercb401952011-02-17 01:43:46 +0000900
901void neg_eq_cst(unsigned int a) {
902if (-a == 123)
903bar();
904}
905
Eli Friedman4e16b292008-11-30 07:36:04 +0000906All should simplify to a single comparison. All of these are
907currently not optimized with "clang -emit-llvm-bc | opt
908-std-compile-opts".
909
910//===---------------------------------------------------------------------===//
911
912From GCC Bug 32605:
913int c(int* x) {return (char*)x+2 == (char*)x;}
914Should combine to 0. Currently not optimized with "clang
915-emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it).
916
917//===---------------------------------------------------------------------===//
918
Eli Friedman4e16b292008-11-30 07:36:04 +0000919int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;}
920Should be combined to "((b >> 1) | b) & 1". Currently not optimized
921with "clang -emit-llvm-bc | opt -std-compile-opts".
922
923//===---------------------------------------------------------------------===//
924
925unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);}
926Should combine to "x | (y & 3)". Currently not optimized with "clang
927-emit-llvm-bc | opt -std-compile-opts".
928
929//===---------------------------------------------------------------------===//
930
Eli Friedman4e16b292008-11-30 07:36:04 +0000931int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);}
932Should fold to "(~a & c) | (a & b)". Currently not optimized with
933"clang -emit-llvm-bc | opt -std-compile-opts".
934
935//===---------------------------------------------------------------------===//
936
937int a(int a,int b) {return (~(a|b))|a;}
938Should fold to "a|~b". Currently not optimized with "clang
939-emit-llvm-bc | opt -std-compile-opts".
940
941//===---------------------------------------------------------------------===//
942
943int a(int a, int b) {return (a&&b) || (a&&!b);}
944Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
945| opt -std-compile-opts".
946
947//===---------------------------------------------------------------------===//
948
949int a(int a, int b, int c) {return (a&&b) || (!a&&c);}
950Should fold to "a ? b : c", or at least something sane. Currently not
951optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
952
953//===---------------------------------------------------------------------===//
954
955int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);}
956Should fold to a && (b || c). Currently not optimized with "clang
957-emit-llvm-bc | opt -std-compile-opts".
958
959//===---------------------------------------------------------------------===//
960
961int a(int x) {return x | ((x & 8) ^ 8);}
962Should combine to x | 8. Currently not optimized with "clang
963-emit-llvm-bc | opt -std-compile-opts".
964
965//===---------------------------------------------------------------------===//
966
967int a(int x) {return x ^ ((x & 8) ^ 8);}
968Should also combine to x | 8. Currently not optimized with "clang
969-emit-llvm-bc | opt -std-compile-opts".
970
971//===---------------------------------------------------------------------===//
972
Eli Friedman4e16b292008-11-30 07:36:04 +0000973int a(int x) {return ((x | -9) ^ 8) & x;}
974Should combine to x & -9. Currently not optimized with "clang
975-emit-llvm-bc | opt -std-compile-opts".
976
977//===---------------------------------------------------------------------===//
978
979unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
980Should combine to "a * 0x88888888 >> 31". Currently not optimized
981with "clang -emit-llvm-bc | opt -std-compile-opts".
982
983//===---------------------------------------------------------------------===//
984
985unsigned a(char* x) {if ((*x & 32) == 0) return b();}
986There's an unnecessary zext in the generated code with "clang
987-emit-llvm-bc | opt -std-compile-opts".
988
989//===---------------------------------------------------------------------===//
990
991unsigned a(unsigned long long x) {return 40 * (x >> 1);}
992Should combine to "20 * (((unsigned)x) & -2)". Currently not
993optimized with "clang -emit-llvm-bc | opt -std-compile-opts".
994
995//===---------------------------------------------------------------------===//
Bill Wendling3bdcda82008-12-02 05:12:47 +0000996
Chris Lattner88d84b22008-12-02 06:32:34 +0000997This was noticed in the entryblock for grokdeclarator in 403.gcc:
998
999 %tmp = icmp eq i32 %decl_context, 4
1000 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
1001 %tmp1 = icmp eq i32 %decl_context_addr.0, 1
1002 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
1003
1004tmp1 should be simplified to something like:
1005 (!tmp || decl_context == 1)
1006
1007This allows recursive simplifications, tmp1 is used all over the place in
1008the function, e.g. by:
1009
1010 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
1011 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
1012 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
1013
1014later.
1015
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001016//===---------------------------------------------------------------------===//
1017
Chris Lattnerd4137f42009-11-29 02:19:52 +00001018[STORE SINKING]
1019
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001020Store sinking: This code:
1021
1022void f (int n, int *cond, int *res) {
1023 int i;
1024 *res = 0;
1025 for (i = 0; i < n; i++)
1026 if (*cond)
1027 *res ^= 234; /* (*) */
1028}
1029
1030On this function GVN hoists the fully redundant value of *res, but nothing
1031moves the store out. This gives us this code:
1032
1033bb: ; preds = %bb2, %entry
1034 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
1035 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
1036 %1 = load i32* %cond, align 4
1037 %2 = icmp eq i32 %1, 0
1038 br i1 %2, label %bb2, label %bb1
1039
1040bb1: ; preds = %bb
1041 %3 = xor i32 %.rle, 234
1042 store i32 %3, i32* %res, align 4
1043 br label %bb2
1044
1045bb2: ; preds = %bb, %bb1
1046 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
1047 %indvar.next = add i32 %i.05, 1
1048 %exitcond = icmp eq i32 %indvar.next, %n
1049 br i1 %exitcond, label %return, label %bb
1050
1051DSE should sink partially dead stores to get the store out of the loop.
1052
Chris Lattner6a09a742008-12-06 22:52:12 +00001053Here's another partial dead case:
1054http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
1055
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001056//===---------------------------------------------------------------------===//
1057
1058Scalar PRE hoists the mul in the common block up to the else:
1059
1060int test (int a, int b, int c, int g) {
1061 int d, e;
1062 if (a)
1063 d = b * c;
1064 else
1065 d = b - c;
1066 e = b * c + g;
1067 return d + e;
1068}
1069
1070It would be better to do the mul once to reduce codesize above the if.
1071This is GCC PR38204.
1072
Chris Lattnercce240d2011-01-06 07:41:22 +00001073
1074//===---------------------------------------------------------------------===//
1075This simple function from 179.art:
1076
1077int winner, numf2s;
1078struct { double y; int reset; } *Y;
1079
1080void find_match() {
1081 int i;
1082 winner = 0;
1083 for (i=0;i<numf2s;i++)
1084 if (Y[i].y > Y[winner].y)
1085 winner =i;
1086}
1087
1088Compiles into (with clang TBAA):
1089
1090for.body: ; preds = %for.inc, %bb.nph
1091 %indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %for.inc ]
1092 %i.01718 = phi i32 [ 0, %bb.nph ], [ %i.01719, %for.inc ]
1093 %tmp4 = getelementptr inbounds %struct.anon* %tmp3, i64 %indvar, i32 0
1094 %tmp5 = load double* %tmp4, align 8, !tbaa !4
1095 %idxprom7 = sext i32 %i.01718 to i64
1096 %tmp10 = getelementptr inbounds %struct.anon* %tmp3, i64 %idxprom7, i32 0
1097 %tmp11 = load double* %tmp10, align 8, !tbaa !4
1098 %cmp12 = fcmp ogt double %tmp5, %tmp11
1099 br i1 %cmp12, label %if.then, label %for.inc
1100
1101if.then: ; preds = %for.body
1102 %i.017 = trunc i64 %indvar to i32
1103 br label %for.inc
1104
1105for.inc: ; preds = %for.body, %if.then
1106 %i.01719 = phi i32 [ %i.01718, %for.body ], [ %i.017, %if.then ]
1107 %indvar.next = add i64 %indvar, 1
1108 %exitcond = icmp eq i64 %indvar.next, %tmp22
1109 br i1 %exitcond, label %for.cond.for.end_crit_edge, label %for.body
1110
1111
1112It is good that we hoisted the reloads of numf2's, and Y out of the loop and
1113sunk the store to winner out.
1114
1115However, this is awful on several levels: the conditional truncate in the loop
1116(-indvars at fault? why can't we completely promote the IV to i64?).
1117
1118Beyond that, we have a partially redundant load in the loop: if "winner" (aka
1119%i.01718) isn't updated, we reload Y[winner].y the next time through the loop.
1120Similarly, the addressing that feeds it (including the sext) is redundant. In
1121the end we get this generated assembly:
1122
1123LBB0_2: ## %for.body
1124 ## =>This Inner Loop Header: Depth=1
1125 movsd (%rdi), %xmm0
1126 movslq %edx, %r8
1127 shlq $4, %r8
1128 ucomisd (%rcx,%r8), %xmm0
1129 jbe LBB0_4
1130 movl %esi, %edx
1131LBB0_4: ## %for.inc
1132 addq $16, %rdi
1133 incq %rsi
1134 cmpq %rsi, %rax
1135 jne LBB0_2
1136
1137All things considered this isn't too bad, but we shouldn't need the movslq or
1138the shlq instruction, or the load folded into ucomisd every time through the
1139loop.
1140
1141On an x86-specific topic, if the loop can't be restructure, the movl should be a
1142cmov.
1143
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001144//===---------------------------------------------------------------------===//
1145
Chris Lattnerd4137f42009-11-29 02:19:52 +00001146[STORE SINKING]
1147
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001148GCC PR37810 is an interesting case where we should sink load/store reload
1149into the if block and outside the loop, so we don't reload/store it on the
1150non-call path.
1151
1152for () {
1153 *P += 1;
1154 if ()
1155 call();
1156 else
1157 ...
1158->
1159tmp = *P
1160for () {
1161 tmp += 1;
1162 if () {
1163 *P = tmp;
1164 call();
1165 tmp = *P;
1166 } else ...
1167}
1168*P = tmp;
1169
Chris Lattner8f416f32008-12-15 07:49:24 +00001170We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
1171we don't sink the store. We need partially dead store sinking.
1172
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001173//===---------------------------------------------------------------------===//
1174
Chris Lattnerd4137f42009-11-29 02:19:52 +00001175[LOAD PRE CRIT EDGE SPLITTING]
Chris Lattner8f416f32008-12-15 07:49:24 +00001176
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001177GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
1178leading to excess stack traffic. This could be handled by GVN with some crazy
1179symbolic phi translation. The code we get looks like (g is on the stack):
1180
1181bb2: ; preds = %bb1
1182..
1183 %9 = getelementptr %struct.f* %g, i32 0, i32 0
1184 store i32 %8, i32* %9, align bel %bb3
1185
1186bb3: ; preds = %bb1, %bb2, %bb
1187 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
1188 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
1189 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
1190 %11 = load i32* %10, align 4
1191
Chris Lattner6d949262009-11-27 16:53:57 +00001192%11 is partially redundant, an in BB2 it should have the value %8.
Chris Lattner78a7e7c2008-12-06 19:28:22 +00001193
Chris Lattnerd4137f42009-11-29 02:19:52 +00001194GCC PR33344 and PR35287 are similar cases.
Chris Lattner6a09a742008-12-06 22:52:12 +00001195
Chris Lattner6c9fab72009-11-05 18:19:19 +00001196
1197//===---------------------------------------------------------------------===//
1198
Chris Lattnerd4137f42009-11-29 02:19:52 +00001199[LOAD PRE]
1200
Chris Lattner6a09a742008-12-06 22:52:12 +00001201There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
Chris Lattnerd4137f42009-11-29 02:19:52 +00001202GCC testsuite, ones we don't get yet are (checked through loadpre25):
1203
1204[CRIT EDGE BREAKING]
1205loadpre3.c predcom-4.c
1206
1207[PRE OF READONLY CALL]
1208loadpre5.c
1209
1210[TURN SELECT INTO BRANCH]
1211loadpre14.c loadpre15.c
1212
1213actually a conditional increment: loadpre18.c loadpre19.c
1214
Chris Lattner2fc36e12010-12-15 06:38:24 +00001215//===---------------------------------------------------------------------===//
1216
1217[LOAD PRE / STORE SINKING / SPEC HACK]
1218
1219This is a chunk of code from 456.hmmer:
1220
1221int f(int M, int *mc, int *mpp, int *tpmm, int *ip, int *tpim, int *dpp,
1222 int *tpdm, int xmb, int *bp, int *ms) {
1223 int k, sc;
1224 for (k = 1; k <= M; k++) {
1225 mc[k] = mpp[k-1] + tpmm[k-1];
1226 if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;
1227 if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
1228 if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
1229 mc[k] += ms[k];
1230 }
1231}
1232
1233It is very profitable for this benchmark to turn the conditional stores to mc[k]
1234into a conditional move (select instr in IR) and allow the final store to do the
1235store. See GCC PR27313 for more details. Note that this is valid to xform even
1236with the new C++ memory model, since mc[k] is previously loaded and later
1237stored.
Chris Lattnerd4137f42009-11-29 02:19:52 +00001238
1239//===---------------------------------------------------------------------===//
1240
1241[SCALAR PRE]
1242There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the
1243GCC testsuite.
Chris Lattner6a09a742008-12-06 22:52:12 +00001244
Chris Lattner582048d2008-12-15 08:32:28 +00001245//===---------------------------------------------------------------------===//
1246
1247There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
Chris Lattnerd4137f42009-11-29 02:19:52 +00001248GCC testsuite. For example, we get the first example in predcom-1.c, but
1249miss the second one:
Chris Lattner582048d2008-12-15 08:32:28 +00001250
Chris Lattnerd4137f42009-11-29 02:19:52 +00001251unsigned fib[1000];
1252unsigned avg[1000];
Chris Lattner582048d2008-12-15 08:32:28 +00001253
Chris Lattnerd4137f42009-11-29 02:19:52 +00001254__attribute__ ((noinline))
1255void count_averages(int n) {
1256 int i;
1257 for (i = 1; i < n; i++)
1258 avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;
1259}
Chris Lattner582048d2008-12-15 08:32:28 +00001260
Chris Lattnerd4137f42009-11-29 02:19:52 +00001261which compiles into two loads instead of one in the loop.
Chris Lattner582048d2008-12-15 08:32:28 +00001262
Chris Lattnerd4137f42009-11-29 02:19:52 +00001263predcom-2.c is the same as predcom-1.c
Chris Lattner582048d2008-12-15 08:32:28 +00001264
Chris Lattner582048d2008-12-15 08:32:28 +00001265predcom-3.c is very similar but needs loads feeding each other instead of
1266store->load.
Chris Lattner582048d2008-12-15 08:32:28 +00001267
1268
1269//===---------------------------------------------------------------------===//
1270
Chris Lattneraa306c22010-01-23 17:59:23 +00001271[ALIAS ANALYSIS]
1272
Chris Lattner582048d2008-12-15 08:32:28 +00001273Type based alias analysis:
Chris Lattner6a09a742008-12-06 22:52:12 +00001274http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
1275
Chris Lattneraa306c22010-01-23 17:59:23 +00001276We should do better analysis of posix_memalign. At the least it should
1277no-capture its pointer argument, at best, we should know that the out-value
1278result doesn't point to anything (like malloc). One example of this is in
1279SingleSource/Benchmarks/Misc/dt.c
1280
Chris Lattner6a09a742008-12-06 22:52:12 +00001281//===---------------------------------------------------------------------===//
1282
Chris Lattner6a09a742008-12-06 22:52:12 +00001283Interesting missed case because of control flow flattening (should be 2 loads):
1284http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner582048d2008-12-15 08:32:28 +00001285With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |
1286 opt -mem2reg -gvn -instcombine | llvm-dis
Chris Lattnerd4137f42009-11-29 02:19:52 +00001287we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT
Chris Lattner582048d2008-12-15 08:32:28 +00001288VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner6a09a742008-12-06 22:52:12 +00001289
1290//===---------------------------------------------------------------------===//
1291
1292http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
1293We could eliminate the branch condition here, loading from null is undefined:
1294
1295struct S { int w, x, y, z; };
1296struct T { int r; struct S s; };
1297void bar (struct S, int);
1298void foo (int a, struct T b)
1299{
1300 struct S *c = 0;
1301 if (a)
1302 c = &b.s;
1303 bar (*c, a);
1304}
1305
1306//===---------------------------------------------------------------------===//
Chris Lattner88d84b22008-12-02 06:32:34 +00001307
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001308simplifylibcalls should do several optimizations for strspn/strcspn:
1309
Chris Lattner9cf8ef62008-12-23 20:52:52 +00001310strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
1311
1312size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
1313 int __reject3) {
1314 register size_t __result = 0;
1315 while (__s[__result] != '\0' && __s[__result] != __reject1 &&
1316 __s[__result] != __reject2 && __s[__result] != __reject3)
1317 ++__result;
1318 return __result;
1319}
1320
1321This should turn into a switch on the character. See PR3253 for some notes on
1322codegen.
1323
1324456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
1325
1326//===---------------------------------------------------------------------===//
Chris Lattnerd23b7992008-12-31 00:54:13 +00001327
Chris Lattnerd3e768e2011-03-01 00:24:51 +00001328simplifylibcalls should turn these snprintf idioms into memcpy (GCC PR47917)
1329
1330char buf1[6], buf2[6], buf3[4], buf4[4];
1331int i;
1332
1333int foo (void) {
1334 int ret = snprintf (buf1, sizeof buf1, "abcde");
1335 ret += snprintf (buf2, sizeof buf2, "abcdef") * 16;
1336 ret += snprintf (buf3, sizeof buf3, "%s", i++ < 6 ? "abc" : "def") * 256;
1337 ret += snprintf (buf4, sizeof buf4, "%s", i++ > 10 ? "abcde" : "defgh")*4096;
1338 return ret;
1339}
1340
1341//===---------------------------------------------------------------------===//
1342
Chris Lattnerd23b7992008-12-31 00:54:13 +00001343"gas" uses this idiom:
1344 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string))
1345..
1346 else if (strchr ("<>", *intel_parser.op_string)
1347
1348Those should be turned into a switch.
1349
1350//===---------------------------------------------------------------------===//
Chris Lattnerffb08f52009-01-08 06:52:57 +00001351
1352252.eon contains this interesting code:
1353
1354 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
1355 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
1356 %strlen = call i32 @strlen(i8* %3072) ; uses = 1
1357 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
1358 call void @llvm.memcpy.i32(i8* %endptr,
1359 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
1360 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly
1361
1362This is interesting for a couple reasons. First, in this:
1363
Benjamin Kramer9d071cb2010-12-23 15:32:07 +00001364The memcpy+strlen strlen can be replaced with:
Chris Lattnerffb08f52009-01-08 06:52:57 +00001365
1366 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
1367
1368Because the destination was just copied into the specified memory buffer. This,
1369in turn, can be constant folded to "4".
1370
1371In other code, it contains:
1372
1373 %endptr6978 = bitcast i8* %endptr69 to i32*
1374 store i32 7107374, i32* %endptr6978, align 1
1375 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
1376
1377Which could also be constant folded. Whatever is producing this should probably
1378be fixed to leave this as a memcpy from a string.
1379
1380Further, eon also has an interesting partially redundant strlen call:
1381
1382bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
1383 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2]
1384 %683 = load i8** %682, align 4 ; <i8*> [#uses=4]
1385 %684 = load i8* %683, align 1 ; <i8> [#uses=1]
1386 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
1387 br i1 %685, label %bb10, label %bb9
1388
1389bb9: ; preds = %bb8
1390 %686 = call i32 @strlen(i8* %683) nounwind readonly
1391 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
1392 br i1 %687, label %bb10, label %bb11
1393
1394bb10: ; preds = %bb9, %bb8
1395 %688 = call i32 @strlen(i8* %683) nounwind readonly
1396
1397This could be eliminated by doing the strlen once in bb8, saving code size and
1398improving perf on the bb8->9->10 path.
1399
1400//===---------------------------------------------------------------------===//
Chris Lattner9fee08f2009-01-08 07:34:55 +00001401
1402I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
1403which looks like:
1404 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
1405
1406
1407bb62: ; preds = %bb55, %bb53
1408 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
1409 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1410 %172 = add i32 %171, -1 ; <i32> [#uses=1]
1411 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
1412
1413... no stores ...
1414 br i1 %or.cond, label %bb65, label %bb72
1415
1416bb65: ; preds = %bb62
1417 store i8 0, i8* %173, align 1
1418 br label %bb72
1419
1420bb72: ; preds = %bb65, %bb62
1421 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
1422 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
1423
1424Note that on the bb62->bb72 path, that the %177 strlen call is partially
1425redundant with the %171 call. At worst, we could shove the %177 strlen call
1426up into the bb65 block moving it out of the bb62->bb72 path. However, note
1427that bb65 stores to the string, zeroing out the last byte. This means that on
1428that path the value of %177 is actually just %171-1. A sub is cheaper than a
1429strlen!
1430
1431This pattern repeats several times, basically doing:
1432
1433 A = strlen(P);
1434 P[A-1] = 0;
1435 B = strlen(P);
1436 where it is "obvious" that B = A-1.
1437
1438//===---------------------------------------------------------------------===//
1439
Chris Lattner9fee08f2009-01-08 07:34:55 +00001440186.crafty has this interesting pattern with the "out.4543" variable:
1441
1442call void @llvm.memcpy.i32(
1443 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
1444 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
1445%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
1446
1447It is basically doing:
1448
1449 memcpy(globalarray, "string");
1450 printf(..., globalarray);
1451
1452Anyway, by knowing that printf just reads the memory and forward substituting
1453the string directly into the printf, this eliminates reads from globalarray.
1454Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
1455other similar functions) there are many stores to "out". Once all the printfs
1456stop using "out", all that is left is the memcpy's into it. This should allow
1457globalopt to remove the "stored only" global.
1458
1459//===---------------------------------------------------------------------===//
1460
Dan Gohman8289b052009-01-20 01:07:33 +00001461This code:
1462
1463define inreg i32 @foo(i8* inreg %p) nounwind {
1464 %tmp0 = load i8* %p
1465 %tmp1 = ashr i8 %tmp0, 5
1466 %tmp2 = sext i8 %tmp1 to i32
1467 ret i32 %tmp2
1468}
1469
1470could be dagcombine'd to a sign-extending load with a shift.
1471For example, on x86 this currently gets this:
1472
1473 movb (%eax), %al
1474 sarb $5, %al
1475 movsbl %al, %eax
1476
1477while it could get this:
1478
1479 movsbl (%eax), %eax
1480 sarl $5, %eax
1481
1482//===---------------------------------------------------------------------===//
Chris Lattner256baa42009-01-22 07:16:03 +00001483
1484GCC PR31029:
1485
1486int test(int x) { return 1-x == x; } // --> return false
1487int test2(int x) { return 2-x == x; } // --> return x == 1 ?
1488
1489Always foldable for odd constants, what is the rule for even?
1490
1491//===---------------------------------------------------------------------===//
1492
Torok Edwine46a6862009-01-24 19:30:25 +00001493PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
1494for next field in struct (which is at same address).
1495
1496For example: store of float into { {{}}, float } could be turned into a store to
1497the float directly.
1498
Torok Edwin474479f2009-02-20 18:42:06 +00001499//===---------------------------------------------------------------------===//
Nick Lewycky20babb12009-02-25 06:52:48 +00001500
Chris Lattner32c5f172009-05-11 17:41:40 +00001501The arg promotion pass should make use of nocapture to make its alias analysis
1502stuff much more precise.
1503
1504//===---------------------------------------------------------------------===//
1505
1506The following functions should be optimized to use a select instead of a
1507branch (from gcc PR40072):
1508
1509char char_int(int m) {if(m>7) return 0; return m;}
1510int int_char(char m) {if(m>7) return 0; return m;}
1511
1512//===---------------------------------------------------------------------===//
1513
Bill Wendling5a569272009-10-27 22:48:31 +00001514int func(int a, int b) { if (a & 0x80) b |= 0x80; else b &= ~0x80; return b; }
1515
1516Generates this:
1517
1518define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1519entry:
1520 %0 = and i32 %a, 128 ; <i32> [#uses=1]
1521 %1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
1522 %2 = or i32 %b, 128 ; <i32> [#uses=1]
1523 %3 = and i32 %b, -129 ; <i32> [#uses=1]
1524 %b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
1525 ret i32 %b_addr.0
1526}
1527
1528However, it's functionally equivalent to:
1529
1530 b = (b & ~0x80) | (a & 0x80);
1531
1532Which generates this:
1533
1534define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
1535entry:
1536 %0 = and i32 %b, -129 ; <i32> [#uses=1]
1537 %1 = and i32 %a, 128 ; <i32> [#uses=1]
1538 %2 = or i32 %0, %1 ; <i32> [#uses=1]
1539 ret i32 %2
1540}
1541
1542This can be generalized for other forms:
1543
1544 b = (b & ~0x80) | (a & 0x40) << 1;
1545
1546//===---------------------------------------------------------------------===//
Bill Wendlingc872e9c2009-10-27 23:30:07 +00001547
1548These two functions produce different code. They shouldn't:
1549
1550#include <stdint.h>
1551
1552uint8_t p1(uint8_t b, uint8_t a) {
1553 b = (b & ~0xc0) | (a & 0xc0);
1554 return (b);
1555}
1556
1557uint8_t p2(uint8_t b, uint8_t a) {
1558 b = (b & ~0x40) | (a & 0x40);
1559 b = (b & ~0x80) | (a & 0x80);
1560 return (b);
1561}
1562
1563define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1564entry:
1565 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1566 %1 = and i8 %a, -64 ; <i8> [#uses=1]
1567 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1568 ret i8 %2
1569}
1570
1571define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
1572entry:
1573 %0 = and i8 %b, 63 ; <i8> [#uses=1]
1574 %.masked = and i8 %a, 64 ; <i8> [#uses=1]
1575 %1 = and i8 %a, -128 ; <i8> [#uses=1]
1576 %2 = or i8 %1, %0 ; <i8> [#uses=1]
1577 %3 = or i8 %2, %.masked ; <i8> [#uses=1]
1578 ret i8 %3
1579}
1580
1581//===---------------------------------------------------------------------===//
Chris Lattner6fdfc9c2009-11-11 17:51:27 +00001582
1583IPSCCP does not currently propagate argument dependent constants through
1584functions where it does not not all of the callers. This includes functions
1585with normal external linkage as well as templates, C99 inline functions etc.
1586Specifically, it does nothing to:
1587
1588define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
1589entry:
1590 %0 = add nsw i32 %y, %z
1591 %1 = mul i32 %0, %x
1592 %2 = mul i32 %y, %z
1593 %3 = add nsw i32 %1, %2
1594 ret i32 %3
1595}
1596
1597define i32 @test2() nounwind {
1598entry:
1599 %0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
1600 ret i32 %0
1601}
1602
1603It would be interesting extend IPSCCP to be able to handle simple cases like
1604this, where all of the arguments to a call are constant. Because IPSCCP runs
1605before inlining, trivial templates and inline functions are not yet inlined.
1606The results for a function + set of constant arguments should be memoized in a
1607map.
1608
1609//===---------------------------------------------------------------------===//
Chris Lattnerfc926c22009-11-11 17:54:02 +00001610
1611The libcall constant folding stuff should be moved out of SimplifyLibcalls into
1612libanalysis' constantfolding logic. This would allow IPSCCP to be able to
1613handle simple things like this:
1614
1615static int foo(const char *X) { return strlen(X); }
1616int bar() { return foo("abcd"); }
1617
1618//===---------------------------------------------------------------------===//
Nick Lewycky93f9f7a2009-11-15 17:51:23 +00001619
Duncan Sandse10920d2010-01-06 15:37:47 +00001620functionattrs doesn't know much about memcpy/memset. This function should be
Duncan Sands7c422ac2010-01-06 08:45:52 +00001621marked readnone rather than readonly, since it only twiddles local memory, but
1622functionattrs doesn't handle memset/memcpy/memmove aggressively:
Chris Lattner89742c22009-12-03 07:43:46 +00001623
1624struct X { int *p; int *q; };
1625int foo() {
1626 int i = 0, j = 1;
1627 struct X x, y;
1628 int **p;
1629 y.p = &i;
1630 x.q = &j;
1631 p = __builtin_memcpy (&x, &y, sizeof (int *));
1632 return **p;
1633}
1634
Chris Lattner9c8fb9e2011-01-01 22:52:11 +00001635This can be seen at:
1636$ clang t.c -S -o - -mkernel -O0 -emit-llvm | opt -functionattrs -S
1637
1638
Chris Lattner05332172009-12-03 07:41:54 +00001639//===---------------------------------------------------------------------===//
1640
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001641Missed instcombine transformation:
1642define i1 @a(i32 %x) nounwind readnone {
1643entry:
1644 %cmp = icmp eq i32 %x, 30
1645 %sub = add i32 %x, -30
1646 %cmp2 = icmp ugt i32 %sub, 9
1647 %or = or i1 %cmp, %cmp2
1648 ret i1 %or
1649}
1650This should be optimized to a single compare. Testcase derived from gcc.
1651
1652//===---------------------------------------------------------------------===//
1653
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001654Missed instcombine or reassociate transformation:
1655int a(int a, int b) { return (a==12)&(b>47)&(b<58); }
1656
1657The sgt and slt should be combined into a single comparison. Testcase derived
1658from gcc.
1659
1660//===---------------------------------------------------------------------===//
1661
1662Missed instcombine transformation:
Chris Lattner3e411062010-11-21 07:05:31 +00001663
1664 %382 = srem i32 %tmp14.i, 64 ; [#uses=1]
1665 %383 = zext i32 %382 to i64 ; [#uses=1]
1666 %384 = shl i64 %381, %383 ; [#uses=1]
1667 %385 = icmp slt i32 %tmp14.i, 64 ; [#uses=1]
1668
Benjamin Kramerc21a8212010-11-23 20:33:57 +00001669The srem can be transformed to an and because if %tmp14.i is negative, the
1670shift is undefined. Testcase derived from 403.gcc.
Chris Lattner3e411062010-11-21 07:05:31 +00001671
1672//===---------------------------------------------------------------------===//
1673
1674This is a range comparison on a divided result (from 403.gcc):
1675
1676 %1337 = sdiv i32 %1336, 8 ; [#uses=1]
1677 %.off.i208 = add i32 %1336, 7 ; [#uses=1]
1678 %1338 = icmp ult i32 %.off.i208, 15 ; [#uses=1]
1679
1680We already catch this (removing the sdiv) if there isn't an add, we should
1681handle the 'add' as well. This is a common idiom with it's builtin_alloca code.
1682C testcase:
1683
1684int a(int x) { return (unsigned)(x/16+7) < 15; }
1685
1686Another similar case involves truncations on 64-bit targets:
1687
1688 %361 = sdiv i64 %.046, 8 ; [#uses=1]
1689 %362 = trunc i64 %361 to i32 ; [#uses=2]
1690...
1691 %367 = icmp eq i32 %362, 0 ; [#uses=1]
1692
Eli Friedman1144d7e2010-01-31 04:55:32 +00001693//===---------------------------------------------------------------------===//
1694
1695Missed instcombine/dagcombine transformation:
1696define void @lshift_lt(i8 zeroext %a) nounwind {
1697entry:
1698 %conv = zext i8 %a to i32
1699 %shl = shl i32 %conv, 3
1700 %cmp = icmp ult i32 %shl, 33
1701 br i1 %cmp, label %if.then, label %if.end
1702
1703if.then:
1704 tail call void @bar() nounwind
1705 ret void
1706
1707if.end:
1708 ret void
1709}
1710declare void @bar() nounwind
1711
1712The shift should be eliminated. Testcase derived from gcc.
Eli Friedman9cfb3ad2010-01-18 22:36:59 +00001713
1714//===---------------------------------------------------------------------===//
Chris Lattnercf031f62010-02-09 00:11:10 +00001715
1716These compile into different code, one gets recognized as a switch and the
1717other doesn't due to phase ordering issues (PR6212):
1718
1719int test1(int mainType, int subType) {
1720 if (mainType == 7)
1721 subType = 4;
1722 else if (mainType == 9)
1723 subType = 6;
1724 else if (mainType == 11)
1725 subType = 9;
1726 return subType;
1727}
1728
1729int test2(int mainType, int subType) {
1730 if (mainType == 7)
1731 subType = 4;
1732 if (mainType == 9)
1733 subType = 6;
1734 if (mainType == 11)
1735 subType = 9;
1736 return subType;
1737}
1738
1739//===---------------------------------------------------------------------===//
Chris Lattner66636702010-03-10 21:42:42 +00001740
1741The following test case (from PR6576):
1742
1743define i32 @mul(i32 %a, i32 %b) nounwind readnone {
1744entry:
1745 %cond1 = icmp eq i32 %b, 0 ; <i1> [#uses=1]
1746 br i1 %cond1, label %exit, label %bb.nph
1747bb.nph: ; preds = %entry
1748 %tmp = mul i32 %b, %a ; <i32> [#uses=1]
1749 ret i32 %tmp
1750exit: ; preds = %entry
1751 ret i32 0
1752}
1753
1754could be reduced to:
1755
1756define i32 @mul(i32 %a, i32 %b) nounwind readnone {
1757entry:
1758 %tmp = mul i32 %b, %a
1759 ret i32 %tmp
1760}
1761
1762//===---------------------------------------------------------------------===//
1763
Chris Lattner94846892010-04-16 23:52:30 +00001764We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates.
1765See GCC PR34949
1766
Chris Lattnerc2685a92010-05-21 23:16:21 +00001767Another interesting case is that something related could be used for variables
1768that go const after their ctor has finished. In these cases, globalopt (which
1769can statically run the constructor) could mark the global const (so it gets put
1770in the readonly section). A testcase would be:
1771
1772#include <complex>
1773using namespace std;
1774const complex<char> should_be_in_rodata (42,-42);
1775complex<char> should_be_in_data (42,-42);
1776complex<char> should_be_in_bss;
1777
1778Where we currently evaluate the ctors but the globals don't become const because
1779the optimizer doesn't know they "become const" after the ctor is done. See
1780GCC PR4131 for more examples.
1781
Chris Lattner94846892010-04-16 23:52:30 +00001782//===---------------------------------------------------------------------===//
1783
Dan Gohman3a2a4842010-05-03 14:31:00 +00001784In this code:
1785
1786long foo(long x) {
1787 return x > 1 ? x : 1;
1788}
1789
1790LLVM emits a comparison with 1 instead of 0. 0 would be equivalent
1791and cheaper on most targets.
1792
1793LLVM prefers comparisons with zero over non-zero in general, but in this
1794case it choses instead to keep the max operation obvious.
1795
1796//===---------------------------------------------------------------------===//
Eli Friedman8c47d3b2010-06-12 05:54:27 +00001797
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001798Switch lowering generates less than ideal code for the following switch:
1799define void @a(i32 %x) nounwind {
1800entry:
1801 switch i32 %x, label %if.end [
1802 i32 0, label %if.then
1803 i32 1, label %if.then
1804 i32 2, label %if.then
1805 i32 3, label %if.then
1806 i32 5, label %if.then
1807 ]
1808if.then:
1809 tail call void @foo() nounwind
1810 ret void
1811if.end:
1812 ret void
1813}
1814declare void @foo()
1815
1816Generated code on x86-64 (other platforms give similar results):
1817a:
1818 cmpl $5, %edi
1819 ja .LBB0_2
1820 movl %edi, %eax
1821 movl $47, %ecx
1822 btq %rax, %rcx
1823 jb .LBB0_3
1824.LBB0_2:
1825 ret
1826.LBB0_3:
Eli Friedmanb4828292010-07-03 08:43:32 +00001827 jmp foo # TAILCALL
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001828
1829The movl+movl+btq+jb could be simplified to a cmpl+jne.
1830
Eli Friedmanb4828292010-07-03 08:43:32 +00001831Or, if we wanted to be really clever, we could simplify the whole thing to
1832something like the following, which eliminates a branch:
1833 xorl $1, %edi
1834 cmpl $4, %edi
1835 ja .LBB0_2
1836 ret
1837.LBB0_2:
1838 jmp foo # TAILCALL
1839
Eli Friedmanb4a74c12010-07-03 07:38:12 +00001840//===---------------------------------------------------------------------===//
Chris Lattner274191f2010-11-09 19:37:28 +00001841
Chris Lattneraf510f12010-11-11 18:23:57 +00001842We compile this:
1843
1844int foo(int a) { return (a & (~15)) / 16; }
1845
1846Into:
1847
1848define i32 @foo(i32 %a) nounwind readnone ssp {
1849entry:
1850 %and = and i32 %a, -16
1851 %div = sdiv i32 %and, 16
1852 ret i32 %div
1853}
1854
1855but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case
1856should be instcombined into just "a >> 4".
1857
1858We do get this at the codegen level, so something knows about it, but
1859instcombine should catch it earlier:
1860
1861_foo: ## @foo
1862## BB#0: ## %entry
1863 movl %edi, %eax
1864 sarl $4, %eax
1865 ret
1866
1867//===---------------------------------------------------------------------===//
1868
Chris Lattnera97c91f2010-12-13 00:15:25 +00001869This code (from GCC PR28685):
1870
1871int test(int a, int b) {
1872 int lt = a < b;
1873 int eq = a == b;
1874 if (lt)
1875 return 1;
1876 return eq;
1877}
1878
1879Is compiled to:
1880
1881define i32 @test(i32 %a, i32 %b) nounwind readnone ssp {
1882entry:
1883 %cmp = icmp slt i32 %a, %b
1884 br i1 %cmp, label %return, label %if.end
1885
1886if.end: ; preds = %entry
1887 %cmp5 = icmp eq i32 %a, %b
1888 %conv6 = zext i1 %cmp5 to i32
1889 ret i32 %conv6
1890
1891return: ; preds = %entry
1892 ret i32 1
1893}
1894
1895it could be:
1896
1897define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp {
1898entry:
1899 %0 = icmp sle i32 %a, %b
1900 %retval = zext i1 %0 to i32
1901 ret i32 %retval
1902}
1903
1904//===---------------------------------------------------------------------===//
Duncan Sands124708d2011-01-01 20:08:02 +00001905
Benjamin Kramereaff66a2011-01-07 20:42:20 +00001906This code can be seen in viterbi:
1907
1908 %64 = call noalias i8* @malloc(i64 %62) nounwind
1909...
1910 %67 = call i64 @llvm.objectsize.i64(i8* %64, i1 false) nounwind
1911 %68 = call i8* @__memset_chk(i8* %64, i32 0, i64 %62, i64 %67) nounwind
1912
1913llvm.objectsize.i64 should be taught about malloc/calloc, allowing it to
1914fold to %62. This is a security win (overflows of malloc will get caught)
1915and also a performance win by exposing more memsets to the optimizer.
1916
1917This occurs several times in viterbi.
1918
1919Note that this would change the semantics of @llvm.objectsize which by its
1920current definition always folds to a constant. We also should make sure that
1921we remove checking in code like
1922
1923 char *p = malloc(strlen(s)+1);
1924 __strcpy_chk(p, s, __builtin_objectsize(p, 0));
1925
1926//===---------------------------------------------------------------------===//
1927
Chris Lattnerc1853e42011-01-06 07:09:23 +00001928This code (from Benchmarks/Dhrystone/dry.c):
1929
1930define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
1931entry:
1932 %sext = shl i32 %0, 24
1933 %conv = ashr i32 %sext, 24
1934 %sext6 = shl i32 %1, 24
1935 %conv4 = ashr i32 %sext6, 24
1936 %cmp = icmp eq i32 %conv, %conv4
1937 %. = select i1 %cmp, i32 10000, i32 0
1938 ret i32 %.
1939}
1940
1941Should be simplified into something like:
1942
1943define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
1944entry:
1945 %sext = shl i32 %0, 24
1946 %conv = and i32 %sext, 0xFF000000
1947 %sext6 = shl i32 %1, 24
1948 %conv4 = and i32 %sext6, 0xFF000000
1949 %cmp = icmp eq i32 %conv, %conv4
1950 %. = select i1 %cmp, i32 10000, i32 0
1951 ret i32 %.
1952}
1953
1954and then to:
1955
1956define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
1957entry:
1958 %conv = and i32 %0, 0xFF
1959 %conv4 = and i32 %1, 0xFF
1960 %cmp = icmp eq i32 %conv, %conv4
1961 %. = select i1 %cmp, i32 10000, i32 0
1962 ret i32 %.
1963}
1964//===---------------------------------------------------------------------===//
Chris Lattner15df0442011-01-01 22:57:31 +00001965
Benjamin Kramerfa366802011-01-06 17:35:50 +00001966clang -O3 currently compiles this code
1967
1968int g(unsigned int a) {
1969 unsigned int c[100];
1970 c[10] = a;
1971 c[11] = a;
1972 unsigned int b = c[10] + c[11];
1973 if(b > a*2) a = 4;
1974 else a = 8;
1975 return a + 7;
1976}
1977
1978into
1979
1980define i32 @g(i32 a) nounwind readnone {
1981 %add = shl i32 %a, 1
1982 %mul = shl i32 %a, 1
1983 %cmp = icmp ugt i32 %add, %mul
1984 %a.addr.0 = select i1 %cmp, i32 11, i32 15
1985 ret i32 %a.addr.0
1986}
1987
1988The icmp should fold to false. This CSE opportunity is only available
1989after GVN and InstCombine have run.
1990
1991//===---------------------------------------------------------------------===//
Chris Lattner01cdc202011-01-06 22:25:00 +00001992
1993memcpyopt should turn this:
1994
1995define i8* @test10(i32 %x) {
1996 %alloc = call noalias i8* @malloc(i32 %x) nounwind
1997 call void @llvm.memset.p0i8.i32(i8* %alloc, i8 0, i32 %x, i32 1, i1 false)
1998 ret i8* %alloc
1999}
2000
2001into a call to calloc. We should make sure that we analyze calloc as
2002aggressively as malloc though.
2003
2004//===---------------------------------------------------------------------===//
Chandler Carruth75fbd372011-01-09 01:32:55 +00002005
Chris Lattner4a6fb942011-01-10 21:01:17 +00002006clang -O3 doesn't optimize this:
Chandler Carruth75fbd372011-01-09 01:32:55 +00002007
2008void f1(int* begin, int* end) {
2009 std::fill(begin, end, 0);
2010}
2011
Chris Lattner66d7a572011-01-09 23:48:41 +00002012into a memset. This is PR8942.
Chandler Carruth75fbd372011-01-09 01:32:55 +00002013
2014//===---------------------------------------------------------------------===//
Chandler Carruthd8723a92011-01-09 09:58:33 +00002015
2016clang -O3 -fno-exceptions currently compiles this code:
2017
2018void f(int N) {
2019 std::vector<int> v(N);
Chandler Carruth27a2a132011-01-09 10:10:59 +00002020
2021 extern void sink(void*); sink(&v);
Chandler Carruthd8723a92011-01-09 09:58:33 +00002022}
2023
2024into
2025
2026define void @_Z1fi(i32 %N) nounwind {
2027entry:
2028 %v2 = alloca [3 x i32*], align 8
2029 %v2.sub = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 0
2030 %tmpcast = bitcast [3 x i32*]* %v2 to %"class.std::vector"*
2031 %conv = sext i32 %N to i64
2032 store i32* null, i32** %v2.sub, align 8, !tbaa !0
2033 %tmp3.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 1
2034 store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
2035 %tmp4.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 2
2036 store i32* null, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
2037 %cmp.i.i.i.i = icmp eq i32 %N, 0
2038 br i1 %cmp.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i, label %cond.true.i.i.i.i
2039
2040_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i: ; preds = %entry
2041 store i32* null, i32** %v2.sub, align 8, !tbaa !0
2042 store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
2043 %add.ptr.i5.i.i = getelementptr inbounds i32* null, i64 %conv
2044 store i32* %add.ptr.i5.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
2045 br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit
2046
2047cond.true.i.i.i.i: ; preds = %entry
2048 %cmp.i.i.i.i.i = icmp slt i32 %N, 0
2049 br i1 %cmp.i.i.i.i.i, label %if.then.i.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i
2050
2051if.then.i.i.i.i.i: ; preds = %cond.true.i.i.i.i
2052 call void @_ZSt17__throw_bad_allocv() noreturn nounwind
2053 unreachable
2054
2055_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i: ; preds = %cond.true.i.i.i.i
2056 %mul.i.i.i.i.i = shl i64 %conv, 2
2057 %call3.i.i.i.i.i = call noalias i8* @_Znwm(i64 %mul.i.i.i.i.i) nounwind
2058 %0 = bitcast i8* %call3.i.i.i.i.i to i32*
2059 store i32* %0, i32** %v2.sub, align 8, !tbaa !0
2060 store i32* %0, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
2061 %add.ptr.i.i.i = getelementptr inbounds i32* %0, i64 %conv
2062 store i32* %add.ptr.i.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
2063 call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %mul.i.i.i.i.i, i32 4, i1 false)
2064 br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit
2065
2066This is just the handling the construction of the vector. Most surprising here
Chris Lattnerb0daffc2011-01-16 06:39:44 +00002067is the fact that all three null stores in %entry are dead (because we do no
2068cross-block DSE).
2069
Chandler Carruthd8723a92011-01-09 09:58:33 +00002070Also surprising is that %conv isn't simplified to 0 in %....exit.thread.i.i.
Chris Lattnerb0daffc2011-01-16 06:39:44 +00002071This is a because the client of LazyValueInfo doesn't simplify all instruction
2072operands, just selected ones.
Chandler Carruthd8723a92011-01-09 09:58:33 +00002073
2074//===---------------------------------------------------------------------===//
Chandler Carruthe5ca4942011-01-09 09:58:36 +00002075
2076clang -O3 -fno-exceptions currently compiles this code:
2077
Chandler Carruthcad33c62011-01-16 01:40:23 +00002078void f(char* a, int n) {
2079 __builtin_memset(a, 0, n);
2080 for (int i = 0; i < n; ++i)
2081 a[i] = 0;
Chandler Carruthe5ca4942011-01-09 09:58:36 +00002082}
2083
Chandler Carruthcad33c62011-01-16 01:40:23 +00002084into:
Chandler Carruthe5ca4942011-01-09 09:58:36 +00002085
Chandler Carruthcad33c62011-01-16 01:40:23 +00002086define void @_Z1fPci(i8* nocapture %a, i32 %n) nounwind {
2087entry:
2088 %conv = sext i32 %n to i64
2089 tail call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %conv, i32 1, i1 false)
2090 %cmp8 = icmp sgt i32 %n, 0
2091 br i1 %cmp8, label %for.body.lr.ph, label %for.end
Chandler Carruthe5ca4942011-01-09 09:58:36 +00002092
Chandler Carruthcad33c62011-01-16 01:40:23 +00002093for.body.lr.ph: ; preds = %entry
2094 %tmp10 = add i32 %n, -1
2095 %tmp11 = zext i32 %tmp10 to i64
2096 %tmp12 = add i64 %tmp11, 1
2097 call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %tmp12, i32 1, i1 false)
2098 ret void
Chandler Carruthe5ca4942011-01-09 09:58:36 +00002099
Chandler Carruthcad33c62011-01-16 01:40:23 +00002100for.end: ; preds = %entry
2101 ret void
2102}
2103
2104This shouldn't need the ((zext (%n - 1)) + 1) game, and it should ideally fold
2105the two memset's together. The issue with %n seems to stem from poor handling
2106of the original loop.
Chandler Carruthe5ca4942011-01-09 09:58:36 +00002107
Chris Lattnerb0daffc2011-01-16 06:39:44 +00002108To simplify this, we need SCEV to know that "n != 0" because of the dominating
2109conditional. That would turn the second memset into a simple memset of 'n'.
2110
Chandler Carruthe5ca4942011-01-09 09:58:36 +00002111//===---------------------------------------------------------------------===//
Chandler Carruth694d7532011-01-09 11:29:57 +00002112
2113clang -O3 -fno-exceptions currently compiles this code:
2114
2115struct S {
2116 unsigned short m1, m2;
2117 unsigned char m3, m4;
2118};
2119
2120void f(int N) {
2121 std::vector<S> v(N);
2122 extern void sink(void*); sink(&v);
2123}
2124
2125into poor code for zero-initializing 'v' when N is >0. The problem is that
2126S is only 6 bytes, but each element is 8 byte-aligned. We generate a loop and
21274 stores on each iteration. If the struct were 8 bytes, this gets turned into
2128a memset.
2129
Chris Lattnerb0daffc2011-01-16 06:39:44 +00002130In order to handle this we have to:
2131 A) Teach clang to generate metadata for memsets of structs that have holes in
2132 them.
2133 B) Teach clang to use such a memset for zero init of this struct (since it has
2134 a hole), instead of doing elementwise zeroing.
2135
Chandler Carruth694d7532011-01-09 11:29:57 +00002136//===---------------------------------------------------------------------===//
Chandler Carruth96b1b6c2011-01-09 21:00:19 +00002137
2138clang -O3 currently compiles this code:
2139
2140extern const int magic;
2141double f() { return 0.0 * magic; }
2142
2143into
2144
2145@magic = external constant i32
2146
2147define double @_Z1fv() nounwind readnone {
2148entry:
2149 %tmp = load i32* @magic, align 4, !tbaa !0
2150 %conv = sitofp i32 %tmp to double
2151 %mul = fmul double %conv, 0.000000e+00
2152 ret double %mul
2153}
2154
Chris Lattner00a35d02011-01-10 00:33:01 +00002155We should be able to fold away this fmul to 0.0. More generally, fmul(x,0.0)
2156can be folded to 0.0 if we can prove that the LHS is not -0.0, not a NaN, and
2157not an INF. The CannotBeNegativeZero predicate in value tracking should be
2158extended to support general "fpclassify" operations that can return
2159yes/no/unknown for each of these predicates.
2160
Chris Lattner4a6fb942011-01-10 21:01:17 +00002161In this predicate, we know that uitofp is trivially never NaN or -0.0, and
Chris Lattner00a35d02011-01-10 00:33:01 +00002162we know that it isn't +/-Inf if the floating point type has enough exponent bits
2163to represent the largest integer value as < inf.
Chandler Carruth96b1b6c2011-01-09 21:00:19 +00002164
2165//===---------------------------------------------------------------------===//
Chandler Carruthfb00e272011-01-09 22:36:18 +00002166
Chris Lattner4a6fb942011-01-10 21:01:17 +00002167When optimizing a transformation that can change the sign of 0.0 (such as the
21680.0*val -> 0.0 transformation above), it might be provable that the sign of the
2169expression doesn't matter. For example, by the above rules, we can't transform
2170fmul(sitofp(x), 0.0) into 0.0, because x might be -1 and the result of the
2171expression is defined to be -0.0.
2172
2173If we look at the uses of the fmul for example, we might be able to prove that
2174all uses don't care about the sign of zero. For example, if we have:
2175
2176 fadd(fmul(sitofp(x), 0.0), 2.0)
2177
2178Since we know that x+2.0 doesn't care about the sign of any zeros in X, we can
2179transform the fmul to 0.0, and then the fadd to 2.0.
2180
2181//===---------------------------------------------------------------------===//
Chris Lattner4cd18f92011-01-13 22:08:15 +00002182
2183We should enhance memcpy/memcpy/memset to allow a metadata node on them
2184indicating that some bytes of the transfer are undefined. This is useful for
Chris Lattner4c5456a2011-01-13 22:11:56 +00002185frontends like clang when lowering struct copies, when some elements of the
Chris Lattner4cd18f92011-01-13 22:08:15 +00002186struct are undefined. Consider something like this:
2187
2188struct x {
2189 char a;
2190 int b[4];
2191};
2192void foo(struct x*P);
2193struct x testfunc() {
2194 struct x V1, V2;
2195 foo(&V1);
2196 V2 = V1;
2197
2198 return V2;
2199}
2200
2201We currently compile this to:
2202$ clang t.c -S -o - -O0 -emit-llvm | opt -scalarrepl -S
2203
2204
2205%struct.x = type { i8, [4 x i32] }
2206
2207define void @testfunc(%struct.x* sret %agg.result) nounwind ssp {
2208entry:
2209 %V1 = alloca %struct.x, align 4
2210 call void @foo(%struct.x* %V1)
2211 %tmp1 = bitcast %struct.x* %V1 to i8*
2212 %0 = bitcast %struct.x* %V1 to i160*
2213 %srcval1 = load i160* %0, align 4
2214 %tmp2 = bitcast %struct.x* %agg.result to i8*
2215 %1 = bitcast %struct.x* %agg.result to i160*
2216 store i160 %srcval1, i160* %1, align 4
2217 ret void
2218}
2219
2220This happens because SRoA sees that the temp alloca has is being memcpy'd into
2221and out of and it has holes and it has to be conservative. If we knew about the
2222holes, then this could be much much better.
2223
2224Having information about these holes would also improve memcpy (etc) lowering at
2225llc time when it gets inlined, because we can use smaller transfers. This also
2226avoids partial register stalls in some important cases.
2227
2228//===---------------------------------------------------------------------===//
Chris Lattner5653f1f2011-02-16 19:16:34 +00002229
Chris Lattnercb401952011-02-17 01:43:46 +00002230We don't fold (icmp (add) (add)) unless the two adds only have a single use.
2231There are a lot of cases that we're refusing to fold in (e.g.) 256.bzip2, for
2232example:
2233
2234 %indvar.next90 = add i64 %indvar89, 1 ;; Has 2 uses
2235 %tmp96 = add i64 %tmp95, 1 ;; Has 1 use
2236 %exitcond97 = icmp eq i64 %indvar.next90, %tmp96
2237
2238We don't fold this because we don't want to introduce an overlapped live range
2239of the ivar. However if we can make this more aggressive without causing
2240performance issues in two ways:
2241
22421. If *either* the LHS or RHS has a single use, we can definitely do the
2243 transformation. In the overlapping liverange case we're trading one register
2244 use for one fewer operation, which is a reasonable trade. Before doing this
2245 we should verify that the llc output actually shrinks for some benchmarks.
22462. If both ops have multiple uses, we can still fold it if the operations are
2247 both sinkable to *after* the icmp (e.g. in a subsequent block) which doesn't
2248 increase register pressure.
2249
2250There are a ton of icmp's we aren't simplifying because of the reg pressure
2251concern. Care is warranted here though because many of these are induction
2252variables and other cases that matter a lot to performance, like the above.
2253Here's a blob of code that you can drop into the bottom of visitICmp to see some
2254missed cases:
2255
2256 { Value *A, *B, *C, *D;
2257 if (match(Op0, m_Add(m_Value(A), m_Value(B))) &&
2258 match(Op1, m_Add(m_Value(C), m_Value(D))) &&
2259 (A == C || A == D || B == C || B == D)) {
2260 errs() << "OP0 = " << *Op0 << " U=" << Op0->getNumUses() << "\n";
2261 errs() << "OP1 = " << *Op1 << " U=" << Op1->getNumUses() << "\n";
2262 errs() << "CMP = " << I << "\n\n";
2263 }
2264 }
2265
2266//===---------------------------------------------------------------------===//
2267
2268