blob: a305ae6ec5505eea1d01814a6ae8b49f7c8734d6 [file] [log] [blame]
Chris Lattner1171ff42005-10-23 19:52:42 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the X86 backend.
3//===---------------------------------------------------------------------===//
4
Chris Lattnerf9dc6442009-05-25 16:34:44 +00005We should add support for the "movbe" instruction, which does a byte-swapping
6copy (3-addr bswap + memory support?) This is available on Atom processors.
Chris Lattner8ceb0fd2007-04-03 23:41:34 +00007
8//===---------------------------------------------------------------------===//
9
Chris Lattner6ef33072007-03-28 18:17:19 +000010CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
11backend knows how to three-addressify this shift, but it appears the register
12allocator isn't even asking it to do so in this case. We should investigate
13why this isn't happening, it could have significant impact on other important
14cases for X86 as well.
15
16//===---------------------------------------------------------------------===//
17
Chris Lattner1171ff42005-10-23 19:52:42 +000018This should be one DIV/IDIV instruction, not a libcall:
19
20unsigned test(unsigned long long X, unsigned Y) {
21 return X/Y;
22}
23
24This can be done trivially with a custom legalizer. What about overflow
25though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
26
27//===---------------------------------------------------------------------===//
28
Chris Lattner1171ff42005-10-23 19:52:42 +000029Improvements to the multiply -> shift/add algorithm:
30http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
31
32//===---------------------------------------------------------------------===//
33
34Improve code like this (occurs fairly frequently, e.g. in LLVM):
35long long foo(int x) { return 1LL << x; }
36
37http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
38http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
39http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
40
41Another useful one would be ~0ULL >> X and ~0ULL << X.
42
Chris Lattner95af34e2006-09-13 03:54:54 +000043One better solution for 1LL << x is:
44 xorl %eax, %eax
45 xorl %edx, %edx
46 testb $32, %cl
47 sete %al
48 setne %dl
49 sall %cl, %eax
50 sall %cl, %edx
51
52But that requires good 8-bit subreg support.
53
Eli Friedmana2e7efa2008-02-21 21:16:49 +000054Also, this might be better. It's an extra shift, but it's one instruction
55shorter, and doesn't stress 8-bit subreg support.
56(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
57but without the unnecessary and.)
58 movl %ecx, %eax
59 shrl $5, %eax
60 movl %eax, %edx
61 xorl $1, %edx
62 sall %cl, %eax
63 sall %cl. %edx
64
Chris Lattnerf73fb882006-09-18 05:36:54 +00006564-bit shifts (in general) expand to really bad code. Instead of using
66cmovs, we should expand to a conditional branch like GCC produces.
Chris Lattner95af34e2006-09-13 03:54:54 +000067
Chris Lattnerffff6172005-10-23 21:44:59 +000068//===---------------------------------------------------------------------===//
69
Chris Lattner1e4ed932005-11-28 04:52:39 +000070Compile this:
71_Bool f(_Bool a) { return a!=1; }
72
73into:
74 movzbl %dil, %eax
75 xorl $1, %eax
76 ret
Evan Cheng8dee8cc2005-12-17 01:25:19 +000077
Eli Friedmana2e7efa2008-02-21 21:16:49 +000078(Although note that this isn't a legal way to express the code that llvm-gcc
79currently generates for that function.)
80
Evan Cheng8dee8cc2005-12-17 01:25:19 +000081//===---------------------------------------------------------------------===//
82
83Some isel ideas:
84
851. Dynamic programming based approach when compile time if not an
86 issue.
872. Code duplication (addressing mode) during isel.
883. Other ideas from "Register-Sensitive Selection, Duplication, and
89 Sequencing of Instructions".
Chris Lattnercb298902006-02-08 07:12:07 +0000904. Scheduling for reduced register pressure. E.g. "Minimum Register
91 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
92 and other related papers.
93 http://citeseer.ist.psu.edu/govindarajan01minimum.html
Evan Cheng8dee8cc2005-12-17 01:25:19 +000094
95//===---------------------------------------------------------------------===//
96
97Should we promote i16 to i32 to avoid partial register update stalls?
Evan Cheng98abbfb2005-12-17 06:54:43 +000098
99//===---------------------------------------------------------------------===//
100
101Leave any_extend as pseudo instruction and hint to register
102allocator. Delay codegen until post register allocation.
Evan Cheng1c5d83c2007-10-12 18:22:55 +0000103Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
104the coalescer how to deal with it though.
Evan Chenga3195e82006-01-12 22:54:21 +0000105
106//===---------------------------------------------------------------------===//
107
Evan Cheng698b6382007-07-17 18:39:45 +0000108It appears icc use push for parameter passing. Need to investigate.
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000109
110//===---------------------------------------------------------------------===//
111
112Only use inc/neg/not instructions on processors where they are faster than
113add/sub/xor. They are slower on the P4 due to only updating some processor
114flags.
115
116//===---------------------------------------------------------------------===//
117
Chris Lattnerb638cd82006-01-29 09:08:15 +0000118The instruction selector sometimes misses folding a load into a compare. The
119pattern is written as (cmp reg, (load p)). Because the compare isn't
120commutative, it is not matched with the load on both sides. The dag combiner
121should be made smart enough to cannonicalize the load into the RHS of a compare
122when it can invert the result of the compare for free.
123
Evan Cheng0f4aa6e2006-09-11 05:25:15 +0000124//===---------------------------------------------------------------------===//
125
Chris Lattner9acddcd2006-02-02 19:16:34 +0000126In many cases, LLVM generates code like this:
127
128_test:
129 movl 8(%esp), %eax
130 cmpl %eax, 4(%esp)
131 setl %al
132 movzbl %al, %eax
133 ret
134
135on some processors (which ones?), it is more efficient to do this:
136
137_test:
138 movl 8(%esp), %ebx
Evan Cheng0f4aa6e2006-09-11 05:25:15 +0000139 xor %eax, %eax
Chris Lattner9acddcd2006-02-02 19:16:34 +0000140 cmpl %ebx, 4(%esp)
141 setl %al
142 ret
143
144Doing this correctly is tricky though, as the xor clobbers the flags.
145
Chris Lattnerd395d092006-02-02 19:43:28 +0000146//===---------------------------------------------------------------------===//
147
Chris Lattner8f77b732006-02-08 06:52:06 +0000148We should generate bts/btr/etc instructions on targets where they are cheap or
149when codesize is important. e.g., for:
150
151void setbit(int *target, int bit) {
152 *target |= (1 << bit);
153}
154void clearbit(int *target, int bit) {
155 *target &= ~(1 << bit);
156}
157
Chris Lattnerdba382b2006-02-08 17:47:22 +0000158//===---------------------------------------------------------------------===//
159
Evan Cheng952b7d62006-02-14 08:25:32 +0000160Instead of the following for memset char*, 1, 10:
161
162 movl $16843009, 4(%edx)
163 movl $16843009, (%edx)
164 movw $257, 8(%edx)
165
166It might be better to generate
167
168 movl $16843009, %eax
169 movl %eax, 4(%edx)
170 movl %eax, (%edx)
171 movw al, 8(%edx)
172
173when we can spare a register. It reduces code size.
Evan Cheng7634ac42006-02-17 00:04:28 +0000174
175//===---------------------------------------------------------------------===//
176
Chris Lattnera648df22006-02-17 04:20:13 +0000177Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
178get this:
179
Eli Friedman41ce5b82008-02-28 00:21:43 +0000180define i32 @test1(i32 %X) {
181 %Y = sdiv i32 %X, 8
182 ret i32 %Y
Chris Lattnera648df22006-02-17 04:20:13 +0000183}
184
185_test1:
186 movl 4(%esp), %eax
187 movl %eax, %ecx
188 sarl $31, %ecx
189 shrl $29, %ecx
190 addl %ecx, %eax
191 sarl $3, %eax
192 ret
193
194GCC knows several different ways to codegen it, one of which is this:
195
196_test1:
197 movl 4(%esp), %eax
198 cmpl $-1, %eax
199 leal 7(%eax), %ecx
200 cmovle %ecx, %eax
201 sarl $3, %eax
202 ret
203
204which is probably slower, but it's interesting at least :)
205
Evan Cheng755ee8f2006-02-20 19:58:27 +0000206//===---------------------------------------------------------------------===//
207
Nate Begemanc02e5a82006-03-26 19:19:27 +0000208We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
209We should leave these as libcalls for everything over a much lower threshold,
210since libc is hand tuned for medium and large mem ops (avoiding RFO for large
211stores, TLB preheating, etc)
212
213//===---------------------------------------------------------------------===//
214
Chris Lattner181b9c62006-03-09 01:39:46 +0000215Optimize this into something reasonable:
216 x * copysign(1.0, y) * copysign(1.0, z)
217
218//===---------------------------------------------------------------------===//
219
220Optimize copysign(x, *y) to use an integer load from y.
221
222//===---------------------------------------------------------------------===//
223
Evan Cheng0def9c32006-03-19 06:08:11 +0000224The following tests perform worse with LSR:
225
226lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
Chris Lattner8bcf9262006-03-19 22:27:41 +0000227
228//===---------------------------------------------------------------------===//
229
Evan Cheng8af5ef92006-04-05 23:46:04 +0000230Adding to the list of cmp / test poor codegen issues:
231
232int test(__m128 *A, __m128 *B) {
233 if (_mm_comige_ss(*A, *B))
234 return 3;
235 else
236 return 4;
237}
238
239_test:
240 movl 8(%esp), %eax
241 movaps (%eax), %xmm0
242 movl 4(%esp), %eax
243 movaps (%eax), %xmm1
244 comiss %xmm0, %xmm1
245 setae %al
246 movzbl %al, %ecx
247 movl $3, %eax
248 movl $4, %edx
249 cmpl $0, %ecx
250 cmove %edx, %eax
251 ret
252
253Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
254are a number of issues. 1) We are introducing a setcc between the result of the
255intrisic call and select. 2) The intrinsic is expected to produce a i32 value
256so a any extend (which becomes a zero extend) is added.
257
258We probably need some kind of target DAG combine hook to fix this.
Evan Cheng573cb7c2006-04-06 23:21:24 +0000259
260//===---------------------------------------------------------------------===//
261
Chris Lattner57a6c132006-04-23 19:47:09 +0000262We generate significantly worse code for this than GCC:
263http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
264http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
265
266There is also one case we do worse on PPC.
267
268//===---------------------------------------------------------------------===//
Evan Chengd7ec5182006-04-24 23:30:10 +0000269
Evan Cheng5a622f22006-05-30 07:37:37 +0000270For this:
271
272int test(int a)
273{
274 return a * 3;
275}
276
277We currently emits
278 imull $3, 4(%esp), %eax
279
280Perhaps this is what we really should generate is? Is imull three or four
281cycles? Note: ICC generates this:
282 movl 4(%esp), %eax
283 leal (%eax,%eax,2), %eax
284
285The current instruction priority is based on pattern complexity. The former is
286more "complex" because it folds a load so the latter will not be emitted.
287
288Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
289should always try to match LEA first since the LEA matching code does some
290estimate to determine whether the match is profitable.
291
292However, if we care more about code size, then imull is better. It's two bytes
293shorter than movl + leal.
Evan Chengc21051f2006-06-04 09:08:00 +0000294
Eli Friedman91db5272008-11-30 07:52:27 +0000295On a Pentium M, both variants have the same characteristics with regard
296to throughput; however, the multiplication has a latency of four cycles, as
297opposed to two cycles for the movl+lea variant.
298
Evan Chengc21051f2006-06-04 09:08:00 +0000299//===---------------------------------------------------------------------===//
300
Eli Friedmana2e7efa2008-02-21 21:16:49 +0000301__builtin_ffs codegen is messy.
Chris Lattnerace2e8a2007-08-11 18:19:07 +0000302
Chris Lattnerace2e8a2007-08-11 18:19:07 +0000303int ffs_(unsigned X) { return __builtin_ffs(X); }
304
Eli Friedmana2e7efa2008-02-21 21:16:49 +0000305llvm produces:
306ffs_:
307 movl 4(%esp), %ecx
308 bsfl %ecx, %eax
309 movl $32, %edx
310 cmove %edx, %eax
311 incl %eax
312 xorl %edx, %edx
313 testl %ecx, %ecx
314 cmove %edx, %eax
Chris Lattnerace2e8a2007-08-11 18:19:07 +0000315 ret
Eli Friedmana2e7efa2008-02-21 21:16:49 +0000316
317vs gcc:
318
Chris Lattnerace2e8a2007-08-11 18:19:07 +0000319_ffs_:
320 movl $-1, %edx
321 bsfl 4(%esp), %eax
322 cmove %edx, %eax
323 addl $1, %eax
324 ret
Evan Chengc21051f2006-06-04 09:08:00 +0000325
Eli Friedmana2e7efa2008-02-21 21:16:49 +0000326Another example of __builtin_ffs (use predsimplify to eliminate a select):
327
328int foo (unsigned long j) {
329 if (j)
330 return __builtin_ffs (j) - 1;
331 else
332 return 0;
333}
334
Evan Chengc21051f2006-06-04 09:08:00 +0000335//===---------------------------------------------------------------------===//
336
337It appears gcc place string data with linkonce linkage in
338.section __TEXT,__const_coal,coalesced instead of
339.section __DATA,__const_coal,coalesced.
340Take a look at darwin.h, there are other Darwin assembler directives that we
341do not make use of.
342
343//===---------------------------------------------------------------------===//
344
Chris Lattnereb05f902008-02-14 06:19:02 +0000345define i32 @foo(i32* %a, i32 %t) {
Nate Begeman83a6d492006-08-02 05:31:20 +0000346entry:
Chris Lattnereb05f902008-02-14 06:19:02 +0000347 br label %cond_true
Nate Begeman83a6d492006-08-02 05:31:20 +0000348
Chris Lattnereb05f902008-02-14 06:19:02 +0000349cond_true: ; preds = %cond_true, %entry
350 %x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ] ; <i32> [#uses=3]
351 %t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ] ; <i32> [#uses=1]
352 %tmp2 = getelementptr i32* %a, i32 %x.0.0 ; <i32*> [#uses=1]
353 %tmp3 = load i32* %tmp2 ; <i32> [#uses=1]
354 %tmp5 = add i32 %t_addr.0.0, %x.0.0 ; <i32> [#uses=1]
355 %tmp7 = add i32 %tmp5, %tmp3 ; <i32> [#uses=2]
356 %tmp9 = add i32 %x.0.0, 1 ; <i32> [#uses=2]
357 %tmp = icmp sgt i32 %tmp9, 39 ; <i1> [#uses=1]
358 br i1 %tmp, label %bb12, label %cond_true
Nate Begeman83a6d492006-08-02 05:31:20 +0000359
Chris Lattnereb05f902008-02-14 06:19:02 +0000360bb12: ; preds = %cond_true
361 ret i32 %tmp7
Chris Lattner8e173de2006-06-15 21:33:31 +0000362}
Nate Begeman83a6d492006-08-02 05:31:20 +0000363is pessimized by -loop-reduce and -indvars
Chris Lattner8e173de2006-06-15 21:33:31 +0000364
365//===---------------------------------------------------------------------===//
Evan Cheng357edf82006-06-17 00:45:49 +0000366
Evan Chengabb4d782006-07-19 21:29:30 +0000367u32 to float conversion improvement:
368
369float uint32_2_float( unsigned u ) {
370 float fl = (int) (u & 0xffff);
371 float fh = (int) (u >> 16);
372 fh *= 0x1.0p16f;
373 return fh + fl;
374}
375
37600000000 subl $0x04,%esp
37700000003 movl 0x08(%esp,1),%eax
37800000007 movl %eax,%ecx
37900000009 shrl $0x10,%ecx
3800000000c cvtsi2ss %ecx,%xmm0
38100000010 andl $0x0000ffff,%eax
38200000015 cvtsi2ss %eax,%xmm1
38300000019 mulss 0x00000078,%xmm0
38400000021 addss %xmm1,%xmm0
38500000025 movss %xmm0,(%esp,1)
3860000002a flds (%esp,1)
3870000002d addl $0x04,%esp
38800000030 ret
Evan Chengae1d33f2006-07-26 21:49:52 +0000389
390//===---------------------------------------------------------------------===//
391
392When using fastcc abi, align stack slot of argument of type double on 8 byte
393boundary to improve performance.
Chris Lattner5c36d782006-08-16 02:47:44 +0000394
395//===---------------------------------------------------------------------===//
396
397Codegen:
398
Chris Lattner2a33a3f2006-09-11 22:57:51 +0000399int f(int a, int b) {
400 if (a == 4 || a == 6)
401 b++;
402 return b;
403}
404
Chris Lattner5c36d782006-08-16 02:47:44 +0000405
406as:
407
408or eax, 2
409cmp eax, 6
410jz label
411
Chris Lattner0bfd7fd2006-09-11 23:00:56 +0000412//===---------------------------------------------------------------------===//
413
Chris Lattnerd1468d32006-09-13 04:19:50 +0000414GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
415simplifications for integer "x cmp y ? a : b". For example, instead of:
416
417int G;
418void f(int X, int Y) {
419 G = X < 0 ? 14 : 13;
420}
421
422compiling to:
423
424_f:
425 movl $14, %eax
426 movl $13, %ecx
427 movl 4(%esp), %edx
428 testl %edx, %edx
429 cmovl %eax, %ecx
430 movl %ecx, _G
431 ret
432
433it could be:
434_f:
435 movl 4(%esp), %eax
436 sarl $31, %eax
437 notl %eax
438 addl $14, %eax
439 movl %eax, _G
440 ret
441
442etc.
443
Chris Lattner25394582007-11-02 17:04:20 +0000444Another is:
445int usesbb(unsigned int a, unsigned int b) {
446 return (a < b ? -1 : 0);
447}
448to:
449_usesbb:
450 movl 8(%esp), %eax
451 cmpl %eax, 4(%esp)
452 sbbl %eax, %eax
453 ret
454
455instead of:
456_usesbb:
457 xorl %eax, %eax
458 movl 8(%esp), %ecx
459 cmpl %ecx, 4(%esp)
460 movl $4294967295, %ecx
461 cmovb %ecx, %eax
462 ret
463
Chris Lattnerd1468d32006-09-13 04:19:50 +0000464//===---------------------------------------------------------------------===//
Chris Lattner08c33012006-09-13 23:37:16 +0000465
Chris Lattner15bdc962006-10-12 22:01:26 +0000466Consider the expansion of:
467
Chris Lattnereb05f902008-02-14 06:19:02 +0000468define i32 @test3(i32 %X) {
469 %tmp1 = urem i32 %X, 255
470 ret i32 %tmp1
Chris Lattner15bdc962006-10-12 22:01:26 +0000471}
472
473Currently it compiles to:
474
475...
476 movl $2155905153, %ecx
477 movl 8(%esp), %esi
478 movl %esi, %eax
479 mull %ecx
480...
481
482This could be "reassociated" into:
483
484 movl $2155905153, %eax
485 movl 8(%esp), %ecx
486 mull %ecx
487
488to avoid the copy. In fact, the existing two-address stuff would do this
489except that mul isn't a commutative 2-addr instruction. I guess this has
490to be done at isel time based on the #uses to mul?
491
Evan Chengead1b802006-11-28 19:59:25 +0000492//===---------------------------------------------------------------------===//
493
494Make sure the instruction which starts a loop does not cross a cacheline
495boundary. This requires knowning the exact length of each machine instruction.
496That is somewhat complicated, but doable. Example 256.bzip2:
497
498In the new trace, the hot loop has an instruction which crosses a cacheline
499boundary. In addition to potential cache misses, this can't help decoding as I
500imagine there has to be some kind of complicated decoder reset and realignment
501to grab the bytes from the next cacheline.
502
503532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
Eli Friedman91db5272008-11-30 07:52:27 +0000504942 942 0x3d03 movl %dh, (1809(%esp, %esi)
505937 937 0x3d0a incl %esi
5063 3 0x3d0b cmpb %bl, %dl
Evan Chengead1b802006-11-28 19:59:25 +000050727 27 0x3d0d jnz 0x000062db <main+11707>
508
509//===---------------------------------------------------------------------===//
510
511In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
Chris Lattnerb9853eb2006-12-22 01:03:22 +0000512
513//===---------------------------------------------------------------------===//
514
515This could be a single 16-bit load.
Chris Lattnerc9d34712007-01-03 19:12:31 +0000516
Chris Lattnerb9853eb2006-12-22 01:03:22 +0000517int f(char *p) {
Chris Lattnerc9d34712007-01-03 19:12:31 +0000518 if ((p[0] == 1) & (p[1] == 2)) return 1;
Chris Lattnerb9853eb2006-12-22 01:03:22 +0000519 return 0;
520}
521
Chris Lattnerab4be632007-01-06 01:30:45 +0000522//===---------------------------------------------------------------------===//
523
524We should inline lrintf and probably other libc functions.
525
526//===---------------------------------------------------------------------===//
Chris Lattner7ace2992007-01-15 06:25:39 +0000527
Dan Gohman3c02c222010-01-04 20:55:05 +0000528Use the FLAGS values from arithmetic instructions more. For example, compile:
Chris Lattner7ace2992007-01-15 06:25:39 +0000529
530int add_zf(int *x, int y, int a, int b) {
531 if ((*x += y) == 0)
532 return a;
533 else
534 return b;
535}
536
537to:
538 addl %esi, (%rdi)
539 movl %edx, %eax
540 cmovne %ecx, %eax
541 ret
542instead of:
543
544_add_zf:
545 addl (%rdi), %esi
546 movl %esi, (%rdi)
547 testl %esi, %esi
548 cmove %edx, %ecx
549 movl %ecx, %eax
550 ret
551
Dan Gohman3c02c222010-01-04 20:55:05 +0000552As another example, compile function f2 in test/CodeGen/X86/cmp-test.ll
553without a test instruction.
Chris Lattner7ace2992007-01-15 06:25:39 +0000554
555//===---------------------------------------------------------------------===//
556
Chris Lattner66bb5b52007-01-21 07:03:37 +0000557These two functions have identical effects:
558
559unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
560unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
561
562We currently compile them to:
563
564_f:
565 movl 4(%esp), %eax
566 movl %eax, %ecx
567 incl %ecx
568 movl 8(%esp), %edx
569 cmpl %edx, %ecx
570 jne LBB1_2 #UnifiedReturnBlock
571LBB1_1: #cond_true
572 addl $2, %eax
573 ret
574LBB1_2: #UnifiedReturnBlock
575 movl %ecx, %eax
576 ret
577_f2:
578 movl 4(%esp), %eax
579 movl %eax, %ecx
580 incl %ecx
581 cmpl 8(%esp), %ecx
582 sete %cl
583 movzbl %cl, %ecx
584 leal 1(%ecx,%eax), %eax
585 ret
586
587both of which are inferior to GCC's:
588
589_f:
590 movl 4(%esp), %edx
591 leal 1(%edx), %eax
592 addl $2, %edx
593 cmpl 8(%esp), %eax
594 cmove %edx, %eax
595 ret
596_f2:
597 movl 4(%esp), %eax
598 addl $1, %eax
599 xorl %edx, %edx
600 cmpl 8(%esp), %eax
601 sete %dl
602 addl %edx, %eax
603 ret
604
605//===---------------------------------------------------------------------===//
606
Chris Lattner08ba1de2007-02-12 20:26:34 +0000607This code:
608
609void test(int X) {
610 if (X) abort();
611}
612
Chris Lattner48d3c102007-02-12 21:20:26 +0000613is currently compiled to:
Chris Lattner08ba1de2007-02-12 20:26:34 +0000614
615_test:
616 subl $12, %esp
617 cmpl $0, 16(%esp)
Chris Lattner48d3c102007-02-12 21:20:26 +0000618 jne LBB1_1
Chris Lattner08ba1de2007-02-12 20:26:34 +0000619 addl $12, %esp
620 ret
Chris Lattner48d3c102007-02-12 21:20:26 +0000621LBB1_1:
Chris Lattner08ba1de2007-02-12 20:26:34 +0000622 call L_abort$stub
623
624It would be better to produce:
625
626_test:
627 subl $12, %esp
628 cmpl $0, 16(%esp)
629 jne L_abort$stub
630 addl $12, %esp
631 ret
632
633This can be applied to any no-return function call that takes no arguments etc.
Chris Lattner48d3c102007-02-12 21:20:26 +0000634Alternatively, the stack save/restore logic could be shrink-wrapped, producing
635something like this:
636
637_test:
638 cmpl $0, 4(%esp)
639 jne LBB1_1
640 ret
641LBB1_1:
642 subl $12, %esp
643 call L_abort$stub
644
645Both are useful in different situations. Finally, it could be shrink-wrapped
646and tail called, like this:
647
648_test:
649 cmpl $0, 4(%esp)
650 jne LBB1_1
651 ret
652LBB1_1:
653 pop %eax # realign stack.
654 call L_abort$stub
655
656Though this probably isn't worth it.
Chris Lattner08ba1de2007-02-12 20:26:34 +0000657
658//===---------------------------------------------------------------------===//
Chris Lattner9b6f57c2007-03-02 05:04:52 +0000659
Chris Lattner0f1621b2007-05-10 00:08:04 +0000660Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
661a neg instead of a sub instruction. Consider:
662
663int test(char X) { return 7-X; }
664
665we currently produce:
666_test:
667 movl $7, %eax
668 movsbl 4(%esp), %ecx
669 subl %ecx, %eax
670 ret
671
672We would use one fewer register if codegen'd as:
673
674 movsbl 4(%esp), %eax
675 neg %eax
676 add $7, %eax
677 ret
678
679Note that this isn't beneficial if the load can be folded into the sub. In
680this case, we want a sub:
681
682int test(int X) { return 7-X; }
683_test:
684 movl $7, %eax
685 subl 4(%esp), %eax
686 ret
Chris Lattner7c1626452007-04-14 23:06:09 +0000687
Evan Chengb5cd2492007-07-18 08:21:49 +0000688//===---------------------------------------------------------------------===//
689
Chris Lattnercf8ba692007-08-20 02:14:33 +0000690Leaf functions that require one 4-byte spill slot have a prolog like this:
691
692_foo:
693 pushl %esi
694 subl $4, %esp
695...
696and an epilog like this:
697 addl $4, %esp
698 popl %esi
699 ret
700
701It would be smaller, and potentially faster, to push eax on entry and to
702pop into a dummy register instead of using addl/subl of esp. Just don't pop
703into any return registers :)
704
705//===---------------------------------------------------------------------===//
Chris Lattner9e43d632007-08-23 15:22:07 +0000706
707The X86 backend should fold (branch (or (setcc, setcc))) into multiple
708branches. We generate really poor code for:
709
710double testf(double a) {
711 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
712}
713
714For example, the entry BB is:
715
716_testf:
717 subl $20, %esp
718 pxor %xmm0, %xmm0
719 movsd 24(%esp), %xmm1
720 ucomisd %xmm0, %xmm1
721 setnp %al
722 sete %cl
723 testb %cl, %al
724 jne LBB1_5 # UnifiedReturnBlock
725LBB1_1: # cond_true
726
727
728it would be better to replace the last four instructions with:
729
730 jp LBB1_1
731 je LBB1_5
732LBB1_1:
733
734We also codegen the inner ?: into a diamond:
735
736 cvtss2sd LCPI1_0(%rip), %xmm2
737 cvtss2sd LCPI1_1(%rip), %xmm3
738 ucomisd %xmm1, %xmm0
739 ja LBB1_3 # cond_true
740LBB1_2: # cond_true
741 movapd %xmm3, %xmm2
742LBB1_3: # cond_true
743 movapd %xmm2, %xmm0
744 ret
745
746We should sink the load into xmm3 into the LBB1_2 block. This should
747be pretty easy, and will nuke all the copies.
748
749//===---------------------------------------------------------------------===//
Chris Lattnerbf8ae842007-09-10 21:43:18 +0000750
751This:
752 #include <algorithm>
753 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
754 { return std::make_pair(a + b, a + b < a); }
755 bool no_overflow(unsigned a, unsigned b)
756 { return !full_add(a, b).second; }
757
758Should compile to:
759
760
761 _Z11no_overflowjj:
762 addl %edi, %esi
763 setae %al
764 ret
765
Eli Friedmana2e7efa2008-02-21 21:16:49 +0000766FIXME: That code looks wrong; bool return is normally defined as zext.
767
Chris Lattnerbf8ae842007-09-10 21:43:18 +0000768on x86-64, not:
769
770__Z11no_overflowjj:
771 addl %edi, %esi
772 cmpl %edi, %esi
773 setae %al
774 movzbl %al, %eax
775 ret
776
777
778//===---------------------------------------------------------------------===//
Evan Chengf618e7c2007-09-10 22:16:37 +0000779
Bill Wendling6aab4912007-10-02 20:42:59 +0000780The following code:
781
Bill Wendling8d1c8ce2007-10-02 20:54:32 +0000782bb114.preheader: ; preds = %cond_next94
783 %tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
784 %tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
785 %tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
786 %tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
787 %tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
788 %tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
789 %tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
790 %tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
791 %tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
792 %tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
793 %tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
794 br label %bb114
795
796produces:
797
Bill Wendling6aab4912007-10-02 20:42:59 +0000798LBB3_5: # bb114.preheader
799 movswl -68(%ebp), %eax
800 movl $32, %ecx
801 movl %ecx, -80(%ebp)
802 subl %eax, -80(%ebp)
803 movswl -52(%ebp), %eax
804 movl %ecx, -84(%ebp)
805 subl %eax, -84(%ebp)
806 movswl -70(%ebp), %eax
807 movl %ecx, -88(%ebp)
808 subl %eax, -88(%ebp)
809 movswl -50(%ebp), %eax
810 subl %eax, %ecx
811 movl %ecx, -76(%ebp)
812 movswl -42(%ebp), %eax
813 movl %eax, -92(%ebp)
814 movswl -66(%ebp), %eax
815 movl %eax, -96(%ebp)
816 movw $0, -98(%ebp)
817
Chris Lattner67a1af92007-10-03 03:40:24 +0000818This appears to be bad because the RA is not folding the store to the stack
819slot into the movl. The above instructions could be:
820 movl $32, -80(%ebp)
821...
822 movl $32, -84(%ebp)
823...
824This seems like a cross between remat and spill folding.
825
Bill Wendling8d1c8ce2007-10-02 20:54:32 +0000826This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling6aab4912007-10-02 20:42:59 +0000827change, so we could simply subtract %eax from %ecx first and then use %ecx (or
828vice-versa).
829
830//===---------------------------------------------------------------------===//
831
Bill Wendling7687bd02007-10-02 21:49:31 +0000832This code:
833
834 %tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
835 br i1 %tmp659, label %cond_true662, label %cond_next715
836
837produces this:
838
839 testw %cx, %cx
840 movswl %cx, %esi
841 jns LBB4_109 # cond_next715
842
843Shark tells us that using %cx in the testw instruction is sub-optimal. It
844suggests using the 32-bit register (which is what ICC uses).
845
846//===---------------------------------------------------------------------===//
Chris Lattnerfce5cfe2007-10-03 17:10:03 +0000847
Chris Lattner87b77b92007-10-04 15:47:27 +0000848We compile this:
849
850void compare (long long foo) {
851 if (foo < 4294967297LL)
852 abort();
853}
854
855to:
856
Eli Friedmana2e7efa2008-02-21 21:16:49 +0000857compare:
858 subl $4, %esp
859 cmpl $0, 8(%esp)
Chris Lattner87b77b92007-10-04 15:47:27 +0000860 setne %al
861 movzbw %al, %ax
Eli Friedmana2e7efa2008-02-21 21:16:49 +0000862 cmpl $1, 12(%esp)
Chris Lattner87b77b92007-10-04 15:47:27 +0000863 setg %cl
864 movzbw %cl, %cx
865 cmove %ax, %cx
Eli Friedmana2e7efa2008-02-21 21:16:49 +0000866 testb $1, %cl
867 jne .LBB1_2 # UnifiedReturnBlock
868.LBB1_1: # ifthen
869 call abort
870.LBB1_2: # UnifiedReturnBlock
871 addl $4, %esp
872 ret
Chris Lattner87b77b92007-10-04 15:47:27 +0000873
874(also really horrible code on ppc). This is due to the expand code for 64-bit
875compares. GCC produces multiple branches, which is much nicer:
876
Eli Friedmana2e7efa2008-02-21 21:16:49 +0000877compare:
878 subl $12, %esp
879 movl 20(%esp), %edx
880 movl 16(%esp), %eax
881 decl %edx
882 jle .L7
883.L5:
884 addl $12, %esp
885 ret
886 .p2align 4,,7
887.L7:
888 jl .L4
Chris Lattner87b77b92007-10-04 15:47:27 +0000889 cmpl $0, %eax
Eli Friedmana2e7efa2008-02-21 21:16:49 +0000890 .p2align 4,,8
891 ja .L5
892.L4:
893 .p2align 4,,9
894 call abort
Chris Lattner87b77b92007-10-04 15:47:27 +0000895
896//===---------------------------------------------------------------------===//
Arnold Schwaighoferc85e1712007-10-11 19:40:01 +0000897
Arnold Schwaighofer48abc5c2007-10-12 21:30:57 +0000898Tail call optimization improvements: Tail call optimization currently
899pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighoferc8ab8cd2008-01-11 16:49:42 +0000900non-tail call optimized calls) that source from the callers arguments
901or that source from a virtual register (also possibly sourcing from
902callers arguments).
903This is done to prevent overwriting of parameters (see example
904below) that might be used later.
Arnold Schwaighofer48abc5c2007-10-12 21:30:57 +0000905
906example:
Arnold Schwaighoferc85e1712007-10-11 19:40:01 +0000907
908int callee(int32, int64);
909int caller(int32 arg1, int32 arg2) {
910 int64 local = arg2 * 2;
911 return callee(arg2, (int64)local);
912}
913
914[arg1] [!arg2 no longer valid since we moved local onto it]
915[arg2] -> [(int64)
916[RETADDR] local ]
917
Arnold Schwaighofer48abc5c2007-10-12 21:30:57 +0000918Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighoferc85e1712007-10-11 19:40:01 +0000919arg2 of the caller.
920
921Possible optimizations:
922
Arnold Schwaighoferc85e1712007-10-11 19:40:01 +0000923
Arnold Schwaighofer48abc5c2007-10-12 21:30:57 +0000924 - Analyse the actual parameters of the callee to see which would
925 overwrite a caller parameter which is used by the callee and only
926 push them onto the top of the stack.
Arnold Schwaighoferc85e1712007-10-11 19:40:01 +0000927
928 int callee (int32 arg1, int32 arg2);
929 int caller (int32 arg1, int32 arg2) {
930 return callee(arg1,arg2);
931 }
932
Arnold Schwaighofer48abc5c2007-10-12 21:30:57 +0000933 Here we don't need to write any variables to the top of the stack
934 since they don't overwrite each other.
Arnold Schwaighoferc85e1712007-10-11 19:40:01 +0000935
936 int callee (int32 arg1, int32 arg2);
937 int caller (int32 arg1, int32 arg2) {
938 return callee(arg2,arg1);
939 }
940
Arnold Schwaighofer48abc5c2007-10-12 21:30:57 +0000941 Here we need to push the arguments because they overwrite each
942 other.
Arnold Schwaighoferc85e1712007-10-11 19:40:01 +0000943
Arnold Schwaighoferc85e1712007-10-11 19:40:01 +0000944//===---------------------------------------------------------------------===//
Evan Cheng402b6782007-10-28 04:01:09 +0000945
946main ()
947{
948 int i = 0;
949 unsigned long int z = 0;
950
951 do {
952 z -= 0x00004000;
953 i++;
954 if (i > 0x00040000)
955 abort ();
956 } while (z > 0);
957 exit (0);
958}
959
960gcc compiles this to:
961
962_main:
963 subl $28, %esp
964 xorl %eax, %eax
965 jmp L2
966L3:
967 cmpl $262144, %eax
968 je L10
969L2:
970 addl $1, %eax
971 cmpl $262145, %eax
972 jne L3
973 call L_abort$stub
974L10:
975 movl $0, (%esp)
976 call L_exit$stub
977
978llvm:
979
980_main:
981 subl $12, %esp
982 movl $1, %eax
983 movl $16384, %ecx
984LBB1_1: # bb
985 cmpl $262145, %eax
986 jge LBB1_4 # cond_true
987LBB1_2: # cond_next
988 incl %eax
989 addl $4294950912, %ecx
990 cmpl $16384, %ecx
991 jne LBB1_1 # bb
992LBB1_3: # bb11
993 xorl %eax, %eax
994 addl $12, %esp
995 ret
996LBB1_4: # cond_true
997 call L_abort$stub
998
9991. LSR should rewrite the first cmp with induction variable %ecx.
10002. DAG combiner should fold
1001 leal 1(%eax), %edx
1002 cmpl $262145, %edx
1003 =>
1004 cmpl $262144, %eax
1005
1006//===---------------------------------------------------------------------===//
Chris Lattner94613162007-11-24 06:13:33 +00001007
1008define i64 @test(double %X) {
1009 %Y = fptosi double %X to i64
1010 ret i64 %Y
1011}
1012
1013compiles to:
1014
1015_test:
1016 subl $20, %esp
1017 movsd 24(%esp), %xmm0
1018 movsd %xmm0, 8(%esp)
1019 fldl 8(%esp)
1020 fisttpll (%esp)
1021 movl 4(%esp), %edx
1022 movl (%esp), %eax
1023 addl $20, %esp
1024 #FP_REG_KILL
1025 ret
1026
1027This should just fldl directly from the input stack slot.
Chris Lattner7d130152007-12-05 22:58:19 +00001028
1029//===---------------------------------------------------------------------===//
1030
1031This code:
1032int foo (int x) { return (x & 65535) | 255; }
1033
1034Should compile into:
1035
1036_foo:
1037 movzwl 4(%esp), %eax
Eli Friedmana2e7efa2008-02-21 21:16:49 +00001038 orl $255, %eax
Chris Lattner7d130152007-12-05 22:58:19 +00001039 ret
1040
1041instead of:
1042_foo:
1043 movl $255, %eax
1044 orl 4(%esp), %eax
1045 andl $65535, %eax
1046 ret
1047
Chris Lattner4185b522007-12-18 16:48:14 +00001048//===---------------------------------------------------------------------===//
1049
Chris Lattner7c1687c2008-02-21 06:51:29 +00001050We're codegen'ing multiply of long longs inefficiently:
Chris Lattner4185b522007-12-18 16:48:14 +00001051
Chris Lattner7c1687c2008-02-21 06:51:29 +00001052unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
1053 return arg1 * arg2;
1054}
Chris Lattner4185b522007-12-18 16:48:14 +00001055
Chris Lattner7c1687c2008-02-21 06:51:29 +00001056We compile to (fomit-frame-pointer):
1057
1058_LLM:
1059 pushl %esi
1060 movl 8(%esp), %ecx
1061 movl 16(%esp), %esi
1062 movl %esi, %eax
1063 mull %ecx
1064 imull 12(%esp), %esi
1065 addl %edx, %esi
1066 imull 20(%esp), %ecx
1067 movl %esi, %edx
1068 addl %ecx, %edx
1069 popl %esi
1070 ret
1071
1072This looks like a scheduling deficiency and lack of remat of the load from
1073the argument area. ICC apparently produces:
1074
1075 movl 8(%esp), %ecx
1076 imull 12(%esp), %ecx
1077 movl 16(%esp), %eax
1078 imull 4(%esp), %eax
1079 addl %eax, %ecx
1080 movl 4(%esp), %eax
1081 mull 12(%esp)
1082 addl %ecx, %edx
Chris Lattner4185b522007-12-18 16:48:14 +00001083 ret
1084
Chris Lattner7c1687c2008-02-21 06:51:29 +00001085Note that it remat'd loads from 4(esp) and 12(esp). See this GCC PR:
1086http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
Chris Lattner4185b522007-12-18 16:48:14 +00001087
1088//===---------------------------------------------------------------------===//
1089
Chris Lattner44cb8ef2007-12-24 19:27:46 +00001090We can fold a store into "zeroing a reg". Instead of:
1091
1092xorl %eax, %eax
1093movl %eax, 124(%esp)
1094
1095we should get:
1096
1097movl $0, 124(%esp)
1098
1099if the flags of the xor are dead.
1100
Chris Lattner047ad942008-01-11 18:00:13 +00001101Likewise, we isel "x<<1" into "add reg,reg". If reg is spilled, this should
1102be folded into: shl [mem], 1
1103
Chris Lattner44cb8ef2007-12-24 19:27:46 +00001104//===---------------------------------------------------------------------===//
Chris Lattner9bfcc622007-12-28 21:50:40 +00001105
Chris Lattner84a7c412008-01-07 21:59:58 +00001106In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
1107or and instruction, for example:
1108
Chris Lattner269f0592008-01-09 00:37:18 +00001109 xorpd LCPI1_0, %xmm2
Chris Lattner84a7c412008-01-07 21:59:58 +00001110
1111However, if xmm2 gets spilled, we end up with really ugly code like this:
1112
Chris Lattner269f0592008-01-09 00:37:18 +00001113 movsd (%esp), %xmm0
1114 xorpd LCPI1_0, %xmm0
1115 movsd %xmm0, (%esp)
Chris Lattner84a7c412008-01-07 21:59:58 +00001116
1117Since we 'know' that this is a 'neg', we can actually "fold" the spill into
1118the neg/abs instruction, turning it into an *integer* operation, like this:
1119
1120 xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
1121
1122you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattner269f0592008-01-09 00:37:18 +00001123stall. Here is a contrived testcase:
1124
1125double a, b, c;
1126void test(double *P) {
1127 double X = *P;
1128 a = X;
1129 bar();
1130 X = -X;
1131 b = X;
1132 bar();
1133 c = X;
1134}
Chris Lattner84a7c412008-01-07 21:59:58 +00001135
1136//===---------------------------------------------------------------------===//
Andrew Lenharth22c5c1b2008-02-16 01:24:58 +00001137
Chris Lattner456012c2008-02-17 19:43:57 +00001138The generated code on x86 for checking for signed overflow on a multiply the
1139obvious way is much longer than it needs to be.
1140
1141int x(int a, int b) {
1142 long long prod = (long long)a*b;
1143 return prod > 0x7FFFFFFF || prod < (-0x7FFFFFFF-1);
1144}
1145
1146See PR2053 for more details.
1147
1148//===---------------------------------------------------------------------===//
Chris Lattner92b416f2008-02-18 18:30:13 +00001149
Eli Friedmana2e7efa2008-02-21 21:16:49 +00001150We should investigate using cdq/ctld (effect: edx = sar eax, 31)
1151more aggressively; it should cost the same as a move+shift on any modern
1152processor, but it's a lot shorter. Downside is that it puts more
1153pressure on register allocation because it has fixed operands.
1154
1155Example:
1156int abs(int x) {return x < 0 ? -x : x;}
1157
1158gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
1159abs:
1160 movl 4(%esp), %eax
1161 cltd
1162 xorl %edx, %eax
1163 subl %edx, %eax
1164 ret
1165
1166//===---------------------------------------------------------------------===//
1167
1168Consider:
Chris Lattner92b416f2008-02-18 18:30:13 +00001169int test(unsigned long a, unsigned long b) { return -(a < b); }
1170
1171We currently compile this to:
1172
1173define i32 @test(i32 %a, i32 %b) nounwind {
1174 %tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
1175 %tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
1176 %tmp5 = sub i32 0, %tmp34 ; <i32> [#uses=1]
1177 ret i32 %tmp5
1178}
1179
1180and
1181
1182_test:
1183 movl 8(%esp), %eax
1184 cmpl %eax, 4(%esp)
1185 setb %al
1186 movzbl %al, %eax
1187 negl %eax
1188 ret
1189
1190Several deficiencies here. First, we should instcombine zext+neg into sext:
1191
1192define i32 @test2(i32 %a, i32 %b) nounwind {
1193 %tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
1194 %tmp34 = sext i1 %tmp3 to i32 ; <i32> [#uses=1]
1195 ret i32 %tmp34
1196}
1197
1198However, before we can do that, we have to fix the bad codegen that we get for
1199sext from bool:
1200
1201_test2:
1202 movl 8(%esp), %eax
1203 cmpl %eax, 4(%esp)
1204 setb %al
1205 movzbl %al, %eax
1206 shll $31, %eax
1207 sarl $31, %eax
1208 ret
1209
1210This code should be at least as good as the code above. Once this is fixed, we
1211can optimize this specific case even more to:
1212
1213 movl 8(%esp), %eax
1214 xorl %ecx, %ecx
1215 cmpl %eax, 4(%esp)
1216 sbbl %ecx, %ecx
1217
1218//===---------------------------------------------------------------------===//
Eli Friedman41ce5b82008-02-28 00:21:43 +00001219
1220Take the following code (from
1221http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
1222
1223extern unsigned char first_one[65536];
1224int FirstOnet(unsigned long long arg1)
1225{
1226 if (arg1 >> 48)
1227 return (first_one[arg1 >> 48]);
1228 return 0;
1229}
1230
1231
1232The following code is currently generated:
1233FirstOnet:
1234 movl 8(%esp), %eax
1235 cmpl $65536, %eax
1236 movl 4(%esp), %ecx
1237 jb .LBB1_2 # UnifiedReturnBlock
1238.LBB1_1: # ifthen
1239 shrl $16, %eax
1240 movzbl first_one(%eax), %eax
1241 ret
1242.LBB1_2: # UnifiedReturnBlock
1243 xorl %eax, %eax
1244 ret
1245
Eli Friedman2ad7e432010-06-03 01:01:48 +00001246We could change the "movl 8(%esp), %eax" into "movzwl 10(%esp), %eax"; this
1247lets us change the cmpl into a testl, which is shorter, and eliminate the shift.
Eli Friedman41ce5b82008-02-28 00:21:43 +00001248
1249//===---------------------------------------------------------------------===//
1250
Chris Lattnerdaf6c542008-02-28 04:52:59 +00001251We compile this function:
1252
1253define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext %d) nounwind {
1254entry:
1255 %tmp2 = icmp eq i8 %d, 0 ; <i1> [#uses=1]
1256 br i1 %tmp2, label %bb7, label %bb
1257
1258bb: ; preds = %entry
1259 %tmp6 = add i32 %b, %a ; <i32> [#uses=1]
1260 ret i32 %tmp6
1261
1262bb7: ; preds = %entry
1263 %tmp10 = sub i32 %a, %c ; <i32> [#uses=1]
1264 ret i32 %tmp10
1265}
1266
1267to:
1268
Eli Friedman2ad7e432010-06-03 01:01:48 +00001269foo: # @foo
1270# BB#0: # %entry
1271 movl 4(%esp), %ecx
Chris Lattnerdaf6c542008-02-28 04:52:59 +00001272 cmpb $0, 16(%esp)
Eli Friedman2ad7e432010-06-03 01:01:48 +00001273 je .LBB0_2
1274# BB#1: # %bb
Chris Lattnerdaf6c542008-02-28 04:52:59 +00001275 movl 8(%esp), %eax
Eli Friedman2ad7e432010-06-03 01:01:48 +00001276 addl %ecx, %eax
Chris Lattnerdaf6c542008-02-28 04:52:59 +00001277 ret
Eli Friedman2ad7e432010-06-03 01:01:48 +00001278.LBB0_2: # %bb7
1279 movl 12(%esp), %edx
1280 movl %ecx, %eax
1281 subl %edx, %eax
Chris Lattnerdaf6c542008-02-28 04:52:59 +00001282 ret
1283
Eli Friedman2ad7e432010-06-03 01:01:48 +00001284There's an obviously unnecessary movl in .LBB0_2, and we could eliminate a
1285couple more movls by putting 4(%esp) into %eax instead of %ecx.
Chris Lattnerdaf6c542008-02-28 04:52:59 +00001286
1287//===---------------------------------------------------------------------===//
Evan Chengddf0a062008-03-28 07:07:06 +00001288
1289See rdar://4653682.
1290
1291From flops:
1292
1293LBB1_15: # bb310
1294 cvtss2sd LCPI1_0, %xmm1
1295 addsd %xmm1, %xmm0
1296 movsd 176(%esp), %xmm2
1297 mulsd %xmm0, %xmm2
1298 movapd %xmm2, %xmm3
1299 mulsd %xmm3, %xmm3
1300 movapd %xmm3, %xmm4
1301 mulsd LCPI1_23, %xmm4
1302 addsd LCPI1_24, %xmm4
1303 mulsd %xmm3, %xmm4
1304 addsd LCPI1_25, %xmm4
1305 mulsd %xmm3, %xmm4
1306 addsd LCPI1_26, %xmm4
1307 mulsd %xmm3, %xmm4
1308 addsd LCPI1_27, %xmm4
1309 mulsd %xmm3, %xmm4
1310 addsd LCPI1_28, %xmm4
1311 mulsd %xmm3, %xmm4
1312 addsd %xmm1, %xmm4
1313 mulsd %xmm2, %xmm4
1314 movsd 152(%esp), %xmm1
1315 addsd %xmm4, %xmm1
1316 movsd %xmm1, 152(%esp)
1317 incl %eax
1318 cmpl %eax, %esi
1319 jge LBB1_15 # bb310
1320LBB1_16: # bb358.loopexit
1321 movsd 152(%esp), %xmm0
1322 addsd %xmm0, %xmm0
1323 addsd LCPI1_22, %xmm0
1324 movsd %xmm0, 152(%esp)
1325
1326Rather than spilling the result of the last addsd in the loop, we should have
1327insert a copy to split the interval (one for the duration of the loop, one
1328extending to the fall through). The register pressure in the loop isn't high
1329enough to warrant the spill.
1330
1331Also check why xmm7 is not used at all in the function.
Chris Lattner7d717a02008-04-21 04:46:30 +00001332
1333//===---------------------------------------------------------------------===//
1334
Eli Friedman2ad7e432010-06-03 01:01:48 +00001335Take the following:
Chris Lattner7d717a02008-04-21 04:46:30 +00001336
1337target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
1338target triple = "i386-apple-darwin8"
1339@in_exit.4870.b = internal global i1 false ; <i1*> [#uses=2]
1340define fastcc void @abort_gzip() noreturn nounwind {
1341entry:
1342 %tmp.b.i = load i1* @in_exit.4870.b ; <i1> [#uses=1]
1343 br i1 %tmp.b.i, label %bb.i, label %bb4.i
1344bb.i: ; preds = %entry
1345 tail call void @exit( i32 1 ) noreturn nounwind
1346 unreachable
1347bb4.i: ; preds = %entry
1348 store i1 true, i1* @in_exit.4870.b
1349 tail call void @exit( i32 1 ) noreturn nounwind
1350 unreachable
1351}
1352declare void @exit(i32) noreturn nounwind
1353
Eli Friedman2ad7e432010-06-03 01:01:48 +00001354This compiles into:
1355_abort_gzip: ## @abort_gzip
1356## BB#0: ## %entry
Chris Lattner7d717a02008-04-21 04:46:30 +00001357 subl $12, %esp
1358 movb _in_exit.4870.b, %al
Eli Friedman2ad7e432010-06-03 01:01:48 +00001359 cmpb $1, %al
1360 jne LBB0_2
1361
1362We somehow miss folding the movb into the cmpb.
Chris Lattner7d717a02008-04-21 04:46:30 +00001363
1364//===---------------------------------------------------------------------===//
Chris Lattner88c1baa2008-05-05 23:19:45 +00001365
1366We compile:
1367
1368int test(int x, int y) {
1369 return x-y-1;
1370}
1371
1372into (-m64):
1373
1374_test:
1375 decl %edi
1376 movl %edi, %eax
1377 subl %esi, %eax
1378 ret
1379
1380it would be better to codegen as: x+~y (notl+addl)
Torok Edwin61df3b72008-10-24 19:23:07 +00001381
1382//===---------------------------------------------------------------------===//
1383
1384This code:
1385
1386int foo(const char *str,...)
1387{
1388 __builtin_va_list a; int x;
1389 __builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
1390 return x;
1391}
1392
1393gets compiled into this on x86-64:
1394 subq $200, %rsp
1395 movaps %xmm7, 160(%rsp)
1396 movaps %xmm6, 144(%rsp)
1397 movaps %xmm5, 128(%rsp)
1398 movaps %xmm4, 112(%rsp)
1399 movaps %xmm3, 96(%rsp)
1400 movaps %xmm2, 80(%rsp)
1401 movaps %xmm1, 64(%rsp)
1402 movaps %xmm0, 48(%rsp)
1403 movq %r9, 40(%rsp)
1404 movq %r8, 32(%rsp)
1405 movq %rcx, 24(%rsp)
1406 movq %rdx, 16(%rsp)
1407 movq %rsi, 8(%rsp)
1408 leaq (%rsp), %rax
1409 movq %rax, 192(%rsp)
1410 leaq 208(%rsp), %rax
1411 movq %rax, 184(%rsp)
1412 movl $48, 180(%rsp)
1413 movl $8, 176(%rsp)
1414 movl 176(%rsp), %eax
1415 cmpl $47, %eax
1416 jbe .LBB1_3 # bb
1417.LBB1_1: # bb3
1418 movq 184(%rsp), %rcx
1419 leaq 8(%rcx), %rax
1420 movq %rax, 184(%rsp)
1421.LBB1_2: # bb4
1422 movl (%rcx), %eax
1423 addq $200, %rsp
1424 ret
1425.LBB1_3: # bb
1426 movl %eax, %ecx
1427 addl $8, %eax
1428 addq 192(%rsp), %rcx
1429 movl %eax, 176(%rsp)
1430 jmp .LBB1_2 # bb4
1431
1432gcc 4.3 generates:
1433 subq $96, %rsp
1434.LCFI0:
1435 leaq 104(%rsp), %rax
1436 movq %rsi, -80(%rsp)
1437 movl $8, -120(%rsp)
1438 movq %rax, -112(%rsp)
1439 leaq -88(%rsp), %rax
1440 movq %rax, -104(%rsp)
1441 movl $8, %eax
1442 cmpl $48, %eax
1443 jb .L6
1444 movq -112(%rsp), %rdx
1445 movl (%rdx), %eax
1446 addq $96, %rsp
1447 ret
1448 .p2align 4,,10
1449 .p2align 3
1450.L6:
1451 mov %eax, %edx
1452 addq -104(%rsp), %rdx
1453 addl $8, %eax
1454 movl %eax, -120(%rsp)
1455 movl (%rdx), %eax
1456 addq $96, %rsp
1457 ret
1458
1459and it gets compiled into this on x86:
1460 pushl %ebp
1461 movl %esp, %ebp
1462 subl $4, %esp
1463 leal 12(%ebp), %eax
1464 movl %eax, -4(%ebp)
1465 leal 16(%ebp), %eax
1466 movl %eax, -4(%ebp)
1467 movl 12(%ebp), %eax
1468 addl $4, %esp
1469 popl %ebp
1470 ret
1471
1472gcc 4.3 generates:
1473 pushl %ebp
1474 movl %esp, %ebp
1475 movl 12(%ebp), %eax
1476 popl %ebp
1477 ret
Evan Cheng86d77332008-11-11 17:35:52 +00001478
1479//===---------------------------------------------------------------------===//
1480
1481Teach tblgen not to check bitconvert source type in some cases. This allows us
1482to consolidate the following patterns in X86InstrMMX.td:
1483
1484def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1485 (iPTR 0))))),
1486 (v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
1487def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1488 (iPTR 0))))),
1489 (v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
1490def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1491 (iPTR 0))))),
1492 (v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
1493
1494There are other cases in various td files.
Eli Friedman91db5272008-11-30 07:52:27 +00001495
1496//===---------------------------------------------------------------------===//
1497
1498Take something like the following on x86-32:
1499unsigned a(unsigned long long x, unsigned y) {return x % y;}
1500
1501We currently generate a libcall, but we really shouldn't: the expansion is
1502shorter and likely faster than the libcall. The expected code is something
1503like the following:
1504
1505 movl 12(%ebp), %eax
1506 movl 16(%ebp), %ecx
1507 xorl %edx, %edx
1508 divl %ecx
1509 movl 8(%ebp), %eax
1510 divl %ecx
1511 movl %edx, %eax
1512 ret
1513
1514A similar code sequence works for division.
1515
1516//===---------------------------------------------------------------------===//
Chris Lattnerf96ca792008-12-06 22:49:05 +00001517
1518These should compile to the same code, but the later codegen's to useless
1519instructions on X86. This may be a trivial dag combine (GCC PR7061):
1520
1521struct s1 { unsigned char a, b; };
1522unsigned long f1(struct s1 x) {
1523 return x.a + x.b;
1524}
1525struct s2 { unsigned a: 8, b: 8; };
1526unsigned long f2(struct s2 x) {
1527 return x.a + x.b;
1528}
1529
1530//===---------------------------------------------------------------------===//
1531
Chris Lattnerbe685cc2009-02-08 20:44:19 +00001532We currently compile this:
1533
1534define i32 @func1(i32 %v1, i32 %v2) nounwind {
1535entry:
1536 %t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)
1537 %sum = extractvalue {i32, i1} %t, 0
1538 %obit = extractvalue {i32, i1} %t, 1
1539 br i1 %obit, label %overflow, label %normal
1540normal:
1541 ret i32 %sum
1542overflow:
1543 call void @llvm.trap()
1544 unreachable
1545}
1546declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)
1547declare void @llvm.trap()
1548
1549to:
1550
1551_func1:
1552 movl 4(%esp), %eax
1553 addl 8(%esp), %eax
1554 jo LBB1_2 ## overflow
1555LBB1_1: ## normal
1556 ret
1557LBB1_2: ## overflow
1558 ud2
1559
1560it would be nice to produce "into" someday.
1561
1562//===---------------------------------------------------------------------===//
Chris Lattnera66878b2009-02-17 01:16:14 +00001563
1564This code:
1565
1566void vec_mpys1(int y[], const int x[], int scaler) {
1567int i;
1568for (i = 0; i < 150; i++)
1569 y[i] += (((long long)scaler * (long long)x[i]) >> 31);
1570}
1571
1572Compiles to this loop with GCC 3.x:
1573
1574.L5:
1575 movl %ebx, %eax
1576 imull (%edi,%ecx,4)
1577 shrdl $31, %edx, %eax
1578 addl %eax, (%esi,%ecx,4)
1579 incl %ecx
1580 cmpl $149, %ecx
1581 jle .L5
1582
1583llvm-gcc compiles it to the much uglier:
1584
1585LBB1_1: ## bb1
1586 movl 24(%esp), %eax
1587 movl (%eax,%edi,4), %ebx
1588 movl %ebx, %ebp
1589 imull %esi, %ebp
1590 movl %ebx, %eax
1591 mull %ecx
1592 addl %ebp, %edx
1593 sarl $31, %ebx
1594 imull %ecx, %ebx
1595 addl %edx, %ebx
1596 shldl $1, %eax, %ebx
1597 movl 20(%esp), %eax
1598 addl %ebx, (%eax,%edi,4)
1599 incl %edi
1600 cmpl $150, %edi
1601 jne LBB1_1 ## bb1
1602
Eli Friedman1f1b0f72009-12-21 08:03:16 +00001603The issue is that we hoist the cast of "scaler" to long long outside of the
1604loop, the value comes into the loop as two values, and
1605RegsForValue::getCopyFromRegs doesn't know how to put an AssertSext on the
1606constructed BUILD_PAIR which represents the cast value.
1607
Chris Lattnera66878b2009-02-17 01:16:14 +00001608//===---------------------------------------------------------------------===//
Chris Lattnerb34487d2009-03-08 01:54:43 +00001609
Dan Gohmanad93e1e2009-03-09 23:47:02 +00001610Test instructions can be eliminated by using EFLAGS values from arithmetic
Dan Gohman3328add2009-03-10 00:26:23 +00001611instructions. This is currently not done for mul, and, or, xor, neg, shl,
1612sra, srl, shld, shrd, atomic ops, and others. It is also currently not done
1613for read-modify-write instructions. It is also current not done if the
1614OF or CF flags are needed.
Dan Gohmanad93e1e2009-03-09 23:47:02 +00001615
1616The shift operators have the complication that when the shift count is
1617zero, EFLAGS is not set, so they can only subsume a test instruction if
Dan Gohman3328add2009-03-10 00:26:23 +00001618the shift count is known to be non-zero. Also, using the EFLAGS value
1619from a shift is apparently very slow on some x86 implementations.
Dan Gohmanad93e1e2009-03-09 23:47:02 +00001620
1621In read-modify-write instructions, the root node in the isel match is
1622the store, and isel has no way for the use of the EFLAGS result of the
1623arithmetic to be remapped to the new node.
1624
Dan Gohman3328add2009-03-10 00:26:23 +00001625Add and subtract instructions set OF on signed overflow and CF on unsiged
1626overflow, while test instructions always clear OF and CF. In order to
1627replace a test with an add or subtract in a situation where OF or CF is
1628needed, codegen must be able to prove that the operation cannot see
1629signed or unsigned overflow, respectively.
1630
Dan Gohmanad93e1e2009-03-09 23:47:02 +00001631//===---------------------------------------------------------------------===//
1632
Chris Lattnerff9dcee2009-03-08 03:04:26 +00001633memcpy/memmove do not lower to SSE copies when possible. A silly example is:
1634define <16 x float> @foo(<16 x float> %A) nounwind {
1635 %tmp = alloca <16 x float>, align 16
1636 %tmp2 = alloca <16 x float>, align 16
1637 store <16 x float> %A, <16 x float>* %tmp
1638 %s = bitcast <16 x float>* %tmp to i8*
1639 %s2 = bitcast <16 x float>* %tmp2 to i8*
1640 call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)
1641 %R = load <16 x float>* %tmp2
1642 ret <16 x float> %R
1643}
1644
1645declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind
1646
1647which compiles to:
1648
1649_foo:
1650 subl $140, %esp
1651 movaps %xmm3, 112(%esp)
1652 movaps %xmm2, 96(%esp)
1653 movaps %xmm1, 80(%esp)
1654 movaps %xmm0, 64(%esp)
1655 movl 60(%esp), %eax
1656 movl %eax, 124(%esp)
1657 movl 56(%esp), %eax
1658 movl %eax, 120(%esp)
1659 movl 52(%esp), %eax
1660 <many many more 32-bit copies>
1661 movaps (%esp), %xmm0
1662 movaps 16(%esp), %xmm1
1663 movaps 32(%esp), %xmm2
1664 movaps 48(%esp), %xmm3
1665 addl $140, %esp
1666 ret
1667
1668On Nehalem, it may even be cheaper to just use movups when unaligned than to
1669fall back to lower-granularity chunks.
1670
1671//===---------------------------------------------------------------------===//
Chris Lattnerd9b77152009-05-25 20:28:19 +00001672
1673Implement processor-specific optimizations for parity with GCC on these
1674processors. GCC does two optimizations:
1675
16761. ix86_pad_returns inserts a noop before ret instructions if immediately
1677 preceeded by a conditional branch or is the target of a jump.
16782. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of
1679 code contains more than 3 branches.
1680
1681The first one is done for all AMDs, Core2, and "Generic"
1682The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,
1683 Core 2, and "Generic"
1684
1685//===---------------------------------------------------------------------===//
Eli Friedman7161cb12009-06-11 23:07:04 +00001686
1687Testcase:
1688int a(int x) { return (x & 127) > 31; }
1689
1690Current output:
1691 movl 4(%esp), %eax
1692 andl $127, %eax
1693 cmpl $31, %eax
1694 seta %al
1695 movzbl %al, %eax
1696 ret
1697
1698Ideal output:
1699 xorl %eax, %eax
1700 testl $96, 4(%esp)
1701 setne %al
1702 ret
1703
Chris Lattnerd23fffe2009-06-16 06:11:35 +00001704This should definitely be done in instcombine, canonicalizing the range
1705condition into a != condition. We get this IR:
1706
1707define i32 @a(i32 %x) nounwind readnone {
1708entry:
1709 %0 = and i32 %x, 127 ; <i32> [#uses=1]
1710 %1 = icmp ugt i32 %0, 31 ; <i1> [#uses=1]
1711 %2 = zext i1 %1 to i32 ; <i32> [#uses=1]
1712 ret i32 %2
1713}
1714
1715Instcombine prefers to strength reduce relational comparisons to equality
1716comparisons when possible, this should be another case of that. This could
1717be handled pretty easily in InstCombiner::visitICmpInstWithInstAndIntCst, but it
1718looks like InstCombiner::visitICmpInstWithInstAndIntCst should really already
1719be redesigned to use ComputeMaskedBits and friends.
1720
Eli Friedman7161cb12009-06-11 23:07:04 +00001721
1722//===---------------------------------------------------------------------===//
1723Testcase:
1724int x(int a) { return (a&0xf0)>>4; }
1725
1726Current output:
1727 movl 4(%esp), %eax
1728 shrl $4, %eax
1729 andl $15, %eax
1730 ret
1731
1732Ideal output:
1733 movzbl 4(%esp), %eax
1734 shrl $4, %eax
1735 ret
1736
1737//===---------------------------------------------------------------------===//
1738
1739Testcase:
1740int x(int a) { return (a & 0x80) ? 0x100 : 0; }
Chris Lattnerb42e20b2009-06-16 06:15:56 +00001741int y(int a) { return (a & 0x80) *2; }
Eli Friedman7161cb12009-06-11 23:07:04 +00001742
Chris Lattnerb42e20b2009-06-16 06:15:56 +00001743Current:
Eli Friedman7161cb12009-06-11 23:07:04 +00001744 testl $128, 4(%esp)
1745 setne %al
1746 movzbl %al, %eax
1747 shll $8, %eax
1748 ret
1749
Chris Lattnerb42e20b2009-06-16 06:15:56 +00001750Better:
Eli Friedman7161cb12009-06-11 23:07:04 +00001751 movl 4(%esp), %eax
1752 addl %eax, %eax
1753 andl $256, %eax
1754 ret
1755
Chris Lattnerb42e20b2009-06-16 06:15:56 +00001756This is another general instcombine transformation that is profitable on all
1757targets. In LLVM IR, these functions look like this:
1758
1759define i32 @x(i32 %a) nounwind readnone {
1760entry:
1761 %0 = and i32 %a, 128
1762 %1 = icmp eq i32 %0, 0
1763 %iftmp.0.0 = select i1 %1, i32 0, i32 256
1764 ret i32 %iftmp.0.0
1765}
1766
1767define i32 @y(i32 %a) nounwind readnone {
1768entry:
1769 %0 = shl i32 %a, 1
1770 %1 = and i32 %0, 256
1771 ret i32 %1
1772}
1773
1774Replacing an icmp+select with a shift should always be considered profitable in
1775instcombine.
Eli Friedman7161cb12009-06-11 23:07:04 +00001776
1777//===---------------------------------------------------------------------===//
Evan Cheng72169202009-07-30 08:56:19 +00001778
1779Re-implement atomic builtins __sync_add_and_fetch() and __sync_sub_and_fetch
1780properly.
1781
1782When the return value is not used (i.e. only care about the value in the
1783memory), x86 does not have to use add to implement these. Instead, it can use
1784add, sub, inc, dec instructions with the "lock" prefix.
1785
1786This is currently implemented using a bit of instruction selection trick. The
1787issue is the target independent pattern produces one output and a chain and we
1788want to map it into one that just output a chain. The current trick is to select
1789it into a MERGE_VALUES with the first definition being an implicit_def. The
1790proper solution is to add new ISD opcodes for the no-output variant. DAG
1791combiner can then transform the node before it gets to target node selection.
1792
1793Problem #2 is we are adding a whole bunch of x86 atomic instructions when in
1794fact these instructions are identical to the non-lock versions. We need a way to
1795add target specific information to target nodes and have this information
1796carried over to machine instructions. Asm printer (or JIT) can use this
1797information to add the "lock" prefix.
Bill Wendling1ff2c482009-10-27 22:34:43 +00001798
1799//===---------------------------------------------------------------------===//
Eli Friedman11d91e82010-02-10 21:26:04 +00001800
1801_Bool bar(int *x) { return *x & 1; }
1802
1803define zeroext i1 @bar(i32* nocapture %x) nounwind readonly {
1804entry:
1805 %tmp1 = load i32* %x ; <i32> [#uses=1]
1806 %and = and i32 %tmp1, 1 ; <i32> [#uses=1]
1807 %tobool = icmp ne i32 %and, 0 ; <i1> [#uses=1]
1808 ret i1 %tobool
1809}
1810
1811bar: # @bar
1812# BB#0: # %entry
1813 movl 4(%esp), %eax
1814 movb (%eax), %al
1815 andb $1, %al
1816 movzbl %al, %eax
1817 ret
1818
1819Missed optimization: should be movl+andl.
1820
1821//===---------------------------------------------------------------------===//
1822
1823Consider the following two functions compiled with clang:
1824_Bool foo(int *x) { return !(*x & 4); }
1825unsigned bar(int *x) { return !(*x & 4); }
1826
1827foo:
1828 movl 4(%esp), %eax
1829 testb $4, (%eax)
1830 sete %al
1831 movzbl %al, %eax
1832 ret
1833
1834bar:
1835 movl 4(%esp), %eax
1836 movl (%eax), %eax
1837 shrl $2, %eax
1838 andl $1, %eax
1839 xorl $1, %eax
1840 ret
1841
1842The second function generates more code even though the two functions are
1843are functionally identical.
1844
1845//===---------------------------------------------------------------------===//
1846
1847Take the following C code:
1848int x(int y) { return (y & 63) << 14; }
1849
1850Code produced by gcc:
1851 andl $63, %edi
1852 sall $14, %edi
1853 movl %edi, %eax
1854 ret
1855
1856Code produced by clang:
1857 shll $14, %edi
1858 movl %edi, %eax
1859 andl $1032192, %eax
1860 ret
1861
1862The code produced by gcc is 3 bytes shorter. This sort of construct often
1863shows up with bitfields.
1864
1865//===---------------------------------------------------------------------===//
Eli Friedman5033f642010-08-29 05:07:40 +00001866
1867Take the following C code:
1868int f(int a, int b) { return (unsigned char)a == (unsigned char)b; }
1869
1870We generate the following IR with clang:
1871define i32 @f(i32 %a, i32 %b) nounwind readnone {
1872entry:
1873 %tmp = xor i32 %b, %a ; <i32> [#uses=1]
1874 %tmp6 = and i32 %tmp, 255 ; <i32> [#uses=1]
1875 %cmp = icmp eq i32 %tmp6, 0 ; <i1> [#uses=1]
1876 %conv5 = zext i1 %cmp to i32 ; <i32> [#uses=1]
1877 ret i32 %conv5
1878}
1879
1880And the following x86 code:
1881 xorl %esi, %edi
1882 testb $-1, %dil
1883 sete %al
1884 movzbl %al, %eax
1885 ret
1886
1887A cmpb instead of the xorl+testb would be one instruction shorter.
1888
1889//===---------------------------------------------------------------------===//
1890
1891Given the following C code:
1892int f(int a, int b) { return (signed char)a == (signed char)b; }
1893
1894We generate the following IR with clang:
1895define i32 @f(i32 %a, i32 %b) nounwind readnone {
1896entry:
1897 %sext = shl i32 %a, 24 ; <i32> [#uses=1]
1898 %conv1 = ashr i32 %sext, 24 ; <i32> [#uses=1]
1899 %sext6 = shl i32 %b, 24 ; <i32> [#uses=1]
1900 %conv4 = ashr i32 %sext6, 24 ; <i32> [#uses=1]
1901 %cmp = icmp eq i32 %conv1, %conv4 ; <i1> [#uses=1]
1902 %conv5 = zext i1 %cmp to i32 ; <i32> [#uses=1]
1903 ret i32 %conv5
1904}
1905
1906And the following x86 code:
1907 movsbl %sil, %eax
1908 movsbl %dil, %ecx
1909 cmpl %eax, %ecx
1910 sete %al
1911 movzbl %al, %eax
1912 ret
1913
1914
1915It should be possible to eliminate the sign extensions.
1916
1917//===---------------------------------------------------------------------===//
Dan Gohman24bde5b2010-09-02 21:18:42 +00001918
1919LLVM misses a load+store narrowing opportunity in this code:
1920
1921%struct.bf = type { i64, i16, i16, i32 }
1922
1923@bfi = external global %struct.bf* ; <%struct.bf**> [#uses=2]
1924
1925define void @t1() nounwind ssp {
1926entry:
1927 %0 = load %struct.bf** @bfi, align 8 ; <%struct.bf*> [#uses=1]
1928 %1 = getelementptr %struct.bf* %0, i64 0, i32 1 ; <i16*> [#uses=1]
1929 %2 = bitcast i16* %1 to i32* ; <i32*> [#uses=2]
1930 %3 = load i32* %2, align 1 ; <i32> [#uses=1]
1931 %4 = and i32 %3, -65537 ; <i32> [#uses=1]
1932 store i32 %4, i32* %2, align 1
1933 %5 = load %struct.bf** @bfi, align 8 ; <%struct.bf*> [#uses=1]
1934 %6 = getelementptr %struct.bf* %5, i64 0, i32 1 ; <i16*> [#uses=1]
1935 %7 = bitcast i16* %6 to i32* ; <i32*> [#uses=2]
1936 %8 = load i32* %7, align 1 ; <i32> [#uses=1]
1937 %9 = and i32 %8, -131073 ; <i32> [#uses=1]
1938 store i32 %9, i32* %7, align 1
1939 ret void
1940}
1941
1942LLVM currently emits this:
1943
1944 movq bfi(%rip), %rax
1945 andl $-65537, 8(%rax)
1946 movq bfi(%rip), %rax
1947 andl $-131073, 8(%rax)
1948 ret
1949
1950It could narrow the loads and stores to emit this:
1951
1952 movq bfi(%rip), %rax
1953 andb $-2, 10(%rax)
1954 movq bfi(%rip), %rax
1955 andb $-3, 10(%rax)
1956 ret
1957
1958The trouble is that there is a TokenFactor between the store and the
1959load, making it non-trivial to determine if there's anything between
1960the load and the store which would prohibit narrowing.
1961
1962//===---------------------------------------------------------------------===//