blob: d4545a6fcfd37150820e69ae6bb268a45c2d7e4d [file] [log] [blame]
Dan Gohmanf17a25c2007-07-18 16:29:46 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the X86 backend.
3//===---------------------------------------------------------------------===//
4
Chris Lattner88de51e2009-05-25 16:34:44 +00005We should add support for the "movbe" instruction, which does a byte-swapping
6copy (3-addr bswap + memory support?) This is available on Atom processors.
Dan Gohmanf17a25c2007-07-18 16:29:46 +00007
8//===---------------------------------------------------------------------===//
9
Dan Gohmanf17a25c2007-07-18 16:29:46 +000010CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
11backend knows how to three-addressify this shift, but it appears the register
12allocator isn't even asking it to do so in this case. We should investigate
13why this isn't happening, it could have significant impact on other important
14cases for X86 as well.
15
16//===---------------------------------------------------------------------===//
17
18This should be one DIV/IDIV instruction, not a libcall:
19
20unsigned test(unsigned long long X, unsigned Y) {
21 return X/Y;
22}
23
24This can be done trivially with a custom legalizer. What about overflow
25though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
26
27//===---------------------------------------------------------------------===//
28
29Improvements to the multiply -> shift/add algorithm:
30http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
31
32//===---------------------------------------------------------------------===//
33
34Improve code like this (occurs fairly frequently, e.g. in LLVM):
35long long foo(int x) { return 1LL << x; }
36
37http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
38http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
39http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
40
41Another useful one would be ~0ULL >> X and ~0ULL << X.
42
43One better solution for 1LL << x is:
44 xorl %eax, %eax
45 xorl %edx, %edx
46 testb $32, %cl
47 sete %al
48 setne %dl
49 sall %cl, %eax
50 sall %cl, %edx
51
52But that requires good 8-bit subreg support.
53
Eli Friedman577c7492008-02-21 21:16:49 +000054Also, this might be better. It's an extra shift, but it's one instruction
55shorter, and doesn't stress 8-bit subreg support.
56(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
57but without the unnecessary and.)
58 movl %ecx, %eax
59 shrl $5, %eax
60 movl %eax, %edx
61 xorl $1, %edx
62 sall %cl, %eax
63 sall %cl. %edx
64
Dan Gohmanf17a25c2007-07-18 16:29:46 +00006564-bit shifts (in general) expand to really bad code. Instead of using
66cmovs, we should expand to a conditional branch like GCC produces.
67
68//===---------------------------------------------------------------------===//
69
70Compile this:
71_Bool f(_Bool a) { return a!=1; }
72
73into:
74 movzbl %dil, %eax
75 xorl $1, %eax
76 ret
77
Eli Friedman577c7492008-02-21 21:16:49 +000078(Although note that this isn't a legal way to express the code that llvm-gcc
79currently generates for that function.)
80
Dan Gohmanf17a25c2007-07-18 16:29:46 +000081//===---------------------------------------------------------------------===//
82
83Some isel ideas:
84
851. Dynamic programming based approach when compile time if not an
86 issue.
872. Code duplication (addressing mode) during isel.
883. Other ideas from "Register-Sensitive Selection, Duplication, and
89 Sequencing of Instructions".
904. Scheduling for reduced register pressure. E.g. "Minimum Register
91 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
92 and other related papers.
93 http://citeseer.ist.psu.edu/govindarajan01minimum.html
94
95//===---------------------------------------------------------------------===//
96
97Should we promote i16 to i32 to avoid partial register update stalls?
98
99//===---------------------------------------------------------------------===//
100
101Leave any_extend as pseudo instruction and hint to register
102allocator. Delay codegen until post register allocation.
Evan Chengfdbb6672007-10-12 18:22:55 +0000103Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
104the coalescer how to deal with it though.
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000105
106//===---------------------------------------------------------------------===//
107
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000108It appears icc use push for parameter passing. Need to investigate.
109
110//===---------------------------------------------------------------------===//
111
112Only use inc/neg/not instructions on processors where they are faster than
113add/sub/xor. They are slower on the P4 due to only updating some processor
114flags.
115
116//===---------------------------------------------------------------------===//
117
118The instruction selector sometimes misses folding a load into a compare. The
119pattern is written as (cmp reg, (load p)). Because the compare isn't
120commutative, it is not matched with the load on both sides. The dag combiner
121should be made smart enough to cannonicalize the load into the RHS of a compare
122when it can invert the result of the compare for free.
123
124//===---------------------------------------------------------------------===//
125
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000126In many cases, LLVM generates code like this:
127
128_test:
129 movl 8(%esp), %eax
130 cmpl %eax, 4(%esp)
131 setl %al
132 movzbl %al, %eax
133 ret
134
135on some processors (which ones?), it is more efficient to do this:
136
137_test:
138 movl 8(%esp), %ebx
139 xor %eax, %eax
140 cmpl %ebx, 4(%esp)
141 setl %al
142 ret
143
144Doing this correctly is tricky though, as the xor clobbers the flags.
145
146//===---------------------------------------------------------------------===//
147
148We should generate bts/btr/etc instructions on targets where they are cheap or
149when codesize is important. e.g., for:
150
151void setbit(int *target, int bit) {
152 *target |= (1 << bit);
153}
154void clearbit(int *target, int bit) {
155 *target &= ~(1 << bit);
156}
157
158//===---------------------------------------------------------------------===//
159
160Instead of the following for memset char*, 1, 10:
161
162 movl $16843009, 4(%edx)
163 movl $16843009, (%edx)
164 movw $257, 8(%edx)
165
166It might be better to generate
167
168 movl $16843009, %eax
169 movl %eax, 4(%edx)
170 movl %eax, (%edx)
171 movw al, 8(%edx)
172
173when we can spare a register. It reduces code size.
174
175//===---------------------------------------------------------------------===//
176
177Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
178get this:
179
Eli Friedman1aa1f2c2008-02-28 00:21:43 +0000180define i32 @test1(i32 %X) {
181 %Y = sdiv i32 %X, 8
182 ret i32 %Y
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000183}
184
185_test1:
186 movl 4(%esp), %eax
187 movl %eax, %ecx
188 sarl $31, %ecx
189 shrl $29, %ecx
190 addl %ecx, %eax
191 sarl $3, %eax
192 ret
193
194GCC knows several different ways to codegen it, one of which is this:
195
196_test1:
197 movl 4(%esp), %eax
198 cmpl $-1, %eax
199 leal 7(%eax), %ecx
200 cmovle %ecx, %eax
201 sarl $3, %eax
202 ret
203
204which is probably slower, but it's interesting at least :)
205
206//===---------------------------------------------------------------------===//
207
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000208We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
209We should leave these as libcalls for everything over a much lower threshold,
210since libc is hand tuned for medium and large mem ops (avoiding RFO for large
211stores, TLB preheating, etc)
212
213//===---------------------------------------------------------------------===//
214
215Optimize this into something reasonable:
216 x * copysign(1.0, y) * copysign(1.0, z)
217
218//===---------------------------------------------------------------------===//
219
220Optimize copysign(x, *y) to use an integer load from y.
221
222//===---------------------------------------------------------------------===//
223
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000224The following tests perform worse with LSR:
225
226lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
227
228//===---------------------------------------------------------------------===//
229
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000230Adding to the list of cmp / test poor codegen issues:
231
232int test(__m128 *A, __m128 *B) {
233 if (_mm_comige_ss(*A, *B))
234 return 3;
235 else
236 return 4;
237}
238
239_test:
240 movl 8(%esp), %eax
241 movaps (%eax), %xmm0
242 movl 4(%esp), %eax
243 movaps (%eax), %xmm1
244 comiss %xmm0, %xmm1
245 setae %al
246 movzbl %al, %ecx
247 movl $3, %eax
248 movl $4, %edx
249 cmpl $0, %ecx
250 cmove %edx, %eax
251 ret
252
253Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
254are a number of issues. 1) We are introducing a setcc between the result of the
255intrisic call and select. 2) The intrinsic is expected to produce a i32 value
256so a any extend (which becomes a zero extend) is added.
257
258We probably need some kind of target DAG combine hook to fix this.
259
260//===---------------------------------------------------------------------===//
261
262We generate significantly worse code for this than GCC:
263http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
264http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
265
266There is also one case we do worse on PPC.
267
268//===---------------------------------------------------------------------===//
269
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000270For this:
271
272int test(int a)
273{
274 return a * 3;
275}
276
277We currently emits
278 imull $3, 4(%esp), %eax
279
280Perhaps this is what we really should generate is? Is imull three or four
281cycles? Note: ICC generates this:
282 movl 4(%esp), %eax
283 leal (%eax,%eax,2), %eax
284
285The current instruction priority is based on pattern complexity. The former is
286more "complex" because it folds a load so the latter will not be emitted.
287
288Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
289should always try to match LEA first since the LEA matching code does some
290estimate to determine whether the match is profitable.
291
292However, if we care more about code size, then imull is better. It's two bytes
293shorter than movl + leal.
294
Eli Friedman9ab1db02008-11-30 07:52:27 +0000295On a Pentium M, both variants have the same characteristics with regard
296to throughput; however, the multiplication has a latency of four cycles, as
297opposed to two cycles for the movl+lea variant.
298
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000299//===---------------------------------------------------------------------===//
300
Eli Friedman577c7492008-02-21 21:16:49 +0000301__builtin_ffs codegen is messy.
Chris Lattnera86af9a2007-08-11 18:19:07 +0000302
Chris Lattnera86af9a2007-08-11 18:19:07 +0000303int ffs_(unsigned X) { return __builtin_ffs(X); }
304
Eli Friedman577c7492008-02-21 21:16:49 +0000305llvm produces:
306ffs_:
307 movl 4(%esp), %ecx
308 bsfl %ecx, %eax
309 movl $32, %edx
310 cmove %edx, %eax
311 incl %eax
312 xorl %edx, %edx
313 testl %ecx, %ecx
314 cmove %edx, %eax
Chris Lattnera86af9a2007-08-11 18:19:07 +0000315 ret
Eli Friedman577c7492008-02-21 21:16:49 +0000316
317vs gcc:
318
Chris Lattnera86af9a2007-08-11 18:19:07 +0000319_ffs_:
320 movl $-1, %edx
321 bsfl 4(%esp), %eax
322 cmove %edx, %eax
323 addl $1, %eax
324 ret
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000325
Eli Friedman577c7492008-02-21 21:16:49 +0000326Another example of __builtin_ffs (use predsimplify to eliminate a select):
327
328int foo (unsigned long j) {
329 if (j)
330 return __builtin_ffs (j) - 1;
331 else
332 return 0;
333}
334
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000335//===---------------------------------------------------------------------===//
336
337It appears gcc place string data with linkonce linkage in
338.section __TEXT,__const_coal,coalesced instead of
339.section __DATA,__const_coal,coalesced.
340Take a look at darwin.h, there are other Darwin assembler directives that we
341do not make use of.
342
343//===---------------------------------------------------------------------===//
344
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000345define i32 @foo(i32* %a, i32 %t) {
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000346entry:
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000347 br label %cond_true
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000348
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000349cond_true: ; preds = %cond_true, %entry
350 %x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ] ; <i32> [#uses=3]
351 %t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ] ; <i32> [#uses=1]
352 %tmp2 = getelementptr i32* %a, i32 %x.0.0 ; <i32*> [#uses=1]
353 %tmp3 = load i32* %tmp2 ; <i32> [#uses=1]
354 %tmp5 = add i32 %t_addr.0.0, %x.0.0 ; <i32> [#uses=1]
355 %tmp7 = add i32 %tmp5, %tmp3 ; <i32> [#uses=2]
356 %tmp9 = add i32 %x.0.0, 1 ; <i32> [#uses=2]
357 %tmp = icmp sgt i32 %tmp9, 39 ; <i1> [#uses=1]
358 br i1 %tmp, label %bb12, label %cond_true
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000359
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000360bb12: ; preds = %cond_true
361 ret i32 %tmp7
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000362}
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000363is pessimized by -loop-reduce and -indvars
364
365//===---------------------------------------------------------------------===//
366
367u32 to float conversion improvement:
368
369float uint32_2_float( unsigned u ) {
370 float fl = (int) (u & 0xffff);
371 float fh = (int) (u >> 16);
372 fh *= 0x1.0p16f;
373 return fh + fl;
374}
375
37600000000 subl $0x04,%esp
37700000003 movl 0x08(%esp,1),%eax
37800000007 movl %eax,%ecx
37900000009 shrl $0x10,%ecx
3800000000c cvtsi2ss %ecx,%xmm0
38100000010 andl $0x0000ffff,%eax
38200000015 cvtsi2ss %eax,%xmm1
38300000019 mulss 0x00000078,%xmm0
38400000021 addss %xmm1,%xmm0
38500000025 movss %xmm0,(%esp,1)
3860000002a flds (%esp,1)
3870000002d addl $0x04,%esp
38800000030 ret
389
390//===---------------------------------------------------------------------===//
391
392When using fastcc abi, align stack slot of argument of type double on 8 byte
393boundary to improve performance.
394
395//===---------------------------------------------------------------------===//
396
397Codegen:
398
399int f(int a, int b) {
400 if (a == 4 || a == 6)
401 b++;
402 return b;
403}
404
405
406as:
407
408or eax, 2
409cmp eax, 6
410jz label
411
412//===---------------------------------------------------------------------===//
413
414GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
415simplifications for integer "x cmp y ? a : b". For example, instead of:
416
417int G;
418void f(int X, int Y) {
419 G = X < 0 ? 14 : 13;
420}
421
422compiling to:
423
424_f:
425 movl $14, %eax
426 movl $13, %ecx
427 movl 4(%esp), %edx
428 testl %edx, %edx
429 cmovl %eax, %ecx
430 movl %ecx, _G
431 ret
432
433it could be:
434_f:
435 movl 4(%esp), %eax
436 sarl $31, %eax
437 notl %eax
438 addl $14, %eax
439 movl %eax, _G
440 ret
441
442etc.
443
Chris Lattnere7037c22007-11-02 17:04:20 +0000444Another is:
445int usesbb(unsigned int a, unsigned int b) {
446 return (a < b ? -1 : 0);
447}
448to:
449_usesbb:
450 movl 8(%esp), %eax
451 cmpl %eax, 4(%esp)
452 sbbl %eax, %eax
453 ret
454
455instead of:
456_usesbb:
457 xorl %eax, %eax
458 movl 8(%esp), %ecx
459 cmpl %ecx, 4(%esp)
460 movl $4294967295, %ecx
461 cmovb %ecx, %eax
462 ret
463
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000464//===---------------------------------------------------------------------===//
465
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000466Consider the expansion of:
467
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000468define i32 @test3(i32 %X) {
469 %tmp1 = urem i32 %X, 255
470 ret i32 %tmp1
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000471}
472
473Currently it compiles to:
474
475...
476 movl $2155905153, %ecx
477 movl 8(%esp), %esi
478 movl %esi, %eax
479 mull %ecx
480...
481
482This could be "reassociated" into:
483
484 movl $2155905153, %eax
485 movl 8(%esp), %ecx
486 mull %ecx
487
488to avoid the copy. In fact, the existing two-address stuff would do this
489except that mul isn't a commutative 2-addr instruction. I guess this has
490to be done at isel time based on the #uses to mul?
491
492//===---------------------------------------------------------------------===//
493
494Make sure the instruction which starts a loop does not cross a cacheline
495boundary. This requires knowning the exact length of each machine instruction.
496That is somewhat complicated, but doable. Example 256.bzip2:
497
498In the new trace, the hot loop has an instruction which crosses a cacheline
499boundary. In addition to potential cache misses, this can't help decoding as I
500imagine there has to be some kind of complicated decoder reset and realignment
501to grab the bytes from the next cacheline.
502
503532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
Eli Friedman9ab1db02008-11-30 07:52:27 +0000504942 942 0x3d03 movl %dh, (1809(%esp, %esi)
505937 937 0x3d0a incl %esi
5063 3 0x3d0b cmpb %bl, %dl
Dan Gohmanf17a25c2007-07-18 16:29:46 +000050727 27 0x3d0d jnz 0x000062db <main+11707>
508
509//===---------------------------------------------------------------------===//
510
511In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
512
513//===---------------------------------------------------------------------===//
514
515This could be a single 16-bit load.
516
517int f(char *p) {
518 if ((p[0] == 1) & (p[1] == 2)) return 1;
519 return 0;
520}
521
522//===---------------------------------------------------------------------===//
523
524We should inline lrintf and probably other libc functions.
525
526//===---------------------------------------------------------------------===//
527
Dan Gohman52415eb2010-01-04 20:55:05 +0000528Use the FLAGS values from arithmetic instructions more. For example, compile:
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000529
530int add_zf(int *x, int y, int a, int b) {
531 if ((*x += y) == 0)
532 return a;
533 else
534 return b;
535}
536
537to:
538 addl %esi, (%rdi)
539 movl %edx, %eax
540 cmovne %ecx, %eax
541 ret
542instead of:
543
544_add_zf:
545 addl (%rdi), %esi
546 movl %esi, (%rdi)
547 testl %esi, %esi
548 cmove %edx, %ecx
549 movl %ecx, %eax
550 ret
551
Dan Gohman52415eb2010-01-04 20:55:05 +0000552As another example, compile function f2 in test/CodeGen/X86/cmp-test.ll
553without a test instruction.
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000554
555//===---------------------------------------------------------------------===//
556
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000557These two functions have identical effects:
558
559unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
560unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
561
562We currently compile them to:
563
564_f:
565 movl 4(%esp), %eax
566 movl %eax, %ecx
567 incl %ecx
568 movl 8(%esp), %edx
569 cmpl %edx, %ecx
570 jne LBB1_2 #UnifiedReturnBlock
571LBB1_1: #cond_true
572 addl $2, %eax
573 ret
574LBB1_2: #UnifiedReturnBlock
575 movl %ecx, %eax
576 ret
577_f2:
578 movl 4(%esp), %eax
579 movl %eax, %ecx
580 incl %ecx
581 cmpl 8(%esp), %ecx
582 sete %cl
583 movzbl %cl, %ecx
584 leal 1(%ecx,%eax), %eax
585 ret
586
587both of which are inferior to GCC's:
588
589_f:
590 movl 4(%esp), %edx
591 leal 1(%edx), %eax
592 addl $2, %edx
593 cmpl 8(%esp), %eax
594 cmove %edx, %eax
595 ret
596_f2:
597 movl 4(%esp), %eax
598 addl $1, %eax
599 xorl %edx, %edx
600 cmpl 8(%esp), %eax
601 sete %dl
602 addl %edx, %eax
603 ret
604
605//===---------------------------------------------------------------------===//
606
607This code:
608
609void test(int X) {
610 if (X) abort();
611}
612
613is currently compiled to:
614
615_test:
616 subl $12, %esp
617 cmpl $0, 16(%esp)
618 jne LBB1_1
619 addl $12, %esp
620 ret
621LBB1_1:
622 call L_abort$stub
623
624It would be better to produce:
625
626_test:
627 subl $12, %esp
628 cmpl $0, 16(%esp)
629 jne L_abort$stub
630 addl $12, %esp
631 ret
632
633This can be applied to any no-return function call that takes no arguments etc.
634Alternatively, the stack save/restore logic could be shrink-wrapped, producing
635something like this:
636
637_test:
638 cmpl $0, 4(%esp)
639 jne LBB1_1
640 ret
641LBB1_1:
642 subl $12, %esp
643 call L_abort$stub
644
645Both are useful in different situations. Finally, it could be shrink-wrapped
646and tail called, like this:
647
648_test:
649 cmpl $0, 4(%esp)
650 jne LBB1_1
651 ret
652LBB1_1:
653 pop %eax # realign stack.
654 call L_abort$stub
655
656Though this probably isn't worth it.
657
658//===---------------------------------------------------------------------===//
659
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000660Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
661a neg instead of a sub instruction. Consider:
662
663int test(char X) { return 7-X; }
664
665we currently produce:
666_test:
667 movl $7, %eax
668 movsbl 4(%esp), %ecx
669 subl %ecx, %eax
670 ret
671
672We would use one fewer register if codegen'd as:
673
674 movsbl 4(%esp), %eax
675 neg %eax
676 add $7, %eax
677 ret
678
679Note that this isn't beneficial if the load can be folded into the sub. In
680this case, we want a sub:
681
682int test(int X) { return 7-X; }
683_test:
684 movl $7, %eax
685 subl 4(%esp), %eax
686 ret
687
688//===---------------------------------------------------------------------===//
689
Chris Lattner32f65872007-08-20 02:14:33 +0000690Leaf functions that require one 4-byte spill slot have a prolog like this:
691
692_foo:
693 pushl %esi
694 subl $4, %esp
695...
696and an epilog like this:
697 addl $4, %esp
698 popl %esi
699 ret
700
701It would be smaller, and potentially faster, to push eax on entry and to
702pop into a dummy register instead of using addl/subl of esp. Just don't pop
703into any return registers :)
704
705//===---------------------------------------------------------------------===//
Chris Lattner44b03cb2007-08-23 15:22:07 +0000706
707The X86 backend should fold (branch (or (setcc, setcc))) into multiple
708branches. We generate really poor code for:
709
710double testf(double a) {
711 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
712}
713
714For example, the entry BB is:
715
716_testf:
717 subl $20, %esp
718 pxor %xmm0, %xmm0
719 movsd 24(%esp), %xmm1
720 ucomisd %xmm0, %xmm1
721 setnp %al
722 sete %cl
723 testb %cl, %al
724 jne LBB1_5 # UnifiedReturnBlock
725LBB1_1: # cond_true
726
727
728it would be better to replace the last four instructions with:
729
730 jp LBB1_1
731 je LBB1_5
732LBB1_1:
733
734We also codegen the inner ?: into a diamond:
735
736 cvtss2sd LCPI1_0(%rip), %xmm2
737 cvtss2sd LCPI1_1(%rip), %xmm3
738 ucomisd %xmm1, %xmm0
739 ja LBB1_3 # cond_true
740LBB1_2: # cond_true
741 movapd %xmm3, %xmm2
742LBB1_3: # cond_true
743 movapd %xmm2, %xmm0
744 ret
745
746We should sink the load into xmm3 into the LBB1_2 block. This should
747be pretty easy, and will nuke all the copies.
748
749//===---------------------------------------------------------------------===//
Chris Lattner4084d492007-09-10 21:43:18 +0000750
751This:
752 #include <algorithm>
753 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
754 { return std::make_pair(a + b, a + b < a); }
755 bool no_overflow(unsigned a, unsigned b)
756 { return !full_add(a, b).second; }
757
758Should compile to:
759
760
761 _Z11no_overflowjj:
762 addl %edi, %esi
763 setae %al
764 ret
765
Eli Friedman577c7492008-02-21 21:16:49 +0000766FIXME: That code looks wrong; bool return is normally defined as zext.
767
Chris Lattner4084d492007-09-10 21:43:18 +0000768on x86-64, not:
769
770__Z11no_overflowjj:
771 addl %edi, %esi
772 cmpl %edi, %esi
773 setae %al
774 movzbl %al, %eax
775 ret
776
777
778//===---------------------------------------------------------------------===//
Evan Cheng35127a62007-09-10 22:16:37 +0000779
Bill Wendling7f436dd2007-10-02 20:42:59 +0000780The following code:
781
Bill Wendlingc2036e32007-10-02 20:54:32 +0000782bb114.preheader: ; preds = %cond_next94
783 %tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
784 %tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
785 %tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
786 %tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
787 %tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
788 %tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
789 %tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
790 %tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
791 %tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
792 %tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
793 %tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
794 br label %bb114
795
796produces:
797
Bill Wendling7f436dd2007-10-02 20:42:59 +0000798LBB3_5: # bb114.preheader
799 movswl -68(%ebp), %eax
800 movl $32, %ecx
801 movl %ecx, -80(%ebp)
802 subl %eax, -80(%ebp)
803 movswl -52(%ebp), %eax
804 movl %ecx, -84(%ebp)
805 subl %eax, -84(%ebp)
806 movswl -70(%ebp), %eax
807 movl %ecx, -88(%ebp)
808 subl %eax, -88(%ebp)
809 movswl -50(%ebp), %eax
810 subl %eax, %ecx
811 movl %ecx, -76(%ebp)
812 movswl -42(%ebp), %eax
813 movl %eax, -92(%ebp)
814 movswl -66(%ebp), %eax
815 movl %eax, -96(%ebp)
816 movw $0, -98(%ebp)
817
Chris Lattner792bae52007-10-03 03:40:24 +0000818This appears to be bad because the RA is not folding the store to the stack
819slot into the movl. The above instructions could be:
820 movl $32, -80(%ebp)
821...
822 movl $32, -84(%ebp)
823...
824This seems like a cross between remat and spill folding.
825
Bill Wendlingc2036e32007-10-02 20:54:32 +0000826This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling7f436dd2007-10-02 20:42:59 +0000827change, so we could simply subtract %eax from %ecx first and then use %ecx (or
828vice-versa).
829
830//===---------------------------------------------------------------------===//
831
Bill Wendling54c4f832007-10-02 21:49:31 +0000832This code:
833
834 %tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
835 br i1 %tmp659, label %cond_true662, label %cond_next715
836
837produces this:
838
839 testw %cx, %cx
840 movswl %cx, %esi
841 jns LBB4_109 # cond_next715
842
843Shark tells us that using %cx in the testw instruction is sub-optimal. It
844suggests using the 32-bit register (which is what ICC uses).
845
846//===---------------------------------------------------------------------===//
Chris Lattner802c62a2007-10-03 17:10:03 +0000847
Chris Lattnerae259992007-10-04 15:47:27 +0000848We compile this:
849
850void compare (long long foo) {
851 if (foo < 4294967297LL)
852 abort();
853}
854
855to:
856
Eli Friedman577c7492008-02-21 21:16:49 +0000857compare:
858 subl $4, %esp
859 cmpl $0, 8(%esp)
Chris Lattnerae259992007-10-04 15:47:27 +0000860 setne %al
861 movzbw %al, %ax
Eli Friedman577c7492008-02-21 21:16:49 +0000862 cmpl $1, 12(%esp)
Chris Lattnerae259992007-10-04 15:47:27 +0000863 setg %cl
864 movzbw %cl, %cx
865 cmove %ax, %cx
Eli Friedman577c7492008-02-21 21:16:49 +0000866 testb $1, %cl
867 jne .LBB1_2 # UnifiedReturnBlock
868.LBB1_1: # ifthen
869 call abort
870.LBB1_2: # UnifiedReturnBlock
871 addl $4, %esp
872 ret
Chris Lattnerae259992007-10-04 15:47:27 +0000873
874(also really horrible code on ppc). This is due to the expand code for 64-bit
875compares. GCC produces multiple branches, which is much nicer:
876
Eli Friedman577c7492008-02-21 21:16:49 +0000877compare:
878 subl $12, %esp
879 movl 20(%esp), %edx
880 movl 16(%esp), %eax
881 decl %edx
882 jle .L7
883.L5:
884 addl $12, %esp
885 ret
886 .p2align 4,,7
887.L7:
888 jl .L4
Chris Lattnerae259992007-10-04 15:47:27 +0000889 cmpl $0, %eax
Eli Friedman577c7492008-02-21 21:16:49 +0000890 .p2align 4,,8
891 ja .L5
892.L4:
893 .p2align 4,,9
894 call abort
Chris Lattnerae259992007-10-04 15:47:27 +0000895
896//===---------------------------------------------------------------------===//
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000897
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000898Tail call optimization improvements: Tail call optimization currently
899pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighofer449b01a2008-01-11 16:49:42 +0000900non-tail call optimized calls) that source from the callers arguments
901or that source from a virtual register (also possibly sourcing from
902callers arguments).
903This is done to prevent overwriting of parameters (see example
904below) that might be used later.
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000905
906example:
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000907
908int callee(int32, int64);
909int caller(int32 arg1, int32 arg2) {
910 int64 local = arg2 * 2;
911 return callee(arg2, (int64)local);
912}
913
914[arg1] [!arg2 no longer valid since we moved local onto it]
915[arg2] -> [(int64)
916[RETADDR] local ]
917
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000918Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000919arg2 of the caller.
920
921Possible optimizations:
922
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000923
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000924 - Analyse the actual parameters of the callee to see which would
925 overwrite a caller parameter which is used by the callee and only
926 push them onto the top of the stack.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000927
928 int callee (int32 arg1, int32 arg2);
929 int caller (int32 arg1, int32 arg2) {
930 return callee(arg1,arg2);
931 }
932
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000933 Here we don't need to write any variables to the top of the stack
934 since they don't overwrite each other.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000935
936 int callee (int32 arg1, int32 arg2);
937 int caller (int32 arg1, int32 arg2) {
938 return callee(arg2,arg1);
939 }
940
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000941 Here we need to push the arguments because they overwrite each
942 other.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000943
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000944//===---------------------------------------------------------------------===//
Evan Cheng7f1ad6a2007-10-28 04:01:09 +0000945
946main ()
947{
948 int i = 0;
949 unsigned long int z = 0;
950
951 do {
952 z -= 0x00004000;
953 i++;
954 if (i > 0x00040000)
955 abort ();
956 } while (z > 0);
957 exit (0);
958}
959
960gcc compiles this to:
961
962_main:
963 subl $28, %esp
964 xorl %eax, %eax
965 jmp L2
966L3:
967 cmpl $262144, %eax
968 je L10
969L2:
970 addl $1, %eax
971 cmpl $262145, %eax
972 jne L3
973 call L_abort$stub
974L10:
975 movl $0, (%esp)
976 call L_exit$stub
977
978llvm:
979
980_main:
981 subl $12, %esp
982 movl $1, %eax
983 movl $16384, %ecx
984LBB1_1: # bb
985 cmpl $262145, %eax
986 jge LBB1_4 # cond_true
987LBB1_2: # cond_next
988 incl %eax
989 addl $4294950912, %ecx
990 cmpl $16384, %ecx
991 jne LBB1_1 # bb
992LBB1_3: # bb11
993 xorl %eax, %eax
994 addl $12, %esp
995 ret
996LBB1_4: # cond_true
997 call L_abort$stub
998
9991. LSR should rewrite the first cmp with induction variable %ecx.
10002. DAG combiner should fold
1001 leal 1(%eax), %edx
1002 cmpl $262145, %edx
1003 =>
1004 cmpl $262144, %eax
1005
1006//===---------------------------------------------------------------------===//
Chris Lattner358670b2007-11-24 06:13:33 +00001007
1008define i64 @test(double %X) {
1009 %Y = fptosi double %X to i64
1010 ret i64 %Y
1011}
1012
1013compiles to:
1014
1015_test:
1016 subl $20, %esp
1017 movsd 24(%esp), %xmm0
1018 movsd %xmm0, 8(%esp)
1019 fldl 8(%esp)
1020 fisttpll (%esp)
1021 movl 4(%esp), %edx
1022 movl (%esp), %eax
1023 addl $20, %esp
1024 #FP_REG_KILL
1025 ret
1026
1027This should just fldl directly from the input stack slot.
Chris Lattner10d54d12007-12-05 22:58:19 +00001028
1029//===---------------------------------------------------------------------===//
1030
1031This code:
1032int foo (int x) { return (x & 65535) | 255; }
1033
1034Should compile into:
1035
1036_foo:
1037 movzwl 4(%esp), %eax
Eli Friedman577c7492008-02-21 21:16:49 +00001038 orl $255, %eax
Chris Lattner10d54d12007-12-05 22:58:19 +00001039 ret
1040
1041instead of:
1042_foo:
1043 movl $255, %eax
1044 orl 4(%esp), %eax
1045 andl $65535, %eax
1046 ret
1047
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001048//===---------------------------------------------------------------------===//
1049
Chris Lattnereec7ac02008-02-21 06:51:29 +00001050We're codegen'ing multiply of long longs inefficiently:
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001051
Chris Lattnereec7ac02008-02-21 06:51:29 +00001052unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
1053 return arg1 * arg2;
1054}
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001055
Chris Lattnereec7ac02008-02-21 06:51:29 +00001056We compile to (fomit-frame-pointer):
1057
1058_LLM:
1059 pushl %esi
1060 movl 8(%esp), %ecx
1061 movl 16(%esp), %esi
1062 movl %esi, %eax
1063 mull %ecx
1064 imull 12(%esp), %esi
1065 addl %edx, %esi
1066 imull 20(%esp), %ecx
1067 movl %esi, %edx
1068 addl %ecx, %edx
1069 popl %esi
1070 ret
1071
1072This looks like a scheduling deficiency and lack of remat of the load from
1073the argument area. ICC apparently produces:
1074
1075 movl 8(%esp), %ecx
1076 imull 12(%esp), %ecx
1077 movl 16(%esp), %eax
1078 imull 4(%esp), %eax
1079 addl %eax, %ecx
1080 movl 4(%esp), %eax
1081 mull 12(%esp)
1082 addl %ecx, %edx
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001083 ret
1084
Chris Lattnereec7ac02008-02-21 06:51:29 +00001085Note that it remat'd loads from 4(esp) and 12(esp). See this GCC PR:
1086http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001087
1088//===---------------------------------------------------------------------===//
1089
Chris Lattner2b55ebd2007-12-24 19:27:46 +00001090We can fold a store into "zeroing a reg". Instead of:
1091
1092xorl %eax, %eax
1093movl %eax, 124(%esp)
1094
1095we should get:
1096
1097movl $0, 124(%esp)
1098
1099if the flags of the xor are dead.
1100
Chris Lattner459ff992008-01-11 18:00:13 +00001101Likewise, we isel "x<<1" into "add reg,reg". If reg is spilled, this should
1102be folded into: shl [mem], 1
1103
Chris Lattner2b55ebd2007-12-24 19:27:46 +00001104//===---------------------------------------------------------------------===//
Chris Lattner64400952007-12-28 21:50:40 +00001105
1106This testcase misses a read/modify/write opportunity (from PR1425):
1107
1108void vertical_decompose97iH1(int *b0, int *b1, int *b2, int width){
1109 int i;
1110 for(i=0; i<width; i++)
1111 b1[i] += (1*(b0[i] + b2[i])+0)>>0;
1112}
1113
1114We compile it down to:
1115
1116LBB1_2: # bb
1117 movl (%esi,%edi,4), %ebx
1118 addl (%ecx,%edi,4), %ebx
1119 addl (%edx,%edi,4), %ebx
1120 movl %ebx, (%ecx,%edi,4)
1121 incl %edi
1122 cmpl %eax, %edi
1123 jne LBB1_2 # bb
1124
1125the inner loop should add to the memory location (%ecx,%edi,4), saving
1126a mov. Something like:
1127
1128 movl (%esi,%edi,4), %ebx
1129 addl (%edx,%edi,4), %ebx
1130 addl %ebx, (%ecx,%edi,4)
1131
Chris Lattnerbde73102007-12-29 05:51:58 +00001132Here is another interesting example:
1133
1134void vertical_compose97iH1(int *b0, int *b1, int *b2, int width){
1135 int i;
1136 for(i=0; i<width; i++)
1137 b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
1138}
1139
1140We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
1141
1142LBB9_2: # bb
1143 movl (%ecx,%edi,4), %ebx
1144 subl (%esi,%edi,4), %ebx
1145 subl (%edx,%edi,4), %ebx
1146 movl %ebx, (%ecx,%edi,4)
1147 incl %edi
1148 cmpl %eax, %edi
1149 jne LBB9_2 # bb
1150
1151Additionally, LSR should rewrite the exit condition of these loops to use
Chris Lattner64400952007-12-28 21:50:40 +00001152a stride-4 IV, would would allow all the scales in the loop to go away.
1153This would result in smaller code and more efficient microops.
1154
1155//===---------------------------------------------------------------------===//
Chris Lattner0362a362008-01-07 21:59:58 +00001156
1157In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
1158or and instruction, for example:
1159
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001160 xorpd LCPI1_0, %xmm2
Chris Lattner0362a362008-01-07 21:59:58 +00001161
1162However, if xmm2 gets spilled, we end up with really ugly code like this:
1163
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001164 movsd (%esp), %xmm0
1165 xorpd LCPI1_0, %xmm0
1166 movsd %xmm0, (%esp)
Chris Lattner0362a362008-01-07 21:59:58 +00001167
1168Since we 'know' that this is a 'neg', we can actually "fold" the spill into
1169the neg/abs instruction, turning it into an *integer* operation, like this:
1170
1171 xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
1172
1173you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001174stall. Here is a contrived testcase:
1175
1176double a, b, c;
1177void test(double *P) {
1178 double X = *P;
1179 a = X;
1180 bar();
1181 X = -X;
1182 b = X;
1183 bar();
1184 c = X;
1185}
Chris Lattner0362a362008-01-07 21:59:58 +00001186
1187//===---------------------------------------------------------------------===//
Andrew Lenharth785610d2008-02-16 01:24:58 +00001188
1189handling llvm.memory.barrier on pre SSE2 cpus
1190
1191should generate:
1192lock ; mov %esp, %esp
1193
1194//===---------------------------------------------------------------------===//
Chris Lattner7644ff32008-02-17 19:43:57 +00001195
1196The generated code on x86 for checking for signed overflow on a multiply the
1197obvious way is much longer than it needs to be.
1198
1199int x(int a, int b) {
1200 long long prod = (long long)a*b;
1201 return prod > 0x7FFFFFFF || prod < (-0x7FFFFFFF-1);
1202}
1203
1204See PR2053 for more details.
1205
1206//===---------------------------------------------------------------------===//
Chris Lattner83f22362008-02-18 18:30:13 +00001207
Eli Friedman577c7492008-02-21 21:16:49 +00001208We should investigate using cdq/ctld (effect: edx = sar eax, 31)
1209more aggressively; it should cost the same as a move+shift on any modern
1210processor, but it's a lot shorter. Downside is that it puts more
1211pressure on register allocation because it has fixed operands.
1212
1213Example:
1214int abs(int x) {return x < 0 ? -x : x;}
1215
1216gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
1217abs:
1218 movl 4(%esp), %eax
1219 cltd
1220 xorl %edx, %eax
1221 subl %edx, %eax
1222 ret
1223
1224//===---------------------------------------------------------------------===//
1225
1226Consider:
Chris Lattner83f22362008-02-18 18:30:13 +00001227int test(unsigned long a, unsigned long b) { return -(a < b); }
1228
1229We currently compile this to:
1230
1231define i32 @test(i32 %a, i32 %b) nounwind {
1232 %tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
1233 %tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
1234 %tmp5 = sub i32 0, %tmp34 ; <i32> [#uses=1]
1235 ret i32 %tmp5
1236}
1237
1238and
1239
1240_test:
1241 movl 8(%esp), %eax
1242 cmpl %eax, 4(%esp)
1243 setb %al
1244 movzbl %al, %eax
1245 negl %eax
1246 ret
1247
1248Several deficiencies here. First, we should instcombine zext+neg into sext:
1249
1250define i32 @test2(i32 %a, i32 %b) nounwind {
1251 %tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
1252 %tmp34 = sext i1 %tmp3 to i32 ; <i32> [#uses=1]
1253 ret i32 %tmp34
1254}
1255
1256However, before we can do that, we have to fix the bad codegen that we get for
1257sext from bool:
1258
1259_test2:
1260 movl 8(%esp), %eax
1261 cmpl %eax, 4(%esp)
1262 setb %al
1263 movzbl %al, %eax
1264 shll $31, %eax
1265 sarl $31, %eax
1266 ret
1267
1268This code should be at least as good as the code above. Once this is fixed, we
1269can optimize this specific case even more to:
1270
1271 movl 8(%esp), %eax
1272 xorl %ecx, %ecx
1273 cmpl %eax, 4(%esp)
1274 sbbl %ecx, %ecx
1275
1276//===---------------------------------------------------------------------===//
Eli Friedman1aa1f2c2008-02-28 00:21:43 +00001277
1278Take the following code (from
1279http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
1280
1281extern unsigned char first_one[65536];
1282int FirstOnet(unsigned long long arg1)
1283{
1284 if (arg1 >> 48)
1285 return (first_one[arg1 >> 48]);
1286 return 0;
1287}
1288
1289
1290The following code is currently generated:
1291FirstOnet:
1292 movl 8(%esp), %eax
1293 cmpl $65536, %eax
1294 movl 4(%esp), %ecx
1295 jb .LBB1_2 # UnifiedReturnBlock
1296.LBB1_1: # ifthen
1297 shrl $16, %eax
1298 movzbl first_one(%eax), %eax
1299 ret
1300.LBB1_2: # UnifiedReturnBlock
1301 xorl %eax, %eax
1302 ret
1303
1304There are a few possible improvements here:
13051. We should be able to eliminate the dead load into %ecx
13062. We could change the "movl 8(%esp), %eax" into
1307 "movzwl 10(%esp), %eax"; this lets us change the cmpl
1308 into a testl, which is shorter, and eliminate the shift.
1309
1310We could also in theory eliminate the branch by using a conditional
1311for the address of the load, but that seems unlikely to be worthwhile
1312in general.
1313
1314//===---------------------------------------------------------------------===//
1315
Chris Lattner44a98ac2008-02-28 04:52:59 +00001316We compile this function:
1317
1318define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext %d) nounwind {
1319entry:
1320 %tmp2 = icmp eq i8 %d, 0 ; <i1> [#uses=1]
1321 br i1 %tmp2, label %bb7, label %bb
1322
1323bb: ; preds = %entry
1324 %tmp6 = add i32 %b, %a ; <i32> [#uses=1]
1325 ret i32 %tmp6
1326
1327bb7: ; preds = %entry
1328 %tmp10 = sub i32 %a, %c ; <i32> [#uses=1]
1329 ret i32 %tmp10
1330}
1331
1332to:
1333
1334_foo:
1335 cmpb $0, 16(%esp)
1336 movl 12(%esp), %ecx
1337 movl 8(%esp), %eax
1338 movl 4(%esp), %edx
1339 je LBB1_2 # bb7
1340LBB1_1: # bb
1341 addl %edx, %eax
1342 ret
1343LBB1_2: # bb7
1344 movl %edx, %eax
1345 subl %ecx, %eax
1346 ret
1347
Gabor Greif02661592008-03-06 10:51:21 +00001348The coalescer could coalesce "edx" with "eax" to avoid the movl in LBB1_2
Chris Lattner44a98ac2008-02-28 04:52:59 +00001349if it commuted the addl in LBB1_1.
1350
1351//===---------------------------------------------------------------------===//
Evan Cheng921dcba2008-03-28 07:07:06 +00001352
1353See rdar://4653682.
1354
1355From flops:
1356
1357LBB1_15: # bb310
1358 cvtss2sd LCPI1_0, %xmm1
1359 addsd %xmm1, %xmm0
1360 movsd 176(%esp), %xmm2
1361 mulsd %xmm0, %xmm2
1362 movapd %xmm2, %xmm3
1363 mulsd %xmm3, %xmm3
1364 movapd %xmm3, %xmm4
1365 mulsd LCPI1_23, %xmm4
1366 addsd LCPI1_24, %xmm4
1367 mulsd %xmm3, %xmm4
1368 addsd LCPI1_25, %xmm4
1369 mulsd %xmm3, %xmm4
1370 addsd LCPI1_26, %xmm4
1371 mulsd %xmm3, %xmm4
1372 addsd LCPI1_27, %xmm4
1373 mulsd %xmm3, %xmm4
1374 addsd LCPI1_28, %xmm4
1375 mulsd %xmm3, %xmm4
1376 addsd %xmm1, %xmm4
1377 mulsd %xmm2, %xmm4
1378 movsd 152(%esp), %xmm1
1379 addsd %xmm4, %xmm1
1380 movsd %xmm1, 152(%esp)
1381 incl %eax
1382 cmpl %eax, %esi
1383 jge LBB1_15 # bb310
1384LBB1_16: # bb358.loopexit
1385 movsd 152(%esp), %xmm0
1386 addsd %xmm0, %xmm0
1387 addsd LCPI1_22, %xmm0
1388 movsd %xmm0, 152(%esp)
1389
1390Rather than spilling the result of the last addsd in the loop, we should have
1391insert a copy to split the interval (one for the duration of the loop, one
1392extending to the fall through). The register pressure in the loop isn't high
1393enough to warrant the spill.
1394
1395Also check why xmm7 is not used at all in the function.
Chris Lattner16e5c782008-04-21 04:46:30 +00001396
1397//===---------------------------------------------------------------------===//
1398
1399Legalize loses track of the fact that bools are always zero extended when in
1400memory. This causes us to compile abort_gzip (from 164.gzip) from:
1401
1402target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
1403target triple = "i386-apple-darwin8"
1404@in_exit.4870.b = internal global i1 false ; <i1*> [#uses=2]
1405define fastcc void @abort_gzip() noreturn nounwind {
1406entry:
1407 %tmp.b.i = load i1* @in_exit.4870.b ; <i1> [#uses=1]
1408 br i1 %tmp.b.i, label %bb.i, label %bb4.i
1409bb.i: ; preds = %entry
1410 tail call void @exit( i32 1 ) noreturn nounwind
1411 unreachable
1412bb4.i: ; preds = %entry
1413 store i1 true, i1* @in_exit.4870.b
1414 tail call void @exit( i32 1 ) noreturn nounwind
1415 unreachable
1416}
1417declare void @exit(i32) noreturn nounwind
1418
1419into:
1420
1421_abort_gzip:
1422 subl $12, %esp
1423 movb _in_exit.4870.b, %al
1424 notb %al
1425 testb $1, %al
1426 jne LBB1_2 ## bb4.i
1427LBB1_1: ## bb.i
1428 ...
1429
1430//===---------------------------------------------------------------------===//
Chris Lattner7cb1d332008-05-05 23:19:45 +00001431
1432We compile:
1433
1434int test(int x, int y) {
1435 return x-y-1;
1436}
1437
1438into (-m64):
1439
1440_test:
1441 decl %edi
1442 movl %edi, %eax
1443 subl %esi, %eax
1444 ret
1445
1446it would be better to codegen as: x+~y (notl+addl)
Edwin Törökfa9d5e22008-10-24 19:23:07 +00001447
1448//===---------------------------------------------------------------------===//
1449
1450This code:
1451
1452int foo(const char *str,...)
1453{
1454 __builtin_va_list a; int x;
1455 __builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
1456 return x;
1457}
1458
1459gets compiled into this on x86-64:
1460 subq $200, %rsp
1461 movaps %xmm7, 160(%rsp)
1462 movaps %xmm6, 144(%rsp)
1463 movaps %xmm5, 128(%rsp)
1464 movaps %xmm4, 112(%rsp)
1465 movaps %xmm3, 96(%rsp)
1466 movaps %xmm2, 80(%rsp)
1467 movaps %xmm1, 64(%rsp)
1468 movaps %xmm0, 48(%rsp)
1469 movq %r9, 40(%rsp)
1470 movq %r8, 32(%rsp)
1471 movq %rcx, 24(%rsp)
1472 movq %rdx, 16(%rsp)
1473 movq %rsi, 8(%rsp)
1474 leaq (%rsp), %rax
1475 movq %rax, 192(%rsp)
1476 leaq 208(%rsp), %rax
1477 movq %rax, 184(%rsp)
1478 movl $48, 180(%rsp)
1479 movl $8, 176(%rsp)
1480 movl 176(%rsp), %eax
1481 cmpl $47, %eax
1482 jbe .LBB1_3 # bb
1483.LBB1_1: # bb3
1484 movq 184(%rsp), %rcx
1485 leaq 8(%rcx), %rax
1486 movq %rax, 184(%rsp)
1487.LBB1_2: # bb4
1488 movl (%rcx), %eax
1489 addq $200, %rsp
1490 ret
1491.LBB1_3: # bb
1492 movl %eax, %ecx
1493 addl $8, %eax
1494 addq 192(%rsp), %rcx
1495 movl %eax, 176(%rsp)
1496 jmp .LBB1_2 # bb4
1497
1498gcc 4.3 generates:
1499 subq $96, %rsp
1500.LCFI0:
1501 leaq 104(%rsp), %rax
1502 movq %rsi, -80(%rsp)
1503 movl $8, -120(%rsp)
1504 movq %rax, -112(%rsp)
1505 leaq -88(%rsp), %rax
1506 movq %rax, -104(%rsp)
1507 movl $8, %eax
1508 cmpl $48, %eax
1509 jb .L6
1510 movq -112(%rsp), %rdx
1511 movl (%rdx), %eax
1512 addq $96, %rsp
1513 ret
1514 .p2align 4,,10
1515 .p2align 3
1516.L6:
1517 mov %eax, %edx
1518 addq -104(%rsp), %rdx
1519 addl $8, %eax
1520 movl %eax, -120(%rsp)
1521 movl (%rdx), %eax
1522 addq $96, %rsp
1523 ret
1524
1525and it gets compiled into this on x86:
1526 pushl %ebp
1527 movl %esp, %ebp
1528 subl $4, %esp
1529 leal 12(%ebp), %eax
1530 movl %eax, -4(%ebp)
1531 leal 16(%ebp), %eax
1532 movl %eax, -4(%ebp)
1533 movl 12(%ebp), %eax
1534 addl $4, %esp
1535 popl %ebp
1536 ret
1537
1538gcc 4.3 generates:
1539 pushl %ebp
1540 movl %esp, %ebp
1541 movl 12(%ebp), %eax
1542 popl %ebp
1543 ret
Evan Chengbf97bec2008-11-11 17:35:52 +00001544
1545//===---------------------------------------------------------------------===//
1546
1547Teach tblgen not to check bitconvert source type in some cases. This allows us
1548to consolidate the following patterns in X86InstrMMX.td:
1549
1550def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1551 (iPTR 0))))),
1552 (v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
1553def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1554 (iPTR 0))))),
1555 (v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
1556def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1557 (iPTR 0))))),
1558 (v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
1559
1560There are other cases in various td files.
Eli Friedman9ab1db02008-11-30 07:52:27 +00001561
1562//===---------------------------------------------------------------------===//
1563
1564Take something like the following on x86-32:
1565unsigned a(unsigned long long x, unsigned y) {return x % y;}
1566
1567We currently generate a libcall, but we really shouldn't: the expansion is
1568shorter and likely faster than the libcall. The expected code is something
1569like the following:
1570
1571 movl 12(%ebp), %eax
1572 movl 16(%ebp), %ecx
1573 xorl %edx, %edx
1574 divl %ecx
1575 movl 8(%ebp), %eax
1576 divl %ecx
1577 movl %edx, %eax
1578 ret
1579
1580A similar code sequence works for division.
1581
1582//===---------------------------------------------------------------------===//
Chris Lattnerbfccda62008-12-06 22:49:05 +00001583
1584These should compile to the same code, but the later codegen's to useless
1585instructions on X86. This may be a trivial dag combine (GCC PR7061):
1586
1587struct s1 { unsigned char a, b; };
1588unsigned long f1(struct s1 x) {
1589 return x.a + x.b;
1590}
1591struct s2 { unsigned a: 8, b: 8; };
1592unsigned long f2(struct s2 x) {
1593 return x.a + x.b;
1594}
1595
1596//===---------------------------------------------------------------------===//
1597
Chris Lattner9f34dc62009-02-08 20:44:19 +00001598We currently compile this:
1599
1600define i32 @func1(i32 %v1, i32 %v2) nounwind {
1601entry:
1602 %t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)
1603 %sum = extractvalue {i32, i1} %t, 0
1604 %obit = extractvalue {i32, i1} %t, 1
1605 br i1 %obit, label %overflow, label %normal
1606normal:
1607 ret i32 %sum
1608overflow:
1609 call void @llvm.trap()
1610 unreachable
1611}
1612declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)
1613declare void @llvm.trap()
1614
1615to:
1616
1617_func1:
1618 movl 4(%esp), %eax
1619 addl 8(%esp), %eax
1620 jo LBB1_2 ## overflow
1621LBB1_1: ## normal
1622 ret
1623LBB1_2: ## overflow
1624 ud2
1625
1626it would be nice to produce "into" someday.
1627
1628//===---------------------------------------------------------------------===//
Chris Lattner09c650b2009-02-17 01:16:14 +00001629
1630This code:
1631
1632void vec_mpys1(int y[], const int x[], int scaler) {
1633int i;
1634for (i = 0; i < 150; i++)
1635 y[i] += (((long long)scaler * (long long)x[i]) >> 31);
1636}
1637
1638Compiles to this loop with GCC 3.x:
1639
1640.L5:
1641 movl %ebx, %eax
1642 imull (%edi,%ecx,4)
1643 shrdl $31, %edx, %eax
1644 addl %eax, (%esi,%ecx,4)
1645 incl %ecx
1646 cmpl $149, %ecx
1647 jle .L5
1648
1649llvm-gcc compiles it to the much uglier:
1650
1651LBB1_1: ## bb1
1652 movl 24(%esp), %eax
1653 movl (%eax,%edi,4), %ebx
1654 movl %ebx, %ebp
1655 imull %esi, %ebp
1656 movl %ebx, %eax
1657 mull %ecx
1658 addl %ebp, %edx
1659 sarl $31, %ebx
1660 imull %ecx, %ebx
1661 addl %edx, %ebx
1662 shldl $1, %eax, %ebx
1663 movl 20(%esp), %eax
1664 addl %ebx, (%eax,%edi,4)
1665 incl %edi
1666 cmpl $150, %edi
1667 jne LBB1_1 ## bb1
1668
Eli Friedman0427ac12009-12-21 08:03:16 +00001669The issue is that we hoist the cast of "scaler" to long long outside of the
1670loop, the value comes into the loop as two values, and
1671RegsForValue::getCopyFromRegs doesn't know how to put an AssertSext on the
1672constructed BUILD_PAIR which represents the cast value.
1673
Chris Lattner09c650b2009-02-17 01:16:14 +00001674//===---------------------------------------------------------------------===//
Chris Lattner8eca7c72009-03-08 01:54:43 +00001675
Dan Gohman5edb7382009-03-09 23:47:02 +00001676Test instructions can be eliminated by using EFLAGS values from arithmetic
Dan Gohman8bb7db62009-03-10 00:26:23 +00001677instructions. This is currently not done for mul, and, or, xor, neg, shl,
1678sra, srl, shld, shrd, atomic ops, and others. It is also currently not done
1679for read-modify-write instructions. It is also current not done if the
1680OF or CF flags are needed.
Dan Gohman5edb7382009-03-09 23:47:02 +00001681
1682The shift operators have the complication that when the shift count is
1683zero, EFLAGS is not set, so they can only subsume a test instruction if
Dan Gohman8bb7db62009-03-10 00:26:23 +00001684the shift count is known to be non-zero. Also, using the EFLAGS value
1685from a shift is apparently very slow on some x86 implementations.
Dan Gohman5edb7382009-03-09 23:47:02 +00001686
1687In read-modify-write instructions, the root node in the isel match is
1688the store, and isel has no way for the use of the EFLAGS result of the
1689arithmetic to be remapped to the new node.
1690
Dan Gohman8bb7db62009-03-10 00:26:23 +00001691Add and subtract instructions set OF on signed overflow and CF on unsiged
1692overflow, while test instructions always clear OF and CF. In order to
1693replace a test with an add or subtract in a situation where OF or CF is
1694needed, codegen must be able to prove that the operation cannot see
1695signed or unsigned overflow, respectively.
1696
Dan Gohman5edb7382009-03-09 23:47:02 +00001697//===---------------------------------------------------------------------===//
1698
Chris Lattnerfe8b5592009-03-08 03:04:26 +00001699memcpy/memmove do not lower to SSE copies when possible. A silly example is:
1700define <16 x float> @foo(<16 x float> %A) nounwind {
1701 %tmp = alloca <16 x float>, align 16
1702 %tmp2 = alloca <16 x float>, align 16
1703 store <16 x float> %A, <16 x float>* %tmp
1704 %s = bitcast <16 x float>* %tmp to i8*
1705 %s2 = bitcast <16 x float>* %tmp2 to i8*
1706 call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)
1707 %R = load <16 x float>* %tmp2
1708 ret <16 x float> %R
1709}
1710
1711declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind
1712
1713which compiles to:
1714
1715_foo:
1716 subl $140, %esp
1717 movaps %xmm3, 112(%esp)
1718 movaps %xmm2, 96(%esp)
1719 movaps %xmm1, 80(%esp)
1720 movaps %xmm0, 64(%esp)
1721 movl 60(%esp), %eax
1722 movl %eax, 124(%esp)
1723 movl 56(%esp), %eax
1724 movl %eax, 120(%esp)
1725 movl 52(%esp), %eax
1726 <many many more 32-bit copies>
1727 movaps (%esp), %xmm0
1728 movaps 16(%esp), %xmm1
1729 movaps 32(%esp), %xmm2
1730 movaps 48(%esp), %xmm3
1731 addl $140, %esp
1732 ret
1733
1734On Nehalem, it may even be cheaper to just use movups when unaligned than to
1735fall back to lower-granularity chunks.
1736
1737//===---------------------------------------------------------------------===//
Chris Lattnere00a4e02009-05-25 20:28:19 +00001738
1739Implement processor-specific optimizations for parity with GCC on these
1740processors. GCC does two optimizations:
1741
17421. ix86_pad_returns inserts a noop before ret instructions if immediately
1743 preceeded by a conditional branch or is the target of a jump.
17442. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of
1745 code contains more than 3 branches.
1746
1747The first one is done for all AMDs, Core2, and "Generic"
1748The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,
1749 Core 2, and "Generic"
1750
1751//===---------------------------------------------------------------------===//
Eli Friedmandd959242009-06-11 23:07:04 +00001752
1753Testcase:
1754int a(int x) { return (x & 127) > 31; }
1755
1756Current output:
1757 movl 4(%esp), %eax
1758 andl $127, %eax
1759 cmpl $31, %eax
1760 seta %al
1761 movzbl %al, %eax
1762 ret
1763
1764Ideal output:
1765 xorl %eax, %eax
1766 testl $96, 4(%esp)
1767 setne %al
1768 ret
1769
Chris Lattner20610972009-06-16 06:11:35 +00001770This should definitely be done in instcombine, canonicalizing the range
1771condition into a != condition. We get this IR:
1772
1773define i32 @a(i32 %x) nounwind readnone {
1774entry:
1775 %0 = and i32 %x, 127 ; <i32> [#uses=1]
1776 %1 = icmp ugt i32 %0, 31 ; <i1> [#uses=1]
1777 %2 = zext i1 %1 to i32 ; <i32> [#uses=1]
1778 ret i32 %2
1779}
1780
1781Instcombine prefers to strength reduce relational comparisons to equality
1782comparisons when possible, this should be another case of that. This could
1783be handled pretty easily in InstCombiner::visitICmpInstWithInstAndIntCst, but it
1784looks like InstCombiner::visitICmpInstWithInstAndIntCst should really already
1785be redesigned to use ComputeMaskedBits and friends.
1786
Eli Friedmandd959242009-06-11 23:07:04 +00001787
1788//===---------------------------------------------------------------------===//
1789Testcase:
1790int x(int a) { return (a&0xf0)>>4; }
1791
1792Current output:
1793 movl 4(%esp), %eax
1794 shrl $4, %eax
1795 andl $15, %eax
1796 ret
1797
1798Ideal output:
1799 movzbl 4(%esp), %eax
1800 shrl $4, %eax
1801 ret
1802
1803//===---------------------------------------------------------------------===//
1804
1805Testcase:
1806int x(int a) { return (a & 0x80) ? 0x100 : 0; }
Chris Lattner498e6ad2009-06-16 06:15:56 +00001807int y(int a) { return (a & 0x80) *2; }
Eli Friedmandd959242009-06-11 23:07:04 +00001808
Chris Lattner498e6ad2009-06-16 06:15:56 +00001809Current:
Eli Friedmandd959242009-06-11 23:07:04 +00001810 testl $128, 4(%esp)
1811 setne %al
1812 movzbl %al, %eax
1813 shll $8, %eax
1814 ret
1815
Chris Lattner498e6ad2009-06-16 06:15:56 +00001816Better:
Eli Friedmandd959242009-06-11 23:07:04 +00001817 movl 4(%esp), %eax
1818 addl %eax, %eax
1819 andl $256, %eax
1820 ret
1821
Chris Lattner498e6ad2009-06-16 06:15:56 +00001822This is another general instcombine transformation that is profitable on all
1823targets. In LLVM IR, these functions look like this:
1824
1825define i32 @x(i32 %a) nounwind readnone {
1826entry:
1827 %0 = and i32 %a, 128
1828 %1 = icmp eq i32 %0, 0
1829 %iftmp.0.0 = select i1 %1, i32 0, i32 256
1830 ret i32 %iftmp.0.0
1831}
1832
1833define i32 @y(i32 %a) nounwind readnone {
1834entry:
1835 %0 = shl i32 %a, 1
1836 %1 = and i32 %0, 256
1837 ret i32 %1
1838}
1839
1840Replacing an icmp+select with a shift should always be considered profitable in
1841instcombine.
Eli Friedmandd959242009-06-11 23:07:04 +00001842
1843//===---------------------------------------------------------------------===//
Evan Cheng454211e2009-07-30 08:56:19 +00001844
1845Re-implement atomic builtins __sync_add_and_fetch() and __sync_sub_and_fetch
1846properly.
1847
1848When the return value is not used (i.e. only care about the value in the
1849memory), x86 does not have to use add to implement these. Instead, it can use
1850add, sub, inc, dec instructions with the "lock" prefix.
1851
1852This is currently implemented using a bit of instruction selection trick. The
1853issue is the target independent pattern produces one output and a chain and we
1854want to map it into one that just output a chain. The current trick is to select
1855it into a MERGE_VALUES with the first definition being an implicit_def. The
1856proper solution is to add new ISD opcodes for the no-output variant. DAG
1857combiner can then transform the node before it gets to target node selection.
1858
1859Problem #2 is we are adding a whole bunch of x86 atomic instructions when in
1860fact these instructions are identical to the non-lock versions. We need a way to
1861add target specific information to target nodes and have this information
1862carried over to machine instructions. Asm printer (or JIT) can use this
1863information to add the "lock" prefix.
Bill Wendling85196b52009-10-27 22:34:43 +00001864
1865//===---------------------------------------------------------------------===//
Eli Friedman2a81a582010-02-10 21:26:04 +00001866
1867_Bool bar(int *x) { return *x & 1; }
1868
1869define zeroext i1 @bar(i32* nocapture %x) nounwind readonly {
1870entry:
1871 %tmp1 = load i32* %x ; <i32> [#uses=1]
1872 %and = and i32 %tmp1, 1 ; <i32> [#uses=1]
1873 %tobool = icmp ne i32 %and, 0 ; <i1> [#uses=1]
1874 ret i1 %tobool
1875}
1876
1877bar: # @bar
1878# BB#0: # %entry
1879 movl 4(%esp), %eax
1880 movb (%eax), %al
1881 andb $1, %al
1882 movzbl %al, %eax
1883 ret
1884
1885Missed optimization: should be movl+andl.
1886
1887//===---------------------------------------------------------------------===//
1888
1889Consider the following two functions compiled with clang:
1890_Bool foo(int *x) { return !(*x & 4); }
1891unsigned bar(int *x) { return !(*x & 4); }
1892
1893foo:
1894 movl 4(%esp), %eax
1895 testb $4, (%eax)
1896 sete %al
1897 movzbl %al, %eax
1898 ret
1899
1900bar:
1901 movl 4(%esp), %eax
1902 movl (%eax), %eax
1903 shrl $2, %eax
1904 andl $1, %eax
1905 xorl $1, %eax
1906 ret
1907
1908The second function generates more code even though the two functions are
1909are functionally identical.
1910
1911//===---------------------------------------------------------------------===//
1912
1913Take the following C code:
1914int x(int y) { return (y & 63) << 14; }
1915
1916Code produced by gcc:
1917 andl $63, %edi
1918 sall $14, %edi
1919 movl %edi, %eax
1920 ret
1921
1922Code produced by clang:
1923 shll $14, %edi
1924 movl %edi, %eax
1925 andl $1032192, %eax
1926 ret
1927
1928The code produced by gcc is 3 bytes shorter. This sort of construct often
1929shows up with bitfields.
1930
1931//===---------------------------------------------------------------------===//