blob: 3796aac57cb53055995a931e5784ecdcb58f0890 [file] [log] [blame]
Dan Gohmanf17a25c2007-07-18 16:29:46 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the X86 backend.
3//===---------------------------------------------------------------------===//
4
Chris Lattner88de51e2009-05-25 16:34:44 +00005We should add support for the "movbe" instruction, which does a byte-swapping
6copy (3-addr bswap + memory support?) This is available on Atom processors.
Dan Gohmanf17a25c2007-07-18 16:29:46 +00007
8//===---------------------------------------------------------------------===//
9
Dan Gohmanf17a25c2007-07-18 16:29:46 +000010CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
11backend knows how to three-addressify this shift, but it appears the register
12allocator isn't even asking it to do so in this case. We should investigate
13why this isn't happening, it could have significant impact on other important
14cases for X86 as well.
15
16//===---------------------------------------------------------------------===//
17
18This should be one DIV/IDIV instruction, not a libcall:
19
20unsigned test(unsigned long long X, unsigned Y) {
21 return X/Y;
22}
23
24This can be done trivially with a custom legalizer. What about overflow
25though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
26
27//===---------------------------------------------------------------------===//
28
29Improvements to the multiply -> shift/add algorithm:
30http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
31
32//===---------------------------------------------------------------------===//
33
34Improve code like this (occurs fairly frequently, e.g. in LLVM):
35long long foo(int x) { return 1LL << x; }
36
37http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
38http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
39http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
40
41Another useful one would be ~0ULL >> X and ~0ULL << X.
42
43One better solution for 1LL << x is:
44 xorl %eax, %eax
45 xorl %edx, %edx
46 testb $32, %cl
47 sete %al
48 setne %dl
49 sall %cl, %eax
50 sall %cl, %edx
51
52But that requires good 8-bit subreg support.
53
Eli Friedman577c7492008-02-21 21:16:49 +000054Also, this might be better. It's an extra shift, but it's one instruction
55shorter, and doesn't stress 8-bit subreg support.
56(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
57but without the unnecessary and.)
58 movl %ecx, %eax
59 shrl $5, %eax
60 movl %eax, %edx
61 xorl $1, %edx
62 sall %cl, %eax
63 sall %cl. %edx
64
Dan Gohmanf17a25c2007-07-18 16:29:46 +00006564-bit shifts (in general) expand to really bad code. Instead of using
66cmovs, we should expand to a conditional branch like GCC produces.
67
68//===---------------------------------------------------------------------===//
69
70Compile this:
71_Bool f(_Bool a) { return a!=1; }
72
73into:
74 movzbl %dil, %eax
75 xorl $1, %eax
76 ret
77
Eli Friedman577c7492008-02-21 21:16:49 +000078(Although note that this isn't a legal way to express the code that llvm-gcc
79currently generates for that function.)
80
Dan Gohmanf17a25c2007-07-18 16:29:46 +000081//===---------------------------------------------------------------------===//
82
83Some isel ideas:
84
851. Dynamic programming based approach when compile time if not an
86 issue.
872. Code duplication (addressing mode) during isel.
883. Other ideas from "Register-Sensitive Selection, Duplication, and
89 Sequencing of Instructions".
904. Scheduling for reduced register pressure. E.g. "Minimum Register
91 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
92 and other related papers.
93 http://citeseer.ist.psu.edu/govindarajan01minimum.html
94
95//===---------------------------------------------------------------------===//
96
97Should we promote i16 to i32 to avoid partial register update stalls?
98
99//===---------------------------------------------------------------------===//
100
101Leave any_extend as pseudo instruction and hint to register
102allocator. Delay codegen until post register allocation.
Evan Chengfdbb6672007-10-12 18:22:55 +0000103Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
104the coalescer how to deal with it though.
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000105
106//===---------------------------------------------------------------------===//
107
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000108It appears icc use push for parameter passing. Need to investigate.
109
110//===---------------------------------------------------------------------===//
111
112Only use inc/neg/not instructions on processors where they are faster than
113add/sub/xor. They are slower on the P4 due to only updating some processor
114flags.
115
116//===---------------------------------------------------------------------===//
117
118The instruction selector sometimes misses folding a load into a compare. The
119pattern is written as (cmp reg, (load p)). Because the compare isn't
120commutative, it is not matched with the load on both sides. The dag combiner
121should be made smart enough to cannonicalize the load into the RHS of a compare
122when it can invert the result of the compare for free.
123
124//===---------------------------------------------------------------------===//
125
126How about intrinsics? An example is:
127 *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
128
129compiles to
130 pmuludq (%eax), %xmm0
131 movl 8(%esp), %eax
132 movdqa (%eax), %xmm1
133 pmulhuw %xmm0, %xmm1
134
135The transformation probably requires a X86 specific pass or a DAG combiner
136target specific hook.
137
138//===---------------------------------------------------------------------===//
139
140In many cases, LLVM generates code like this:
141
142_test:
143 movl 8(%esp), %eax
144 cmpl %eax, 4(%esp)
145 setl %al
146 movzbl %al, %eax
147 ret
148
149on some processors (which ones?), it is more efficient to do this:
150
151_test:
152 movl 8(%esp), %ebx
153 xor %eax, %eax
154 cmpl %ebx, 4(%esp)
155 setl %al
156 ret
157
158Doing this correctly is tricky though, as the xor clobbers the flags.
159
160//===---------------------------------------------------------------------===//
161
162We should generate bts/btr/etc instructions on targets where they are cheap or
163when codesize is important. e.g., for:
164
165void setbit(int *target, int bit) {
166 *target |= (1 << bit);
167}
168void clearbit(int *target, int bit) {
169 *target &= ~(1 << bit);
170}
171
172//===---------------------------------------------------------------------===//
173
174Instead of the following for memset char*, 1, 10:
175
176 movl $16843009, 4(%edx)
177 movl $16843009, (%edx)
178 movw $257, 8(%edx)
179
180It might be better to generate
181
182 movl $16843009, %eax
183 movl %eax, 4(%edx)
184 movl %eax, (%edx)
185 movw al, 8(%edx)
186
187when we can spare a register. It reduces code size.
188
189//===---------------------------------------------------------------------===//
190
191Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
192get this:
193
Eli Friedman1aa1f2c2008-02-28 00:21:43 +0000194define i32 @test1(i32 %X) {
195 %Y = sdiv i32 %X, 8
196 ret i32 %Y
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000197}
198
199_test1:
200 movl 4(%esp), %eax
201 movl %eax, %ecx
202 sarl $31, %ecx
203 shrl $29, %ecx
204 addl %ecx, %eax
205 sarl $3, %eax
206 ret
207
208GCC knows several different ways to codegen it, one of which is this:
209
210_test1:
211 movl 4(%esp), %eax
212 cmpl $-1, %eax
213 leal 7(%eax), %ecx
214 cmovle %ecx, %eax
215 sarl $3, %eax
216 ret
217
218which is probably slower, but it's interesting at least :)
219
220//===---------------------------------------------------------------------===//
221
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000222We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
223We should leave these as libcalls for everything over a much lower threshold,
224since libc is hand tuned for medium and large mem ops (avoiding RFO for large
225stores, TLB preheating, etc)
226
227//===---------------------------------------------------------------------===//
228
229Optimize this into something reasonable:
230 x * copysign(1.0, y) * copysign(1.0, z)
231
232//===---------------------------------------------------------------------===//
233
234Optimize copysign(x, *y) to use an integer load from y.
235
236//===---------------------------------------------------------------------===//
237
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000238The following tests perform worse with LSR:
239
240lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
241
242//===---------------------------------------------------------------------===//
243
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000244Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
245FR64 to VR128.
246
247//===---------------------------------------------------------------------===//
248
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000249Adding to the list of cmp / test poor codegen issues:
250
251int test(__m128 *A, __m128 *B) {
252 if (_mm_comige_ss(*A, *B))
253 return 3;
254 else
255 return 4;
256}
257
258_test:
259 movl 8(%esp), %eax
260 movaps (%eax), %xmm0
261 movl 4(%esp), %eax
262 movaps (%eax), %xmm1
263 comiss %xmm0, %xmm1
264 setae %al
265 movzbl %al, %ecx
266 movl $3, %eax
267 movl $4, %edx
268 cmpl $0, %ecx
269 cmove %edx, %eax
270 ret
271
272Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
273are a number of issues. 1) We are introducing a setcc between the result of the
274intrisic call and select. 2) The intrinsic is expected to produce a i32 value
275so a any extend (which becomes a zero extend) is added.
276
277We probably need some kind of target DAG combine hook to fix this.
278
279//===---------------------------------------------------------------------===//
280
281We generate significantly worse code for this than GCC:
282http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
283http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
284
285There is also one case we do worse on PPC.
286
287//===---------------------------------------------------------------------===//
288
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000289For this:
290
291int test(int a)
292{
293 return a * 3;
294}
295
296We currently emits
297 imull $3, 4(%esp), %eax
298
299Perhaps this is what we really should generate is? Is imull three or four
300cycles? Note: ICC generates this:
301 movl 4(%esp), %eax
302 leal (%eax,%eax,2), %eax
303
304The current instruction priority is based on pattern complexity. The former is
305more "complex" because it folds a load so the latter will not be emitted.
306
307Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
308should always try to match LEA first since the LEA matching code does some
309estimate to determine whether the match is profitable.
310
311However, if we care more about code size, then imull is better. It's two bytes
312shorter than movl + leal.
313
Eli Friedman9ab1db02008-11-30 07:52:27 +0000314On a Pentium M, both variants have the same characteristics with regard
315to throughput; however, the multiplication has a latency of four cycles, as
316opposed to two cycles for the movl+lea variant.
317
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000318//===---------------------------------------------------------------------===//
319
Eli Friedman577c7492008-02-21 21:16:49 +0000320__builtin_ffs codegen is messy.
Chris Lattnera86af9a2007-08-11 18:19:07 +0000321
Chris Lattnera86af9a2007-08-11 18:19:07 +0000322int ffs_(unsigned X) { return __builtin_ffs(X); }
323
Eli Friedman577c7492008-02-21 21:16:49 +0000324llvm produces:
325ffs_:
326 movl 4(%esp), %ecx
327 bsfl %ecx, %eax
328 movl $32, %edx
329 cmove %edx, %eax
330 incl %eax
331 xorl %edx, %edx
332 testl %ecx, %ecx
333 cmove %edx, %eax
Chris Lattnera86af9a2007-08-11 18:19:07 +0000334 ret
Eli Friedman577c7492008-02-21 21:16:49 +0000335
336vs gcc:
337
Chris Lattnera86af9a2007-08-11 18:19:07 +0000338_ffs_:
339 movl $-1, %edx
340 bsfl 4(%esp), %eax
341 cmove %edx, %eax
342 addl $1, %eax
343 ret
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000344
Eli Friedman577c7492008-02-21 21:16:49 +0000345Another example of __builtin_ffs (use predsimplify to eliminate a select):
346
347int foo (unsigned long j) {
348 if (j)
349 return __builtin_ffs (j) - 1;
350 else
351 return 0;
352}
353
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000354//===---------------------------------------------------------------------===//
355
356It appears gcc place string data with linkonce linkage in
357.section __TEXT,__const_coal,coalesced instead of
358.section __DATA,__const_coal,coalesced.
359Take a look at darwin.h, there are other Darwin assembler directives that we
360do not make use of.
361
362//===---------------------------------------------------------------------===//
363
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000364define i32 @foo(i32* %a, i32 %t) {
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000365entry:
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000366 br label %cond_true
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000367
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000368cond_true: ; preds = %cond_true, %entry
369 %x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ] ; <i32> [#uses=3]
370 %t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ] ; <i32> [#uses=1]
371 %tmp2 = getelementptr i32* %a, i32 %x.0.0 ; <i32*> [#uses=1]
372 %tmp3 = load i32* %tmp2 ; <i32> [#uses=1]
373 %tmp5 = add i32 %t_addr.0.0, %x.0.0 ; <i32> [#uses=1]
374 %tmp7 = add i32 %tmp5, %tmp3 ; <i32> [#uses=2]
375 %tmp9 = add i32 %x.0.0, 1 ; <i32> [#uses=2]
376 %tmp = icmp sgt i32 %tmp9, 39 ; <i1> [#uses=1]
377 br i1 %tmp, label %bb12, label %cond_true
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000378
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000379bb12: ; preds = %cond_true
380 ret i32 %tmp7
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000381}
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000382is pessimized by -loop-reduce and -indvars
383
384//===---------------------------------------------------------------------===//
385
386u32 to float conversion improvement:
387
388float uint32_2_float( unsigned u ) {
389 float fl = (int) (u & 0xffff);
390 float fh = (int) (u >> 16);
391 fh *= 0x1.0p16f;
392 return fh + fl;
393}
394
39500000000 subl $0x04,%esp
39600000003 movl 0x08(%esp,1),%eax
39700000007 movl %eax,%ecx
39800000009 shrl $0x10,%ecx
3990000000c cvtsi2ss %ecx,%xmm0
40000000010 andl $0x0000ffff,%eax
40100000015 cvtsi2ss %eax,%xmm1
40200000019 mulss 0x00000078,%xmm0
40300000021 addss %xmm1,%xmm0
40400000025 movss %xmm0,(%esp,1)
4050000002a flds (%esp,1)
4060000002d addl $0x04,%esp
40700000030 ret
408
409//===---------------------------------------------------------------------===//
410
411When using fastcc abi, align stack slot of argument of type double on 8 byte
412boundary to improve performance.
413
414//===---------------------------------------------------------------------===//
415
416Codegen:
417
418int f(int a, int b) {
419 if (a == 4 || a == 6)
420 b++;
421 return b;
422}
423
424
425as:
426
427or eax, 2
428cmp eax, 6
429jz label
430
431//===---------------------------------------------------------------------===//
432
433GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
434simplifications for integer "x cmp y ? a : b". For example, instead of:
435
436int G;
437void f(int X, int Y) {
438 G = X < 0 ? 14 : 13;
439}
440
441compiling to:
442
443_f:
444 movl $14, %eax
445 movl $13, %ecx
446 movl 4(%esp), %edx
447 testl %edx, %edx
448 cmovl %eax, %ecx
449 movl %ecx, _G
450 ret
451
452it could be:
453_f:
454 movl 4(%esp), %eax
455 sarl $31, %eax
456 notl %eax
457 addl $14, %eax
458 movl %eax, _G
459 ret
460
461etc.
462
Chris Lattnere7037c22007-11-02 17:04:20 +0000463Another is:
464int usesbb(unsigned int a, unsigned int b) {
465 return (a < b ? -1 : 0);
466}
467to:
468_usesbb:
469 movl 8(%esp), %eax
470 cmpl %eax, 4(%esp)
471 sbbl %eax, %eax
472 ret
473
474instead of:
475_usesbb:
476 xorl %eax, %eax
477 movl 8(%esp), %ecx
478 cmpl %ecx, 4(%esp)
479 movl $4294967295, %ecx
480 cmovb %ecx, %eax
481 ret
482
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000483//===---------------------------------------------------------------------===//
484
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000485Consider the expansion of:
486
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000487define i32 @test3(i32 %X) {
488 %tmp1 = urem i32 %X, 255
489 ret i32 %tmp1
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000490}
491
492Currently it compiles to:
493
494...
495 movl $2155905153, %ecx
496 movl 8(%esp), %esi
497 movl %esi, %eax
498 mull %ecx
499...
500
501This could be "reassociated" into:
502
503 movl $2155905153, %eax
504 movl 8(%esp), %ecx
505 mull %ecx
506
507to avoid the copy. In fact, the existing two-address stuff would do this
508except that mul isn't a commutative 2-addr instruction. I guess this has
509to be done at isel time based on the #uses to mul?
510
511//===---------------------------------------------------------------------===//
512
513Make sure the instruction which starts a loop does not cross a cacheline
514boundary. This requires knowning the exact length of each machine instruction.
515That is somewhat complicated, but doable. Example 256.bzip2:
516
517In the new trace, the hot loop has an instruction which crosses a cacheline
518boundary. In addition to potential cache misses, this can't help decoding as I
519imagine there has to be some kind of complicated decoder reset and realignment
520to grab the bytes from the next cacheline.
521
522532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
Eli Friedman9ab1db02008-11-30 07:52:27 +0000523942 942 0x3d03 movl %dh, (1809(%esp, %esi)
524937 937 0x3d0a incl %esi
5253 3 0x3d0b cmpb %bl, %dl
Dan Gohmanf17a25c2007-07-18 16:29:46 +000052627 27 0x3d0d jnz 0x000062db <main+11707>
527
528//===---------------------------------------------------------------------===//
529
530In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
531
532//===---------------------------------------------------------------------===//
533
534This could be a single 16-bit load.
535
536int f(char *p) {
537 if ((p[0] == 1) & (p[1] == 2)) return 1;
538 return 0;
539}
540
541//===---------------------------------------------------------------------===//
542
543We should inline lrintf and probably other libc functions.
544
545//===---------------------------------------------------------------------===//
546
547Start using the flags more. For example, compile:
548
549int add_zf(int *x, int y, int a, int b) {
550 if ((*x += y) == 0)
551 return a;
552 else
553 return b;
554}
555
556to:
557 addl %esi, (%rdi)
558 movl %edx, %eax
559 cmovne %ecx, %eax
560 ret
561instead of:
562
563_add_zf:
564 addl (%rdi), %esi
565 movl %esi, (%rdi)
566 testl %esi, %esi
567 cmove %edx, %ecx
568 movl %ecx, %eax
569 ret
570
571and:
572
573int add_zf(int *x, int y, int a, int b) {
574 if ((*x + y) < 0)
575 return a;
576 else
577 return b;
578}
579
580to:
581
582add_zf:
583 addl (%rdi), %esi
584 movl %edx, %eax
585 cmovns %ecx, %eax
586 ret
587
588instead of:
589
590_add_zf:
591 addl (%rdi), %esi
592 testl %esi, %esi
593 cmovs %edx, %ecx
594 movl %ecx, %eax
595 ret
596
597//===---------------------------------------------------------------------===//
598
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000599These two functions have identical effects:
600
601unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
602unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
603
604We currently compile them to:
605
606_f:
607 movl 4(%esp), %eax
608 movl %eax, %ecx
609 incl %ecx
610 movl 8(%esp), %edx
611 cmpl %edx, %ecx
612 jne LBB1_2 #UnifiedReturnBlock
613LBB1_1: #cond_true
614 addl $2, %eax
615 ret
616LBB1_2: #UnifiedReturnBlock
617 movl %ecx, %eax
618 ret
619_f2:
620 movl 4(%esp), %eax
621 movl %eax, %ecx
622 incl %ecx
623 cmpl 8(%esp), %ecx
624 sete %cl
625 movzbl %cl, %ecx
626 leal 1(%ecx,%eax), %eax
627 ret
628
629both of which are inferior to GCC's:
630
631_f:
632 movl 4(%esp), %edx
633 leal 1(%edx), %eax
634 addl $2, %edx
635 cmpl 8(%esp), %eax
636 cmove %edx, %eax
637 ret
638_f2:
639 movl 4(%esp), %eax
640 addl $1, %eax
641 xorl %edx, %edx
642 cmpl 8(%esp), %eax
643 sete %dl
644 addl %edx, %eax
645 ret
646
647//===---------------------------------------------------------------------===//
648
649This code:
650
651void test(int X) {
652 if (X) abort();
653}
654
655is currently compiled to:
656
657_test:
658 subl $12, %esp
659 cmpl $0, 16(%esp)
660 jne LBB1_1
661 addl $12, %esp
662 ret
663LBB1_1:
664 call L_abort$stub
665
666It would be better to produce:
667
668_test:
669 subl $12, %esp
670 cmpl $0, 16(%esp)
671 jne L_abort$stub
672 addl $12, %esp
673 ret
674
675This can be applied to any no-return function call that takes no arguments etc.
676Alternatively, the stack save/restore logic could be shrink-wrapped, producing
677something like this:
678
679_test:
680 cmpl $0, 4(%esp)
681 jne LBB1_1
682 ret
683LBB1_1:
684 subl $12, %esp
685 call L_abort$stub
686
687Both are useful in different situations. Finally, it could be shrink-wrapped
688and tail called, like this:
689
690_test:
691 cmpl $0, 4(%esp)
692 jne LBB1_1
693 ret
694LBB1_1:
695 pop %eax # realign stack.
696 call L_abort$stub
697
698Though this probably isn't worth it.
699
700//===---------------------------------------------------------------------===//
701
702We need to teach the codegen to convert two-address INC instructions to LEA
Chris Lattner0d64ec32007-08-11 18:16:46 +0000703when the flags are dead (likewise dec). For example, on X86-64, compile:
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000704
705int foo(int A, int B) {
706 return A+1;
707}
708
709to:
710
711_foo:
712 leal 1(%edi), %eax
713 ret
714
715instead of:
716
717_foo:
718 incl %edi
719 movl %edi, %eax
720 ret
721
722Another example is:
723
724;; X's live range extends beyond the shift, so the register allocator
725;; cannot coalesce it with Y. Because of this, a copy needs to be
726;; emitted before the shift to save the register value before it is
727;; clobbered. However, this copy is not needed if the register
728;; allocator turns the shift into an LEA. This also occurs for ADD.
729
730; Check that the shift gets turned into an LEA.
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000731; RUN: llvm-as < %s | llc -march=x86 -x86-asm-syntax=intel | \
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000732; RUN: not grep {mov E.X, E.X}
733
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000734@G = external global i32 ; <i32*> [#uses=3]
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000735
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000736define i32 @test1(i32 %X, i32 %Y) {
737 %Z = add i32 %X, %Y ; <i32> [#uses=1]
738 volatile store i32 %Y, i32* @G
739 volatile store i32 %Z, i32* @G
740 ret i32 %X
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000741}
742
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000743define i32 @test2(i32 %X) {
744 %Z = add i32 %X, 1 ; <i32> [#uses=1]
745 volatile store i32 %Z, i32* @G
746 ret i32 %X
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000747}
748
749//===---------------------------------------------------------------------===//
750
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000751Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
752a neg instead of a sub instruction. Consider:
753
754int test(char X) { return 7-X; }
755
756we currently produce:
757_test:
758 movl $7, %eax
759 movsbl 4(%esp), %ecx
760 subl %ecx, %eax
761 ret
762
763We would use one fewer register if codegen'd as:
764
765 movsbl 4(%esp), %eax
766 neg %eax
767 add $7, %eax
768 ret
769
770Note that this isn't beneficial if the load can be folded into the sub. In
771this case, we want a sub:
772
773int test(int X) { return 7-X; }
774_test:
775 movl $7, %eax
776 subl 4(%esp), %eax
777 ret
778
779//===---------------------------------------------------------------------===//
780
Chris Lattner32f65872007-08-20 02:14:33 +0000781Leaf functions that require one 4-byte spill slot have a prolog like this:
782
783_foo:
784 pushl %esi
785 subl $4, %esp
786...
787and an epilog like this:
788 addl $4, %esp
789 popl %esi
790 ret
791
792It would be smaller, and potentially faster, to push eax on entry and to
793pop into a dummy register instead of using addl/subl of esp. Just don't pop
794into any return registers :)
795
796//===---------------------------------------------------------------------===//
Chris Lattner44b03cb2007-08-23 15:22:07 +0000797
798The X86 backend should fold (branch (or (setcc, setcc))) into multiple
799branches. We generate really poor code for:
800
801double testf(double a) {
802 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
803}
804
805For example, the entry BB is:
806
807_testf:
808 subl $20, %esp
809 pxor %xmm0, %xmm0
810 movsd 24(%esp), %xmm1
811 ucomisd %xmm0, %xmm1
812 setnp %al
813 sete %cl
814 testb %cl, %al
815 jne LBB1_5 # UnifiedReturnBlock
816LBB1_1: # cond_true
817
818
819it would be better to replace the last four instructions with:
820
821 jp LBB1_1
822 je LBB1_5
823LBB1_1:
824
825We also codegen the inner ?: into a diamond:
826
827 cvtss2sd LCPI1_0(%rip), %xmm2
828 cvtss2sd LCPI1_1(%rip), %xmm3
829 ucomisd %xmm1, %xmm0
830 ja LBB1_3 # cond_true
831LBB1_2: # cond_true
832 movapd %xmm3, %xmm2
833LBB1_3: # cond_true
834 movapd %xmm2, %xmm0
835 ret
836
837We should sink the load into xmm3 into the LBB1_2 block. This should
838be pretty easy, and will nuke all the copies.
839
840//===---------------------------------------------------------------------===//
Chris Lattner4084d492007-09-10 21:43:18 +0000841
842This:
843 #include <algorithm>
844 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
845 { return std::make_pair(a + b, a + b < a); }
846 bool no_overflow(unsigned a, unsigned b)
847 { return !full_add(a, b).second; }
848
849Should compile to:
850
851
852 _Z11no_overflowjj:
853 addl %edi, %esi
854 setae %al
855 ret
856
Eli Friedman577c7492008-02-21 21:16:49 +0000857FIXME: That code looks wrong; bool return is normally defined as zext.
858
Chris Lattner4084d492007-09-10 21:43:18 +0000859on x86-64, not:
860
861__Z11no_overflowjj:
862 addl %edi, %esi
863 cmpl %edi, %esi
864 setae %al
865 movzbl %al, %eax
866 ret
867
868
869//===---------------------------------------------------------------------===//
Evan Cheng35127a62007-09-10 22:16:37 +0000870
871Re-materialize MOV32r0 etc. with xor instead of changing them to moves if the
872condition register is dead. xor reg reg is shorter than mov reg, #0.
Chris Lattnera487bf72007-09-26 06:29:31 +0000873
874//===---------------------------------------------------------------------===//
875
Bill Wendling7f436dd2007-10-02 20:42:59 +0000876The following code:
877
Bill Wendlingc2036e32007-10-02 20:54:32 +0000878bb114.preheader: ; preds = %cond_next94
879 %tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
880 %tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
881 %tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
882 %tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
883 %tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
884 %tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
885 %tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
886 %tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
887 %tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
888 %tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
889 %tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
890 br label %bb114
891
892produces:
893
Bill Wendling7f436dd2007-10-02 20:42:59 +0000894LBB3_5: # bb114.preheader
895 movswl -68(%ebp), %eax
896 movl $32, %ecx
897 movl %ecx, -80(%ebp)
898 subl %eax, -80(%ebp)
899 movswl -52(%ebp), %eax
900 movl %ecx, -84(%ebp)
901 subl %eax, -84(%ebp)
902 movswl -70(%ebp), %eax
903 movl %ecx, -88(%ebp)
904 subl %eax, -88(%ebp)
905 movswl -50(%ebp), %eax
906 subl %eax, %ecx
907 movl %ecx, -76(%ebp)
908 movswl -42(%ebp), %eax
909 movl %eax, -92(%ebp)
910 movswl -66(%ebp), %eax
911 movl %eax, -96(%ebp)
912 movw $0, -98(%ebp)
913
Chris Lattner792bae52007-10-03 03:40:24 +0000914This appears to be bad because the RA is not folding the store to the stack
915slot into the movl. The above instructions could be:
916 movl $32, -80(%ebp)
917...
918 movl $32, -84(%ebp)
919...
920This seems like a cross between remat and spill folding.
921
Bill Wendlingc2036e32007-10-02 20:54:32 +0000922This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling7f436dd2007-10-02 20:42:59 +0000923change, so we could simply subtract %eax from %ecx first and then use %ecx (or
924vice-versa).
925
926//===---------------------------------------------------------------------===//
927
Bill Wendling54c4f832007-10-02 21:49:31 +0000928This code:
929
930 %tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
931 br i1 %tmp659, label %cond_true662, label %cond_next715
932
933produces this:
934
935 testw %cx, %cx
936 movswl %cx, %esi
937 jns LBB4_109 # cond_next715
938
939Shark tells us that using %cx in the testw instruction is sub-optimal. It
940suggests using the 32-bit register (which is what ICC uses).
941
942//===---------------------------------------------------------------------===//
Chris Lattner802c62a2007-10-03 17:10:03 +0000943
Chris Lattnerae259992007-10-04 15:47:27 +0000944We compile this:
945
946void compare (long long foo) {
947 if (foo < 4294967297LL)
948 abort();
949}
950
951to:
952
Eli Friedman577c7492008-02-21 21:16:49 +0000953compare:
954 subl $4, %esp
955 cmpl $0, 8(%esp)
Chris Lattnerae259992007-10-04 15:47:27 +0000956 setne %al
957 movzbw %al, %ax
Eli Friedman577c7492008-02-21 21:16:49 +0000958 cmpl $1, 12(%esp)
Chris Lattnerae259992007-10-04 15:47:27 +0000959 setg %cl
960 movzbw %cl, %cx
961 cmove %ax, %cx
Eli Friedman577c7492008-02-21 21:16:49 +0000962 testb $1, %cl
963 jne .LBB1_2 # UnifiedReturnBlock
964.LBB1_1: # ifthen
965 call abort
966.LBB1_2: # UnifiedReturnBlock
967 addl $4, %esp
968 ret
Chris Lattnerae259992007-10-04 15:47:27 +0000969
970(also really horrible code on ppc). This is due to the expand code for 64-bit
971compares. GCC produces multiple branches, which is much nicer:
972
Eli Friedman577c7492008-02-21 21:16:49 +0000973compare:
974 subl $12, %esp
975 movl 20(%esp), %edx
976 movl 16(%esp), %eax
977 decl %edx
978 jle .L7
979.L5:
980 addl $12, %esp
981 ret
982 .p2align 4,,7
983.L7:
984 jl .L4
Chris Lattnerae259992007-10-04 15:47:27 +0000985 cmpl $0, %eax
Eli Friedman577c7492008-02-21 21:16:49 +0000986 .p2align 4,,8
987 ja .L5
988.L4:
989 .p2align 4,,9
990 call abort
Chris Lattnerae259992007-10-04 15:47:27 +0000991
992//===---------------------------------------------------------------------===//
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000993
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000994Tail call optimization improvements: Tail call optimization currently
995pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighofer449b01a2008-01-11 16:49:42 +0000996non-tail call optimized calls) that source from the callers arguments
997or that source from a virtual register (also possibly sourcing from
998callers arguments).
999This is done to prevent overwriting of parameters (see example
1000below) that might be used later.
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001001
1002example:
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001003
1004int callee(int32, int64);
1005int caller(int32 arg1, int32 arg2) {
1006 int64 local = arg2 * 2;
1007 return callee(arg2, (int64)local);
1008}
1009
1010[arg1] [!arg2 no longer valid since we moved local onto it]
1011[arg2] -> [(int64)
1012[RETADDR] local ]
1013
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001014Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001015arg2 of the caller.
1016
1017Possible optimizations:
1018
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001019
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001020 - Analyse the actual parameters of the callee to see which would
1021 overwrite a caller parameter which is used by the callee and only
1022 push them onto the top of the stack.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001023
1024 int callee (int32 arg1, int32 arg2);
1025 int caller (int32 arg1, int32 arg2) {
1026 return callee(arg1,arg2);
1027 }
1028
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001029 Here we don't need to write any variables to the top of the stack
1030 since they don't overwrite each other.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001031
1032 int callee (int32 arg1, int32 arg2);
1033 int caller (int32 arg1, int32 arg2) {
1034 return callee(arg2,arg1);
1035 }
1036
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001037 Here we need to push the arguments because they overwrite each
1038 other.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001039
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001040//===---------------------------------------------------------------------===//
Evan Cheng7f1ad6a2007-10-28 04:01:09 +00001041
1042main ()
1043{
1044 int i = 0;
1045 unsigned long int z = 0;
1046
1047 do {
1048 z -= 0x00004000;
1049 i++;
1050 if (i > 0x00040000)
1051 abort ();
1052 } while (z > 0);
1053 exit (0);
1054}
1055
1056gcc compiles this to:
1057
1058_main:
1059 subl $28, %esp
1060 xorl %eax, %eax
1061 jmp L2
1062L3:
1063 cmpl $262144, %eax
1064 je L10
1065L2:
1066 addl $1, %eax
1067 cmpl $262145, %eax
1068 jne L3
1069 call L_abort$stub
1070L10:
1071 movl $0, (%esp)
1072 call L_exit$stub
1073
1074llvm:
1075
1076_main:
1077 subl $12, %esp
1078 movl $1, %eax
1079 movl $16384, %ecx
1080LBB1_1: # bb
1081 cmpl $262145, %eax
1082 jge LBB1_4 # cond_true
1083LBB1_2: # cond_next
1084 incl %eax
1085 addl $4294950912, %ecx
1086 cmpl $16384, %ecx
1087 jne LBB1_1 # bb
1088LBB1_3: # bb11
1089 xorl %eax, %eax
1090 addl $12, %esp
1091 ret
1092LBB1_4: # cond_true
1093 call L_abort$stub
1094
10951. LSR should rewrite the first cmp with induction variable %ecx.
10962. DAG combiner should fold
1097 leal 1(%eax), %edx
1098 cmpl $262145, %edx
1099 =>
1100 cmpl $262144, %eax
1101
1102//===---------------------------------------------------------------------===//
Chris Lattner358670b2007-11-24 06:13:33 +00001103
1104define i64 @test(double %X) {
1105 %Y = fptosi double %X to i64
1106 ret i64 %Y
1107}
1108
1109compiles to:
1110
1111_test:
1112 subl $20, %esp
1113 movsd 24(%esp), %xmm0
1114 movsd %xmm0, 8(%esp)
1115 fldl 8(%esp)
1116 fisttpll (%esp)
1117 movl 4(%esp), %edx
1118 movl (%esp), %eax
1119 addl $20, %esp
1120 #FP_REG_KILL
1121 ret
1122
1123This should just fldl directly from the input stack slot.
Chris Lattner10d54d12007-12-05 22:58:19 +00001124
1125//===---------------------------------------------------------------------===//
1126
1127This code:
1128int foo (int x) { return (x & 65535) | 255; }
1129
1130Should compile into:
1131
1132_foo:
1133 movzwl 4(%esp), %eax
Eli Friedman577c7492008-02-21 21:16:49 +00001134 orl $255, %eax
Chris Lattner10d54d12007-12-05 22:58:19 +00001135 ret
1136
1137instead of:
1138_foo:
1139 movl $255, %eax
1140 orl 4(%esp), %eax
1141 andl $65535, %eax
1142 ret
1143
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001144//===---------------------------------------------------------------------===//
1145
Chris Lattnereec7ac02008-02-21 06:51:29 +00001146We're codegen'ing multiply of long longs inefficiently:
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001147
Chris Lattnereec7ac02008-02-21 06:51:29 +00001148unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
1149 return arg1 * arg2;
1150}
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001151
Chris Lattnereec7ac02008-02-21 06:51:29 +00001152We compile to (fomit-frame-pointer):
1153
1154_LLM:
1155 pushl %esi
1156 movl 8(%esp), %ecx
1157 movl 16(%esp), %esi
1158 movl %esi, %eax
1159 mull %ecx
1160 imull 12(%esp), %esi
1161 addl %edx, %esi
1162 imull 20(%esp), %ecx
1163 movl %esi, %edx
1164 addl %ecx, %edx
1165 popl %esi
1166 ret
1167
1168This looks like a scheduling deficiency and lack of remat of the load from
1169the argument area. ICC apparently produces:
1170
1171 movl 8(%esp), %ecx
1172 imull 12(%esp), %ecx
1173 movl 16(%esp), %eax
1174 imull 4(%esp), %eax
1175 addl %eax, %ecx
1176 movl 4(%esp), %eax
1177 mull 12(%esp)
1178 addl %ecx, %edx
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001179 ret
1180
Chris Lattnereec7ac02008-02-21 06:51:29 +00001181Note that it remat'd loads from 4(esp) and 12(esp). See this GCC PR:
1182http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001183
1184//===---------------------------------------------------------------------===//
1185
Chris Lattner2b55ebd2007-12-24 19:27:46 +00001186We can fold a store into "zeroing a reg". Instead of:
1187
1188xorl %eax, %eax
1189movl %eax, 124(%esp)
1190
1191we should get:
1192
1193movl $0, 124(%esp)
1194
1195if the flags of the xor are dead.
1196
Chris Lattner459ff992008-01-11 18:00:13 +00001197Likewise, we isel "x<<1" into "add reg,reg". If reg is spilled, this should
1198be folded into: shl [mem], 1
1199
Chris Lattner2b55ebd2007-12-24 19:27:46 +00001200//===---------------------------------------------------------------------===//
Chris Lattner64400952007-12-28 21:50:40 +00001201
1202This testcase misses a read/modify/write opportunity (from PR1425):
1203
1204void vertical_decompose97iH1(int *b0, int *b1, int *b2, int width){
1205 int i;
1206 for(i=0; i<width; i++)
1207 b1[i] += (1*(b0[i] + b2[i])+0)>>0;
1208}
1209
1210We compile it down to:
1211
1212LBB1_2: # bb
1213 movl (%esi,%edi,4), %ebx
1214 addl (%ecx,%edi,4), %ebx
1215 addl (%edx,%edi,4), %ebx
1216 movl %ebx, (%ecx,%edi,4)
1217 incl %edi
1218 cmpl %eax, %edi
1219 jne LBB1_2 # bb
1220
1221the inner loop should add to the memory location (%ecx,%edi,4), saving
1222a mov. Something like:
1223
1224 movl (%esi,%edi,4), %ebx
1225 addl (%edx,%edi,4), %ebx
1226 addl %ebx, (%ecx,%edi,4)
1227
Chris Lattnerbde73102007-12-29 05:51:58 +00001228Here is another interesting example:
1229
1230void vertical_compose97iH1(int *b0, int *b1, int *b2, int width){
1231 int i;
1232 for(i=0; i<width; i++)
1233 b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
1234}
1235
1236We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
1237
1238LBB9_2: # bb
1239 movl (%ecx,%edi,4), %ebx
1240 subl (%esi,%edi,4), %ebx
1241 subl (%edx,%edi,4), %ebx
1242 movl %ebx, (%ecx,%edi,4)
1243 incl %edi
1244 cmpl %eax, %edi
1245 jne LBB9_2 # bb
1246
1247Additionally, LSR should rewrite the exit condition of these loops to use
Chris Lattner64400952007-12-28 21:50:40 +00001248a stride-4 IV, would would allow all the scales in the loop to go away.
1249This would result in smaller code and more efficient microops.
1250
1251//===---------------------------------------------------------------------===//
Chris Lattner0362a362008-01-07 21:59:58 +00001252
1253In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
1254or and instruction, for example:
1255
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001256 xorpd LCPI1_0, %xmm2
Chris Lattner0362a362008-01-07 21:59:58 +00001257
1258However, if xmm2 gets spilled, we end up with really ugly code like this:
1259
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001260 movsd (%esp), %xmm0
1261 xorpd LCPI1_0, %xmm0
1262 movsd %xmm0, (%esp)
Chris Lattner0362a362008-01-07 21:59:58 +00001263
1264Since we 'know' that this is a 'neg', we can actually "fold" the spill into
1265the neg/abs instruction, turning it into an *integer* operation, like this:
1266
1267 xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
1268
1269you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001270stall. Here is a contrived testcase:
1271
1272double a, b, c;
1273void test(double *P) {
1274 double X = *P;
1275 a = X;
1276 bar();
1277 X = -X;
1278 b = X;
1279 bar();
1280 c = X;
1281}
Chris Lattner0362a362008-01-07 21:59:58 +00001282
1283//===---------------------------------------------------------------------===//
Andrew Lenharth785610d2008-02-16 01:24:58 +00001284
1285handling llvm.memory.barrier on pre SSE2 cpus
1286
1287should generate:
1288lock ; mov %esp, %esp
1289
1290//===---------------------------------------------------------------------===//
Chris Lattner7644ff32008-02-17 19:43:57 +00001291
1292The generated code on x86 for checking for signed overflow on a multiply the
1293obvious way is much longer than it needs to be.
1294
1295int x(int a, int b) {
1296 long long prod = (long long)a*b;
1297 return prod > 0x7FFFFFFF || prod < (-0x7FFFFFFF-1);
1298}
1299
1300See PR2053 for more details.
1301
1302//===---------------------------------------------------------------------===//
Chris Lattner83f22362008-02-18 18:30:13 +00001303
Eli Friedman577c7492008-02-21 21:16:49 +00001304We should investigate using cdq/ctld (effect: edx = sar eax, 31)
1305more aggressively; it should cost the same as a move+shift on any modern
1306processor, but it's a lot shorter. Downside is that it puts more
1307pressure on register allocation because it has fixed operands.
1308
1309Example:
1310int abs(int x) {return x < 0 ? -x : x;}
1311
1312gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
1313abs:
1314 movl 4(%esp), %eax
1315 cltd
1316 xorl %edx, %eax
1317 subl %edx, %eax
1318 ret
1319
1320//===---------------------------------------------------------------------===//
1321
1322Consider:
Chris Lattner83f22362008-02-18 18:30:13 +00001323int test(unsigned long a, unsigned long b) { return -(a < b); }
1324
1325We currently compile this to:
1326
1327define i32 @test(i32 %a, i32 %b) nounwind {
1328 %tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
1329 %tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
1330 %tmp5 = sub i32 0, %tmp34 ; <i32> [#uses=1]
1331 ret i32 %tmp5
1332}
1333
1334and
1335
1336_test:
1337 movl 8(%esp), %eax
1338 cmpl %eax, 4(%esp)
1339 setb %al
1340 movzbl %al, %eax
1341 negl %eax
1342 ret
1343
1344Several deficiencies here. First, we should instcombine zext+neg into sext:
1345
1346define i32 @test2(i32 %a, i32 %b) nounwind {
1347 %tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
1348 %tmp34 = sext i1 %tmp3 to i32 ; <i32> [#uses=1]
1349 ret i32 %tmp34
1350}
1351
1352However, before we can do that, we have to fix the bad codegen that we get for
1353sext from bool:
1354
1355_test2:
1356 movl 8(%esp), %eax
1357 cmpl %eax, 4(%esp)
1358 setb %al
1359 movzbl %al, %eax
1360 shll $31, %eax
1361 sarl $31, %eax
1362 ret
1363
1364This code should be at least as good as the code above. Once this is fixed, we
1365can optimize this specific case even more to:
1366
1367 movl 8(%esp), %eax
1368 xorl %ecx, %ecx
1369 cmpl %eax, 4(%esp)
1370 sbbl %ecx, %ecx
1371
1372//===---------------------------------------------------------------------===//
Eli Friedman1aa1f2c2008-02-28 00:21:43 +00001373
1374Take the following code (from
1375http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
1376
1377extern unsigned char first_one[65536];
1378int FirstOnet(unsigned long long arg1)
1379{
1380 if (arg1 >> 48)
1381 return (first_one[arg1 >> 48]);
1382 return 0;
1383}
1384
1385
1386The following code is currently generated:
1387FirstOnet:
1388 movl 8(%esp), %eax
1389 cmpl $65536, %eax
1390 movl 4(%esp), %ecx
1391 jb .LBB1_2 # UnifiedReturnBlock
1392.LBB1_1: # ifthen
1393 shrl $16, %eax
1394 movzbl first_one(%eax), %eax
1395 ret
1396.LBB1_2: # UnifiedReturnBlock
1397 xorl %eax, %eax
1398 ret
1399
1400There are a few possible improvements here:
14011. We should be able to eliminate the dead load into %ecx
14022. We could change the "movl 8(%esp), %eax" into
1403 "movzwl 10(%esp), %eax"; this lets us change the cmpl
1404 into a testl, which is shorter, and eliminate the shift.
1405
1406We could also in theory eliminate the branch by using a conditional
1407for the address of the load, but that seems unlikely to be worthwhile
1408in general.
1409
1410//===---------------------------------------------------------------------===//
1411
Chris Lattner44a98ac2008-02-28 04:52:59 +00001412We compile this function:
1413
1414define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext %d) nounwind {
1415entry:
1416 %tmp2 = icmp eq i8 %d, 0 ; <i1> [#uses=1]
1417 br i1 %tmp2, label %bb7, label %bb
1418
1419bb: ; preds = %entry
1420 %tmp6 = add i32 %b, %a ; <i32> [#uses=1]
1421 ret i32 %tmp6
1422
1423bb7: ; preds = %entry
1424 %tmp10 = sub i32 %a, %c ; <i32> [#uses=1]
1425 ret i32 %tmp10
1426}
1427
1428to:
1429
1430_foo:
1431 cmpb $0, 16(%esp)
1432 movl 12(%esp), %ecx
1433 movl 8(%esp), %eax
1434 movl 4(%esp), %edx
1435 je LBB1_2 # bb7
1436LBB1_1: # bb
1437 addl %edx, %eax
1438 ret
1439LBB1_2: # bb7
1440 movl %edx, %eax
1441 subl %ecx, %eax
1442 ret
1443
Gabor Greif02661592008-03-06 10:51:21 +00001444The coalescer could coalesce "edx" with "eax" to avoid the movl in LBB1_2
Chris Lattner44a98ac2008-02-28 04:52:59 +00001445if it commuted the addl in LBB1_1.
1446
1447//===---------------------------------------------------------------------===//
Evan Cheng921dcba2008-03-28 07:07:06 +00001448
1449See rdar://4653682.
1450
1451From flops:
1452
1453LBB1_15: # bb310
1454 cvtss2sd LCPI1_0, %xmm1
1455 addsd %xmm1, %xmm0
1456 movsd 176(%esp), %xmm2
1457 mulsd %xmm0, %xmm2
1458 movapd %xmm2, %xmm3
1459 mulsd %xmm3, %xmm3
1460 movapd %xmm3, %xmm4
1461 mulsd LCPI1_23, %xmm4
1462 addsd LCPI1_24, %xmm4
1463 mulsd %xmm3, %xmm4
1464 addsd LCPI1_25, %xmm4
1465 mulsd %xmm3, %xmm4
1466 addsd LCPI1_26, %xmm4
1467 mulsd %xmm3, %xmm4
1468 addsd LCPI1_27, %xmm4
1469 mulsd %xmm3, %xmm4
1470 addsd LCPI1_28, %xmm4
1471 mulsd %xmm3, %xmm4
1472 addsd %xmm1, %xmm4
1473 mulsd %xmm2, %xmm4
1474 movsd 152(%esp), %xmm1
1475 addsd %xmm4, %xmm1
1476 movsd %xmm1, 152(%esp)
1477 incl %eax
1478 cmpl %eax, %esi
1479 jge LBB1_15 # bb310
1480LBB1_16: # bb358.loopexit
1481 movsd 152(%esp), %xmm0
1482 addsd %xmm0, %xmm0
1483 addsd LCPI1_22, %xmm0
1484 movsd %xmm0, 152(%esp)
1485
1486Rather than spilling the result of the last addsd in the loop, we should have
1487insert a copy to split the interval (one for the duration of the loop, one
1488extending to the fall through). The register pressure in the loop isn't high
1489enough to warrant the spill.
1490
1491Also check why xmm7 is not used at all in the function.
Chris Lattner16e5c782008-04-21 04:46:30 +00001492
1493//===---------------------------------------------------------------------===//
1494
1495Legalize loses track of the fact that bools are always zero extended when in
1496memory. This causes us to compile abort_gzip (from 164.gzip) from:
1497
1498target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
1499target triple = "i386-apple-darwin8"
1500@in_exit.4870.b = internal global i1 false ; <i1*> [#uses=2]
1501define fastcc void @abort_gzip() noreturn nounwind {
1502entry:
1503 %tmp.b.i = load i1* @in_exit.4870.b ; <i1> [#uses=1]
1504 br i1 %tmp.b.i, label %bb.i, label %bb4.i
1505bb.i: ; preds = %entry
1506 tail call void @exit( i32 1 ) noreturn nounwind
1507 unreachable
1508bb4.i: ; preds = %entry
1509 store i1 true, i1* @in_exit.4870.b
1510 tail call void @exit( i32 1 ) noreturn nounwind
1511 unreachable
1512}
1513declare void @exit(i32) noreturn nounwind
1514
1515into:
1516
1517_abort_gzip:
1518 subl $12, %esp
1519 movb _in_exit.4870.b, %al
1520 notb %al
1521 testb $1, %al
1522 jne LBB1_2 ## bb4.i
1523LBB1_1: ## bb.i
1524 ...
1525
1526//===---------------------------------------------------------------------===//
Chris Lattner7cb1d332008-05-05 23:19:45 +00001527
1528We compile:
1529
1530int test(int x, int y) {
1531 return x-y-1;
1532}
1533
1534into (-m64):
1535
1536_test:
1537 decl %edi
1538 movl %edi, %eax
1539 subl %esi, %eax
1540 ret
1541
1542it would be better to codegen as: x+~y (notl+addl)
Edwin Törökfa9d5e22008-10-24 19:23:07 +00001543
1544//===---------------------------------------------------------------------===//
1545
1546This code:
1547
1548int foo(const char *str,...)
1549{
1550 __builtin_va_list a; int x;
1551 __builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
1552 return x;
1553}
1554
1555gets compiled into this on x86-64:
1556 subq $200, %rsp
1557 movaps %xmm7, 160(%rsp)
1558 movaps %xmm6, 144(%rsp)
1559 movaps %xmm5, 128(%rsp)
1560 movaps %xmm4, 112(%rsp)
1561 movaps %xmm3, 96(%rsp)
1562 movaps %xmm2, 80(%rsp)
1563 movaps %xmm1, 64(%rsp)
1564 movaps %xmm0, 48(%rsp)
1565 movq %r9, 40(%rsp)
1566 movq %r8, 32(%rsp)
1567 movq %rcx, 24(%rsp)
1568 movq %rdx, 16(%rsp)
1569 movq %rsi, 8(%rsp)
1570 leaq (%rsp), %rax
1571 movq %rax, 192(%rsp)
1572 leaq 208(%rsp), %rax
1573 movq %rax, 184(%rsp)
1574 movl $48, 180(%rsp)
1575 movl $8, 176(%rsp)
1576 movl 176(%rsp), %eax
1577 cmpl $47, %eax
1578 jbe .LBB1_3 # bb
1579.LBB1_1: # bb3
1580 movq 184(%rsp), %rcx
1581 leaq 8(%rcx), %rax
1582 movq %rax, 184(%rsp)
1583.LBB1_2: # bb4
1584 movl (%rcx), %eax
1585 addq $200, %rsp
1586 ret
1587.LBB1_3: # bb
1588 movl %eax, %ecx
1589 addl $8, %eax
1590 addq 192(%rsp), %rcx
1591 movl %eax, 176(%rsp)
1592 jmp .LBB1_2 # bb4
1593
1594gcc 4.3 generates:
1595 subq $96, %rsp
1596.LCFI0:
1597 leaq 104(%rsp), %rax
1598 movq %rsi, -80(%rsp)
1599 movl $8, -120(%rsp)
1600 movq %rax, -112(%rsp)
1601 leaq -88(%rsp), %rax
1602 movq %rax, -104(%rsp)
1603 movl $8, %eax
1604 cmpl $48, %eax
1605 jb .L6
1606 movq -112(%rsp), %rdx
1607 movl (%rdx), %eax
1608 addq $96, %rsp
1609 ret
1610 .p2align 4,,10
1611 .p2align 3
1612.L6:
1613 mov %eax, %edx
1614 addq -104(%rsp), %rdx
1615 addl $8, %eax
1616 movl %eax, -120(%rsp)
1617 movl (%rdx), %eax
1618 addq $96, %rsp
1619 ret
1620
1621and it gets compiled into this on x86:
1622 pushl %ebp
1623 movl %esp, %ebp
1624 subl $4, %esp
1625 leal 12(%ebp), %eax
1626 movl %eax, -4(%ebp)
1627 leal 16(%ebp), %eax
1628 movl %eax, -4(%ebp)
1629 movl 12(%ebp), %eax
1630 addl $4, %esp
1631 popl %ebp
1632 ret
1633
1634gcc 4.3 generates:
1635 pushl %ebp
1636 movl %esp, %ebp
1637 movl 12(%ebp), %eax
1638 popl %ebp
1639 ret
Evan Chengbf97bec2008-11-11 17:35:52 +00001640
1641//===---------------------------------------------------------------------===//
1642
1643Teach tblgen not to check bitconvert source type in some cases. This allows us
1644to consolidate the following patterns in X86InstrMMX.td:
1645
1646def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1647 (iPTR 0))))),
1648 (v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
1649def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1650 (iPTR 0))))),
1651 (v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
1652def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1653 (iPTR 0))))),
1654 (v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
1655
1656There are other cases in various td files.
Eli Friedman9ab1db02008-11-30 07:52:27 +00001657
1658//===---------------------------------------------------------------------===//
1659
1660Take something like the following on x86-32:
1661unsigned a(unsigned long long x, unsigned y) {return x % y;}
1662
1663We currently generate a libcall, but we really shouldn't: the expansion is
1664shorter and likely faster than the libcall. The expected code is something
1665like the following:
1666
1667 movl 12(%ebp), %eax
1668 movl 16(%ebp), %ecx
1669 xorl %edx, %edx
1670 divl %ecx
1671 movl 8(%ebp), %eax
1672 divl %ecx
1673 movl %edx, %eax
1674 ret
1675
1676A similar code sequence works for division.
1677
1678//===---------------------------------------------------------------------===//
Chris Lattnerbfccda62008-12-06 22:49:05 +00001679
1680These should compile to the same code, but the later codegen's to useless
1681instructions on X86. This may be a trivial dag combine (GCC PR7061):
1682
1683struct s1 { unsigned char a, b; };
1684unsigned long f1(struct s1 x) {
1685 return x.a + x.b;
1686}
1687struct s2 { unsigned a: 8, b: 8; };
1688unsigned long f2(struct s2 x) {
1689 return x.a + x.b;
1690}
1691
1692//===---------------------------------------------------------------------===//
1693
Chris Lattner9f34dc62009-02-08 20:44:19 +00001694We currently compile this:
1695
1696define i32 @func1(i32 %v1, i32 %v2) nounwind {
1697entry:
1698 %t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)
1699 %sum = extractvalue {i32, i1} %t, 0
1700 %obit = extractvalue {i32, i1} %t, 1
1701 br i1 %obit, label %overflow, label %normal
1702normal:
1703 ret i32 %sum
1704overflow:
1705 call void @llvm.trap()
1706 unreachable
1707}
1708declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)
1709declare void @llvm.trap()
1710
1711to:
1712
1713_func1:
1714 movl 4(%esp), %eax
1715 addl 8(%esp), %eax
1716 jo LBB1_2 ## overflow
1717LBB1_1: ## normal
1718 ret
1719LBB1_2: ## overflow
1720 ud2
1721
1722it would be nice to produce "into" someday.
1723
1724//===---------------------------------------------------------------------===//
Chris Lattner09c650b2009-02-17 01:16:14 +00001725
1726This code:
1727
1728void vec_mpys1(int y[], const int x[], int scaler) {
1729int i;
1730for (i = 0; i < 150; i++)
1731 y[i] += (((long long)scaler * (long long)x[i]) >> 31);
1732}
1733
1734Compiles to this loop with GCC 3.x:
1735
1736.L5:
1737 movl %ebx, %eax
1738 imull (%edi,%ecx,4)
1739 shrdl $31, %edx, %eax
1740 addl %eax, (%esi,%ecx,4)
1741 incl %ecx
1742 cmpl $149, %ecx
1743 jle .L5
1744
1745llvm-gcc compiles it to the much uglier:
1746
1747LBB1_1: ## bb1
1748 movl 24(%esp), %eax
1749 movl (%eax,%edi,4), %ebx
1750 movl %ebx, %ebp
1751 imull %esi, %ebp
1752 movl %ebx, %eax
1753 mull %ecx
1754 addl %ebp, %edx
1755 sarl $31, %ebx
1756 imull %ecx, %ebx
1757 addl %edx, %ebx
1758 shldl $1, %eax, %ebx
1759 movl 20(%esp), %eax
1760 addl %ebx, (%eax,%edi,4)
1761 incl %edi
1762 cmpl $150, %edi
1763 jne LBB1_1 ## bb1
1764
1765//===---------------------------------------------------------------------===//
Chris Lattner8eca7c72009-03-08 01:54:43 +00001766
Dan Gohman5edb7382009-03-09 23:47:02 +00001767Test instructions can be eliminated by using EFLAGS values from arithmetic
Dan Gohman8bb7db62009-03-10 00:26:23 +00001768instructions. This is currently not done for mul, and, or, xor, neg, shl,
1769sra, srl, shld, shrd, atomic ops, and others. It is also currently not done
1770for read-modify-write instructions. It is also current not done if the
1771OF or CF flags are needed.
Dan Gohman5edb7382009-03-09 23:47:02 +00001772
1773The shift operators have the complication that when the shift count is
1774zero, EFLAGS is not set, so they can only subsume a test instruction if
Dan Gohman8bb7db62009-03-10 00:26:23 +00001775the shift count is known to be non-zero. Also, using the EFLAGS value
1776from a shift is apparently very slow on some x86 implementations.
Dan Gohman5edb7382009-03-09 23:47:02 +00001777
1778In read-modify-write instructions, the root node in the isel match is
1779the store, and isel has no way for the use of the EFLAGS result of the
1780arithmetic to be remapped to the new node.
1781
Dan Gohman8bb7db62009-03-10 00:26:23 +00001782Add and subtract instructions set OF on signed overflow and CF on unsiged
1783overflow, while test instructions always clear OF and CF. In order to
1784replace a test with an add or subtract in a situation where OF or CF is
1785needed, codegen must be able to prove that the operation cannot see
1786signed or unsigned overflow, respectively.
1787
Dan Gohman5edb7382009-03-09 23:47:02 +00001788//===---------------------------------------------------------------------===//
1789
Chris Lattnerfe8b5592009-03-08 03:04:26 +00001790memcpy/memmove do not lower to SSE copies when possible. A silly example is:
1791define <16 x float> @foo(<16 x float> %A) nounwind {
1792 %tmp = alloca <16 x float>, align 16
1793 %tmp2 = alloca <16 x float>, align 16
1794 store <16 x float> %A, <16 x float>* %tmp
1795 %s = bitcast <16 x float>* %tmp to i8*
1796 %s2 = bitcast <16 x float>* %tmp2 to i8*
1797 call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)
1798 %R = load <16 x float>* %tmp2
1799 ret <16 x float> %R
1800}
1801
1802declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind
1803
1804which compiles to:
1805
1806_foo:
1807 subl $140, %esp
1808 movaps %xmm3, 112(%esp)
1809 movaps %xmm2, 96(%esp)
1810 movaps %xmm1, 80(%esp)
1811 movaps %xmm0, 64(%esp)
1812 movl 60(%esp), %eax
1813 movl %eax, 124(%esp)
1814 movl 56(%esp), %eax
1815 movl %eax, 120(%esp)
1816 movl 52(%esp), %eax
1817 <many many more 32-bit copies>
1818 movaps (%esp), %xmm0
1819 movaps 16(%esp), %xmm1
1820 movaps 32(%esp), %xmm2
1821 movaps 48(%esp), %xmm3
1822 addl $140, %esp
1823 ret
1824
1825On Nehalem, it may even be cheaper to just use movups when unaligned than to
1826fall back to lower-granularity chunks.
1827
1828//===---------------------------------------------------------------------===//
Chris Lattnere00a4e02009-05-25 20:28:19 +00001829
1830Implement processor-specific optimizations for parity with GCC on these
1831processors. GCC does two optimizations:
1832
18331. ix86_pad_returns inserts a noop before ret instructions if immediately
1834 preceeded by a conditional branch or is the target of a jump.
18352. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of
1836 code contains more than 3 branches.
1837
1838The first one is done for all AMDs, Core2, and "Generic"
1839The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,
1840 Core 2, and "Generic"
1841
1842//===---------------------------------------------------------------------===//
Eli Friedmandd959242009-06-11 23:07:04 +00001843
1844Testcase:
1845int a(int x) { return (x & 127) > 31; }
1846
1847Current output:
1848 movl 4(%esp), %eax
1849 andl $127, %eax
1850 cmpl $31, %eax
1851 seta %al
1852 movzbl %al, %eax
1853 ret
1854
1855Ideal output:
1856 xorl %eax, %eax
1857 testl $96, 4(%esp)
1858 setne %al
1859 ret
1860
1861We could do this transformation in instcombine, but it's only clearly
1862beneficial on platforms with a test instruction.
1863
1864//===---------------------------------------------------------------------===//
1865Testcase:
1866int x(int a) { return (a&0xf0)>>4; }
1867
1868Current output:
1869 movl 4(%esp), %eax
1870 shrl $4, %eax
1871 andl $15, %eax
1872 ret
1873
1874Ideal output:
1875 movzbl 4(%esp), %eax
1876 shrl $4, %eax
1877 ret
1878
1879//===---------------------------------------------------------------------===//
1880
1881Testcase:
1882int x(int a) { return (a & 0x80) ? 0x100 : 0; }
1883
1884Current output:
1885 testl $128, 4(%esp)
1886 setne %al
1887 movzbl %al, %eax
1888 shll $8, %eax
1889 ret
1890
1891Ideal output:
1892 movl 4(%esp), %eax
1893 addl %eax, %eax
1894 andl $256, %eax
1895 ret
1896
1897We generally want to fold shifted tests of a single bit into a shift+and on x86.
1898
1899//===---------------------------------------------------------------------===//