blob: 3c6138bb1e6a64089439a068c328c0e4834b5cc3 [file] [log] [blame]
Dan Gohmanf17a25c2007-07-18 16:29:46 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the X86 backend.
3//===---------------------------------------------------------------------===//
4
Chris Lattner88de51e2009-05-25 16:34:44 +00005We should add support for the "movbe" instruction, which does a byte-swapping
6copy (3-addr bswap + memory support?) This is available on Atom processors.
Dan Gohmanf17a25c2007-07-18 16:29:46 +00007
8//===---------------------------------------------------------------------===//
9
Dan Gohmanf17a25c2007-07-18 16:29:46 +000010CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
11backend knows how to three-addressify this shift, but it appears the register
12allocator isn't even asking it to do so in this case. We should investigate
13why this isn't happening, it could have significant impact on other important
14cases for X86 as well.
15
16//===---------------------------------------------------------------------===//
17
18This should be one DIV/IDIV instruction, not a libcall:
19
20unsigned test(unsigned long long X, unsigned Y) {
21 return X/Y;
22}
23
24This can be done trivially with a custom legalizer. What about overflow
25though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
26
27//===---------------------------------------------------------------------===//
28
29Improvements to the multiply -> shift/add algorithm:
30http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
31
32//===---------------------------------------------------------------------===//
33
34Improve code like this (occurs fairly frequently, e.g. in LLVM):
35long long foo(int x) { return 1LL << x; }
36
37http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
38http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
39http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
40
41Another useful one would be ~0ULL >> X and ~0ULL << X.
42
43One better solution for 1LL << x is:
44 xorl %eax, %eax
45 xorl %edx, %edx
46 testb $32, %cl
47 sete %al
48 setne %dl
49 sall %cl, %eax
50 sall %cl, %edx
51
52But that requires good 8-bit subreg support.
53
Eli Friedman577c7492008-02-21 21:16:49 +000054Also, this might be better. It's an extra shift, but it's one instruction
55shorter, and doesn't stress 8-bit subreg support.
56(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
57but without the unnecessary and.)
58 movl %ecx, %eax
59 shrl $5, %eax
60 movl %eax, %edx
61 xorl $1, %edx
62 sall %cl, %eax
63 sall %cl. %edx
64
Dan Gohmanf17a25c2007-07-18 16:29:46 +00006564-bit shifts (in general) expand to really bad code. Instead of using
66cmovs, we should expand to a conditional branch like GCC produces.
67
68//===---------------------------------------------------------------------===//
69
70Compile this:
71_Bool f(_Bool a) { return a!=1; }
72
73into:
74 movzbl %dil, %eax
75 xorl $1, %eax
76 ret
77
Eli Friedman577c7492008-02-21 21:16:49 +000078(Although note that this isn't a legal way to express the code that llvm-gcc
79currently generates for that function.)
80
Dan Gohmanf17a25c2007-07-18 16:29:46 +000081//===---------------------------------------------------------------------===//
82
83Some isel ideas:
84
851. Dynamic programming based approach when compile time if not an
86 issue.
872. Code duplication (addressing mode) during isel.
883. Other ideas from "Register-Sensitive Selection, Duplication, and
89 Sequencing of Instructions".
904. Scheduling for reduced register pressure. E.g. "Minimum Register
91 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
92 and other related papers.
93 http://citeseer.ist.psu.edu/govindarajan01minimum.html
94
95//===---------------------------------------------------------------------===//
96
97Should we promote i16 to i32 to avoid partial register update stalls?
98
99//===---------------------------------------------------------------------===//
100
101Leave any_extend as pseudo instruction and hint to register
102allocator. Delay codegen until post register allocation.
Evan Chengfdbb6672007-10-12 18:22:55 +0000103Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
104the coalescer how to deal with it though.
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000105
106//===---------------------------------------------------------------------===//
107
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000108It appears icc use push for parameter passing. Need to investigate.
109
110//===---------------------------------------------------------------------===//
111
112Only use inc/neg/not instructions on processors where they are faster than
113add/sub/xor. They are slower on the P4 due to only updating some processor
114flags.
115
116//===---------------------------------------------------------------------===//
117
118The instruction selector sometimes misses folding a load into a compare. The
119pattern is written as (cmp reg, (load p)). Because the compare isn't
120commutative, it is not matched with the load on both sides. The dag combiner
121should be made smart enough to cannonicalize the load into the RHS of a compare
122when it can invert the result of the compare for free.
123
124//===---------------------------------------------------------------------===//
125
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000126In many cases, LLVM generates code like this:
127
128_test:
129 movl 8(%esp), %eax
130 cmpl %eax, 4(%esp)
131 setl %al
132 movzbl %al, %eax
133 ret
134
135on some processors (which ones?), it is more efficient to do this:
136
137_test:
138 movl 8(%esp), %ebx
139 xor %eax, %eax
140 cmpl %ebx, 4(%esp)
141 setl %al
142 ret
143
144Doing this correctly is tricky though, as the xor clobbers the flags.
145
146//===---------------------------------------------------------------------===//
147
148We should generate bts/btr/etc instructions on targets where they are cheap or
149when codesize is important. e.g., for:
150
151void setbit(int *target, int bit) {
152 *target |= (1 << bit);
153}
154void clearbit(int *target, int bit) {
155 *target &= ~(1 << bit);
156}
157
158//===---------------------------------------------------------------------===//
159
160Instead of the following for memset char*, 1, 10:
161
162 movl $16843009, 4(%edx)
163 movl $16843009, (%edx)
164 movw $257, 8(%edx)
165
166It might be better to generate
167
168 movl $16843009, %eax
169 movl %eax, 4(%edx)
170 movl %eax, (%edx)
171 movw al, 8(%edx)
172
173when we can spare a register. It reduces code size.
174
175//===---------------------------------------------------------------------===//
176
177Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
178get this:
179
Eli Friedman1aa1f2c2008-02-28 00:21:43 +0000180define i32 @test1(i32 %X) {
181 %Y = sdiv i32 %X, 8
182 ret i32 %Y
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000183}
184
185_test1:
186 movl 4(%esp), %eax
187 movl %eax, %ecx
188 sarl $31, %ecx
189 shrl $29, %ecx
190 addl %ecx, %eax
191 sarl $3, %eax
192 ret
193
194GCC knows several different ways to codegen it, one of which is this:
195
196_test1:
197 movl 4(%esp), %eax
198 cmpl $-1, %eax
199 leal 7(%eax), %ecx
200 cmovle %ecx, %eax
201 sarl $3, %eax
202 ret
203
204which is probably slower, but it's interesting at least :)
205
206//===---------------------------------------------------------------------===//
207
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000208We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
209We should leave these as libcalls for everything over a much lower threshold,
210since libc is hand tuned for medium and large mem ops (avoiding RFO for large
211stores, TLB preheating, etc)
212
213//===---------------------------------------------------------------------===//
214
215Optimize this into something reasonable:
216 x * copysign(1.0, y) * copysign(1.0, z)
217
218//===---------------------------------------------------------------------===//
219
220Optimize copysign(x, *y) to use an integer load from y.
221
222//===---------------------------------------------------------------------===//
223
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000224The following tests perform worse with LSR:
225
226lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
227
228//===---------------------------------------------------------------------===//
229
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000230Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
231FR64 to VR128.
232
233//===---------------------------------------------------------------------===//
234
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000235Adding to the list of cmp / test poor codegen issues:
236
237int test(__m128 *A, __m128 *B) {
238 if (_mm_comige_ss(*A, *B))
239 return 3;
240 else
241 return 4;
242}
243
244_test:
245 movl 8(%esp), %eax
246 movaps (%eax), %xmm0
247 movl 4(%esp), %eax
248 movaps (%eax), %xmm1
249 comiss %xmm0, %xmm1
250 setae %al
251 movzbl %al, %ecx
252 movl $3, %eax
253 movl $4, %edx
254 cmpl $0, %ecx
255 cmove %edx, %eax
256 ret
257
258Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
259are a number of issues. 1) We are introducing a setcc between the result of the
260intrisic call and select. 2) The intrinsic is expected to produce a i32 value
261so a any extend (which becomes a zero extend) is added.
262
263We probably need some kind of target DAG combine hook to fix this.
264
265//===---------------------------------------------------------------------===//
266
267We generate significantly worse code for this than GCC:
268http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
269http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
270
271There is also one case we do worse on PPC.
272
273//===---------------------------------------------------------------------===//
274
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000275For this:
276
277int test(int a)
278{
279 return a * 3;
280}
281
282We currently emits
283 imull $3, 4(%esp), %eax
284
285Perhaps this is what we really should generate is? Is imull three or four
286cycles? Note: ICC generates this:
287 movl 4(%esp), %eax
288 leal (%eax,%eax,2), %eax
289
290The current instruction priority is based on pattern complexity. The former is
291more "complex" because it folds a load so the latter will not be emitted.
292
293Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
294should always try to match LEA first since the LEA matching code does some
295estimate to determine whether the match is profitable.
296
297However, if we care more about code size, then imull is better. It's two bytes
298shorter than movl + leal.
299
Eli Friedman9ab1db02008-11-30 07:52:27 +0000300On a Pentium M, both variants have the same characteristics with regard
301to throughput; however, the multiplication has a latency of four cycles, as
302opposed to two cycles for the movl+lea variant.
303
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000304//===---------------------------------------------------------------------===//
305
Eli Friedman577c7492008-02-21 21:16:49 +0000306__builtin_ffs codegen is messy.
Chris Lattnera86af9a2007-08-11 18:19:07 +0000307
Chris Lattnera86af9a2007-08-11 18:19:07 +0000308int ffs_(unsigned X) { return __builtin_ffs(X); }
309
Eli Friedman577c7492008-02-21 21:16:49 +0000310llvm produces:
311ffs_:
312 movl 4(%esp), %ecx
313 bsfl %ecx, %eax
314 movl $32, %edx
315 cmove %edx, %eax
316 incl %eax
317 xorl %edx, %edx
318 testl %ecx, %ecx
319 cmove %edx, %eax
Chris Lattnera86af9a2007-08-11 18:19:07 +0000320 ret
Eli Friedman577c7492008-02-21 21:16:49 +0000321
322vs gcc:
323
Chris Lattnera86af9a2007-08-11 18:19:07 +0000324_ffs_:
325 movl $-1, %edx
326 bsfl 4(%esp), %eax
327 cmove %edx, %eax
328 addl $1, %eax
329 ret
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000330
Eli Friedman577c7492008-02-21 21:16:49 +0000331Another example of __builtin_ffs (use predsimplify to eliminate a select):
332
333int foo (unsigned long j) {
334 if (j)
335 return __builtin_ffs (j) - 1;
336 else
337 return 0;
338}
339
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000340//===---------------------------------------------------------------------===//
341
342It appears gcc place string data with linkonce linkage in
343.section __TEXT,__const_coal,coalesced instead of
344.section __DATA,__const_coal,coalesced.
345Take a look at darwin.h, there are other Darwin assembler directives that we
346do not make use of.
347
348//===---------------------------------------------------------------------===//
349
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000350define i32 @foo(i32* %a, i32 %t) {
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000351entry:
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000352 br label %cond_true
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000353
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000354cond_true: ; preds = %cond_true, %entry
355 %x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ] ; <i32> [#uses=3]
356 %t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ] ; <i32> [#uses=1]
357 %tmp2 = getelementptr i32* %a, i32 %x.0.0 ; <i32*> [#uses=1]
358 %tmp3 = load i32* %tmp2 ; <i32> [#uses=1]
359 %tmp5 = add i32 %t_addr.0.0, %x.0.0 ; <i32> [#uses=1]
360 %tmp7 = add i32 %tmp5, %tmp3 ; <i32> [#uses=2]
361 %tmp9 = add i32 %x.0.0, 1 ; <i32> [#uses=2]
362 %tmp = icmp sgt i32 %tmp9, 39 ; <i1> [#uses=1]
363 br i1 %tmp, label %bb12, label %cond_true
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000364
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000365bb12: ; preds = %cond_true
366 ret i32 %tmp7
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000367}
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000368is pessimized by -loop-reduce and -indvars
369
370//===---------------------------------------------------------------------===//
371
372u32 to float conversion improvement:
373
374float uint32_2_float( unsigned u ) {
375 float fl = (int) (u & 0xffff);
376 float fh = (int) (u >> 16);
377 fh *= 0x1.0p16f;
378 return fh + fl;
379}
380
38100000000 subl $0x04,%esp
38200000003 movl 0x08(%esp,1),%eax
38300000007 movl %eax,%ecx
38400000009 shrl $0x10,%ecx
3850000000c cvtsi2ss %ecx,%xmm0
38600000010 andl $0x0000ffff,%eax
38700000015 cvtsi2ss %eax,%xmm1
38800000019 mulss 0x00000078,%xmm0
38900000021 addss %xmm1,%xmm0
39000000025 movss %xmm0,(%esp,1)
3910000002a flds (%esp,1)
3920000002d addl $0x04,%esp
39300000030 ret
394
395//===---------------------------------------------------------------------===//
396
397When using fastcc abi, align stack slot of argument of type double on 8 byte
398boundary to improve performance.
399
400//===---------------------------------------------------------------------===//
401
402Codegen:
403
404int f(int a, int b) {
405 if (a == 4 || a == 6)
406 b++;
407 return b;
408}
409
410
411as:
412
413or eax, 2
414cmp eax, 6
415jz label
416
417//===---------------------------------------------------------------------===//
418
419GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
420simplifications for integer "x cmp y ? a : b". For example, instead of:
421
422int G;
423void f(int X, int Y) {
424 G = X < 0 ? 14 : 13;
425}
426
427compiling to:
428
429_f:
430 movl $14, %eax
431 movl $13, %ecx
432 movl 4(%esp), %edx
433 testl %edx, %edx
434 cmovl %eax, %ecx
435 movl %ecx, _G
436 ret
437
438it could be:
439_f:
440 movl 4(%esp), %eax
441 sarl $31, %eax
442 notl %eax
443 addl $14, %eax
444 movl %eax, _G
445 ret
446
447etc.
448
Chris Lattnere7037c22007-11-02 17:04:20 +0000449Another is:
450int usesbb(unsigned int a, unsigned int b) {
451 return (a < b ? -1 : 0);
452}
453to:
454_usesbb:
455 movl 8(%esp), %eax
456 cmpl %eax, 4(%esp)
457 sbbl %eax, %eax
458 ret
459
460instead of:
461_usesbb:
462 xorl %eax, %eax
463 movl 8(%esp), %ecx
464 cmpl %ecx, 4(%esp)
465 movl $4294967295, %ecx
466 cmovb %ecx, %eax
467 ret
468
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000469//===---------------------------------------------------------------------===//
470
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000471Consider the expansion of:
472
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000473define i32 @test3(i32 %X) {
474 %tmp1 = urem i32 %X, 255
475 ret i32 %tmp1
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000476}
477
478Currently it compiles to:
479
480...
481 movl $2155905153, %ecx
482 movl 8(%esp), %esi
483 movl %esi, %eax
484 mull %ecx
485...
486
487This could be "reassociated" into:
488
489 movl $2155905153, %eax
490 movl 8(%esp), %ecx
491 mull %ecx
492
493to avoid the copy. In fact, the existing two-address stuff would do this
494except that mul isn't a commutative 2-addr instruction. I guess this has
495to be done at isel time based on the #uses to mul?
496
497//===---------------------------------------------------------------------===//
498
499Make sure the instruction which starts a loop does not cross a cacheline
500boundary. This requires knowning the exact length of each machine instruction.
501That is somewhat complicated, but doable. Example 256.bzip2:
502
503In the new trace, the hot loop has an instruction which crosses a cacheline
504boundary. In addition to potential cache misses, this can't help decoding as I
505imagine there has to be some kind of complicated decoder reset and realignment
506to grab the bytes from the next cacheline.
507
508532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
Eli Friedman9ab1db02008-11-30 07:52:27 +0000509942 942 0x3d03 movl %dh, (1809(%esp, %esi)
510937 937 0x3d0a incl %esi
5113 3 0x3d0b cmpb %bl, %dl
Dan Gohmanf17a25c2007-07-18 16:29:46 +000051227 27 0x3d0d jnz 0x000062db <main+11707>
513
514//===---------------------------------------------------------------------===//
515
516In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
517
518//===---------------------------------------------------------------------===//
519
520This could be a single 16-bit load.
521
522int f(char *p) {
523 if ((p[0] == 1) & (p[1] == 2)) return 1;
524 return 0;
525}
526
527//===---------------------------------------------------------------------===//
528
529We should inline lrintf and probably other libc functions.
530
531//===---------------------------------------------------------------------===//
532
Dan Gohman52415eb2010-01-04 20:55:05 +0000533Use the FLAGS values from arithmetic instructions more. For example, compile:
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000534
535int add_zf(int *x, int y, int a, int b) {
536 if ((*x += y) == 0)
537 return a;
538 else
539 return b;
540}
541
542to:
543 addl %esi, (%rdi)
544 movl %edx, %eax
545 cmovne %ecx, %eax
546 ret
547instead of:
548
549_add_zf:
550 addl (%rdi), %esi
551 movl %esi, (%rdi)
552 testl %esi, %esi
553 cmove %edx, %ecx
554 movl %ecx, %eax
555 ret
556
Dan Gohman52415eb2010-01-04 20:55:05 +0000557As another example, compile function f2 in test/CodeGen/X86/cmp-test.ll
558without a test instruction.
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000559
560//===---------------------------------------------------------------------===//
561
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000562These two functions have identical effects:
563
564unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
565unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
566
567We currently compile them to:
568
569_f:
570 movl 4(%esp), %eax
571 movl %eax, %ecx
572 incl %ecx
573 movl 8(%esp), %edx
574 cmpl %edx, %ecx
575 jne LBB1_2 #UnifiedReturnBlock
576LBB1_1: #cond_true
577 addl $2, %eax
578 ret
579LBB1_2: #UnifiedReturnBlock
580 movl %ecx, %eax
581 ret
582_f2:
583 movl 4(%esp), %eax
584 movl %eax, %ecx
585 incl %ecx
586 cmpl 8(%esp), %ecx
587 sete %cl
588 movzbl %cl, %ecx
589 leal 1(%ecx,%eax), %eax
590 ret
591
592both of which are inferior to GCC's:
593
594_f:
595 movl 4(%esp), %edx
596 leal 1(%edx), %eax
597 addl $2, %edx
598 cmpl 8(%esp), %eax
599 cmove %edx, %eax
600 ret
601_f2:
602 movl 4(%esp), %eax
603 addl $1, %eax
604 xorl %edx, %edx
605 cmpl 8(%esp), %eax
606 sete %dl
607 addl %edx, %eax
608 ret
609
610//===---------------------------------------------------------------------===//
611
612This code:
613
614void test(int X) {
615 if (X) abort();
616}
617
618is currently compiled to:
619
620_test:
621 subl $12, %esp
622 cmpl $0, 16(%esp)
623 jne LBB1_1
624 addl $12, %esp
625 ret
626LBB1_1:
627 call L_abort$stub
628
629It would be better to produce:
630
631_test:
632 subl $12, %esp
633 cmpl $0, 16(%esp)
634 jne L_abort$stub
635 addl $12, %esp
636 ret
637
638This can be applied to any no-return function call that takes no arguments etc.
639Alternatively, the stack save/restore logic could be shrink-wrapped, producing
640something like this:
641
642_test:
643 cmpl $0, 4(%esp)
644 jne LBB1_1
645 ret
646LBB1_1:
647 subl $12, %esp
648 call L_abort$stub
649
650Both are useful in different situations. Finally, it could be shrink-wrapped
651and tail called, like this:
652
653_test:
654 cmpl $0, 4(%esp)
655 jne LBB1_1
656 ret
657LBB1_1:
658 pop %eax # realign stack.
659 call L_abort$stub
660
661Though this probably isn't worth it.
662
663//===---------------------------------------------------------------------===//
664
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000665Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
666a neg instead of a sub instruction. Consider:
667
668int test(char X) { return 7-X; }
669
670we currently produce:
671_test:
672 movl $7, %eax
673 movsbl 4(%esp), %ecx
674 subl %ecx, %eax
675 ret
676
677We would use one fewer register if codegen'd as:
678
679 movsbl 4(%esp), %eax
680 neg %eax
681 add $7, %eax
682 ret
683
684Note that this isn't beneficial if the load can be folded into the sub. In
685this case, we want a sub:
686
687int test(int X) { return 7-X; }
688_test:
689 movl $7, %eax
690 subl 4(%esp), %eax
691 ret
692
693//===---------------------------------------------------------------------===//
694
Chris Lattner32f65872007-08-20 02:14:33 +0000695Leaf functions that require one 4-byte spill slot have a prolog like this:
696
697_foo:
698 pushl %esi
699 subl $4, %esp
700...
701and an epilog like this:
702 addl $4, %esp
703 popl %esi
704 ret
705
706It would be smaller, and potentially faster, to push eax on entry and to
707pop into a dummy register instead of using addl/subl of esp. Just don't pop
708into any return registers :)
709
710//===---------------------------------------------------------------------===//
Chris Lattner44b03cb2007-08-23 15:22:07 +0000711
712The X86 backend should fold (branch (or (setcc, setcc))) into multiple
713branches. We generate really poor code for:
714
715double testf(double a) {
716 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
717}
718
719For example, the entry BB is:
720
721_testf:
722 subl $20, %esp
723 pxor %xmm0, %xmm0
724 movsd 24(%esp), %xmm1
725 ucomisd %xmm0, %xmm1
726 setnp %al
727 sete %cl
728 testb %cl, %al
729 jne LBB1_5 # UnifiedReturnBlock
730LBB1_1: # cond_true
731
732
733it would be better to replace the last four instructions with:
734
735 jp LBB1_1
736 je LBB1_5
737LBB1_1:
738
739We also codegen the inner ?: into a diamond:
740
741 cvtss2sd LCPI1_0(%rip), %xmm2
742 cvtss2sd LCPI1_1(%rip), %xmm3
743 ucomisd %xmm1, %xmm0
744 ja LBB1_3 # cond_true
745LBB1_2: # cond_true
746 movapd %xmm3, %xmm2
747LBB1_3: # cond_true
748 movapd %xmm2, %xmm0
749 ret
750
751We should sink the load into xmm3 into the LBB1_2 block. This should
752be pretty easy, and will nuke all the copies.
753
754//===---------------------------------------------------------------------===//
Chris Lattner4084d492007-09-10 21:43:18 +0000755
756This:
757 #include <algorithm>
758 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
759 { return std::make_pair(a + b, a + b < a); }
760 bool no_overflow(unsigned a, unsigned b)
761 { return !full_add(a, b).second; }
762
763Should compile to:
764
765
766 _Z11no_overflowjj:
767 addl %edi, %esi
768 setae %al
769 ret
770
Eli Friedman577c7492008-02-21 21:16:49 +0000771FIXME: That code looks wrong; bool return is normally defined as zext.
772
Chris Lattner4084d492007-09-10 21:43:18 +0000773on x86-64, not:
774
775__Z11no_overflowjj:
776 addl %edi, %esi
777 cmpl %edi, %esi
778 setae %al
779 movzbl %al, %eax
780 ret
781
782
783//===---------------------------------------------------------------------===//
Evan Cheng35127a62007-09-10 22:16:37 +0000784
Bill Wendling7f436dd2007-10-02 20:42:59 +0000785The following code:
786
Bill Wendlingc2036e32007-10-02 20:54:32 +0000787bb114.preheader: ; preds = %cond_next94
788 %tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
789 %tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
790 %tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
791 %tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
792 %tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
793 %tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
794 %tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
795 %tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
796 %tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
797 %tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
798 %tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
799 br label %bb114
800
801produces:
802
Bill Wendling7f436dd2007-10-02 20:42:59 +0000803LBB3_5: # bb114.preheader
804 movswl -68(%ebp), %eax
805 movl $32, %ecx
806 movl %ecx, -80(%ebp)
807 subl %eax, -80(%ebp)
808 movswl -52(%ebp), %eax
809 movl %ecx, -84(%ebp)
810 subl %eax, -84(%ebp)
811 movswl -70(%ebp), %eax
812 movl %ecx, -88(%ebp)
813 subl %eax, -88(%ebp)
814 movswl -50(%ebp), %eax
815 subl %eax, %ecx
816 movl %ecx, -76(%ebp)
817 movswl -42(%ebp), %eax
818 movl %eax, -92(%ebp)
819 movswl -66(%ebp), %eax
820 movl %eax, -96(%ebp)
821 movw $0, -98(%ebp)
822
Chris Lattner792bae52007-10-03 03:40:24 +0000823This appears to be bad because the RA is not folding the store to the stack
824slot into the movl. The above instructions could be:
825 movl $32, -80(%ebp)
826...
827 movl $32, -84(%ebp)
828...
829This seems like a cross between remat and spill folding.
830
Bill Wendlingc2036e32007-10-02 20:54:32 +0000831This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling7f436dd2007-10-02 20:42:59 +0000832change, so we could simply subtract %eax from %ecx first and then use %ecx (or
833vice-versa).
834
835//===---------------------------------------------------------------------===//
836
Bill Wendling54c4f832007-10-02 21:49:31 +0000837This code:
838
839 %tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
840 br i1 %tmp659, label %cond_true662, label %cond_next715
841
842produces this:
843
844 testw %cx, %cx
845 movswl %cx, %esi
846 jns LBB4_109 # cond_next715
847
848Shark tells us that using %cx in the testw instruction is sub-optimal. It
849suggests using the 32-bit register (which is what ICC uses).
850
851//===---------------------------------------------------------------------===//
Chris Lattner802c62a2007-10-03 17:10:03 +0000852
Chris Lattnerae259992007-10-04 15:47:27 +0000853We compile this:
854
855void compare (long long foo) {
856 if (foo < 4294967297LL)
857 abort();
858}
859
860to:
861
Eli Friedman577c7492008-02-21 21:16:49 +0000862compare:
863 subl $4, %esp
864 cmpl $0, 8(%esp)
Chris Lattnerae259992007-10-04 15:47:27 +0000865 setne %al
866 movzbw %al, %ax
Eli Friedman577c7492008-02-21 21:16:49 +0000867 cmpl $1, 12(%esp)
Chris Lattnerae259992007-10-04 15:47:27 +0000868 setg %cl
869 movzbw %cl, %cx
870 cmove %ax, %cx
Eli Friedman577c7492008-02-21 21:16:49 +0000871 testb $1, %cl
872 jne .LBB1_2 # UnifiedReturnBlock
873.LBB1_1: # ifthen
874 call abort
875.LBB1_2: # UnifiedReturnBlock
876 addl $4, %esp
877 ret
Chris Lattnerae259992007-10-04 15:47:27 +0000878
879(also really horrible code on ppc). This is due to the expand code for 64-bit
880compares. GCC produces multiple branches, which is much nicer:
881
Eli Friedman577c7492008-02-21 21:16:49 +0000882compare:
883 subl $12, %esp
884 movl 20(%esp), %edx
885 movl 16(%esp), %eax
886 decl %edx
887 jle .L7
888.L5:
889 addl $12, %esp
890 ret
891 .p2align 4,,7
892.L7:
893 jl .L4
Chris Lattnerae259992007-10-04 15:47:27 +0000894 cmpl $0, %eax
Eli Friedman577c7492008-02-21 21:16:49 +0000895 .p2align 4,,8
896 ja .L5
897.L4:
898 .p2align 4,,9
899 call abort
Chris Lattnerae259992007-10-04 15:47:27 +0000900
901//===---------------------------------------------------------------------===//
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000902
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000903Tail call optimization improvements: Tail call optimization currently
904pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighofer449b01a2008-01-11 16:49:42 +0000905non-tail call optimized calls) that source from the callers arguments
906or that source from a virtual register (also possibly sourcing from
907callers arguments).
908This is done to prevent overwriting of parameters (see example
909below) that might be used later.
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000910
911example:
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000912
913int callee(int32, int64);
914int caller(int32 arg1, int32 arg2) {
915 int64 local = arg2 * 2;
916 return callee(arg2, (int64)local);
917}
918
919[arg1] [!arg2 no longer valid since we moved local onto it]
920[arg2] -> [(int64)
921[RETADDR] local ]
922
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000923Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000924arg2 of the caller.
925
926Possible optimizations:
927
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000928
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000929 - Analyse the actual parameters of the callee to see which would
930 overwrite a caller parameter which is used by the callee and only
931 push them onto the top of the stack.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000932
933 int callee (int32 arg1, int32 arg2);
934 int caller (int32 arg1, int32 arg2) {
935 return callee(arg1,arg2);
936 }
937
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000938 Here we don't need to write any variables to the top of the stack
939 since they don't overwrite each other.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000940
941 int callee (int32 arg1, int32 arg2);
942 int caller (int32 arg1, int32 arg2) {
943 return callee(arg2,arg1);
944 }
945
Arnold Schwaighofer373e8652007-10-12 21:30:57 +0000946 Here we need to push the arguments because they overwrite each
947 other.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000948
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +0000949//===---------------------------------------------------------------------===//
Evan Cheng7f1ad6a2007-10-28 04:01:09 +0000950
951main ()
952{
953 int i = 0;
954 unsigned long int z = 0;
955
956 do {
957 z -= 0x00004000;
958 i++;
959 if (i > 0x00040000)
960 abort ();
961 } while (z > 0);
962 exit (0);
963}
964
965gcc compiles this to:
966
967_main:
968 subl $28, %esp
969 xorl %eax, %eax
970 jmp L2
971L3:
972 cmpl $262144, %eax
973 je L10
974L2:
975 addl $1, %eax
976 cmpl $262145, %eax
977 jne L3
978 call L_abort$stub
979L10:
980 movl $0, (%esp)
981 call L_exit$stub
982
983llvm:
984
985_main:
986 subl $12, %esp
987 movl $1, %eax
988 movl $16384, %ecx
989LBB1_1: # bb
990 cmpl $262145, %eax
991 jge LBB1_4 # cond_true
992LBB1_2: # cond_next
993 incl %eax
994 addl $4294950912, %ecx
995 cmpl $16384, %ecx
996 jne LBB1_1 # bb
997LBB1_3: # bb11
998 xorl %eax, %eax
999 addl $12, %esp
1000 ret
1001LBB1_4: # cond_true
1002 call L_abort$stub
1003
10041. LSR should rewrite the first cmp with induction variable %ecx.
10052. DAG combiner should fold
1006 leal 1(%eax), %edx
1007 cmpl $262145, %edx
1008 =>
1009 cmpl $262144, %eax
1010
1011//===---------------------------------------------------------------------===//
Chris Lattner358670b2007-11-24 06:13:33 +00001012
1013define i64 @test(double %X) {
1014 %Y = fptosi double %X to i64
1015 ret i64 %Y
1016}
1017
1018compiles to:
1019
1020_test:
1021 subl $20, %esp
1022 movsd 24(%esp), %xmm0
1023 movsd %xmm0, 8(%esp)
1024 fldl 8(%esp)
1025 fisttpll (%esp)
1026 movl 4(%esp), %edx
1027 movl (%esp), %eax
1028 addl $20, %esp
1029 #FP_REG_KILL
1030 ret
1031
1032This should just fldl directly from the input stack slot.
Chris Lattner10d54d12007-12-05 22:58:19 +00001033
1034//===---------------------------------------------------------------------===//
1035
1036This code:
1037int foo (int x) { return (x & 65535) | 255; }
1038
1039Should compile into:
1040
1041_foo:
1042 movzwl 4(%esp), %eax
Eli Friedman577c7492008-02-21 21:16:49 +00001043 orl $255, %eax
Chris Lattner10d54d12007-12-05 22:58:19 +00001044 ret
1045
1046instead of:
1047_foo:
1048 movl $255, %eax
1049 orl 4(%esp), %eax
1050 andl $65535, %eax
1051 ret
1052
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001053//===---------------------------------------------------------------------===//
1054
Chris Lattnereec7ac02008-02-21 06:51:29 +00001055We're codegen'ing multiply of long longs inefficiently:
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001056
Chris Lattnereec7ac02008-02-21 06:51:29 +00001057unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
1058 return arg1 * arg2;
1059}
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001060
Chris Lattnereec7ac02008-02-21 06:51:29 +00001061We compile to (fomit-frame-pointer):
1062
1063_LLM:
1064 pushl %esi
1065 movl 8(%esp), %ecx
1066 movl 16(%esp), %esi
1067 movl %esi, %eax
1068 mull %ecx
1069 imull 12(%esp), %esi
1070 addl %edx, %esi
1071 imull 20(%esp), %ecx
1072 movl %esi, %edx
1073 addl %ecx, %edx
1074 popl %esi
1075 ret
1076
1077This looks like a scheduling deficiency and lack of remat of the load from
1078the argument area. ICC apparently produces:
1079
1080 movl 8(%esp), %ecx
1081 imull 12(%esp), %ecx
1082 movl 16(%esp), %eax
1083 imull 4(%esp), %eax
1084 addl %eax, %ecx
1085 movl 4(%esp), %eax
1086 mull 12(%esp)
1087 addl %ecx, %edx
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001088 ret
1089
Chris Lattnereec7ac02008-02-21 06:51:29 +00001090Note that it remat'd loads from 4(esp) and 12(esp). See this GCC PR:
1091http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001092
1093//===---------------------------------------------------------------------===//
1094
Chris Lattner2b55ebd2007-12-24 19:27:46 +00001095We can fold a store into "zeroing a reg". Instead of:
1096
1097xorl %eax, %eax
1098movl %eax, 124(%esp)
1099
1100we should get:
1101
1102movl $0, 124(%esp)
1103
1104if the flags of the xor are dead.
1105
Chris Lattner459ff992008-01-11 18:00:13 +00001106Likewise, we isel "x<<1" into "add reg,reg". If reg is spilled, this should
1107be folded into: shl [mem], 1
1108
Chris Lattner2b55ebd2007-12-24 19:27:46 +00001109//===---------------------------------------------------------------------===//
Chris Lattner64400952007-12-28 21:50:40 +00001110
1111This testcase misses a read/modify/write opportunity (from PR1425):
1112
1113void vertical_decompose97iH1(int *b0, int *b1, int *b2, int width){
1114 int i;
1115 for(i=0; i<width; i++)
1116 b1[i] += (1*(b0[i] + b2[i])+0)>>0;
1117}
1118
1119We compile it down to:
1120
1121LBB1_2: # bb
1122 movl (%esi,%edi,4), %ebx
1123 addl (%ecx,%edi,4), %ebx
1124 addl (%edx,%edi,4), %ebx
1125 movl %ebx, (%ecx,%edi,4)
1126 incl %edi
1127 cmpl %eax, %edi
1128 jne LBB1_2 # bb
1129
1130the inner loop should add to the memory location (%ecx,%edi,4), saving
1131a mov. Something like:
1132
1133 movl (%esi,%edi,4), %ebx
1134 addl (%edx,%edi,4), %ebx
1135 addl %ebx, (%ecx,%edi,4)
1136
Chris Lattnerbde73102007-12-29 05:51:58 +00001137Here is another interesting example:
1138
1139void vertical_compose97iH1(int *b0, int *b1, int *b2, int width){
1140 int i;
1141 for(i=0; i<width; i++)
1142 b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
1143}
1144
1145We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
1146
1147LBB9_2: # bb
1148 movl (%ecx,%edi,4), %ebx
1149 subl (%esi,%edi,4), %ebx
1150 subl (%edx,%edi,4), %ebx
1151 movl %ebx, (%ecx,%edi,4)
1152 incl %edi
1153 cmpl %eax, %edi
1154 jne LBB9_2 # bb
1155
1156Additionally, LSR should rewrite the exit condition of these loops to use
Chris Lattner64400952007-12-28 21:50:40 +00001157a stride-4 IV, would would allow all the scales in the loop to go away.
1158This would result in smaller code and more efficient microops.
1159
1160//===---------------------------------------------------------------------===//
Chris Lattner0362a362008-01-07 21:59:58 +00001161
1162In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
1163or and instruction, for example:
1164
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001165 xorpd LCPI1_0, %xmm2
Chris Lattner0362a362008-01-07 21:59:58 +00001166
1167However, if xmm2 gets spilled, we end up with really ugly code like this:
1168
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001169 movsd (%esp), %xmm0
1170 xorpd LCPI1_0, %xmm0
1171 movsd %xmm0, (%esp)
Chris Lattner0362a362008-01-07 21:59:58 +00001172
1173Since we 'know' that this is a 'neg', we can actually "fold" the spill into
1174the neg/abs instruction, turning it into an *integer* operation, like this:
1175
1176 xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
1177
1178you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001179stall. Here is a contrived testcase:
1180
1181double a, b, c;
1182void test(double *P) {
1183 double X = *P;
1184 a = X;
1185 bar();
1186 X = -X;
1187 b = X;
1188 bar();
1189 c = X;
1190}
Chris Lattner0362a362008-01-07 21:59:58 +00001191
1192//===---------------------------------------------------------------------===//
Andrew Lenharth785610d2008-02-16 01:24:58 +00001193
1194handling llvm.memory.barrier on pre SSE2 cpus
1195
1196should generate:
1197lock ; mov %esp, %esp
1198
1199//===---------------------------------------------------------------------===//
Chris Lattner7644ff32008-02-17 19:43:57 +00001200
1201The generated code on x86 for checking for signed overflow on a multiply the
1202obvious way is much longer than it needs to be.
1203
1204int x(int a, int b) {
1205 long long prod = (long long)a*b;
1206 return prod > 0x7FFFFFFF || prod < (-0x7FFFFFFF-1);
1207}
1208
1209See PR2053 for more details.
1210
1211//===---------------------------------------------------------------------===//
Chris Lattner83f22362008-02-18 18:30:13 +00001212
Eli Friedman577c7492008-02-21 21:16:49 +00001213We should investigate using cdq/ctld (effect: edx = sar eax, 31)
1214more aggressively; it should cost the same as a move+shift on any modern
1215processor, but it's a lot shorter. Downside is that it puts more
1216pressure on register allocation because it has fixed operands.
1217
1218Example:
1219int abs(int x) {return x < 0 ? -x : x;}
1220
1221gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
1222abs:
1223 movl 4(%esp), %eax
1224 cltd
1225 xorl %edx, %eax
1226 subl %edx, %eax
1227 ret
1228
1229//===---------------------------------------------------------------------===//
1230
1231Consider:
Chris Lattner83f22362008-02-18 18:30:13 +00001232int test(unsigned long a, unsigned long b) { return -(a < b); }
1233
1234We currently compile this to:
1235
1236define i32 @test(i32 %a, i32 %b) nounwind {
1237 %tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
1238 %tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
1239 %tmp5 = sub i32 0, %tmp34 ; <i32> [#uses=1]
1240 ret i32 %tmp5
1241}
1242
1243and
1244
1245_test:
1246 movl 8(%esp), %eax
1247 cmpl %eax, 4(%esp)
1248 setb %al
1249 movzbl %al, %eax
1250 negl %eax
1251 ret
1252
1253Several deficiencies here. First, we should instcombine zext+neg into sext:
1254
1255define i32 @test2(i32 %a, i32 %b) nounwind {
1256 %tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
1257 %tmp34 = sext i1 %tmp3 to i32 ; <i32> [#uses=1]
1258 ret i32 %tmp34
1259}
1260
1261However, before we can do that, we have to fix the bad codegen that we get for
1262sext from bool:
1263
1264_test2:
1265 movl 8(%esp), %eax
1266 cmpl %eax, 4(%esp)
1267 setb %al
1268 movzbl %al, %eax
1269 shll $31, %eax
1270 sarl $31, %eax
1271 ret
1272
1273This code should be at least as good as the code above. Once this is fixed, we
1274can optimize this specific case even more to:
1275
1276 movl 8(%esp), %eax
1277 xorl %ecx, %ecx
1278 cmpl %eax, 4(%esp)
1279 sbbl %ecx, %ecx
1280
1281//===---------------------------------------------------------------------===//
Eli Friedman1aa1f2c2008-02-28 00:21:43 +00001282
1283Take the following code (from
1284http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
1285
1286extern unsigned char first_one[65536];
1287int FirstOnet(unsigned long long arg1)
1288{
1289 if (arg1 >> 48)
1290 return (first_one[arg1 >> 48]);
1291 return 0;
1292}
1293
1294
1295The following code is currently generated:
1296FirstOnet:
1297 movl 8(%esp), %eax
1298 cmpl $65536, %eax
1299 movl 4(%esp), %ecx
1300 jb .LBB1_2 # UnifiedReturnBlock
1301.LBB1_1: # ifthen
1302 shrl $16, %eax
1303 movzbl first_one(%eax), %eax
1304 ret
1305.LBB1_2: # UnifiedReturnBlock
1306 xorl %eax, %eax
1307 ret
1308
1309There are a few possible improvements here:
13101. We should be able to eliminate the dead load into %ecx
13112. We could change the "movl 8(%esp), %eax" into
1312 "movzwl 10(%esp), %eax"; this lets us change the cmpl
1313 into a testl, which is shorter, and eliminate the shift.
1314
1315We could also in theory eliminate the branch by using a conditional
1316for the address of the load, but that seems unlikely to be worthwhile
1317in general.
1318
1319//===---------------------------------------------------------------------===//
1320
Chris Lattner44a98ac2008-02-28 04:52:59 +00001321We compile this function:
1322
1323define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext %d) nounwind {
1324entry:
1325 %tmp2 = icmp eq i8 %d, 0 ; <i1> [#uses=1]
1326 br i1 %tmp2, label %bb7, label %bb
1327
1328bb: ; preds = %entry
1329 %tmp6 = add i32 %b, %a ; <i32> [#uses=1]
1330 ret i32 %tmp6
1331
1332bb7: ; preds = %entry
1333 %tmp10 = sub i32 %a, %c ; <i32> [#uses=1]
1334 ret i32 %tmp10
1335}
1336
1337to:
1338
1339_foo:
1340 cmpb $0, 16(%esp)
1341 movl 12(%esp), %ecx
1342 movl 8(%esp), %eax
1343 movl 4(%esp), %edx
1344 je LBB1_2 # bb7
1345LBB1_1: # bb
1346 addl %edx, %eax
1347 ret
1348LBB1_2: # bb7
1349 movl %edx, %eax
1350 subl %ecx, %eax
1351 ret
1352
Gabor Greif02661592008-03-06 10:51:21 +00001353The coalescer could coalesce "edx" with "eax" to avoid the movl in LBB1_2
Chris Lattner44a98ac2008-02-28 04:52:59 +00001354if it commuted the addl in LBB1_1.
1355
1356//===---------------------------------------------------------------------===//
Evan Cheng921dcba2008-03-28 07:07:06 +00001357
1358See rdar://4653682.
1359
1360From flops:
1361
1362LBB1_15: # bb310
1363 cvtss2sd LCPI1_0, %xmm1
1364 addsd %xmm1, %xmm0
1365 movsd 176(%esp), %xmm2
1366 mulsd %xmm0, %xmm2
1367 movapd %xmm2, %xmm3
1368 mulsd %xmm3, %xmm3
1369 movapd %xmm3, %xmm4
1370 mulsd LCPI1_23, %xmm4
1371 addsd LCPI1_24, %xmm4
1372 mulsd %xmm3, %xmm4
1373 addsd LCPI1_25, %xmm4
1374 mulsd %xmm3, %xmm4
1375 addsd LCPI1_26, %xmm4
1376 mulsd %xmm3, %xmm4
1377 addsd LCPI1_27, %xmm4
1378 mulsd %xmm3, %xmm4
1379 addsd LCPI1_28, %xmm4
1380 mulsd %xmm3, %xmm4
1381 addsd %xmm1, %xmm4
1382 mulsd %xmm2, %xmm4
1383 movsd 152(%esp), %xmm1
1384 addsd %xmm4, %xmm1
1385 movsd %xmm1, 152(%esp)
1386 incl %eax
1387 cmpl %eax, %esi
1388 jge LBB1_15 # bb310
1389LBB1_16: # bb358.loopexit
1390 movsd 152(%esp), %xmm0
1391 addsd %xmm0, %xmm0
1392 addsd LCPI1_22, %xmm0
1393 movsd %xmm0, 152(%esp)
1394
1395Rather than spilling the result of the last addsd in the loop, we should have
1396insert a copy to split the interval (one for the duration of the loop, one
1397extending to the fall through). The register pressure in the loop isn't high
1398enough to warrant the spill.
1399
1400Also check why xmm7 is not used at all in the function.
Chris Lattner16e5c782008-04-21 04:46:30 +00001401
1402//===---------------------------------------------------------------------===//
1403
1404Legalize loses track of the fact that bools are always zero extended when in
1405memory. This causes us to compile abort_gzip (from 164.gzip) from:
1406
1407target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
1408target triple = "i386-apple-darwin8"
1409@in_exit.4870.b = internal global i1 false ; <i1*> [#uses=2]
1410define fastcc void @abort_gzip() noreturn nounwind {
1411entry:
1412 %tmp.b.i = load i1* @in_exit.4870.b ; <i1> [#uses=1]
1413 br i1 %tmp.b.i, label %bb.i, label %bb4.i
1414bb.i: ; preds = %entry
1415 tail call void @exit( i32 1 ) noreturn nounwind
1416 unreachable
1417bb4.i: ; preds = %entry
1418 store i1 true, i1* @in_exit.4870.b
1419 tail call void @exit( i32 1 ) noreturn nounwind
1420 unreachable
1421}
1422declare void @exit(i32) noreturn nounwind
1423
1424into:
1425
1426_abort_gzip:
1427 subl $12, %esp
1428 movb _in_exit.4870.b, %al
1429 notb %al
1430 testb $1, %al
1431 jne LBB1_2 ## bb4.i
1432LBB1_1: ## bb.i
1433 ...
1434
1435//===---------------------------------------------------------------------===//
Chris Lattner7cb1d332008-05-05 23:19:45 +00001436
1437We compile:
1438
1439int test(int x, int y) {
1440 return x-y-1;
1441}
1442
1443into (-m64):
1444
1445_test:
1446 decl %edi
1447 movl %edi, %eax
1448 subl %esi, %eax
1449 ret
1450
1451it would be better to codegen as: x+~y (notl+addl)
Edwin Törökfa9d5e22008-10-24 19:23:07 +00001452
1453//===---------------------------------------------------------------------===//
1454
1455This code:
1456
1457int foo(const char *str,...)
1458{
1459 __builtin_va_list a; int x;
1460 __builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
1461 return x;
1462}
1463
1464gets compiled into this on x86-64:
1465 subq $200, %rsp
1466 movaps %xmm7, 160(%rsp)
1467 movaps %xmm6, 144(%rsp)
1468 movaps %xmm5, 128(%rsp)
1469 movaps %xmm4, 112(%rsp)
1470 movaps %xmm3, 96(%rsp)
1471 movaps %xmm2, 80(%rsp)
1472 movaps %xmm1, 64(%rsp)
1473 movaps %xmm0, 48(%rsp)
1474 movq %r9, 40(%rsp)
1475 movq %r8, 32(%rsp)
1476 movq %rcx, 24(%rsp)
1477 movq %rdx, 16(%rsp)
1478 movq %rsi, 8(%rsp)
1479 leaq (%rsp), %rax
1480 movq %rax, 192(%rsp)
1481 leaq 208(%rsp), %rax
1482 movq %rax, 184(%rsp)
1483 movl $48, 180(%rsp)
1484 movl $8, 176(%rsp)
1485 movl 176(%rsp), %eax
1486 cmpl $47, %eax
1487 jbe .LBB1_3 # bb
1488.LBB1_1: # bb3
1489 movq 184(%rsp), %rcx
1490 leaq 8(%rcx), %rax
1491 movq %rax, 184(%rsp)
1492.LBB1_2: # bb4
1493 movl (%rcx), %eax
1494 addq $200, %rsp
1495 ret
1496.LBB1_3: # bb
1497 movl %eax, %ecx
1498 addl $8, %eax
1499 addq 192(%rsp), %rcx
1500 movl %eax, 176(%rsp)
1501 jmp .LBB1_2 # bb4
1502
1503gcc 4.3 generates:
1504 subq $96, %rsp
1505.LCFI0:
1506 leaq 104(%rsp), %rax
1507 movq %rsi, -80(%rsp)
1508 movl $8, -120(%rsp)
1509 movq %rax, -112(%rsp)
1510 leaq -88(%rsp), %rax
1511 movq %rax, -104(%rsp)
1512 movl $8, %eax
1513 cmpl $48, %eax
1514 jb .L6
1515 movq -112(%rsp), %rdx
1516 movl (%rdx), %eax
1517 addq $96, %rsp
1518 ret
1519 .p2align 4,,10
1520 .p2align 3
1521.L6:
1522 mov %eax, %edx
1523 addq -104(%rsp), %rdx
1524 addl $8, %eax
1525 movl %eax, -120(%rsp)
1526 movl (%rdx), %eax
1527 addq $96, %rsp
1528 ret
1529
1530and it gets compiled into this on x86:
1531 pushl %ebp
1532 movl %esp, %ebp
1533 subl $4, %esp
1534 leal 12(%ebp), %eax
1535 movl %eax, -4(%ebp)
1536 leal 16(%ebp), %eax
1537 movl %eax, -4(%ebp)
1538 movl 12(%ebp), %eax
1539 addl $4, %esp
1540 popl %ebp
1541 ret
1542
1543gcc 4.3 generates:
1544 pushl %ebp
1545 movl %esp, %ebp
1546 movl 12(%ebp), %eax
1547 popl %ebp
1548 ret
Evan Chengbf97bec2008-11-11 17:35:52 +00001549
1550//===---------------------------------------------------------------------===//
1551
1552Teach tblgen not to check bitconvert source type in some cases. This allows us
1553to consolidate the following patterns in X86InstrMMX.td:
1554
1555def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1556 (iPTR 0))))),
1557 (v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
1558def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1559 (iPTR 0))))),
1560 (v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
1561def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1562 (iPTR 0))))),
1563 (v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
1564
1565There are other cases in various td files.
Eli Friedman9ab1db02008-11-30 07:52:27 +00001566
1567//===---------------------------------------------------------------------===//
1568
1569Take something like the following on x86-32:
1570unsigned a(unsigned long long x, unsigned y) {return x % y;}
1571
1572We currently generate a libcall, but we really shouldn't: the expansion is
1573shorter and likely faster than the libcall. The expected code is something
1574like the following:
1575
1576 movl 12(%ebp), %eax
1577 movl 16(%ebp), %ecx
1578 xorl %edx, %edx
1579 divl %ecx
1580 movl 8(%ebp), %eax
1581 divl %ecx
1582 movl %edx, %eax
1583 ret
1584
1585A similar code sequence works for division.
1586
1587//===---------------------------------------------------------------------===//
Chris Lattnerbfccda62008-12-06 22:49:05 +00001588
1589These should compile to the same code, but the later codegen's to useless
1590instructions on X86. This may be a trivial dag combine (GCC PR7061):
1591
1592struct s1 { unsigned char a, b; };
1593unsigned long f1(struct s1 x) {
1594 return x.a + x.b;
1595}
1596struct s2 { unsigned a: 8, b: 8; };
1597unsigned long f2(struct s2 x) {
1598 return x.a + x.b;
1599}
1600
1601//===---------------------------------------------------------------------===//
1602
Chris Lattner9f34dc62009-02-08 20:44:19 +00001603We currently compile this:
1604
1605define i32 @func1(i32 %v1, i32 %v2) nounwind {
1606entry:
1607 %t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)
1608 %sum = extractvalue {i32, i1} %t, 0
1609 %obit = extractvalue {i32, i1} %t, 1
1610 br i1 %obit, label %overflow, label %normal
1611normal:
1612 ret i32 %sum
1613overflow:
1614 call void @llvm.trap()
1615 unreachable
1616}
1617declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)
1618declare void @llvm.trap()
1619
1620to:
1621
1622_func1:
1623 movl 4(%esp), %eax
1624 addl 8(%esp), %eax
1625 jo LBB1_2 ## overflow
1626LBB1_1: ## normal
1627 ret
1628LBB1_2: ## overflow
1629 ud2
1630
1631it would be nice to produce "into" someday.
1632
1633//===---------------------------------------------------------------------===//
Chris Lattner09c650b2009-02-17 01:16:14 +00001634
1635This code:
1636
1637void vec_mpys1(int y[], const int x[], int scaler) {
1638int i;
1639for (i = 0; i < 150; i++)
1640 y[i] += (((long long)scaler * (long long)x[i]) >> 31);
1641}
1642
1643Compiles to this loop with GCC 3.x:
1644
1645.L5:
1646 movl %ebx, %eax
1647 imull (%edi,%ecx,4)
1648 shrdl $31, %edx, %eax
1649 addl %eax, (%esi,%ecx,4)
1650 incl %ecx
1651 cmpl $149, %ecx
1652 jle .L5
1653
1654llvm-gcc compiles it to the much uglier:
1655
1656LBB1_1: ## bb1
1657 movl 24(%esp), %eax
1658 movl (%eax,%edi,4), %ebx
1659 movl %ebx, %ebp
1660 imull %esi, %ebp
1661 movl %ebx, %eax
1662 mull %ecx
1663 addl %ebp, %edx
1664 sarl $31, %ebx
1665 imull %ecx, %ebx
1666 addl %edx, %ebx
1667 shldl $1, %eax, %ebx
1668 movl 20(%esp), %eax
1669 addl %ebx, (%eax,%edi,4)
1670 incl %edi
1671 cmpl $150, %edi
1672 jne LBB1_1 ## bb1
1673
Eli Friedman0427ac12009-12-21 08:03:16 +00001674The issue is that we hoist the cast of "scaler" to long long outside of the
1675loop, the value comes into the loop as two values, and
1676RegsForValue::getCopyFromRegs doesn't know how to put an AssertSext on the
1677constructed BUILD_PAIR which represents the cast value.
1678
Chris Lattner09c650b2009-02-17 01:16:14 +00001679//===---------------------------------------------------------------------===//
Chris Lattner8eca7c72009-03-08 01:54:43 +00001680
Dan Gohman5edb7382009-03-09 23:47:02 +00001681Test instructions can be eliminated by using EFLAGS values from arithmetic
Dan Gohman8bb7db62009-03-10 00:26:23 +00001682instructions. This is currently not done for mul, and, or, xor, neg, shl,
1683sra, srl, shld, shrd, atomic ops, and others. It is also currently not done
1684for read-modify-write instructions. It is also current not done if the
1685OF or CF flags are needed.
Dan Gohman5edb7382009-03-09 23:47:02 +00001686
1687The shift operators have the complication that when the shift count is
1688zero, EFLAGS is not set, so they can only subsume a test instruction if
Dan Gohman8bb7db62009-03-10 00:26:23 +00001689the shift count is known to be non-zero. Also, using the EFLAGS value
1690from a shift is apparently very slow on some x86 implementations.
Dan Gohman5edb7382009-03-09 23:47:02 +00001691
1692In read-modify-write instructions, the root node in the isel match is
1693the store, and isel has no way for the use of the EFLAGS result of the
1694arithmetic to be remapped to the new node.
1695
Dan Gohman8bb7db62009-03-10 00:26:23 +00001696Add and subtract instructions set OF on signed overflow and CF on unsiged
1697overflow, while test instructions always clear OF and CF. In order to
1698replace a test with an add or subtract in a situation where OF or CF is
1699needed, codegen must be able to prove that the operation cannot see
1700signed or unsigned overflow, respectively.
1701
Dan Gohman5edb7382009-03-09 23:47:02 +00001702//===---------------------------------------------------------------------===//
1703
Chris Lattnerfe8b5592009-03-08 03:04:26 +00001704memcpy/memmove do not lower to SSE copies when possible. A silly example is:
1705define <16 x float> @foo(<16 x float> %A) nounwind {
1706 %tmp = alloca <16 x float>, align 16
1707 %tmp2 = alloca <16 x float>, align 16
1708 store <16 x float> %A, <16 x float>* %tmp
1709 %s = bitcast <16 x float>* %tmp to i8*
1710 %s2 = bitcast <16 x float>* %tmp2 to i8*
1711 call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)
1712 %R = load <16 x float>* %tmp2
1713 ret <16 x float> %R
1714}
1715
1716declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind
1717
1718which compiles to:
1719
1720_foo:
1721 subl $140, %esp
1722 movaps %xmm3, 112(%esp)
1723 movaps %xmm2, 96(%esp)
1724 movaps %xmm1, 80(%esp)
1725 movaps %xmm0, 64(%esp)
1726 movl 60(%esp), %eax
1727 movl %eax, 124(%esp)
1728 movl 56(%esp), %eax
1729 movl %eax, 120(%esp)
1730 movl 52(%esp), %eax
1731 <many many more 32-bit copies>
1732 movaps (%esp), %xmm0
1733 movaps 16(%esp), %xmm1
1734 movaps 32(%esp), %xmm2
1735 movaps 48(%esp), %xmm3
1736 addl $140, %esp
1737 ret
1738
1739On Nehalem, it may even be cheaper to just use movups when unaligned than to
1740fall back to lower-granularity chunks.
1741
1742//===---------------------------------------------------------------------===//
Chris Lattnere00a4e02009-05-25 20:28:19 +00001743
1744Implement processor-specific optimizations for parity with GCC on these
1745processors. GCC does two optimizations:
1746
17471. ix86_pad_returns inserts a noop before ret instructions if immediately
1748 preceeded by a conditional branch or is the target of a jump.
17492. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of
1750 code contains more than 3 branches.
1751
1752The first one is done for all AMDs, Core2, and "Generic"
1753The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,
1754 Core 2, and "Generic"
1755
1756//===---------------------------------------------------------------------===//
Eli Friedmandd959242009-06-11 23:07:04 +00001757
1758Testcase:
1759int a(int x) { return (x & 127) > 31; }
1760
1761Current output:
1762 movl 4(%esp), %eax
1763 andl $127, %eax
1764 cmpl $31, %eax
1765 seta %al
1766 movzbl %al, %eax
1767 ret
1768
1769Ideal output:
1770 xorl %eax, %eax
1771 testl $96, 4(%esp)
1772 setne %al
1773 ret
1774
Chris Lattner20610972009-06-16 06:11:35 +00001775This should definitely be done in instcombine, canonicalizing the range
1776condition into a != condition. We get this IR:
1777
1778define i32 @a(i32 %x) nounwind readnone {
1779entry:
1780 %0 = and i32 %x, 127 ; <i32> [#uses=1]
1781 %1 = icmp ugt i32 %0, 31 ; <i1> [#uses=1]
1782 %2 = zext i1 %1 to i32 ; <i32> [#uses=1]
1783 ret i32 %2
1784}
1785
1786Instcombine prefers to strength reduce relational comparisons to equality
1787comparisons when possible, this should be another case of that. This could
1788be handled pretty easily in InstCombiner::visitICmpInstWithInstAndIntCst, but it
1789looks like InstCombiner::visitICmpInstWithInstAndIntCst should really already
1790be redesigned to use ComputeMaskedBits and friends.
1791
Eli Friedmandd959242009-06-11 23:07:04 +00001792
1793//===---------------------------------------------------------------------===//
1794Testcase:
1795int x(int a) { return (a&0xf0)>>4; }
1796
1797Current output:
1798 movl 4(%esp), %eax
1799 shrl $4, %eax
1800 andl $15, %eax
1801 ret
1802
1803Ideal output:
1804 movzbl 4(%esp), %eax
1805 shrl $4, %eax
1806 ret
1807
1808//===---------------------------------------------------------------------===//
1809
1810Testcase:
1811int x(int a) { return (a & 0x80) ? 0x100 : 0; }
Chris Lattner498e6ad2009-06-16 06:15:56 +00001812int y(int a) { return (a & 0x80) *2; }
Eli Friedmandd959242009-06-11 23:07:04 +00001813
Chris Lattner498e6ad2009-06-16 06:15:56 +00001814Current:
Eli Friedmandd959242009-06-11 23:07:04 +00001815 testl $128, 4(%esp)
1816 setne %al
1817 movzbl %al, %eax
1818 shll $8, %eax
1819 ret
1820
Chris Lattner498e6ad2009-06-16 06:15:56 +00001821Better:
Eli Friedmandd959242009-06-11 23:07:04 +00001822 movl 4(%esp), %eax
1823 addl %eax, %eax
1824 andl $256, %eax
1825 ret
1826
Chris Lattner498e6ad2009-06-16 06:15:56 +00001827This is another general instcombine transformation that is profitable on all
1828targets. In LLVM IR, these functions look like this:
1829
1830define i32 @x(i32 %a) nounwind readnone {
1831entry:
1832 %0 = and i32 %a, 128
1833 %1 = icmp eq i32 %0, 0
1834 %iftmp.0.0 = select i1 %1, i32 0, i32 256
1835 ret i32 %iftmp.0.0
1836}
1837
1838define i32 @y(i32 %a) nounwind readnone {
1839entry:
1840 %0 = shl i32 %a, 1
1841 %1 = and i32 %0, 256
1842 ret i32 %1
1843}
1844
1845Replacing an icmp+select with a shift should always be considered profitable in
1846instcombine.
Eli Friedmandd959242009-06-11 23:07:04 +00001847
1848//===---------------------------------------------------------------------===//
Evan Cheng454211e2009-07-30 08:56:19 +00001849
1850Re-implement atomic builtins __sync_add_and_fetch() and __sync_sub_and_fetch
1851properly.
1852
1853When the return value is not used (i.e. only care about the value in the
1854memory), x86 does not have to use add to implement these. Instead, it can use
1855add, sub, inc, dec instructions with the "lock" prefix.
1856
1857This is currently implemented using a bit of instruction selection trick. The
1858issue is the target independent pattern produces one output and a chain and we
1859want to map it into one that just output a chain. The current trick is to select
1860it into a MERGE_VALUES with the first definition being an implicit_def. The
1861proper solution is to add new ISD opcodes for the no-output variant. DAG
1862combiner can then transform the node before it gets to target node selection.
1863
1864Problem #2 is we are adding a whole bunch of x86 atomic instructions when in
1865fact these instructions are identical to the non-lock versions. We need a way to
1866add target specific information to target nodes and have this information
1867carried over to machine instructions. Asm printer (or JIT) can use this
1868information to add the "lock" prefix.
Bill Wendling85196b52009-10-27 22:34:43 +00001869
1870//===---------------------------------------------------------------------===//
Eli Friedman2a81a582010-02-10 21:26:04 +00001871
1872_Bool bar(int *x) { return *x & 1; }
1873
1874define zeroext i1 @bar(i32* nocapture %x) nounwind readonly {
1875entry:
1876 %tmp1 = load i32* %x ; <i32> [#uses=1]
1877 %and = and i32 %tmp1, 1 ; <i32> [#uses=1]
1878 %tobool = icmp ne i32 %and, 0 ; <i1> [#uses=1]
1879 ret i1 %tobool
1880}
1881
1882bar: # @bar
1883# BB#0: # %entry
1884 movl 4(%esp), %eax
1885 movb (%eax), %al
1886 andb $1, %al
1887 movzbl %al, %eax
1888 ret
1889
1890Missed optimization: should be movl+andl.
1891
1892//===---------------------------------------------------------------------===//
1893
1894Consider the following two functions compiled with clang:
1895_Bool foo(int *x) { return !(*x & 4); }
1896unsigned bar(int *x) { return !(*x & 4); }
1897
1898foo:
1899 movl 4(%esp), %eax
1900 testb $4, (%eax)
1901 sete %al
1902 movzbl %al, %eax
1903 ret
1904
1905bar:
1906 movl 4(%esp), %eax
1907 movl (%eax), %eax
1908 shrl $2, %eax
1909 andl $1, %eax
1910 xorl $1, %eax
1911 ret
1912
1913The second function generates more code even though the two functions are
1914are functionally identical.
1915
1916//===---------------------------------------------------------------------===//
1917
1918Take the following C code:
1919int x(int y) { return (y & 63) << 14; }
1920
1921Code produced by gcc:
1922 andl $63, %edi
1923 sall $14, %edi
1924 movl %edi, %eax
1925 ret
1926
1927Code produced by clang:
1928 shll $14, %edi
1929 movl %edi, %eax
1930 andl $1032192, %eax
1931 ret
1932
1933The code produced by gcc is 3 bytes shorter. This sort of construct often
1934shows up with bitfields.
1935
1936//===---------------------------------------------------------------------===//