blob: c02eca6410fb86e3b580322febb78cdf803d7bac [file] [log] [blame]
Dan Gohmanf17a25c2007-07-18 16:29:46 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the X86 backend.
3//===---------------------------------------------------------------------===//
4
Dan Gohmanf17a25c2007-07-18 16:29:46 +00005
6//===---------------------------------------------------------------------===//
7
Dan Gohmanf17a25c2007-07-18 16:29:46 +00008CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
9backend knows how to three-addressify this shift, but it appears the register
10allocator isn't even asking it to do so in this case. We should investigate
11why this isn't happening, it could have significant impact on other important
12cases for X86 as well.
13
14//===---------------------------------------------------------------------===//
15
16This should be one DIV/IDIV instruction, not a libcall:
17
18unsigned test(unsigned long long X, unsigned Y) {
19 return X/Y;
20}
21
22This can be done trivially with a custom legalizer. What about overflow
23though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
24
25//===---------------------------------------------------------------------===//
26
27Improvements to the multiply -> shift/add algorithm:
28http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
29
30//===---------------------------------------------------------------------===//
31
32Improve code like this (occurs fairly frequently, e.g. in LLVM):
33long long foo(int x) { return 1LL << x; }
34
35http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
36http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
37http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
38
39Another useful one would be ~0ULL >> X and ~0ULL << X.
40
41One better solution for 1LL << x is:
42 xorl %eax, %eax
43 xorl %edx, %edx
44 testb $32, %cl
45 sete %al
46 setne %dl
47 sall %cl, %eax
48 sall %cl, %edx
49
50But that requires good 8-bit subreg support.
51
Eli Friedman577c7492008-02-21 21:16:49 +000052Also, this might be better. It's an extra shift, but it's one instruction
53shorter, and doesn't stress 8-bit subreg support.
54(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
55but without the unnecessary and.)
56 movl %ecx, %eax
57 shrl $5, %eax
58 movl %eax, %edx
59 xorl $1, %edx
60 sall %cl, %eax
61 sall %cl. %edx
62
Dan Gohmanf17a25c2007-07-18 16:29:46 +00006364-bit shifts (in general) expand to really bad code. Instead of using
64cmovs, we should expand to a conditional branch like GCC produces.
65
66//===---------------------------------------------------------------------===//
67
68Compile this:
69_Bool f(_Bool a) { return a!=1; }
70
71into:
72 movzbl %dil, %eax
73 xorl $1, %eax
74 ret
75
Eli Friedman577c7492008-02-21 21:16:49 +000076(Although note that this isn't a legal way to express the code that llvm-gcc
77currently generates for that function.)
78
Dan Gohmanf17a25c2007-07-18 16:29:46 +000079//===---------------------------------------------------------------------===//
80
81Some isel ideas:
82
831. Dynamic programming based approach when compile time if not an
84 issue.
852. Code duplication (addressing mode) during isel.
863. Other ideas from "Register-Sensitive Selection, Duplication, and
87 Sequencing of Instructions".
884. Scheduling for reduced register pressure. E.g. "Minimum Register
89 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
90 and other related papers.
91 http://citeseer.ist.psu.edu/govindarajan01minimum.html
92
93//===---------------------------------------------------------------------===//
94
95Should we promote i16 to i32 to avoid partial register update stalls?
96
97//===---------------------------------------------------------------------===//
98
99Leave any_extend as pseudo instruction and hint to register
100allocator. Delay codegen until post register allocation.
Evan Chengfdbb6672007-10-12 18:22:55 +0000101Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
102the coalescer how to deal with it though.
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000103
104//===---------------------------------------------------------------------===//
105
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000106It appears icc use push for parameter passing. Need to investigate.
107
108//===---------------------------------------------------------------------===//
109
110Only use inc/neg/not instructions on processors where they are faster than
111add/sub/xor. They are slower on the P4 due to only updating some processor
112flags.
113
114//===---------------------------------------------------------------------===//
115
116The instruction selector sometimes misses folding a load into a compare. The
117pattern is written as (cmp reg, (load p)). Because the compare isn't
118commutative, it is not matched with the load on both sides. The dag combiner
119should be made smart enough to cannonicalize the load into the RHS of a compare
120when it can invert the result of the compare for free.
121
122//===---------------------------------------------------------------------===//
123
124How about intrinsics? An example is:
125 *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
126
127compiles to
128 pmuludq (%eax), %xmm0
129 movl 8(%esp), %eax
130 movdqa (%eax), %xmm1
131 pmulhuw %xmm0, %xmm1
132
133The transformation probably requires a X86 specific pass or a DAG combiner
134target specific hook.
135
136//===---------------------------------------------------------------------===//
137
138In many cases, LLVM generates code like this:
139
140_test:
141 movl 8(%esp), %eax
142 cmpl %eax, 4(%esp)
143 setl %al
144 movzbl %al, %eax
145 ret
146
147on some processors (which ones?), it is more efficient to do this:
148
149_test:
150 movl 8(%esp), %ebx
151 xor %eax, %eax
152 cmpl %ebx, 4(%esp)
153 setl %al
154 ret
155
156Doing this correctly is tricky though, as the xor clobbers the flags.
157
158//===---------------------------------------------------------------------===//
159
160We should generate bts/btr/etc instructions on targets where they are cheap or
161when codesize is important. e.g., for:
162
163void setbit(int *target, int bit) {
164 *target |= (1 << bit);
165}
166void clearbit(int *target, int bit) {
167 *target &= ~(1 << bit);
168}
169
170//===---------------------------------------------------------------------===//
171
172Instead of the following for memset char*, 1, 10:
173
174 movl $16843009, 4(%edx)
175 movl $16843009, (%edx)
176 movw $257, 8(%edx)
177
178It might be better to generate
179
180 movl $16843009, %eax
181 movl %eax, 4(%edx)
182 movl %eax, (%edx)
183 movw al, 8(%edx)
184
185when we can spare a register. It reduces code size.
186
187//===---------------------------------------------------------------------===//
188
189Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
190get this:
191
Eli Friedman1aa1f2c2008-02-28 00:21:43 +0000192define i32 @test1(i32 %X) {
193 %Y = sdiv i32 %X, 8
194 ret i32 %Y
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000195}
196
197_test1:
198 movl 4(%esp), %eax
199 movl %eax, %ecx
200 sarl $31, %ecx
201 shrl $29, %ecx
202 addl %ecx, %eax
203 sarl $3, %eax
204 ret
205
206GCC knows several different ways to codegen it, one of which is this:
207
208_test1:
209 movl 4(%esp), %eax
210 cmpl $-1, %eax
211 leal 7(%eax), %ecx
212 cmovle %ecx, %eax
213 sarl $3, %eax
214 ret
215
216which is probably slower, but it's interesting at least :)
217
218//===---------------------------------------------------------------------===//
219
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000220We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
221We should leave these as libcalls for everything over a much lower threshold,
222since libc is hand tuned for medium and large mem ops (avoiding RFO for large
223stores, TLB preheating, etc)
224
225//===---------------------------------------------------------------------===//
226
227Optimize this into something reasonable:
228 x * copysign(1.0, y) * copysign(1.0, z)
229
230//===---------------------------------------------------------------------===//
231
232Optimize copysign(x, *y) to use an integer load from y.
233
234//===---------------------------------------------------------------------===//
235
236%X = weak global int 0
237
238void %foo(int %N) {
239 %N = cast int %N to uint
240 %tmp.24 = setgt int %N, 0
241 br bool %tmp.24, label %no_exit, label %return
242
243no_exit:
244 %indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
245 %i.0.0 = cast uint %indvar to int
246 volatile store int %i.0.0, int* %X
247 %indvar.next = add uint %indvar, 1
248 %exitcond = seteq uint %indvar.next, %N
249 br bool %exitcond, label %return, label %no_exit
250
251return:
252 ret void
253}
254
255compiles into:
256
257 .text
258 .align 4
259 .globl _foo
260_foo:
261 movl 4(%esp), %eax
262 cmpl $1, %eax
263 jl LBB_foo_4 # return
264LBB_foo_1: # no_exit.preheader
265 xorl %ecx, %ecx
266LBB_foo_2: # no_exit
267 movl L_X$non_lazy_ptr, %edx
268 movl %ecx, (%edx)
269 incl %ecx
270 cmpl %eax, %ecx
271 jne LBB_foo_2 # no_exit
272LBB_foo_3: # return.loopexit
273LBB_foo_4: # return
274 ret
275
276We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
277remateralization is implemented. This can be accomplished with 1) a target
278dependent LICM pass or 2) makeing SelectDAG represent the whole function.
279
280//===---------------------------------------------------------------------===//
281
282The following tests perform worse with LSR:
283
284lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
285
286//===---------------------------------------------------------------------===//
287
288We are generating far worse code than gcc:
289
290volatile short X, Y;
291
292void foo(int N) {
293 int i;
294 for (i = 0; i < N; i++) { X = i; Y = i*4; }
295}
296
Evan Cheng27a820a2007-10-26 01:56:11 +0000297LBB1_1: # entry.bb_crit_edge
298 xorl %ecx, %ecx
299 xorw %dx, %dx
300LBB1_2: # bb
301 movl L_X$non_lazy_ptr, %esi
302 movw %cx, (%esi)
303 movl L_Y$non_lazy_ptr, %esi
304 movw %dx, (%esi)
305 addw $4, %dx
306 incl %ecx
307 cmpl %eax, %ecx
308 jne LBB1_2 # bb
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000309
310vs.
311
312 xorl %edx, %edx
313 movl L_X$non_lazy_ptr-"L00000000001$pb"(%ebx), %esi
314 movl L_Y$non_lazy_ptr-"L00000000001$pb"(%ebx), %ecx
315L4:
316 movw %dx, (%esi)
317 leal 0(,%edx,4), %eax
318 movw %ax, (%ecx)
319 addl $1, %edx
320 cmpl %edx, %edi
321 jne L4
322
Evan Cheng27a820a2007-10-26 01:56:11 +0000323This is due to the lack of post regalloc LICM.
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000324
325//===---------------------------------------------------------------------===//
326
327Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
328FR64 to VR128.
329
330//===---------------------------------------------------------------------===//
331
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000332Adding to the list of cmp / test poor codegen issues:
333
334int test(__m128 *A, __m128 *B) {
335 if (_mm_comige_ss(*A, *B))
336 return 3;
337 else
338 return 4;
339}
340
341_test:
342 movl 8(%esp), %eax
343 movaps (%eax), %xmm0
344 movl 4(%esp), %eax
345 movaps (%eax), %xmm1
346 comiss %xmm0, %xmm1
347 setae %al
348 movzbl %al, %ecx
349 movl $3, %eax
350 movl $4, %edx
351 cmpl $0, %ecx
352 cmove %edx, %eax
353 ret
354
355Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
356are a number of issues. 1) We are introducing a setcc between the result of the
357intrisic call and select. 2) The intrinsic is expected to produce a i32 value
358so a any extend (which becomes a zero extend) is added.
359
360We probably need some kind of target DAG combine hook to fix this.
361
362//===---------------------------------------------------------------------===//
363
364We generate significantly worse code for this than GCC:
365http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
366http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
367
368There is also one case we do worse on PPC.
369
370//===---------------------------------------------------------------------===//
371
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000372For this:
373
374int test(int a)
375{
376 return a * 3;
377}
378
379We currently emits
380 imull $3, 4(%esp), %eax
381
382Perhaps this is what we really should generate is? Is imull three or four
383cycles? Note: ICC generates this:
384 movl 4(%esp), %eax
385 leal (%eax,%eax,2), %eax
386
387The current instruction priority is based on pattern complexity. The former is
388more "complex" because it folds a load so the latter will not be emitted.
389
390Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
391should always try to match LEA first since the LEA matching code does some
392estimate to determine whether the match is profitable.
393
394However, if we care more about code size, then imull is better. It's two bytes
395shorter than movl + leal.
396
Eli Friedman9ab1db02008-11-30 07:52:27 +0000397On a Pentium M, both variants have the same characteristics with regard
398to throughput; however, the multiplication has a latency of four cycles, as
399opposed to two cycles for the movl+lea variant.
400
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000401//===---------------------------------------------------------------------===//
402
Eli Friedman577c7492008-02-21 21:16:49 +0000403__builtin_ffs codegen is messy.
Chris Lattnera86af9a2007-08-11 18:19:07 +0000404
Chris Lattnera86af9a2007-08-11 18:19:07 +0000405int ffs_(unsigned X) { return __builtin_ffs(X); }
406
Eli Friedman577c7492008-02-21 21:16:49 +0000407llvm produces:
408ffs_:
409 movl 4(%esp), %ecx
410 bsfl %ecx, %eax
411 movl $32, %edx
412 cmove %edx, %eax
413 incl %eax
414 xorl %edx, %edx
415 testl %ecx, %ecx
416 cmove %edx, %eax
Chris Lattnera86af9a2007-08-11 18:19:07 +0000417 ret
Eli Friedman577c7492008-02-21 21:16:49 +0000418
419vs gcc:
420
Chris Lattnera86af9a2007-08-11 18:19:07 +0000421_ffs_:
422 movl $-1, %edx
423 bsfl 4(%esp), %eax
424 cmove %edx, %eax
425 addl $1, %eax
426 ret
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000427
Eli Friedman577c7492008-02-21 21:16:49 +0000428Another example of __builtin_ffs (use predsimplify to eliminate a select):
429
430int foo (unsigned long j) {
431 if (j)
432 return __builtin_ffs (j) - 1;
433 else
434 return 0;
435}
436
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000437//===---------------------------------------------------------------------===//
438
439It appears gcc place string data with linkonce linkage in
440.section __TEXT,__const_coal,coalesced instead of
441.section __DATA,__const_coal,coalesced.
442Take a look at darwin.h, there are other Darwin assembler directives that we
443do not make use of.
444
445//===---------------------------------------------------------------------===//
446
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000447define i32 @foo(i32* %a, i32 %t) {
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000448entry:
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000449 br label %cond_true
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000450
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000451cond_true: ; preds = %cond_true, %entry
452 %x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ] ; <i32> [#uses=3]
453 %t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ] ; <i32> [#uses=1]
454 %tmp2 = getelementptr i32* %a, i32 %x.0.0 ; <i32*> [#uses=1]
455 %tmp3 = load i32* %tmp2 ; <i32> [#uses=1]
456 %tmp5 = add i32 %t_addr.0.0, %x.0.0 ; <i32> [#uses=1]
457 %tmp7 = add i32 %tmp5, %tmp3 ; <i32> [#uses=2]
458 %tmp9 = add i32 %x.0.0, 1 ; <i32> [#uses=2]
459 %tmp = icmp sgt i32 %tmp9, 39 ; <i1> [#uses=1]
460 br i1 %tmp, label %bb12, label %cond_true
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000461
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000462bb12: ; preds = %cond_true
463 ret i32 %tmp7
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000464}
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000465is pessimized by -loop-reduce and -indvars
466
467//===---------------------------------------------------------------------===//
468
469u32 to float conversion improvement:
470
471float uint32_2_float( unsigned u ) {
472 float fl = (int) (u & 0xffff);
473 float fh = (int) (u >> 16);
474 fh *= 0x1.0p16f;
475 return fh + fl;
476}
477
47800000000 subl $0x04,%esp
47900000003 movl 0x08(%esp,1),%eax
48000000007 movl %eax,%ecx
48100000009 shrl $0x10,%ecx
4820000000c cvtsi2ss %ecx,%xmm0
48300000010 andl $0x0000ffff,%eax
48400000015 cvtsi2ss %eax,%xmm1
48500000019 mulss 0x00000078,%xmm0
48600000021 addss %xmm1,%xmm0
48700000025 movss %xmm0,(%esp,1)
4880000002a flds (%esp,1)
4890000002d addl $0x04,%esp
49000000030 ret
491
492//===---------------------------------------------------------------------===//
493
494When using fastcc abi, align stack slot of argument of type double on 8 byte
495boundary to improve performance.
496
497//===---------------------------------------------------------------------===//
498
499Codegen:
500
501int f(int a, int b) {
502 if (a == 4 || a == 6)
503 b++;
504 return b;
505}
506
507
508as:
509
510or eax, 2
511cmp eax, 6
512jz label
513
514//===---------------------------------------------------------------------===//
515
516GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
517simplifications for integer "x cmp y ? a : b". For example, instead of:
518
519int G;
520void f(int X, int Y) {
521 G = X < 0 ? 14 : 13;
522}
523
524compiling to:
525
526_f:
527 movl $14, %eax
528 movl $13, %ecx
529 movl 4(%esp), %edx
530 testl %edx, %edx
531 cmovl %eax, %ecx
532 movl %ecx, _G
533 ret
534
535it could be:
536_f:
537 movl 4(%esp), %eax
538 sarl $31, %eax
539 notl %eax
540 addl $14, %eax
541 movl %eax, _G
542 ret
543
544etc.
545
Chris Lattnere7037c22007-11-02 17:04:20 +0000546Another is:
547int usesbb(unsigned int a, unsigned int b) {
548 return (a < b ? -1 : 0);
549}
550to:
551_usesbb:
552 movl 8(%esp), %eax
553 cmpl %eax, 4(%esp)
554 sbbl %eax, %eax
555 ret
556
557instead of:
558_usesbb:
559 xorl %eax, %eax
560 movl 8(%esp), %ecx
561 cmpl %ecx, 4(%esp)
562 movl $4294967295, %ecx
563 cmovb %ecx, %eax
564 ret
565
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000566//===---------------------------------------------------------------------===//
567
568Currently we don't have elimination of redundant stack manipulations. Consider
569the code:
570
571int %main() {
572entry:
573 call fastcc void %test1( )
574 call fastcc void %test2( sbyte* cast (void ()* %test1 to sbyte*) )
575 ret int 0
576}
577
578declare fastcc void %test1()
579
580declare fastcc void %test2(sbyte*)
581
582
583This currently compiles to:
584
585 subl $16, %esp
586 call _test5
587 addl $12, %esp
588 subl $16, %esp
589 movl $_test5, (%esp)
590 call _test6
591 addl $12, %esp
592
593The add\sub pair is really unneeded here.
594
595//===---------------------------------------------------------------------===//
596
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000597Consider the expansion of:
598
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000599define i32 @test3(i32 %X) {
600 %tmp1 = urem i32 %X, 255
601 ret i32 %tmp1
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000602}
603
604Currently it compiles to:
605
606...
607 movl $2155905153, %ecx
608 movl 8(%esp), %esi
609 movl %esi, %eax
610 mull %ecx
611...
612
613This could be "reassociated" into:
614
615 movl $2155905153, %eax
616 movl 8(%esp), %ecx
617 mull %ecx
618
619to avoid the copy. In fact, the existing two-address stuff would do this
620except that mul isn't a commutative 2-addr instruction. I guess this has
621to be done at isel time based on the #uses to mul?
622
623//===---------------------------------------------------------------------===//
624
625Make sure the instruction which starts a loop does not cross a cacheline
626boundary. This requires knowning the exact length of each machine instruction.
627That is somewhat complicated, but doable. Example 256.bzip2:
628
629In the new trace, the hot loop has an instruction which crosses a cacheline
630boundary. In addition to potential cache misses, this can't help decoding as I
631imagine there has to be some kind of complicated decoder reset and realignment
632to grab the bytes from the next cacheline.
633
634532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
Eli Friedman9ab1db02008-11-30 07:52:27 +0000635942 942 0x3d03 movl %dh, (1809(%esp, %esi)
636937 937 0x3d0a incl %esi
6373 3 0x3d0b cmpb %bl, %dl
Dan Gohmanf17a25c2007-07-18 16:29:46 +000063827 27 0x3d0d jnz 0x000062db <main+11707>
639
640//===---------------------------------------------------------------------===//
641
642In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
643
644//===---------------------------------------------------------------------===//
645
646This could be a single 16-bit load.
647
648int f(char *p) {
649 if ((p[0] == 1) & (p[1] == 2)) return 1;
650 return 0;
651}
652
653//===---------------------------------------------------------------------===//
654
655We should inline lrintf and probably other libc functions.
656
657//===---------------------------------------------------------------------===//
658
659Start using the flags more. For example, compile:
660
661int add_zf(int *x, int y, int a, int b) {
662 if ((*x += y) == 0)
663 return a;
664 else
665 return b;
666}
667
668to:
669 addl %esi, (%rdi)
670 movl %edx, %eax
671 cmovne %ecx, %eax
672 ret
673instead of:
674
675_add_zf:
676 addl (%rdi), %esi
677 movl %esi, (%rdi)
678 testl %esi, %esi
679 cmove %edx, %ecx
680 movl %ecx, %eax
681 ret
682
683and:
684
685int add_zf(int *x, int y, int a, int b) {
686 if ((*x + y) < 0)
687 return a;
688 else
689 return b;
690}
691
692to:
693
694add_zf:
695 addl (%rdi), %esi
696 movl %edx, %eax
697 cmovns %ecx, %eax
698 ret
699
700instead of:
701
702_add_zf:
703 addl (%rdi), %esi
704 testl %esi, %esi
705 cmovs %edx, %ecx
706 movl %ecx, %eax
707 ret
708
709//===---------------------------------------------------------------------===//
710
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000711These two functions have identical effects:
712
713unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
714unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
715
716We currently compile them to:
717
718_f:
719 movl 4(%esp), %eax
720 movl %eax, %ecx
721 incl %ecx
722 movl 8(%esp), %edx
723 cmpl %edx, %ecx
724 jne LBB1_2 #UnifiedReturnBlock
725LBB1_1: #cond_true
726 addl $2, %eax
727 ret
728LBB1_2: #UnifiedReturnBlock
729 movl %ecx, %eax
730 ret
731_f2:
732 movl 4(%esp), %eax
733 movl %eax, %ecx
734 incl %ecx
735 cmpl 8(%esp), %ecx
736 sete %cl
737 movzbl %cl, %ecx
738 leal 1(%ecx,%eax), %eax
739 ret
740
741both of which are inferior to GCC's:
742
743_f:
744 movl 4(%esp), %edx
745 leal 1(%edx), %eax
746 addl $2, %edx
747 cmpl 8(%esp), %eax
748 cmove %edx, %eax
749 ret
750_f2:
751 movl 4(%esp), %eax
752 addl $1, %eax
753 xorl %edx, %edx
754 cmpl 8(%esp), %eax
755 sete %dl
756 addl %edx, %eax
757 ret
758
759//===---------------------------------------------------------------------===//
760
761This code:
762
763void test(int X) {
764 if (X) abort();
765}
766
767is currently compiled to:
768
769_test:
770 subl $12, %esp
771 cmpl $0, 16(%esp)
772 jne LBB1_1
773 addl $12, %esp
774 ret
775LBB1_1:
776 call L_abort$stub
777
778It would be better to produce:
779
780_test:
781 subl $12, %esp
782 cmpl $0, 16(%esp)
783 jne L_abort$stub
784 addl $12, %esp
785 ret
786
787This can be applied to any no-return function call that takes no arguments etc.
788Alternatively, the stack save/restore logic could be shrink-wrapped, producing
789something like this:
790
791_test:
792 cmpl $0, 4(%esp)
793 jne LBB1_1
794 ret
795LBB1_1:
796 subl $12, %esp
797 call L_abort$stub
798
799Both are useful in different situations. Finally, it could be shrink-wrapped
800and tail called, like this:
801
802_test:
803 cmpl $0, 4(%esp)
804 jne LBB1_1
805 ret
806LBB1_1:
807 pop %eax # realign stack.
808 call L_abort$stub
809
810Though this probably isn't worth it.
811
812//===---------------------------------------------------------------------===//
813
814We need to teach the codegen to convert two-address INC instructions to LEA
Chris Lattner0d64ec32007-08-11 18:16:46 +0000815when the flags are dead (likewise dec). For example, on X86-64, compile:
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000816
817int foo(int A, int B) {
818 return A+1;
819}
820
821to:
822
823_foo:
824 leal 1(%edi), %eax
825 ret
826
827instead of:
828
829_foo:
830 incl %edi
831 movl %edi, %eax
832 ret
833
834Another example is:
835
836;; X's live range extends beyond the shift, so the register allocator
837;; cannot coalesce it with Y. Because of this, a copy needs to be
838;; emitted before the shift to save the register value before it is
839;; clobbered. However, this copy is not needed if the register
840;; allocator turns the shift into an LEA. This also occurs for ADD.
841
842; Check that the shift gets turned into an LEA.
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000843; RUN: llvm-as < %s | llc -march=x86 -x86-asm-syntax=intel | \
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000844; RUN: not grep {mov E.X, E.X}
845
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000846@G = external global i32 ; <i32*> [#uses=3]
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000847
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000848define i32 @test1(i32 %X, i32 %Y) {
849 %Z = add i32 %X, %Y ; <i32> [#uses=1]
850 volatile store i32 %Y, i32* @G
851 volatile store i32 %Z, i32* @G
852 ret i32 %X
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000853}
854
Chris Lattnerbea5feb2008-02-14 06:19:02 +0000855define i32 @test2(i32 %X) {
856 %Z = add i32 %X, 1 ; <i32> [#uses=1]
857 volatile store i32 %Z, i32* @G
858 ret i32 %X
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000859}
860
861//===---------------------------------------------------------------------===//
862
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000863Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
864a neg instead of a sub instruction. Consider:
865
866int test(char X) { return 7-X; }
867
868we currently produce:
869_test:
870 movl $7, %eax
871 movsbl 4(%esp), %ecx
872 subl %ecx, %eax
873 ret
874
875We would use one fewer register if codegen'd as:
876
877 movsbl 4(%esp), %eax
878 neg %eax
879 add $7, %eax
880 ret
881
882Note that this isn't beneficial if the load can be folded into the sub. In
883this case, we want a sub:
884
885int test(int X) { return 7-X; }
886_test:
887 movl $7, %eax
888 subl 4(%esp), %eax
889 ret
890
891//===---------------------------------------------------------------------===//
892
Chris Lattner32f65872007-08-20 02:14:33 +0000893Leaf functions that require one 4-byte spill slot have a prolog like this:
894
895_foo:
896 pushl %esi
897 subl $4, %esp
898...
899and an epilog like this:
900 addl $4, %esp
901 popl %esi
902 ret
903
904It would be smaller, and potentially faster, to push eax on entry and to
905pop into a dummy register instead of using addl/subl of esp. Just don't pop
906into any return registers :)
907
908//===---------------------------------------------------------------------===//
Chris Lattner44b03cb2007-08-23 15:22:07 +0000909
910The X86 backend should fold (branch (or (setcc, setcc))) into multiple
911branches. We generate really poor code for:
912
913double testf(double a) {
914 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
915}
916
917For example, the entry BB is:
918
919_testf:
920 subl $20, %esp
921 pxor %xmm0, %xmm0
922 movsd 24(%esp), %xmm1
923 ucomisd %xmm0, %xmm1
924 setnp %al
925 sete %cl
926 testb %cl, %al
927 jne LBB1_5 # UnifiedReturnBlock
928LBB1_1: # cond_true
929
930
931it would be better to replace the last four instructions with:
932
933 jp LBB1_1
934 je LBB1_5
935LBB1_1:
936
937We also codegen the inner ?: into a diamond:
938
939 cvtss2sd LCPI1_0(%rip), %xmm2
940 cvtss2sd LCPI1_1(%rip), %xmm3
941 ucomisd %xmm1, %xmm0
942 ja LBB1_3 # cond_true
943LBB1_2: # cond_true
944 movapd %xmm3, %xmm2
945LBB1_3: # cond_true
946 movapd %xmm2, %xmm0
947 ret
948
949We should sink the load into xmm3 into the LBB1_2 block. This should
950be pretty easy, and will nuke all the copies.
951
952//===---------------------------------------------------------------------===//
Chris Lattner4084d492007-09-10 21:43:18 +0000953
954This:
955 #include <algorithm>
956 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
957 { return std::make_pair(a + b, a + b < a); }
958 bool no_overflow(unsigned a, unsigned b)
959 { return !full_add(a, b).second; }
960
961Should compile to:
962
963
964 _Z11no_overflowjj:
965 addl %edi, %esi
966 setae %al
967 ret
968
Eli Friedman577c7492008-02-21 21:16:49 +0000969FIXME: That code looks wrong; bool return is normally defined as zext.
970
Chris Lattner4084d492007-09-10 21:43:18 +0000971on x86-64, not:
972
973__Z11no_overflowjj:
974 addl %edi, %esi
975 cmpl %edi, %esi
976 setae %al
977 movzbl %al, %eax
978 ret
979
980
981//===---------------------------------------------------------------------===//
Evan Cheng35127a62007-09-10 22:16:37 +0000982
983Re-materialize MOV32r0 etc. with xor instead of changing them to moves if the
984condition register is dead. xor reg reg is shorter than mov reg, #0.
Chris Lattnera487bf72007-09-26 06:29:31 +0000985
986//===---------------------------------------------------------------------===//
987
988We aren't matching RMW instructions aggressively
989enough. Here's a reduced testcase (more in PR1160):
990
991define void @test(i32* %huge_ptr, i32* %target_ptr) {
992 %A = load i32* %huge_ptr ; <i32> [#uses=1]
993 %B = load i32* %target_ptr ; <i32> [#uses=1]
994 %C = or i32 %A, %B ; <i32> [#uses=1]
995 store i32 %C, i32* %target_ptr
996 ret void
997}
998
999$ llvm-as < t.ll | llc -march=x86-64
1000
1001_test:
1002 movl (%rdi), %eax
1003 orl (%rsi), %eax
1004 movl %eax, (%rsi)
1005 ret
1006
1007That should be something like:
1008
1009_test:
1010 movl (%rdi), %eax
1011 orl %eax, (%rsi)
1012 ret
1013
1014//===---------------------------------------------------------------------===//
1015
Bill Wendling7f436dd2007-10-02 20:42:59 +00001016The following code:
1017
Bill Wendlingc2036e32007-10-02 20:54:32 +00001018bb114.preheader: ; preds = %cond_next94
1019 %tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
1020 %tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
1021 %tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
1022 %tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
1023 %tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
1024 %tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
1025 %tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
1026 %tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
1027 %tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
1028 %tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
1029 %tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
1030 br label %bb114
1031
1032produces:
1033
Bill Wendling7f436dd2007-10-02 20:42:59 +00001034LBB3_5: # bb114.preheader
1035 movswl -68(%ebp), %eax
1036 movl $32, %ecx
1037 movl %ecx, -80(%ebp)
1038 subl %eax, -80(%ebp)
1039 movswl -52(%ebp), %eax
1040 movl %ecx, -84(%ebp)
1041 subl %eax, -84(%ebp)
1042 movswl -70(%ebp), %eax
1043 movl %ecx, -88(%ebp)
1044 subl %eax, -88(%ebp)
1045 movswl -50(%ebp), %eax
1046 subl %eax, %ecx
1047 movl %ecx, -76(%ebp)
1048 movswl -42(%ebp), %eax
1049 movl %eax, -92(%ebp)
1050 movswl -66(%ebp), %eax
1051 movl %eax, -96(%ebp)
1052 movw $0, -98(%ebp)
1053
Chris Lattner792bae52007-10-03 03:40:24 +00001054This appears to be bad because the RA is not folding the store to the stack
1055slot into the movl. The above instructions could be:
1056 movl $32, -80(%ebp)
1057...
1058 movl $32, -84(%ebp)
1059...
1060This seems like a cross between remat and spill folding.
1061
Bill Wendlingc2036e32007-10-02 20:54:32 +00001062This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling7f436dd2007-10-02 20:42:59 +00001063change, so we could simply subtract %eax from %ecx first and then use %ecx (or
1064vice-versa).
1065
1066//===---------------------------------------------------------------------===//
1067
Bill Wendling54c4f832007-10-02 21:49:31 +00001068This code:
1069
1070 %tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
1071 br i1 %tmp659, label %cond_true662, label %cond_next715
1072
1073produces this:
1074
1075 testw %cx, %cx
1076 movswl %cx, %esi
1077 jns LBB4_109 # cond_next715
1078
1079Shark tells us that using %cx in the testw instruction is sub-optimal. It
1080suggests using the 32-bit register (which is what ICC uses).
1081
1082//===---------------------------------------------------------------------===//
Chris Lattner802c62a2007-10-03 17:10:03 +00001083
Chris Lattnerae259992007-10-04 15:47:27 +00001084We compile this:
1085
1086void compare (long long foo) {
1087 if (foo < 4294967297LL)
1088 abort();
1089}
1090
1091to:
1092
Eli Friedman577c7492008-02-21 21:16:49 +00001093compare:
1094 subl $4, %esp
1095 cmpl $0, 8(%esp)
Chris Lattnerae259992007-10-04 15:47:27 +00001096 setne %al
1097 movzbw %al, %ax
Eli Friedman577c7492008-02-21 21:16:49 +00001098 cmpl $1, 12(%esp)
Chris Lattnerae259992007-10-04 15:47:27 +00001099 setg %cl
1100 movzbw %cl, %cx
1101 cmove %ax, %cx
Eli Friedman577c7492008-02-21 21:16:49 +00001102 testb $1, %cl
1103 jne .LBB1_2 # UnifiedReturnBlock
1104.LBB1_1: # ifthen
1105 call abort
1106.LBB1_2: # UnifiedReturnBlock
1107 addl $4, %esp
1108 ret
Chris Lattnerae259992007-10-04 15:47:27 +00001109
1110(also really horrible code on ppc). This is due to the expand code for 64-bit
1111compares. GCC produces multiple branches, which is much nicer:
1112
Eli Friedman577c7492008-02-21 21:16:49 +00001113compare:
1114 subl $12, %esp
1115 movl 20(%esp), %edx
1116 movl 16(%esp), %eax
1117 decl %edx
1118 jle .L7
1119.L5:
1120 addl $12, %esp
1121 ret
1122 .p2align 4,,7
1123.L7:
1124 jl .L4
Chris Lattnerae259992007-10-04 15:47:27 +00001125 cmpl $0, %eax
Eli Friedman577c7492008-02-21 21:16:49 +00001126 .p2align 4,,8
1127 ja .L5
1128.L4:
1129 .p2align 4,,9
1130 call abort
Chris Lattnerae259992007-10-04 15:47:27 +00001131
1132//===---------------------------------------------------------------------===//
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001133
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001134Tail call optimization improvements: Tail call optimization currently
1135pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighofer449b01a2008-01-11 16:49:42 +00001136non-tail call optimized calls) that source from the callers arguments
1137or that source from a virtual register (also possibly sourcing from
1138callers arguments).
1139This is done to prevent overwriting of parameters (see example
1140below) that might be used later.
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001141
1142example:
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001143
1144int callee(int32, int64);
1145int caller(int32 arg1, int32 arg2) {
1146 int64 local = arg2 * 2;
1147 return callee(arg2, (int64)local);
1148}
1149
1150[arg1] [!arg2 no longer valid since we moved local onto it]
1151[arg2] -> [(int64)
1152[RETADDR] local ]
1153
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001154Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001155arg2 of the caller.
1156
1157Possible optimizations:
1158
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001159
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001160 - Analyse the actual parameters of the callee to see which would
1161 overwrite a caller parameter which is used by the callee and only
1162 push them onto the top of the stack.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001163
1164 int callee (int32 arg1, int32 arg2);
1165 int caller (int32 arg1, int32 arg2) {
1166 return callee(arg1,arg2);
1167 }
1168
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001169 Here we don't need to write any variables to the top of the stack
1170 since they don't overwrite each other.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001171
1172 int callee (int32 arg1, int32 arg2);
1173 int caller (int32 arg1, int32 arg2) {
1174 return callee(arg2,arg1);
1175 }
1176
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001177 Here we need to push the arguments because they overwrite each
1178 other.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001179
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001180//===---------------------------------------------------------------------===//
Evan Cheng7f1ad6a2007-10-28 04:01:09 +00001181
1182main ()
1183{
1184 int i = 0;
1185 unsigned long int z = 0;
1186
1187 do {
1188 z -= 0x00004000;
1189 i++;
1190 if (i > 0x00040000)
1191 abort ();
1192 } while (z > 0);
1193 exit (0);
1194}
1195
1196gcc compiles this to:
1197
1198_main:
1199 subl $28, %esp
1200 xorl %eax, %eax
1201 jmp L2
1202L3:
1203 cmpl $262144, %eax
1204 je L10
1205L2:
1206 addl $1, %eax
1207 cmpl $262145, %eax
1208 jne L3
1209 call L_abort$stub
1210L10:
1211 movl $0, (%esp)
1212 call L_exit$stub
1213
1214llvm:
1215
1216_main:
1217 subl $12, %esp
1218 movl $1, %eax
1219 movl $16384, %ecx
1220LBB1_1: # bb
1221 cmpl $262145, %eax
1222 jge LBB1_4 # cond_true
1223LBB1_2: # cond_next
1224 incl %eax
1225 addl $4294950912, %ecx
1226 cmpl $16384, %ecx
1227 jne LBB1_1 # bb
1228LBB1_3: # bb11
1229 xorl %eax, %eax
1230 addl $12, %esp
1231 ret
1232LBB1_4: # cond_true
1233 call L_abort$stub
1234
12351. LSR should rewrite the first cmp with induction variable %ecx.
12362. DAG combiner should fold
1237 leal 1(%eax), %edx
1238 cmpl $262145, %edx
1239 =>
1240 cmpl $262144, %eax
1241
1242//===---------------------------------------------------------------------===//
Chris Lattner358670b2007-11-24 06:13:33 +00001243
1244define i64 @test(double %X) {
1245 %Y = fptosi double %X to i64
1246 ret i64 %Y
1247}
1248
1249compiles to:
1250
1251_test:
1252 subl $20, %esp
1253 movsd 24(%esp), %xmm0
1254 movsd %xmm0, 8(%esp)
1255 fldl 8(%esp)
1256 fisttpll (%esp)
1257 movl 4(%esp), %edx
1258 movl (%esp), %eax
1259 addl $20, %esp
1260 #FP_REG_KILL
1261 ret
1262
1263This should just fldl directly from the input stack slot.
Chris Lattner10d54d12007-12-05 22:58:19 +00001264
1265//===---------------------------------------------------------------------===//
1266
1267This code:
1268int foo (int x) { return (x & 65535) | 255; }
1269
1270Should compile into:
1271
1272_foo:
1273 movzwl 4(%esp), %eax
Eli Friedman577c7492008-02-21 21:16:49 +00001274 orl $255, %eax
Chris Lattner10d54d12007-12-05 22:58:19 +00001275 ret
1276
1277instead of:
1278_foo:
1279 movl $255, %eax
1280 orl 4(%esp), %eax
1281 andl $65535, %eax
1282 ret
1283
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001284//===---------------------------------------------------------------------===//
1285
Chris Lattnereec7ac02008-02-21 06:51:29 +00001286We're codegen'ing multiply of long longs inefficiently:
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001287
Chris Lattnereec7ac02008-02-21 06:51:29 +00001288unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
1289 return arg1 * arg2;
1290}
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001291
Chris Lattnereec7ac02008-02-21 06:51:29 +00001292We compile to (fomit-frame-pointer):
1293
1294_LLM:
1295 pushl %esi
1296 movl 8(%esp), %ecx
1297 movl 16(%esp), %esi
1298 movl %esi, %eax
1299 mull %ecx
1300 imull 12(%esp), %esi
1301 addl %edx, %esi
1302 imull 20(%esp), %ecx
1303 movl %esi, %edx
1304 addl %ecx, %edx
1305 popl %esi
1306 ret
1307
1308This looks like a scheduling deficiency and lack of remat of the load from
1309the argument area. ICC apparently produces:
1310
1311 movl 8(%esp), %ecx
1312 imull 12(%esp), %ecx
1313 movl 16(%esp), %eax
1314 imull 4(%esp), %eax
1315 addl %eax, %ecx
1316 movl 4(%esp), %eax
1317 mull 12(%esp)
1318 addl %ecx, %edx
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001319 ret
1320
Chris Lattnereec7ac02008-02-21 06:51:29 +00001321Note that it remat'd loads from 4(esp) and 12(esp). See this GCC PR:
1322http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001323
1324//===---------------------------------------------------------------------===//
1325
Chris Lattner2b55ebd2007-12-24 19:27:46 +00001326We can fold a store into "zeroing a reg". Instead of:
1327
1328xorl %eax, %eax
1329movl %eax, 124(%esp)
1330
1331we should get:
1332
1333movl $0, 124(%esp)
1334
1335if the flags of the xor are dead.
1336
Chris Lattner459ff992008-01-11 18:00:13 +00001337Likewise, we isel "x<<1" into "add reg,reg". If reg is spilled, this should
1338be folded into: shl [mem], 1
1339
Chris Lattner2b55ebd2007-12-24 19:27:46 +00001340//===---------------------------------------------------------------------===//
Chris Lattner64400952007-12-28 21:50:40 +00001341
1342This testcase misses a read/modify/write opportunity (from PR1425):
1343
1344void vertical_decompose97iH1(int *b0, int *b1, int *b2, int width){
1345 int i;
1346 for(i=0; i<width; i++)
1347 b1[i] += (1*(b0[i] + b2[i])+0)>>0;
1348}
1349
1350We compile it down to:
1351
1352LBB1_2: # bb
1353 movl (%esi,%edi,4), %ebx
1354 addl (%ecx,%edi,4), %ebx
1355 addl (%edx,%edi,4), %ebx
1356 movl %ebx, (%ecx,%edi,4)
1357 incl %edi
1358 cmpl %eax, %edi
1359 jne LBB1_2 # bb
1360
1361the inner loop should add to the memory location (%ecx,%edi,4), saving
1362a mov. Something like:
1363
1364 movl (%esi,%edi,4), %ebx
1365 addl (%edx,%edi,4), %ebx
1366 addl %ebx, (%ecx,%edi,4)
1367
Chris Lattnerbde73102007-12-29 05:51:58 +00001368Here is another interesting example:
1369
1370void vertical_compose97iH1(int *b0, int *b1, int *b2, int width){
1371 int i;
1372 for(i=0; i<width; i++)
1373 b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
1374}
1375
1376We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
1377
1378LBB9_2: # bb
1379 movl (%ecx,%edi,4), %ebx
1380 subl (%esi,%edi,4), %ebx
1381 subl (%edx,%edi,4), %ebx
1382 movl %ebx, (%ecx,%edi,4)
1383 incl %edi
1384 cmpl %eax, %edi
1385 jne LBB9_2 # bb
1386
1387Additionally, LSR should rewrite the exit condition of these loops to use
Chris Lattner64400952007-12-28 21:50:40 +00001388a stride-4 IV, would would allow all the scales in the loop to go away.
1389This would result in smaller code and more efficient microops.
1390
1391//===---------------------------------------------------------------------===//
Chris Lattner0362a362008-01-07 21:59:58 +00001392
1393In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
1394or and instruction, for example:
1395
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001396 xorpd LCPI1_0, %xmm2
Chris Lattner0362a362008-01-07 21:59:58 +00001397
1398However, if xmm2 gets spilled, we end up with really ugly code like this:
1399
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001400 movsd (%esp), %xmm0
1401 xorpd LCPI1_0, %xmm0
1402 movsd %xmm0, (%esp)
Chris Lattner0362a362008-01-07 21:59:58 +00001403
1404Since we 'know' that this is a 'neg', we can actually "fold" the spill into
1405the neg/abs instruction, turning it into an *integer* operation, like this:
1406
1407 xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
1408
1409you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001410stall. Here is a contrived testcase:
1411
1412double a, b, c;
1413void test(double *P) {
1414 double X = *P;
1415 a = X;
1416 bar();
1417 X = -X;
1418 b = X;
1419 bar();
1420 c = X;
1421}
Chris Lattner0362a362008-01-07 21:59:58 +00001422
1423//===---------------------------------------------------------------------===//
Andrew Lenharth785610d2008-02-16 01:24:58 +00001424
1425handling llvm.memory.barrier on pre SSE2 cpus
1426
1427should generate:
1428lock ; mov %esp, %esp
1429
1430//===---------------------------------------------------------------------===//
Chris Lattner7644ff32008-02-17 19:43:57 +00001431
1432The generated code on x86 for checking for signed overflow on a multiply the
1433obvious way is much longer than it needs to be.
1434
1435int x(int a, int b) {
1436 long long prod = (long long)a*b;
1437 return prod > 0x7FFFFFFF || prod < (-0x7FFFFFFF-1);
1438}
1439
1440See PR2053 for more details.
1441
1442//===---------------------------------------------------------------------===//
Chris Lattner83f22362008-02-18 18:30:13 +00001443
Eli Friedman577c7492008-02-21 21:16:49 +00001444We should investigate using cdq/ctld (effect: edx = sar eax, 31)
1445more aggressively; it should cost the same as a move+shift on any modern
1446processor, but it's a lot shorter. Downside is that it puts more
1447pressure on register allocation because it has fixed operands.
1448
1449Example:
1450int abs(int x) {return x < 0 ? -x : x;}
1451
1452gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
1453abs:
1454 movl 4(%esp), %eax
1455 cltd
1456 xorl %edx, %eax
1457 subl %edx, %eax
1458 ret
1459
1460//===---------------------------------------------------------------------===//
1461
1462Consider:
Chris Lattner83f22362008-02-18 18:30:13 +00001463int test(unsigned long a, unsigned long b) { return -(a < b); }
1464
1465We currently compile this to:
1466
1467define i32 @test(i32 %a, i32 %b) nounwind {
1468 %tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
1469 %tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
1470 %tmp5 = sub i32 0, %tmp34 ; <i32> [#uses=1]
1471 ret i32 %tmp5
1472}
1473
1474and
1475
1476_test:
1477 movl 8(%esp), %eax
1478 cmpl %eax, 4(%esp)
1479 setb %al
1480 movzbl %al, %eax
1481 negl %eax
1482 ret
1483
1484Several deficiencies here. First, we should instcombine zext+neg into sext:
1485
1486define i32 @test2(i32 %a, i32 %b) nounwind {
1487 %tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
1488 %tmp34 = sext i1 %tmp3 to i32 ; <i32> [#uses=1]
1489 ret i32 %tmp34
1490}
1491
1492However, before we can do that, we have to fix the bad codegen that we get for
1493sext from bool:
1494
1495_test2:
1496 movl 8(%esp), %eax
1497 cmpl %eax, 4(%esp)
1498 setb %al
1499 movzbl %al, %eax
1500 shll $31, %eax
1501 sarl $31, %eax
1502 ret
1503
1504This code should be at least as good as the code above. Once this is fixed, we
1505can optimize this specific case even more to:
1506
1507 movl 8(%esp), %eax
1508 xorl %ecx, %ecx
1509 cmpl %eax, 4(%esp)
1510 sbbl %ecx, %ecx
1511
1512//===---------------------------------------------------------------------===//
Eli Friedman1aa1f2c2008-02-28 00:21:43 +00001513
1514Take the following code (from
1515http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
1516
1517extern unsigned char first_one[65536];
1518int FirstOnet(unsigned long long arg1)
1519{
1520 if (arg1 >> 48)
1521 return (first_one[arg1 >> 48]);
1522 return 0;
1523}
1524
1525
1526The following code is currently generated:
1527FirstOnet:
1528 movl 8(%esp), %eax
1529 cmpl $65536, %eax
1530 movl 4(%esp), %ecx
1531 jb .LBB1_2 # UnifiedReturnBlock
1532.LBB1_1: # ifthen
1533 shrl $16, %eax
1534 movzbl first_one(%eax), %eax
1535 ret
1536.LBB1_2: # UnifiedReturnBlock
1537 xorl %eax, %eax
1538 ret
1539
1540There are a few possible improvements here:
15411. We should be able to eliminate the dead load into %ecx
15422. We could change the "movl 8(%esp), %eax" into
1543 "movzwl 10(%esp), %eax"; this lets us change the cmpl
1544 into a testl, which is shorter, and eliminate the shift.
1545
1546We could also in theory eliminate the branch by using a conditional
1547for the address of the load, but that seems unlikely to be worthwhile
1548in general.
1549
1550//===---------------------------------------------------------------------===//
1551
Chris Lattner44a98ac2008-02-28 04:52:59 +00001552We compile this function:
1553
1554define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext %d) nounwind {
1555entry:
1556 %tmp2 = icmp eq i8 %d, 0 ; <i1> [#uses=1]
1557 br i1 %tmp2, label %bb7, label %bb
1558
1559bb: ; preds = %entry
1560 %tmp6 = add i32 %b, %a ; <i32> [#uses=1]
1561 ret i32 %tmp6
1562
1563bb7: ; preds = %entry
1564 %tmp10 = sub i32 %a, %c ; <i32> [#uses=1]
1565 ret i32 %tmp10
1566}
1567
1568to:
1569
1570_foo:
1571 cmpb $0, 16(%esp)
1572 movl 12(%esp), %ecx
1573 movl 8(%esp), %eax
1574 movl 4(%esp), %edx
1575 je LBB1_2 # bb7
1576LBB1_1: # bb
1577 addl %edx, %eax
1578 ret
1579LBB1_2: # bb7
1580 movl %edx, %eax
1581 subl %ecx, %eax
1582 ret
1583
Gabor Greif02661592008-03-06 10:51:21 +00001584The coalescer could coalesce "edx" with "eax" to avoid the movl in LBB1_2
Chris Lattner44a98ac2008-02-28 04:52:59 +00001585if it commuted the addl in LBB1_1.
1586
1587//===---------------------------------------------------------------------===//
Evan Cheng921dcba2008-03-28 07:07:06 +00001588
1589See rdar://4653682.
1590
1591From flops:
1592
1593LBB1_15: # bb310
1594 cvtss2sd LCPI1_0, %xmm1
1595 addsd %xmm1, %xmm0
1596 movsd 176(%esp), %xmm2
1597 mulsd %xmm0, %xmm2
1598 movapd %xmm2, %xmm3
1599 mulsd %xmm3, %xmm3
1600 movapd %xmm3, %xmm4
1601 mulsd LCPI1_23, %xmm4
1602 addsd LCPI1_24, %xmm4
1603 mulsd %xmm3, %xmm4
1604 addsd LCPI1_25, %xmm4
1605 mulsd %xmm3, %xmm4
1606 addsd LCPI1_26, %xmm4
1607 mulsd %xmm3, %xmm4
1608 addsd LCPI1_27, %xmm4
1609 mulsd %xmm3, %xmm4
1610 addsd LCPI1_28, %xmm4
1611 mulsd %xmm3, %xmm4
1612 addsd %xmm1, %xmm4
1613 mulsd %xmm2, %xmm4
1614 movsd 152(%esp), %xmm1
1615 addsd %xmm4, %xmm1
1616 movsd %xmm1, 152(%esp)
1617 incl %eax
1618 cmpl %eax, %esi
1619 jge LBB1_15 # bb310
1620LBB1_16: # bb358.loopexit
1621 movsd 152(%esp), %xmm0
1622 addsd %xmm0, %xmm0
1623 addsd LCPI1_22, %xmm0
1624 movsd %xmm0, 152(%esp)
1625
1626Rather than spilling the result of the last addsd in the loop, we should have
1627insert a copy to split the interval (one for the duration of the loop, one
1628extending to the fall through). The register pressure in the loop isn't high
1629enough to warrant the spill.
1630
1631Also check why xmm7 is not used at all in the function.
Chris Lattner16e5c782008-04-21 04:46:30 +00001632
1633//===---------------------------------------------------------------------===//
1634
1635Legalize loses track of the fact that bools are always zero extended when in
1636memory. This causes us to compile abort_gzip (from 164.gzip) from:
1637
1638target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
1639target triple = "i386-apple-darwin8"
1640@in_exit.4870.b = internal global i1 false ; <i1*> [#uses=2]
1641define fastcc void @abort_gzip() noreturn nounwind {
1642entry:
1643 %tmp.b.i = load i1* @in_exit.4870.b ; <i1> [#uses=1]
1644 br i1 %tmp.b.i, label %bb.i, label %bb4.i
1645bb.i: ; preds = %entry
1646 tail call void @exit( i32 1 ) noreturn nounwind
1647 unreachable
1648bb4.i: ; preds = %entry
1649 store i1 true, i1* @in_exit.4870.b
1650 tail call void @exit( i32 1 ) noreturn nounwind
1651 unreachable
1652}
1653declare void @exit(i32) noreturn nounwind
1654
1655into:
1656
1657_abort_gzip:
1658 subl $12, %esp
1659 movb _in_exit.4870.b, %al
1660 notb %al
1661 testb $1, %al
1662 jne LBB1_2 ## bb4.i
1663LBB1_1: ## bb.i
1664 ...
1665
1666//===---------------------------------------------------------------------===//
Chris Lattner7cb1d332008-05-05 23:19:45 +00001667
1668We compile:
1669
1670int test(int x, int y) {
1671 return x-y-1;
1672}
1673
1674into (-m64):
1675
1676_test:
1677 decl %edi
1678 movl %edi, %eax
1679 subl %esi, %eax
1680 ret
1681
1682it would be better to codegen as: x+~y (notl+addl)
Edwin Törökfa9d5e22008-10-24 19:23:07 +00001683
1684//===---------------------------------------------------------------------===//
1685
1686This code:
1687
1688int foo(const char *str,...)
1689{
1690 __builtin_va_list a; int x;
1691 __builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
1692 return x;
1693}
1694
1695gets compiled into this on x86-64:
1696 subq $200, %rsp
1697 movaps %xmm7, 160(%rsp)
1698 movaps %xmm6, 144(%rsp)
1699 movaps %xmm5, 128(%rsp)
1700 movaps %xmm4, 112(%rsp)
1701 movaps %xmm3, 96(%rsp)
1702 movaps %xmm2, 80(%rsp)
1703 movaps %xmm1, 64(%rsp)
1704 movaps %xmm0, 48(%rsp)
1705 movq %r9, 40(%rsp)
1706 movq %r8, 32(%rsp)
1707 movq %rcx, 24(%rsp)
1708 movq %rdx, 16(%rsp)
1709 movq %rsi, 8(%rsp)
1710 leaq (%rsp), %rax
1711 movq %rax, 192(%rsp)
1712 leaq 208(%rsp), %rax
1713 movq %rax, 184(%rsp)
1714 movl $48, 180(%rsp)
1715 movl $8, 176(%rsp)
1716 movl 176(%rsp), %eax
1717 cmpl $47, %eax
1718 jbe .LBB1_3 # bb
1719.LBB1_1: # bb3
1720 movq 184(%rsp), %rcx
1721 leaq 8(%rcx), %rax
1722 movq %rax, 184(%rsp)
1723.LBB1_2: # bb4
1724 movl (%rcx), %eax
1725 addq $200, %rsp
1726 ret
1727.LBB1_3: # bb
1728 movl %eax, %ecx
1729 addl $8, %eax
1730 addq 192(%rsp), %rcx
1731 movl %eax, 176(%rsp)
1732 jmp .LBB1_2 # bb4
1733
1734gcc 4.3 generates:
1735 subq $96, %rsp
1736.LCFI0:
1737 leaq 104(%rsp), %rax
1738 movq %rsi, -80(%rsp)
1739 movl $8, -120(%rsp)
1740 movq %rax, -112(%rsp)
1741 leaq -88(%rsp), %rax
1742 movq %rax, -104(%rsp)
1743 movl $8, %eax
1744 cmpl $48, %eax
1745 jb .L6
1746 movq -112(%rsp), %rdx
1747 movl (%rdx), %eax
1748 addq $96, %rsp
1749 ret
1750 .p2align 4,,10
1751 .p2align 3
1752.L6:
1753 mov %eax, %edx
1754 addq -104(%rsp), %rdx
1755 addl $8, %eax
1756 movl %eax, -120(%rsp)
1757 movl (%rdx), %eax
1758 addq $96, %rsp
1759 ret
1760
1761and it gets compiled into this on x86:
1762 pushl %ebp
1763 movl %esp, %ebp
1764 subl $4, %esp
1765 leal 12(%ebp), %eax
1766 movl %eax, -4(%ebp)
1767 leal 16(%ebp), %eax
1768 movl %eax, -4(%ebp)
1769 movl 12(%ebp), %eax
1770 addl $4, %esp
1771 popl %ebp
1772 ret
1773
1774gcc 4.3 generates:
1775 pushl %ebp
1776 movl %esp, %ebp
1777 movl 12(%ebp), %eax
1778 popl %ebp
1779 ret
Evan Chengbf97bec2008-11-11 17:35:52 +00001780
1781//===---------------------------------------------------------------------===//
1782
1783Teach tblgen not to check bitconvert source type in some cases. This allows us
1784to consolidate the following patterns in X86InstrMMX.td:
1785
1786def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1787 (iPTR 0))))),
1788 (v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
1789def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1790 (iPTR 0))))),
1791 (v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
1792def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1793 (iPTR 0))))),
1794 (v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
1795
1796There are other cases in various td files.
Eli Friedman9ab1db02008-11-30 07:52:27 +00001797
1798//===---------------------------------------------------------------------===//
1799
1800Take something like the following on x86-32:
1801unsigned a(unsigned long long x, unsigned y) {return x % y;}
1802
1803We currently generate a libcall, but we really shouldn't: the expansion is
1804shorter and likely faster than the libcall. The expected code is something
1805like the following:
1806
1807 movl 12(%ebp), %eax
1808 movl 16(%ebp), %ecx
1809 xorl %edx, %edx
1810 divl %ecx
1811 movl 8(%ebp), %eax
1812 divl %ecx
1813 movl %edx, %eax
1814 ret
1815
1816A similar code sequence works for division.
1817
1818//===---------------------------------------------------------------------===//