blob: 5ffb2f81cd3e32238dd43b19df2a4adba1fe30a2 [file] [log] [blame]
Dan Gohmanf17a25c2007-07-18 16:29:46 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the X86 backend.
3//===---------------------------------------------------------------------===//
4
5Missing features:
6 - Support for SSE4: http://www.intel.com/software/penryn
7http://softwarecommunity.intel.com/isn/Downloads/Intel%20SSE4%20Programming%20Reference.pdf
8 - support for 3DNow!
9 - weird abis?
10
11//===---------------------------------------------------------------------===//
12
Dan Gohmanf17a25c2007-07-18 16:29:46 +000013CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
14backend knows how to three-addressify this shift, but it appears the register
15allocator isn't even asking it to do so in this case. We should investigate
16why this isn't happening, it could have significant impact on other important
17cases for X86 as well.
18
19//===---------------------------------------------------------------------===//
20
21This should be one DIV/IDIV instruction, not a libcall:
22
23unsigned test(unsigned long long X, unsigned Y) {
24 return X/Y;
25}
26
27This can be done trivially with a custom legalizer. What about overflow
28though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
29
30//===---------------------------------------------------------------------===//
31
32Improvements to the multiply -> shift/add algorithm:
33http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
34
35//===---------------------------------------------------------------------===//
36
37Improve code like this (occurs fairly frequently, e.g. in LLVM):
38long long foo(int x) { return 1LL << x; }
39
40http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
41http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
42http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
43
44Another useful one would be ~0ULL >> X and ~0ULL << X.
45
46One better solution for 1LL << x is:
47 xorl %eax, %eax
48 xorl %edx, %edx
49 testb $32, %cl
50 sete %al
51 setne %dl
52 sall %cl, %eax
53 sall %cl, %edx
54
55But that requires good 8-bit subreg support.
56
5764-bit shifts (in general) expand to really bad code. Instead of using
58cmovs, we should expand to a conditional branch like GCC produces.
59
60//===---------------------------------------------------------------------===//
61
62Compile this:
63_Bool f(_Bool a) { return a!=1; }
64
65into:
66 movzbl %dil, %eax
67 xorl $1, %eax
68 ret
69
70//===---------------------------------------------------------------------===//
71
72Some isel ideas:
73
741. Dynamic programming based approach when compile time if not an
75 issue.
762. Code duplication (addressing mode) during isel.
773. Other ideas from "Register-Sensitive Selection, Duplication, and
78 Sequencing of Instructions".
794. Scheduling for reduced register pressure. E.g. "Minimum Register
80 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
81 and other related papers.
82 http://citeseer.ist.psu.edu/govindarajan01minimum.html
83
84//===---------------------------------------------------------------------===//
85
86Should we promote i16 to i32 to avoid partial register update stalls?
87
88//===---------------------------------------------------------------------===//
89
90Leave any_extend as pseudo instruction and hint to register
91allocator. Delay codegen until post register allocation.
Evan Chengfdbb6672007-10-12 18:22:55 +000092Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
93the coalescer how to deal with it though.
Dan Gohmanf17a25c2007-07-18 16:29:46 +000094
95//===---------------------------------------------------------------------===//
96
97Count leading zeros and count trailing zeros:
98
99int clz(int X) { return __builtin_clz(X); }
100int ctz(int X) { return __builtin_ctz(X); }
101
102$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
103clz:
104 bsr %eax, DWORD PTR [%esp+4]
105 xor %eax, 31
106 ret
107ctz:
108 bsf %eax, DWORD PTR [%esp+4]
109 ret
110
111however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
112aren't.
113
114Another example (use predsimplify to eliminate a select):
115
116int foo (unsigned long j) {
117 if (j)
118 return __builtin_ffs (j) - 1;
119 else
120 return 0;
121}
122
123//===---------------------------------------------------------------------===//
124
125It appears icc use push for parameter passing. Need to investigate.
126
127//===---------------------------------------------------------------------===//
128
129Only use inc/neg/not instructions on processors where they are faster than
130add/sub/xor. They are slower on the P4 due to only updating some processor
131flags.
132
133//===---------------------------------------------------------------------===//
134
135The instruction selector sometimes misses folding a load into a compare. The
136pattern is written as (cmp reg, (load p)). Because the compare isn't
137commutative, it is not matched with the load on both sides. The dag combiner
138should be made smart enough to cannonicalize the load into the RHS of a compare
139when it can invert the result of the compare for free.
140
141//===---------------------------------------------------------------------===//
142
143How about intrinsics? An example is:
144 *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
145
146compiles to
147 pmuludq (%eax), %xmm0
148 movl 8(%esp), %eax
149 movdqa (%eax), %xmm1
150 pmulhuw %xmm0, %xmm1
151
152The transformation probably requires a X86 specific pass or a DAG combiner
153target specific hook.
154
155//===---------------------------------------------------------------------===//
156
157In many cases, LLVM generates code like this:
158
159_test:
160 movl 8(%esp), %eax
161 cmpl %eax, 4(%esp)
162 setl %al
163 movzbl %al, %eax
164 ret
165
166on some processors (which ones?), it is more efficient to do this:
167
168_test:
169 movl 8(%esp), %ebx
170 xor %eax, %eax
171 cmpl %ebx, 4(%esp)
172 setl %al
173 ret
174
175Doing this correctly is tricky though, as the xor clobbers the flags.
176
177//===---------------------------------------------------------------------===//
178
179We should generate bts/btr/etc instructions on targets where they are cheap or
180when codesize is important. e.g., for:
181
182void setbit(int *target, int bit) {
183 *target |= (1 << bit);
184}
185void clearbit(int *target, int bit) {
186 *target &= ~(1 << bit);
187}
188
189//===---------------------------------------------------------------------===//
190
191Instead of the following for memset char*, 1, 10:
192
193 movl $16843009, 4(%edx)
194 movl $16843009, (%edx)
195 movw $257, 8(%edx)
196
197It might be better to generate
198
199 movl $16843009, %eax
200 movl %eax, 4(%edx)
201 movl %eax, (%edx)
202 movw al, 8(%edx)
203
204when we can spare a register. It reduces code size.
205
206//===---------------------------------------------------------------------===//
207
208Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
209get this:
210
211int %test1(int %X) {
212 %Y = div int %X, 8
213 ret int %Y
214}
215
216_test1:
217 movl 4(%esp), %eax
218 movl %eax, %ecx
219 sarl $31, %ecx
220 shrl $29, %ecx
221 addl %ecx, %eax
222 sarl $3, %eax
223 ret
224
225GCC knows several different ways to codegen it, one of which is this:
226
227_test1:
228 movl 4(%esp), %eax
229 cmpl $-1, %eax
230 leal 7(%eax), %ecx
231 cmovle %ecx, %eax
232 sarl $3, %eax
233 ret
234
235which is probably slower, but it's interesting at least :)
236
237//===---------------------------------------------------------------------===//
238
239The first BB of this code:
240
241declare bool %foo()
242int %bar() {
243 %V = call bool %foo()
244 br bool %V, label %T, label %F
245T:
246 ret int 1
247F:
248 call bool %foo()
249 ret int 12
250}
251
252compiles to:
253
254_bar:
255 subl $12, %esp
256 call L_foo$stub
257 xorb $1, %al
258 testb %al, %al
259 jne LBB_bar_2 # F
260
261It would be better to emit "cmp %al, 1" than a xor and test.
262
263//===---------------------------------------------------------------------===//
264
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000265We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
266We should leave these as libcalls for everything over a much lower threshold,
267since libc is hand tuned for medium and large mem ops (avoiding RFO for large
268stores, TLB preheating, etc)
269
270//===---------------------------------------------------------------------===//
271
272Optimize this into something reasonable:
273 x * copysign(1.0, y) * copysign(1.0, z)
274
275//===---------------------------------------------------------------------===//
276
277Optimize copysign(x, *y) to use an integer load from y.
278
279//===---------------------------------------------------------------------===//
280
281%X = weak global int 0
282
283void %foo(int %N) {
284 %N = cast int %N to uint
285 %tmp.24 = setgt int %N, 0
286 br bool %tmp.24, label %no_exit, label %return
287
288no_exit:
289 %indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
290 %i.0.0 = cast uint %indvar to int
291 volatile store int %i.0.0, int* %X
292 %indvar.next = add uint %indvar, 1
293 %exitcond = seteq uint %indvar.next, %N
294 br bool %exitcond, label %return, label %no_exit
295
296return:
297 ret void
298}
299
300compiles into:
301
302 .text
303 .align 4
304 .globl _foo
305_foo:
306 movl 4(%esp), %eax
307 cmpl $1, %eax
308 jl LBB_foo_4 # return
309LBB_foo_1: # no_exit.preheader
310 xorl %ecx, %ecx
311LBB_foo_2: # no_exit
312 movl L_X$non_lazy_ptr, %edx
313 movl %ecx, (%edx)
314 incl %ecx
315 cmpl %eax, %ecx
316 jne LBB_foo_2 # no_exit
317LBB_foo_3: # return.loopexit
318LBB_foo_4: # return
319 ret
320
321We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
322remateralization is implemented. This can be accomplished with 1) a target
323dependent LICM pass or 2) makeing SelectDAG represent the whole function.
324
325//===---------------------------------------------------------------------===//
326
327The following tests perform worse with LSR:
328
329lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
330
331//===---------------------------------------------------------------------===//
332
333We are generating far worse code than gcc:
334
335volatile short X, Y;
336
337void foo(int N) {
338 int i;
339 for (i = 0; i < N; i++) { X = i; Y = i*4; }
340}
341
Evan Cheng27a820a2007-10-26 01:56:11 +0000342LBB1_1: # entry.bb_crit_edge
343 xorl %ecx, %ecx
344 xorw %dx, %dx
345LBB1_2: # bb
346 movl L_X$non_lazy_ptr, %esi
347 movw %cx, (%esi)
348 movl L_Y$non_lazy_ptr, %esi
349 movw %dx, (%esi)
350 addw $4, %dx
351 incl %ecx
352 cmpl %eax, %ecx
353 jne LBB1_2 # bb
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000354
355vs.
356
357 xorl %edx, %edx
358 movl L_X$non_lazy_ptr-"L00000000001$pb"(%ebx), %esi
359 movl L_Y$non_lazy_ptr-"L00000000001$pb"(%ebx), %ecx
360L4:
361 movw %dx, (%esi)
362 leal 0(,%edx,4), %eax
363 movw %ax, (%ecx)
364 addl $1, %edx
365 cmpl %edx, %edi
366 jne L4
367
Evan Cheng27a820a2007-10-26 01:56:11 +0000368This is due to the lack of post regalloc LICM.
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000369
370//===---------------------------------------------------------------------===//
371
372Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
373FR64 to VR128.
374
375//===---------------------------------------------------------------------===//
376
377mov $reg, 48(%esp)
378...
379leal 48(%esp), %eax
380mov %eax, (%esp)
381call _foo
382
383Obviously it would have been better for the first mov (or any op) to store
384directly %esp[0] if there are no other uses.
385
386//===---------------------------------------------------------------------===//
387
388Adding to the list of cmp / test poor codegen issues:
389
390int test(__m128 *A, __m128 *B) {
391 if (_mm_comige_ss(*A, *B))
392 return 3;
393 else
394 return 4;
395}
396
397_test:
398 movl 8(%esp), %eax
399 movaps (%eax), %xmm0
400 movl 4(%esp), %eax
401 movaps (%eax), %xmm1
402 comiss %xmm0, %xmm1
403 setae %al
404 movzbl %al, %ecx
405 movl $3, %eax
406 movl $4, %edx
407 cmpl $0, %ecx
408 cmove %edx, %eax
409 ret
410
411Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
412are a number of issues. 1) We are introducing a setcc between the result of the
413intrisic call and select. 2) The intrinsic is expected to produce a i32 value
414so a any extend (which becomes a zero extend) is added.
415
416We probably need some kind of target DAG combine hook to fix this.
417
418//===---------------------------------------------------------------------===//
419
420We generate significantly worse code for this than GCC:
421http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
422http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
423
424There is also one case we do worse on PPC.
425
426//===---------------------------------------------------------------------===//
427
428If shorter, we should use things like:
429movzwl %ax, %eax
430instead of:
431andl $65535, %EAX
432
433The former can also be used when the two-addressy nature of the 'and' would
434require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
435
436//===---------------------------------------------------------------------===//
437
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000438Consider this:
439
440typedef struct pair { float A, B; } pair;
441void pairtest(pair P, float *FP) {
442 *FP = P.A+P.B;
443}
444
445We currently generate this code with llvmgcc4:
446
447_pairtest:
448 movl 8(%esp), %eax
449 movl 4(%esp), %ecx
450 movd %eax, %xmm0
451 movd %ecx, %xmm1
452 addss %xmm0, %xmm1
453 movl 12(%esp), %eax
454 movss %xmm1, (%eax)
455 ret
456
457we should be able to generate:
458_pairtest:
459 movss 4(%esp), %xmm0
460 movl 12(%esp), %eax
461 addss 8(%esp), %xmm0
462 movss %xmm0, (%eax)
463 ret
464
465The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
466integer chunks. It does this so that structs like {short,short} are passed in
467a single 32-bit integer stack slot. We should handle the safe cases above much
468nicer, while still handling the hard cases.
469
470While true in general, in this specific case we could do better by promoting
471load int + bitcast to float -> load fload. This basically needs alignment info,
472the code is already implemented (but disabled) in dag combine).
473
474//===---------------------------------------------------------------------===//
475
476Another instruction selector deficiency:
477
478void %bar() {
479 %tmp = load int (int)** %foo
480 %tmp = tail call int %tmp( int 3 )
481 ret void
482}
483
484_bar:
485 subl $12, %esp
486 movl L_foo$non_lazy_ptr, %eax
487 movl (%eax), %eax
488 call *%eax
489 addl $12, %esp
490 ret
491
492The current isel scheme will not allow the load to be folded in the call since
493the load's chain result is read by the callseq_start.
494
495//===---------------------------------------------------------------------===//
496
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000497For this:
498
499int test(int a)
500{
501 return a * 3;
502}
503
504We currently emits
505 imull $3, 4(%esp), %eax
506
507Perhaps this is what we really should generate is? Is imull three or four
508cycles? Note: ICC generates this:
509 movl 4(%esp), %eax
510 leal (%eax,%eax,2), %eax
511
512The current instruction priority is based on pattern complexity. The former is
513more "complex" because it folds a load so the latter will not be emitted.
514
515Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
516should always try to match LEA first since the LEA matching code does some
517estimate to determine whether the match is profitable.
518
519However, if we care more about code size, then imull is better. It's two bytes
520shorter than movl + leal.
521
522//===---------------------------------------------------------------------===//
523
Chris Lattnera86af9a2007-08-11 18:19:07 +0000524Implement CTTZ, CTLZ with bsf and bsr. GCC produces:
525
526int ctz_(unsigned X) { return __builtin_ctz(X); }
527int clz_(unsigned X) { return __builtin_clz(X); }
528int ffs_(unsigned X) { return __builtin_ffs(X); }
529
530_ctz_:
531 bsfl 4(%esp), %eax
532 ret
533_clz_:
534 bsrl 4(%esp), %eax
535 xorl $31, %eax
536 ret
537_ffs_:
538 movl $-1, %edx
539 bsfl 4(%esp), %eax
540 cmove %edx, %eax
541 addl $1, %eax
542 ret
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000543
544//===---------------------------------------------------------------------===//
545
546It appears gcc place string data with linkonce linkage in
547.section __TEXT,__const_coal,coalesced instead of
548.section __DATA,__const_coal,coalesced.
549Take a look at darwin.h, there are other Darwin assembler directives that we
550do not make use of.
551
552//===---------------------------------------------------------------------===//
553
554int %foo(int* %a, int %t) {
555entry:
556 br label %cond_true
557
558cond_true: ; preds = %cond_true, %entry
559 %x.0.0 = phi int [ 0, %entry ], [ %tmp9, %cond_true ]
560 %t_addr.0.0 = phi int [ %t, %entry ], [ %tmp7, %cond_true ]
561 %tmp2 = getelementptr int* %a, int %x.0.0
562 %tmp3 = load int* %tmp2 ; <int> [#uses=1]
563 %tmp5 = add int %t_addr.0.0, %x.0.0 ; <int> [#uses=1]
564 %tmp7 = add int %tmp5, %tmp3 ; <int> [#uses=2]
565 %tmp9 = add int %x.0.0, 1 ; <int> [#uses=2]
566 %tmp = setgt int %tmp9, 39 ; <bool> [#uses=1]
567 br bool %tmp, label %bb12, label %cond_true
568
569bb12: ; preds = %cond_true
570 ret int %tmp7
571}
572
573is pessimized by -loop-reduce and -indvars
574
575//===---------------------------------------------------------------------===//
576
577u32 to float conversion improvement:
578
579float uint32_2_float( unsigned u ) {
580 float fl = (int) (u & 0xffff);
581 float fh = (int) (u >> 16);
582 fh *= 0x1.0p16f;
583 return fh + fl;
584}
585
58600000000 subl $0x04,%esp
58700000003 movl 0x08(%esp,1),%eax
58800000007 movl %eax,%ecx
58900000009 shrl $0x10,%ecx
5900000000c cvtsi2ss %ecx,%xmm0
59100000010 andl $0x0000ffff,%eax
59200000015 cvtsi2ss %eax,%xmm1
59300000019 mulss 0x00000078,%xmm0
59400000021 addss %xmm1,%xmm0
59500000025 movss %xmm0,(%esp,1)
5960000002a flds (%esp,1)
5970000002d addl $0x04,%esp
59800000030 ret
599
600//===---------------------------------------------------------------------===//
601
602When using fastcc abi, align stack slot of argument of type double on 8 byte
603boundary to improve performance.
604
605//===---------------------------------------------------------------------===//
606
607Codegen:
608
609int f(int a, int b) {
610 if (a == 4 || a == 6)
611 b++;
612 return b;
613}
614
615
616as:
617
618or eax, 2
619cmp eax, 6
620jz label
621
622//===---------------------------------------------------------------------===//
623
624GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
625simplifications for integer "x cmp y ? a : b". For example, instead of:
626
627int G;
628void f(int X, int Y) {
629 G = X < 0 ? 14 : 13;
630}
631
632compiling to:
633
634_f:
635 movl $14, %eax
636 movl $13, %ecx
637 movl 4(%esp), %edx
638 testl %edx, %edx
639 cmovl %eax, %ecx
640 movl %ecx, _G
641 ret
642
643it could be:
644_f:
645 movl 4(%esp), %eax
646 sarl $31, %eax
647 notl %eax
648 addl $14, %eax
649 movl %eax, _G
650 ret
651
652etc.
653
Chris Lattnere7037c22007-11-02 17:04:20 +0000654Another is:
655int usesbb(unsigned int a, unsigned int b) {
656 return (a < b ? -1 : 0);
657}
658to:
659_usesbb:
660 movl 8(%esp), %eax
661 cmpl %eax, 4(%esp)
662 sbbl %eax, %eax
663 ret
664
665instead of:
666_usesbb:
667 xorl %eax, %eax
668 movl 8(%esp), %ecx
669 cmpl %ecx, 4(%esp)
670 movl $4294967295, %ecx
671 cmovb %ecx, %eax
672 ret
673
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000674//===---------------------------------------------------------------------===//
675
676Currently we don't have elimination of redundant stack manipulations. Consider
677the code:
678
679int %main() {
680entry:
681 call fastcc void %test1( )
682 call fastcc void %test2( sbyte* cast (void ()* %test1 to sbyte*) )
683 ret int 0
684}
685
686declare fastcc void %test1()
687
688declare fastcc void %test2(sbyte*)
689
690
691This currently compiles to:
692
693 subl $16, %esp
694 call _test5
695 addl $12, %esp
696 subl $16, %esp
697 movl $_test5, (%esp)
698 call _test6
699 addl $12, %esp
700
701The add\sub pair is really unneeded here.
702
703//===---------------------------------------------------------------------===//
704
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000705Consider the expansion of:
706
707uint %test3(uint %X) {
708 %tmp1 = rem uint %X, 255
709 ret uint %tmp1
710}
711
712Currently it compiles to:
713
714...
715 movl $2155905153, %ecx
716 movl 8(%esp), %esi
717 movl %esi, %eax
718 mull %ecx
719...
720
721This could be "reassociated" into:
722
723 movl $2155905153, %eax
724 movl 8(%esp), %ecx
725 mull %ecx
726
727to avoid the copy. In fact, the existing two-address stuff would do this
728except that mul isn't a commutative 2-addr instruction. I guess this has
729to be done at isel time based on the #uses to mul?
730
731//===---------------------------------------------------------------------===//
732
733Make sure the instruction which starts a loop does not cross a cacheline
734boundary. This requires knowning the exact length of each machine instruction.
735That is somewhat complicated, but doable. Example 256.bzip2:
736
737In the new trace, the hot loop has an instruction which crosses a cacheline
738boundary. In addition to potential cache misses, this can't help decoding as I
739imagine there has to be some kind of complicated decoder reset and realignment
740to grab the bytes from the next cacheline.
741
742532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
743942 942 0x3d03 movl %dh, (1809(%esp, %esi)
744937 937 0x3d0a incl %esi
7453 3 0x3d0b cmpb %bl, %dl
74627 27 0x3d0d jnz 0x000062db <main+11707>
747
748//===---------------------------------------------------------------------===//
749
750In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
751
752//===---------------------------------------------------------------------===//
753
754This could be a single 16-bit load.
755
756int f(char *p) {
757 if ((p[0] == 1) & (p[1] == 2)) return 1;
758 return 0;
759}
760
761//===---------------------------------------------------------------------===//
762
763We should inline lrintf and probably other libc functions.
764
765//===---------------------------------------------------------------------===//
766
767Start using the flags more. For example, compile:
768
769int add_zf(int *x, int y, int a, int b) {
770 if ((*x += y) == 0)
771 return a;
772 else
773 return b;
774}
775
776to:
777 addl %esi, (%rdi)
778 movl %edx, %eax
779 cmovne %ecx, %eax
780 ret
781instead of:
782
783_add_zf:
784 addl (%rdi), %esi
785 movl %esi, (%rdi)
786 testl %esi, %esi
787 cmove %edx, %ecx
788 movl %ecx, %eax
789 ret
790
791and:
792
793int add_zf(int *x, int y, int a, int b) {
794 if ((*x + y) < 0)
795 return a;
796 else
797 return b;
798}
799
800to:
801
802add_zf:
803 addl (%rdi), %esi
804 movl %edx, %eax
805 cmovns %ecx, %eax
806 ret
807
808instead of:
809
810_add_zf:
811 addl (%rdi), %esi
812 testl %esi, %esi
813 cmovs %edx, %ecx
814 movl %ecx, %eax
815 ret
816
817//===---------------------------------------------------------------------===//
818
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000819These two functions have identical effects:
820
821unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
822unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
823
824We currently compile them to:
825
826_f:
827 movl 4(%esp), %eax
828 movl %eax, %ecx
829 incl %ecx
830 movl 8(%esp), %edx
831 cmpl %edx, %ecx
832 jne LBB1_2 #UnifiedReturnBlock
833LBB1_1: #cond_true
834 addl $2, %eax
835 ret
836LBB1_2: #UnifiedReturnBlock
837 movl %ecx, %eax
838 ret
839_f2:
840 movl 4(%esp), %eax
841 movl %eax, %ecx
842 incl %ecx
843 cmpl 8(%esp), %ecx
844 sete %cl
845 movzbl %cl, %ecx
846 leal 1(%ecx,%eax), %eax
847 ret
848
849both of which are inferior to GCC's:
850
851_f:
852 movl 4(%esp), %edx
853 leal 1(%edx), %eax
854 addl $2, %edx
855 cmpl 8(%esp), %eax
856 cmove %edx, %eax
857 ret
858_f2:
859 movl 4(%esp), %eax
860 addl $1, %eax
861 xorl %edx, %edx
862 cmpl 8(%esp), %eax
863 sete %dl
864 addl %edx, %eax
865 ret
866
867//===---------------------------------------------------------------------===//
868
869This code:
870
871void test(int X) {
872 if (X) abort();
873}
874
875is currently compiled to:
876
877_test:
878 subl $12, %esp
879 cmpl $0, 16(%esp)
880 jne LBB1_1
881 addl $12, %esp
882 ret
883LBB1_1:
884 call L_abort$stub
885
886It would be better to produce:
887
888_test:
889 subl $12, %esp
890 cmpl $0, 16(%esp)
891 jne L_abort$stub
892 addl $12, %esp
893 ret
894
895This can be applied to any no-return function call that takes no arguments etc.
896Alternatively, the stack save/restore logic could be shrink-wrapped, producing
897something like this:
898
899_test:
900 cmpl $0, 4(%esp)
901 jne LBB1_1
902 ret
903LBB1_1:
904 subl $12, %esp
905 call L_abort$stub
906
907Both are useful in different situations. Finally, it could be shrink-wrapped
908and tail called, like this:
909
910_test:
911 cmpl $0, 4(%esp)
912 jne LBB1_1
913 ret
914LBB1_1:
915 pop %eax # realign stack.
916 call L_abort$stub
917
918Though this probably isn't worth it.
919
920//===---------------------------------------------------------------------===//
921
922We need to teach the codegen to convert two-address INC instructions to LEA
Chris Lattner0d64ec32007-08-11 18:16:46 +0000923when the flags are dead (likewise dec). For example, on X86-64, compile:
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000924
925int foo(int A, int B) {
926 return A+1;
927}
928
929to:
930
931_foo:
932 leal 1(%edi), %eax
933 ret
934
935instead of:
936
937_foo:
938 incl %edi
939 movl %edi, %eax
940 ret
941
942Another example is:
943
944;; X's live range extends beyond the shift, so the register allocator
945;; cannot coalesce it with Y. Because of this, a copy needs to be
946;; emitted before the shift to save the register value before it is
947;; clobbered. However, this copy is not needed if the register
948;; allocator turns the shift into an LEA. This also occurs for ADD.
949
950; Check that the shift gets turned into an LEA.
951; RUN: llvm-upgrade < %s | llvm-as | llc -march=x86 -x86-asm-syntax=intel | \
952; RUN: not grep {mov E.X, E.X}
953
954%G = external global int
955
956int %test1(int %X, int %Y) {
957 %Z = add int %X, %Y
958 volatile store int %Y, int* %G
959 volatile store int %Z, int* %G
960 ret int %X
961}
962
963int %test2(int %X) {
964 %Z = add int %X, 1 ;; inc
965 volatile store int %Z, int* %G
966 ret int %X
967}
968
969//===---------------------------------------------------------------------===//
970
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000971Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
972a neg instead of a sub instruction. Consider:
973
974int test(char X) { return 7-X; }
975
976we currently produce:
977_test:
978 movl $7, %eax
979 movsbl 4(%esp), %ecx
980 subl %ecx, %eax
981 ret
982
983We would use one fewer register if codegen'd as:
984
985 movsbl 4(%esp), %eax
986 neg %eax
987 add $7, %eax
988 ret
989
990Note that this isn't beneficial if the load can be folded into the sub. In
991this case, we want a sub:
992
993int test(int X) { return 7-X; }
994_test:
995 movl $7, %eax
996 subl 4(%esp), %eax
997 ret
998
999//===---------------------------------------------------------------------===//
1000
1001For code like:
1002phi (undef, x)
1003
1004We get an implicit def on the undef side. If the phi is spilled, we then get:
1005implicitdef xmm1
1006store xmm1 -> stack
1007
1008It should be possible to teach the x86 backend to "fold" the store into the
1009implicitdef, which just deletes the implicit def.
1010
1011These instructions should go away:
1012#IMPLICIT_DEF %xmm1
1013movaps %xmm1, 192(%esp)
1014movaps %xmm1, 224(%esp)
1015movaps %xmm1, 176(%esp)
Chris Lattnera3c76a42007-08-03 00:17:42 +00001016
1017//===---------------------------------------------------------------------===//
1018
1019This is a "commutable two-address" register coallescing deficiency:
1020
1021define <4 x float> @test1(<4 x float> %V) {
1022entry:
Chris Lattnera86af9a2007-08-11 18:19:07 +00001023 %tmp8 = shufflevector <4 x float> %V, <4 x float> undef,
1024 <4 x i32> < i32 3, i32 2, i32 1, i32 0 >
1025 %add = add <4 x float> %tmp8, %V
Chris Lattnera3c76a42007-08-03 00:17:42 +00001026 ret <4 x float> %add
1027}
1028
1029this codegens to:
1030
1031_test1:
1032 pshufd $27, %xmm0, %xmm1
1033 addps %xmm0, %xmm1
1034 movaps %xmm1, %xmm0
1035 ret
1036
1037instead of:
1038
1039_test1:
1040 pshufd $27, %xmm0, %xmm1
1041 addps %xmm1, %xmm0
1042 ret
1043
Chris Lattner32f65872007-08-20 02:14:33 +00001044//===---------------------------------------------------------------------===//
1045
1046Leaf functions that require one 4-byte spill slot have a prolog like this:
1047
1048_foo:
1049 pushl %esi
1050 subl $4, %esp
1051...
1052and an epilog like this:
1053 addl $4, %esp
1054 popl %esi
1055 ret
1056
1057It would be smaller, and potentially faster, to push eax on entry and to
1058pop into a dummy register instead of using addl/subl of esp. Just don't pop
1059into any return registers :)
1060
1061//===---------------------------------------------------------------------===//
Chris Lattner44b03cb2007-08-23 15:22:07 +00001062
1063The X86 backend should fold (branch (or (setcc, setcc))) into multiple
1064branches. We generate really poor code for:
1065
1066double testf(double a) {
1067 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
1068}
1069
1070For example, the entry BB is:
1071
1072_testf:
1073 subl $20, %esp
1074 pxor %xmm0, %xmm0
1075 movsd 24(%esp), %xmm1
1076 ucomisd %xmm0, %xmm1
1077 setnp %al
1078 sete %cl
1079 testb %cl, %al
1080 jne LBB1_5 # UnifiedReturnBlock
1081LBB1_1: # cond_true
1082
1083
1084it would be better to replace the last four instructions with:
1085
1086 jp LBB1_1
1087 je LBB1_5
1088LBB1_1:
1089
1090We also codegen the inner ?: into a diamond:
1091
1092 cvtss2sd LCPI1_0(%rip), %xmm2
1093 cvtss2sd LCPI1_1(%rip), %xmm3
1094 ucomisd %xmm1, %xmm0
1095 ja LBB1_3 # cond_true
1096LBB1_2: # cond_true
1097 movapd %xmm3, %xmm2
1098LBB1_3: # cond_true
1099 movapd %xmm2, %xmm0
1100 ret
1101
1102We should sink the load into xmm3 into the LBB1_2 block. This should
1103be pretty easy, and will nuke all the copies.
1104
1105//===---------------------------------------------------------------------===//
Chris Lattner4084d492007-09-10 21:43:18 +00001106
1107This:
1108 #include <algorithm>
1109 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
1110 { return std::make_pair(a + b, a + b < a); }
1111 bool no_overflow(unsigned a, unsigned b)
1112 { return !full_add(a, b).second; }
1113
1114Should compile to:
1115
1116
1117 _Z11no_overflowjj:
1118 addl %edi, %esi
1119 setae %al
1120 ret
1121
1122on x86-64, not:
1123
1124__Z11no_overflowjj:
1125 addl %edi, %esi
1126 cmpl %edi, %esi
1127 setae %al
1128 movzbl %al, %eax
1129 ret
1130
1131
1132//===---------------------------------------------------------------------===//
Evan Cheng35127a62007-09-10 22:16:37 +00001133
1134Re-materialize MOV32r0 etc. with xor instead of changing them to moves if the
1135condition register is dead. xor reg reg is shorter than mov reg, #0.
Chris Lattnera487bf72007-09-26 06:29:31 +00001136
1137//===---------------------------------------------------------------------===//
1138
1139We aren't matching RMW instructions aggressively
1140enough. Here's a reduced testcase (more in PR1160):
1141
1142define void @test(i32* %huge_ptr, i32* %target_ptr) {
1143 %A = load i32* %huge_ptr ; <i32> [#uses=1]
1144 %B = load i32* %target_ptr ; <i32> [#uses=1]
1145 %C = or i32 %A, %B ; <i32> [#uses=1]
1146 store i32 %C, i32* %target_ptr
1147 ret void
1148}
1149
1150$ llvm-as < t.ll | llc -march=x86-64
1151
1152_test:
1153 movl (%rdi), %eax
1154 orl (%rsi), %eax
1155 movl %eax, (%rsi)
1156 ret
1157
1158That should be something like:
1159
1160_test:
1161 movl (%rdi), %eax
1162 orl %eax, (%rsi)
1163 ret
1164
1165//===---------------------------------------------------------------------===//
1166
Bill Wendling7f436dd2007-10-02 20:42:59 +00001167The following code:
1168
Bill Wendlingc2036e32007-10-02 20:54:32 +00001169bb114.preheader: ; preds = %cond_next94
1170 %tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
1171 %tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
1172 %tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
1173 %tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
1174 %tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
1175 %tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
1176 %tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
1177 %tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
1178 %tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
1179 %tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
1180 %tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
1181 br label %bb114
1182
1183produces:
1184
Bill Wendling7f436dd2007-10-02 20:42:59 +00001185LBB3_5: # bb114.preheader
1186 movswl -68(%ebp), %eax
1187 movl $32, %ecx
1188 movl %ecx, -80(%ebp)
1189 subl %eax, -80(%ebp)
1190 movswl -52(%ebp), %eax
1191 movl %ecx, -84(%ebp)
1192 subl %eax, -84(%ebp)
1193 movswl -70(%ebp), %eax
1194 movl %ecx, -88(%ebp)
1195 subl %eax, -88(%ebp)
1196 movswl -50(%ebp), %eax
1197 subl %eax, %ecx
1198 movl %ecx, -76(%ebp)
1199 movswl -42(%ebp), %eax
1200 movl %eax, -92(%ebp)
1201 movswl -66(%ebp), %eax
1202 movl %eax, -96(%ebp)
1203 movw $0, -98(%ebp)
1204
Chris Lattner792bae52007-10-03 03:40:24 +00001205This appears to be bad because the RA is not folding the store to the stack
1206slot into the movl. The above instructions could be:
1207 movl $32, -80(%ebp)
1208...
1209 movl $32, -84(%ebp)
1210...
1211This seems like a cross between remat and spill folding.
1212
Bill Wendlingc2036e32007-10-02 20:54:32 +00001213This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling7f436dd2007-10-02 20:42:59 +00001214change, so we could simply subtract %eax from %ecx first and then use %ecx (or
1215vice-versa).
1216
1217//===---------------------------------------------------------------------===//
1218
Bill Wendlingc524bae2007-10-02 21:43:06 +00001219For this code:
1220
1221cond_next603: ; preds = %bb493, %cond_true336, %cond_next599
1222 %v.21050.1 = phi i32 [ %v.21050.0, %cond_next599 ], [ %tmp344, %cond_true336 ], [ %v.2, %bb493 ] ; <i32> [#uses=1]
1223 %maxz.21051.1 = phi i32 [ %maxz.21051.0, %cond_next599 ], [ 0, %cond_true336 ], [ %maxz.2, %bb493 ] ; <i32> [#uses=2]
1224 %cnt.01055.1 = phi i32 [ %cnt.01055.0, %cond_next599 ], [ 0, %cond_true336 ], [ %cnt.0, %bb493 ] ; <i32> [#uses=2]
1225 %byteptr.9 = phi i8* [ %byteptr.12, %cond_next599 ], [ %byteptr.0, %cond_true336 ], [ %byteptr.10, %bb493 ] ; <i8*> [#uses=9]
1226 %bitptr.6 = phi i32 [ %tmp5571104.1, %cond_next599 ], [ %tmp4921049, %cond_true336 ], [ %bitptr.7, %bb493 ] ; <i32> [#uses=4]
1227 %source.5 = phi i32 [ %tmp602, %cond_next599 ], [ %source.0, %cond_true336 ], [ %source.6, %bb493 ] ; <i32> [#uses=7]
1228 %tmp606 = getelementptr %struct.const_tables* @tables, i32 0, i32 0, i32 %cnt.01055.1 ; <i8*> [#uses=1]
1229 %tmp607 = load i8* %tmp606, align 1 ; <i8> [#uses=1]
1230
1231We produce this:
1232
1233LBB4_70: # cond_next603
1234 movl -20(%ebp), %esi
1235 movl L_tables$non_lazy_ptr-"L4$pb"(%esi), %esi
1236
1237However, ICC caches this information before the loop and produces this:
1238
1239 movl 88(%esp), %eax #481.12
1240
1241//===---------------------------------------------------------------------===//
Bill Wendling54c4f832007-10-02 21:49:31 +00001242
1243This code:
1244
1245 %tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
1246 br i1 %tmp659, label %cond_true662, label %cond_next715
1247
1248produces this:
1249
1250 testw %cx, %cx
1251 movswl %cx, %esi
1252 jns LBB4_109 # cond_next715
1253
1254Shark tells us that using %cx in the testw instruction is sub-optimal. It
1255suggests using the 32-bit register (which is what ICC uses).
1256
1257//===---------------------------------------------------------------------===//
Chris Lattner802c62a2007-10-03 17:10:03 +00001258
1259rdar://5506677 - We compile this:
1260
1261define i32 @foo(double %x) {
1262 %x14 = bitcast double %x to i64 ; <i64> [#uses=1]
1263 %tmp713 = trunc i64 %x14 to i32 ; <i32> [#uses=1]
1264 %tmp8 = and i32 %tmp713, 2147483647 ; <i32> [#uses=1]
1265 ret i32 %tmp8
1266}
1267
1268to:
1269
1270_foo:
1271 subl $12, %esp
1272 fldl 16(%esp)
1273 fstpl (%esp)
1274 movl $2147483647, %eax
1275 andl (%esp), %eax
1276 addl $12, %esp
1277 #FP_REG_KILL
1278 ret
1279
1280It would be much better to eliminate the fldl/fstpl by folding the bitcast
1281into the load SDNode. That would give us:
1282
1283_foo:
1284 movl $2147483647, %eax
1285 andl 4(%esp), %eax
1286 ret
1287
1288//===---------------------------------------------------------------------===//
1289
Chris Lattnerae259992007-10-04 15:47:27 +00001290We compile this:
1291
1292void compare (long long foo) {
1293 if (foo < 4294967297LL)
1294 abort();
1295}
1296
1297to:
1298
1299_compare:
1300 subl $12, %esp
1301 cmpl $0, 16(%esp)
1302 setne %al
1303 movzbw %al, %ax
1304 cmpl $1, 20(%esp)
1305 setg %cl
1306 movzbw %cl, %cx
1307 cmove %ax, %cx
1308 movw %cx, %ax
1309 testb $1, %al
1310 je LBB1_2 # cond_true
1311
1312(also really horrible code on ppc). This is due to the expand code for 64-bit
1313compares. GCC produces multiple branches, which is much nicer:
1314
1315_compare:
1316 pushl %ebp
1317 movl %esp, %ebp
1318 subl $8, %esp
1319 movl 8(%ebp), %eax
1320 movl 12(%ebp), %edx
1321 subl $1, %edx
1322 jg L5
1323L7:
1324 jl L4
1325 cmpl $0, %eax
1326 jbe L4
1327L5:
1328
1329//===---------------------------------------------------------------------===//
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001330
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001331Tail call optimization improvements: Tail call optimization currently
1332pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighofer449b01a2008-01-11 16:49:42 +00001333non-tail call optimized calls) that source from the callers arguments
1334or that source from a virtual register (also possibly sourcing from
1335callers arguments).
1336This is done to prevent overwriting of parameters (see example
1337below) that might be used later.
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001338
1339example:
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001340
1341int callee(int32, int64);
1342int caller(int32 arg1, int32 arg2) {
1343 int64 local = arg2 * 2;
1344 return callee(arg2, (int64)local);
1345}
1346
1347[arg1] [!arg2 no longer valid since we moved local onto it]
1348[arg2] -> [(int64)
1349[RETADDR] local ]
1350
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001351Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001352arg2 of the caller.
1353
1354Possible optimizations:
1355
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001356
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001357 - Analyse the actual parameters of the callee to see which would
1358 overwrite a caller parameter which is used by the callee and only
1359 push them onto the top of the stack.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001360
1361 int callee (int32 arg1, int32 arg2);
1362 int caller (int32 arg1, int32 arg2) {
1363 return callee(arg1,arg2);
1364 }
1365
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001366 Here we don't need to write any variables to the top of the stack
1367 since they don't overwrite each other.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001368
1369 int callee (int32 arg1, int32 arg2);
1370 int caller (int32 arg1, int32 arg2) {
1371 return callee(arg2,arg1);
1372 }
1373
Arnold Schwaighofer373e8652007-10-12 21:30:57 +00001374 Here we need to push the arguments because they overwrite each
1375 other.
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001376
Arnold Schwaighofere2d6bbb2007-10-11 19:40:01 +00001377//===---------------------------------------------------------------------===//
Evan Cheng7f1ad6a2007-10-28 04:01:09 +00001378
1379main ()
1380{
1381 int i = 0;
1382 unsigned long int z = 0;
1383
1384 do {
1385 z -= 0x00004000;
1386 i++;
1387 if (i > 0x00040000)
1388 abort ();
1389 } while (z > 0);
1390 exit (0);
1391}
1392
1393gcc compiles this to:
1394
1395_main:
1396 subl $28, %esp
1397 xorl %eax, %eax
1398 jmp L2
1399L3:
1400 cmpl $262144, %eax
1401 je L10
1402L2:
1403 addl $1, %eax
1404 cmpl $262145, %eax
1405 jne L3
1406 call L_abort$stub
1407L10:
1408 movl $0, (%esp)
1409 call L_exit$stub
1410
1411llvm:
1412
1413_main:
1414 subl $12, %esp
1415 movl $1, %eax
1416 movl $16384, %ecx
1417LBB1_1: # bb
1418 cmpl $262145, %eax
1419 jge LBB1_4 # cond_true
1420LBB1_2: # cond_next
1421 incl %eax
1422 addl $4294950912, %ecx
1423 cmpl $16384, %ecx
1424 jne LBB1_1 # bb
1425LBB1_3: # bb11
1426 xorl %eax, %eax
1427 addl $12, %esp
1428 ret
1429LBB1_4: # cond_true
1430 call L_abort$stub
1431
14321. LSR should rewrite the first cmp with induction variable %ecx.
14332. DAG combiner should fold
1434 leal 1(%eax), %edx
1435 cmpl $262145, %edx
1436 =>
1437 cmpl $262144, %eax
1438
1439//===---------------------------------------------------------------------===//
Chris Lattner358670b2007-11-24 06:13:33 +00001440
1441define i64 @test(double %X) {
1442 %Y = fptosi double %X to i64
1443 ret i64 %Y
1444}
1445
1446compiles to:
1447
1448_test:
1449 subl $20, %esp
1450 movsd 24(%esp), %xmm0
1451 movsd %xmm0, 8(%esp)
1452 fldl 8(%esp)
1453 fisttpll (%esp)
1454 movl 4(%esp), %edx
1455 movl (%esp), %eax
1456 addl $20, %esp
1457 #FP_REG_KILL
1458 ret
1459
1460This should just fldl directly from the input stack slot.
Chris Lattner10d54d12007-12-05 22:58:19 +00001461
1462//===---------------------------------------------------------------------===//
1463
1464This code:
1465int foo (int x) { return (x & 65535) | 255; }
1466
1467Should compile into:
1468
1469_foo:
1470 movzwl 4(%esp), %eax
1471 orb $-1, %al ;; 'orl 255' is also fine :)
1472 ret
1473
1474instead of:
1475_foo:
1476 movl $255, %eax
1477 orl 4(%esp), %eax
1478 andl $65535, %eax
1479 ret
1480
Chris Lattnerd079b4e2007-12-18 16:48:14 +00001481//===---------------------------------------------------------------------===//
1482
1483We're missing an obvious fold of a load into imul:
1484
1485int test(long a, long b) { return a * b; }
1486
1487LLVM produces:
1488_test:
1489 movl 4(%esp), %ecx
1490 movl 8(%esp), %eax
1491 imull %ecx, %eax
1492 ret
1493
1494vs:
1495_test:
1496 movl 8(%esp), %eax
1497 imull 4(%esp), %eax
1498 ret
1499
1500//===---------------------------------------------------------------------===//
1501
Chris Lattner2b55ebd2007-12-24 19:27:46 +00001502We can fold a store into "zeroing a reg". Instead of:
1503
1504xorl %eax, %eax
1505movl %eax, 124(%esp)
1506
1507we should get:
1508
1509movl $0, 124(%esp)
1510
1511if the flags of the xor are dead.
1512
1513//===---------------------------------------------------------------------===//
Chris Lattner64400952007-12-28 21:50:40 +00001514
1515This testcase misses a read/modify/write opportunity (from PR1425):
1516
1517void vertical_decompose97iH1(int *b0, int *b1, int *b2, int width){
1518 int i;
1519 for(i=0; i<width; i++)
1520 b1[i] += (1*(b0[i] + b2[i])+0)>>0;
1521}
1522
1523We compile it down to:
1524
1525LBB1_2: # bb
1526 movl (%esi,%edi,4), %ebx
1527 addl (%ecx,%edi,4), %ebx
1528 addl (%edx,%edi,4), %ebx
1529 movl %ebx, (%ecx,%edi,4)
1530 incl %edi
1531 cmpl %eax, %edi
1532 jne LBB1_2 # bb
1533
1534the inner loop should add to the memory location (%ecx,%edi,4), saving
1535a mov. Something like:
1536
1537 movl (%esi,%edi,4), %ebx
1538 addl (%edx,%edi,4), %ebx
1539 addl %ebx, (%ecx,%edi,4)
1540
Chris Lattnerbde73102007-12-29 05:51:58 +00001541Here is another interesting example:
1542
1543void vertical_compose97iH1(int *b0, int *b1, int *b2, int width){
1544 int i;
1545 for(i=0; i<width; i++)
1546 b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
1547}
1548
1549We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
1550
1551LBB9_2: # bb
1552 movl (%ecx,%edi,4), %ebx
1553 subl (%esi,%edi,4), %ebx
1554 subl (%edx,%edi,4), %ebx
1555 movl %ebx, (%ecx,%edi,4)
1556 incl %edi
1557 cmpl %eax, %edi
1558 jne LBB9_2 # bb
1559
1560Additionally, LSR should rewrite the exit condition of these loops to use
Chris Lattner64400952007-12-28 21:50:40 +00001561a stride-4 IV, would would allow all the scales in the loop to go away.
1562This would result in smaller code and more efficient microops.
1563
1564//===---------------------------------------------------------------------===//
Chris Lattner0362a362008-01-07 21:59:58 +00001565
1566In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
1567or and instruction, for example:
1568
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001569 xorpd LCPI1_0, %xmm2
Chris Lattner0362a362008-01-07 21:59:58 +00001570
1571However, if xmm2 gets spilled, we end up with really ugly code like this:
1572
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001573 movsd (%esp), %xmm0
1574 xorpd LCPI1_0, %xmm0
1575 movsd %xmm0, (%esp)
Chris Lattner0362a362008-01-07 21:59:58 +00001576
1577Since we 'know' that this is a 'neg', we can actually "fold" the spill into
1578the neg/abs instruction, turning it into an *integer* operation, like this:
1579
1580 xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
1581
1582you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattnerb4cbb682008-01-09 00:37:18 +00001583stall. Here is a contrived testcase:
1584
1585double a, b, c;
1586void test(double *P) {
1587 double X = *P;
1588 a = X;
1589 bar();
1590 X = -X;
1591 b = X;
1592 bar();
1593 c = X;
1594}
Chris Lattner0362a362008-01-07 21:59:58 +00001595
1596//===---------------------------------------------------------------------===//