blob: 98b58539f5a9217f28172f9288ba6cac2e9deb38 [file] [log] [blame]
Chris Lattner1171ff42005-10-23 19:52:42 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the X86 backend.
3//===---------------------------------------------------------------------===//
4
5Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
6Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
7X86, & make the dag combiner produce it when needed. This will eliminate one
8imul from the code generated for:
9
10long long test(long long X, long long Y) { return X*Y; }
11
12by using the EAX result from the mul. We should add a similar node for
13DIVREM.
14
Chris Lattner865874c2005-12-02 00:11:20 +000015another case is:
16
17long long test(int X, int Y) { return (long long)X*Y; }
18
19... which should only be one imul instruction.
20
Chris Lattner1171ff42005-10-23 19:52:42 +000021//===---------------------------------------------------------------------===//
22
23This should be one DIV/IDIV instruction, not a libcall:
24
25unsigned test(unsigned long long X, unsigned Y) {
26 return X/Y;
27}
28
29This can be done trivially with a custom legalizer. What about overflow
30though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
31
32//===---------------------------------------------------------------------===//
33
Chris Lattner1171ff42005-10-23 19:52:42 +000034Improvements to the multiply -> shift/add algorithm:
35http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
36
37//===---------------------------------------------------------------------===//
38
39Improve code like this (occurs fairly frequently, e.g. in LLVM):
40long long foo(int x) { return 1LL << x; }
41
42http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
43http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
44http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
45
46Another useful one would be ~0ULL >> X and ~0ULL << X.
47
Chris Lattnerffff6172005-10-23 21:44:59 +000048//===---------------------------------------------------------------------===//
49
Chris Lattner1e4ed932005-11-28 04:52:39 +000050Compile this:
51_Bool f(_Bool a) { return a!=1; }
52
53into:
54 movzbl %dil, %eax
55 xorl $1, %eax
56 ret
Evan Cheng8dee8cc2005-12-17 01:25:19 +000057
58//===---------------------------------------------------------------------===//
59
60Some isel ideas:
61
621. Dynamic programming based approach when compile time if not an
63 issue.
642. Code duplication (addressing mode) during isel.
653. Other ideas from "Register-Sensitive Selection, Duplication, and
66 Sequencing of Instructions".
Chris Lattnercb298902006-02-08 07:12:07 +0000674. Scheduling for reduced register pressure. E.g. "Minimum Register
68 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
69 and other related papers.
70 http://citeseer.ist.psu.edu/govindarajan01minimum.html
Evan Cheng8dee8cc2005-12-17 01:25:19 +000071
72//===---------------------------------------------------------------------===//
73
74Should we promote i16 to i32 to avoid partial register update stalls?
Evan Cheng98abbfb2005-12-17 06:54:43 +000075
76//===---------------------------------------------------------------------===//
77
78Leave any_extend as pseudo instruction and hint to register
79allocator. Delay codegen until post register allocation.
Evan Chenga3195e82006-01-12 22:54:21 +000080
81//===---------------------------------------------------------------------===//
82
Chris Lattner1db4b4f2006-01-16 17:53:00 +000083Count leading zeros and count trailing zeros:
84
85int clz(int X) { return __builtin_clz(X); }
86int ctz(int X) { return __builtin_ctz(X); }
87
88$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
89clz:
90 bsr %eax, DWORD PTR [%esp+4]
91 xor %eax, 31
92 ret
93ctz:
94 bsf %eax, DWORD PTR [%esp+4]
95 ret
96
97however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
98aren't.
99
100//===---------------------------------------------------------------------===//
101
102Use push/pop instructions in prolog/epilog sequences instead of stores off
103ESP (certain code size win, perf win on some [which?] processors).
Evan Cheng53f280a2006-02-25 10:04:07 +0000104Also, it appears icc use push for parameter passing. Need to investigate.
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000105
106//===---------------------------------------------------------------------===//
107
108Only use inc/neg/not instructions on processors where they are faster than
109add/sub/xor. They are slower on the P4 due to only updating some processor
110flags.
111
112//===---------------------------------------------------------------------===//
113
Chris Lattnerb638cd82006-01-29 09:08:15 +0000114The instruction selector sometimes misses folding a load into a compare. The
115pattern is written as (cmp reg, (load p)). Because the compare isn't
116commutative, it is not matched with the load on both sides. The dag combiner
117should be made smart enough to cannonicalize the load into the RHS of a compare
118when it can invert the result of the compare for free.
119
Evan Cheng0f4aa6e2006-09-11 05:25:15 +0000120//===---------------------------------------------------------------------===//
121
Evan Chengfc7c17a2006-04-13 05:09:45 +0000122How about intrinsics? An example is:
123 *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
124
125compiles to
126 pmuludq (%eax), %xmm0
127 movl 8(%esp), %eax
128 movdqa (%eax), %xmm1
129 pmulhuw %xmm0, %xmm1
130
131The transformation probably requires a X86 specific pass or a DAG combiner
132target specific hook.
133
Chris Lattner6a284562006-01-29 09:14:47 +0000134//===---------------------------------------------------------------------===//
135
Chris Lattner9acddcd2006-02-02 19:16:34 +0000136In many cases, LLVM generates code like this:
137
138_test:
139 movl 8(%esp), %eax
140 cmpl %eax, 4(%esp)
141 setl %al
142 movzbl %al, %eax
143 ret
144
145on some processors (which ones?), it is more efficient to do this:
146
147_test:
148 movl 8(%esp), %ebx
Evan Cheng0f4aa6e2006-09-11 05:25:15 +0000149 xor %eax, %eax
Chris Lattner9acddcd2006-02-02 19:16:34 +0000150 cmpl %ebx, 4(%esp)
151 setl %al
152 ret
153
154Doing this correctly is tricky though, as the xor clobbers the flags.
155
Chris Lattnerd395d092006-02-02 19:43:28 +0000156//===---------------------------------------------------------------------===//
157
Chris Lattner8f77b732006-02-08 06:52:06 +0000158We should generate bts/btr/etc instructions on targets where they are cheap or
159when codesize is important. e.g., for:
160
161void setbit(int *target, int bit) {
162 *target |= (1 << bit);
163}
164void clearbit(int *target, int bit) {
165 *target &= ~(1 << bit);
166}
167
Chris Lattnerdba382b2006-02-08 17:47:22 +0000168//===---------------------------------------------------------------------===//
169
Evan Cheng952b7d62006-02-14 08:25:32 +0000170Instead of the following for memset char*, 1, 10:
171
172 movl $16843009, 4(%edx)
173 movl $16843009, (%edx)
174 movw $257, 8(%edx)
175
176It might be better to generate
177
178 movl $16843009, %eax
179 movl %eax, 4(%edx)
180 movl %eax, (%edx)
181 movw al, 8(%edx)
182
183when we can spare a register. It reduces code size.
Evan Cheng7634ac42006-02-17 00:04:28 +0000184
185//===---------------------------------------------------------------------===//
186
Chris Lattnera648df22006-02-17 04:20:13 +0000187Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
188get this:
189
190int %test1(int %X) {
191 %Y = div int %X, 8
192 ret int %Y
193}
194
195_test1:
196 movl 4(%esp), %eax
197 movl %eax, %ecx
198 sarl $31, %ecx
199 shrl $29, %ecx
200 addl %ecx, %eax
201 sarl $3, %eax
202 ret
203
204GCC knows several different ways to codegen it, one of which is this:
205
206_test1:
207 movl 4(%esp), %eax
208 cmpl $-1, %eax
209 leal 7(%eax), %ecx
210 cmovle %ecx, %eax
211 sarl $3, %eax
212 ret
213
214which is probably slower, but it's interesting at least :)
215
Evan Cheng755ee8f2006-02-20 19:58:27 +0000216//===---------------------------------------------------------------------===//
217
Evan Cheng7ab54042006-03-21 07:18:26 +0000218Should generate min/max for stuff like:
219
220void minf(float a, float b, float *X) {
221 *X = a <= b ? a : b;
222}
223
Evan Cheng30324102006-02-23 02:50:21 +0000224Make use of floating point min / max instructions. Perhaps introduce ISD::FMIN
225and ISD::FMAX node types?
226
227//===---------------------------------------------------------------------===//
228
Chris Lattner205065a2006-02-23 05:17:43 +0000229The first BB of this code:
230
231declare bool %foo()
232int %bar() {
233 %V = call bool %foo()
234 br bool %V, label %T, label %F
235T:
236 ret int 1
237F:
238 call bool %foo()
239 ret int 12
240}
241
242compiles to:
243
244_bar:
245 subl $12, %esp
246 call L_foo$stub
247 xorb $1, %al
248 testb %al, %al
249 jne LBB_bar_2 # F
250
251It would be better to emit "cmp %al, 1" than a xor and test.
252
Evan Cheng53f280a2006-02-25 10:04:07 +0000253//===---------------------------------------------------------------------===//
Chris Lattner205065a2006-02-23 05:17:43 +0000254
Evan Cheng53f280a2006-02-25 10:04:07 +0000255Enable X86InstrInfo::convertToThreeAddress().
Evan Chengaafc1412006-02-28 23:38:49 +0000256
257//===---------------------------------------------------------------------===//
258
Nate Begemanc02e5a82006-03-26 19:19:27 +0000259We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
260We should leave these as libcalls for everything over a much lower threshold,
261since libc is hand tuned for medium and large mem ops (avoiding RFO for large
262stores, TLB preheating, etc)
263
264//===---------------------------------------------------------------------===//
265
Chris Lattner181b9c62006-03-09 01:39:46 +0000266Optimize this into something reasonable:
267 x * copysign(1.0, y) * copysign(1.0, z)
268
269//===---------------------------------------------------------------------===//
270
271Optimize copysign(x, *y) to use an integer load from y.
272
273//===---------------------------------------------------------------------===//
274
Evan Cheng2771d212006-03-16 22:44:22 +0000275%X = weak global int 0
276
277void %foo(int %N) {
278 %N = cast int %N to uint
279 %tmp.24 = setgt int %N, 0
280 br bool %tmp.24, label %no_exit, label %return
281
282no_exit:
283 %indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
284 %i.0.0 = cast uint %indvar to int
285 volatile store int %i.0.0, int* %X
286 %indvar.next = add uint %indvar, 1
287 %exitcond = seteq uint %indvar.next, %N
288 br bool %exitcond, label %return, label %no_exit
289
290return:
291 ret void
292}
293
294compiles into:
295
296 .text
297 .align 4
298 .globl _foo
299_foo:
300 movl 4(%esp), %eax
301 cmpl $1, %eax
302 jl LBB_foo_4 # return
303LBB_foo_1: # no_exit.preheader
304 xorl %ecx, %ecx
305LBB_foo_2: # no_exit
306 movl L_X$non_lazy_ptr, %edx
307 movl %ecx, (%edx)
308 incl %ecx
309 cmpl %eax, %ecx
310 jne LBB_foo_2 # no_exit
311LBB_foo_3: # return.loopexit
312LBB_foo_4: # return
313 ret
314
315We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
316remateralization is implemented. This can be accomplished with 1) a target
317dependent LICM pass or 2) makeing SelectDAG represent the whole function.
318
319//===---------------------------------------------------------------------===//
Evan Cheng0def9c32006-03-19 06:08:11 +0000320
321The following tests perform worse with LSR:
322
323lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
Chris Lattner8bcf9262006-03-19 22:27:41 +0000324
325//===---------------------------------------------------------------------===//
326
Chris Lattner9d5da1d2006-03-24 07:12:19 +0000327Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
Evan Cheng50a6d8c2006-03-21 07:12:57 +0000328FR64 to VR128.
Evan Chengb20aace2006-03-24 02:57:03 +0000329
330//===---------------------------------------------------------------------===//
331
332mov $reg, 48(%esp)
333...
334leal 48(%esp), %eax
335mov %eax, (%esp)
336call _foo
337
338Obviously it would have been better for the first mov (or any op) to store
339directly %esp[0] if there are no other uses.
Evan Cheng4c4a2e22006-03-28 02:49:12 +0000340
341//===---------------------------------------------------------------------===//
342
Evan Cheng8af5ef92006-04-05 23:46:04 +0000343Adding to the list of cmp / test poor codegen issues:
344
345int test(__m128 *A, __m128 *B) {
346 if (_mm_comige_ss(*A, *B))
347 return 3;
348 else
349 return 4;
350}
351
352_test:
353 movl 8(%esp), %eax
354 movaps (%eax), %xmm0
355 movl 4(%esp), %eax
356 movaps (%eax), %xmm1
357 comiss %xmm0, %xmm1
358 setae %al
359 movzbl %al, %ecx
360 movl $3, %eax
361 movl $4, %edx
362 cmpl $0, %ecx
363 cmove %edx, %eax
364 ret
365
366Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
367are a number of issues. 1) We are introducing a setcc between the result of the
368intrisic call and select. 2) The intrinsic is expected to produce a i32 value
369so a any extend (which becomes a zero extend) is added.
370
371We probably need some kind of target DAG combine hook to fix this.
Evan Cheng573cb7c2006-04-06 23:21:24 +0000372
373//===---------------------------------------------------------------------===//
374
Chris Lattner57a6c132006-04-23 19:47:09 +0000375We generate significantly worse code for this than GCC:
376http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
377http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
378
379There is also one case we do worse on PPC.
380
381//===---------------------------------------------------------------------===//
Evan Chengd7ec5182006-04-24 23:30:10 +0000382
Chris Lattner8ef67ac2006-05-08 21:24:21 +0000383If shorter, we should use things like:
384movzwl %ax, %eax
385instead of:
386andl $65535, %EAX
387
388The former can also be used when the two-addressy nature of the 'and' would
389require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
Chris Lattner217fde52006-04-27 21:40:57 +0000390
Chris Lattner9fd868a2006-05-08 21:39:45 +0000391//===---------------------------------------------------------------------===//
392
Evan Cheng4dfa85d2006-05-17 19:05:31 +0000393Bad codegen:
394
395char foo(int x) { return x; }
396
397_foo:
398 movl 4(%esp), %eax
399 shll $24, %eax
400 sarl $24, %eax
401 ret
Evan Chengc8999f22006-05-17 21:20:51 +0000402
Evan Chengc21051f2006-06-04 09:08:00 +0000403SIGN_EXTEND_INREG can be implemented as (sext (trunc)) to take advantage of
404sub-registers.
405
Evan Chengc8999f22006-05-17 21:20:51 +0000406//===---------------------------------------------------------------------===//
Chris Lattner778ae712006-05-19 20:55:31 +0000407
408Consider this:
409
410typedef struct pair { float A, B; } pair;
411void pairtest(pair P, float *FP) {
412 *FP = P.A+P.B;
413}
414
415We currently generate this code with llvmgcc4:
416
417_pairtest:
418 subl $12, %esp
419 movl 20(%esp), %eax
420 movl %eax, 4(%esp)
421 movl 16(%esp), %eax
422 movl %eax, (%esp)
423 movss (%esp), %xmm0
424 addss 4(%esp), %xmm0
425 movl 24(%esp), %eax
426 movss %xmm0, (%eax)
427 addl $12, %esp
428 ret
429
430we should be able to generate:
431_pairtest:
432 movss 4(%esp), %xmm0
433 movl 12(%esp), %eax
434 addss 8(%esp), %xmm0
435 movss %xmm0, (%eax)
436 ret
437
438The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
439integer chunks. It does this so that structs like {short,short} are passed in
440a single 32-bit integer stack slot. We should handle the safe cases above much
441nicer, while still handling the hard cases.
442
443//===---------------------------------------------------------------------===//
444
Evan Cheng435bcd72006-05-22 05:54:49 +0000445Another instruction selector deficiency:
446
447void %bar() {
448 %tmp = load int (int)** %foo
449 %tmp = tail call int %tmp( int 3 )
450 ret void
451}
452
453_bar:
454 subl $12, %esp
455 movl L_foo$non_lazy_ptr, %eax
456 movl (%eax), %eax
457 call *%eax
458 addl $12, %esp
459 ret
460
461The current isel scheme will not allow the load to be folded in the call since
462the load's chain result is read by the callseq_start.
Evan Cheng8c65fa52006-05-30 06:23:50 +0000463
464//===---------------------------------------------------------------------===//
465
466Don't forget to find a way to squash noop truncates in the JIT environment.
467
468//===---------------------------------------------------------------------===//
469
470Implement anyext in the same manner as truncate that would allow them to be
471eliminated.
472
473//===---------------------------------------------------------------------===//
474
475How about implementing truncate / anyext as a property of machine instruction
476operand? i.e. Print as 32-bit super-class register / 16-bit sub-class register.
477Do this for the cases where a truncate / anyext is guaranteed to be eliminated.
478For IA32 that is truncate from 32 to 16 and anyext from 16 to 32.
Evan Cheng5a622f22006-05-30 07:37:37 +0000479
480//===---------------------------------------------------------------------===//
481
482For this:
483
484int test(int a)
485{
486 return a * 3;
487}
488
489We currently emits
490 imull $3, 4(%esp), %eax
491
492Perhaps this is what we really should generate is? Is imull three or four
493cycles? Note: ICC generates this:
494 movl 4(%esp), %eax
495 leal (%eax,%eax,2), %eax
496
497The current instruction priority is based on pattern complexity. The former is
498more "complex" because it folds a load so the latter will not be emitted.
499
500Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
501should always try to match LEA first since the LEA matching code does some
502estimate to determine whether the match is profitable.
503
504However, if we care more about code size, then imull is better. It's two bytes
505shorter than movl + leal.
Evan Chengc21051f2006-06-04 09:08:00 +0000506
507//===---------------------------------------------------------------------===//
508
509Implement CTTZ, CTLZ with bsf and bsr.
510
511//===---------------------------------------------------------------------===//
512
513It appears gcc place string data with linkonce linkage in
514.section __TEXT,__const_coal,coalesced instead of
515.section __DATA,__const_coal,coalesced.
516Take a look at darwin.h, there are other Darwin assembler directives that we
517do not make use of.
518
519//===---------------------------------------------------------------------===//
520
521We should handle __attribute__ ((__visibility__ ("hidden"))).
Chris Lattner8e173de2006-06-15 21:33:31 +0000522
523//===---------------------------------------------------------------------===//
524
Nate Begeman83a6d492006-08-02 05:31:20 +0000525int %foo(int* %a, int %t) {
526entry:
527 br label %cond_true
528
529cond_true: ; preds = %cond_true, %entry
530 %x.0.0 = phi int [ 0, %entry ], [ %tmp9, %cond_true ] ; <int> [#uses=3]
531 %t_addr.0.0 = phi int [ %t, %entry ], [ %tmp7, %cond_true ] ; <int> [#uses=1]
532 %tmp2 = getelementptr int* %a, int %x.0.0 ; <int*> [#uses=1]
533 %tmp3 = load int* %tmp2 ; <int> [#uses=1]
534 %tmp5 = add int %t_addr.0.0, %x.0.0 ; <int> [#uses=1]
535 %tmp7 = add int %tmp5, %tmp3 ; <int> [#uses=2]
536 %tmp9 = add int %x.0.0, 1 ; <int> [#uses=2]
537 %tmp = setgt int %tmp9, 39 ; <bool> [#uses=1]
538 br bool %tmp, label %bb12, label %cond_true
539
540bb12: ; preds = %cond_true
541 ret int %tmp7
Chris Lattner8e173de2006-06-15 21:33:31 +0000542}
543
Nate Begeman83a6d492006-08-02 05:31:20 +0000544is pessimized by -loop-reduce and -indvars
Chris Lattner8e173de2006-06-15 21:33:31 +0000545
546//===---------------------------------------------------------------------===//
Evan Cheng357edf82006-06-17 00:45:49 +0000547
548Use cpuid to auto-detect CPU features such as SSE, SSE2, and SSE3.
Evan Cheng1c969532006-07-19 06:06:24 +0000549
550//===---------------------------------------------------------------------===//
551
Evan Chengabb4d782006-07-19 21:29:30 +0000552u32 to float conversion improvement:
553
554float uint32_2_float( unsigned u ) {
555 float fl = (int) (u & 0xffff);
556 float fh = (int) (u >> 16);
557 fh *= 0x1.0p16f;
558 return fh + fl;
559}
560
56100000000 subl $0x04,%esp
56200000003 movl 0x08(%esp,1),%eax
56300000007 movl %eax,%ecx
56400000009 shrl $0x10,%ecx
5650000000c cvtsi2ss %ecx,%xmm0
56600000010 andl $0x0000ffff,%eax
56700000015 cvtsi2ss %eax,%xmm1
56800000019 mulss 0x00000078,%xmm0
56900000021 addss %xmm1,%xmm0
57000000025 movss %xmm0,(%esp,1)
5710000002a flds (%esp,1)
5720000002d addl $0x04,%esp
57300000030 ret
Evan Chengae1d33f2006-07-26 21:49:52 +0000574
575//===---------------------------------------------------------------------===//
576
577When using fastcc abi, align stack slot of argument of type double on 8 byte
578boundary to improve performance.
Chris Lattner5c36d782006-08-16 02:47:44 +0000579
580//===---------------------------------------------------------------------===//
581
582Codegen:
583
Chris Lattner2a33a3f2006-09-11 22:57:51 +0000584int f(int a, int b) {
585 if (a == 4 || a == 6)
586 b++;
587 return b;
588}
589
Chris Lattner5c36d782006-08-16 02:47:44 +0000590
591as:
592
593or eax, 2
594cmp eax, 6
595jz label
596