blob: 72965d0759c28db1d348ab173ac3998d13608946 [file] [log] [blame]
Chris Lattner1171ff42005-10-23 19:52:42 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the X86 backend.
3//===---------------------------------------------------------------------===//
4
5Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
6Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
7X86, & make the dag combiner produce it when needed. This will eliminate one
8imul from the code generated for:
9
10long long test(long long X, long long Y) { return X*Y; }
11
12by using the EAX result from the mul. We should add a similar node for
13DIVREM.
14
Chris Lattner865874c2005-12-02 00:11:20 +000015another case is:
16
17long long test(int X, int Y) { return (long long)X*Y; }
18
19... which should only be one imul instruction.
20
Chris Lattner88640b52006-09-16 03:30:19 +000021This can be done with a custom expander, but it would be nice to move this to
22generic code.
23
Chris Lattner1171ff42005-10-23 19:52:42 +000024//===---------------------------------------------------------------------===//
25
26This should be one DIV/IDIV instruction, not a libcall:
27
28unsigned test(unsigned long long X, unsigned Y) {
29 return X/Y;
30}
31
32This can be done trivially with a custom legalizer. What about overflow
33though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
34
35//===---------------------------------------------------------------------===//
36
Chris Lattner1171ff42005-10-23 19:52:42 +000037Improvements to the multiply -> shift/add algorithm:
38http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
39
40//===---------------------------------------------------------------------===//
41
42Improve code like this (occurs fairly frequently, e.g. in LLVM):
43long long foo(int x) { return 1LL << x; }
44
45http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
46http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
47http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
48
49Another useful one would be ~0ULL >> X and ~0ULL << X.
50
Chris Lattner95af34e2006-09-13 03:54:54 +000051One better solution for 1LL << x is:
52 xorl %eax, %eax
53 xorl %edx, %edx
54 testb $32, %cl
55 sete %al
56 setne %dl
57 sall %cl, %eax
58 sall %cl, %edx
59
60But that requires good 8-bit subreg support.
61
Chris Lattnerf73fb882006-09-18 05:36:54 +00006264-bit shifts (in general) expand to really bad code. Instead of using
63cmovs, we should expand to a conditional branch like GCC produces.
Chris Lattner95af34e2006-09-13 03:54:54 +000064
Chris Lattnerffff6172005-10-23 21:44:59 +000065//===---------------------------------------------------------------------===//
66
Chris Lattner1e4ed932005-11-28 04:52:39 +000067Compile this:
68_Bool f(_Bool a) { return a!=1; }
69
70into:
71 movzbl %dil, %eax
72 xorl $1, %eax
73 ret
Evan Cheng8dee8cc2005-12-17 01:25:19 +000074
75//===---------------------------------------------------------------------===//
76
77Some isel ideas:
78
791. Dynamic programming based approach when compile time if not an
80 issue.
812. Code duplication (addressing mode) during isel.
823. Other ideas from "Register-Sensitive Selection, Duplication, and
83 Sequencing of Instructions".
Chris Lattnercb298902006-02-08 07:12:07 +0000844. Scheduling for reduced register pressure. E.g. "Minimum Register
85 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
86 and other related papers.
87 http://citeseer.ist.psu.edu/govindarajan01minimum.html
Evan Cheng8dee8cc2005-12-17 01:25:19 +000088
89//===---------------------------------------------------------------------===//
90
91Should we promote i16 to i32 to avoid partial register update stalls?
Evan Cheng98abbfb2005-12-17 06:54:43 +000092
93//===---------------------------------------------------------------------===//
94
95Leave any_extend as pseudo instruction and hint to register
96allocator. Delay codegen until post register allocation.
Evan Chenga3195e82006-01-12 22:54:21 +000097
98//===---------------------------------------------------------------------===//
99
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000100Count leading zeros and count trailing zeros:
101
102int clz(int X) { return __builtin_clz(X); }
103int ctz(int X) { return __builtin_ctz(X); }
104
105$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
106clz:
107 bsr %eax, DWORD PTR [%esp+4]
108 xor %eax, 31
109 ret
110ctz:
111 bsf %eax, DWORD PTR [%esp+4]
112 ret
113
114however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
115aren't.
116
Chris Lattner20ddd4a2006-11-14 08:08:46 +0000117Another example (use predsimplify to eliminate a select):
118
119int foo (unsigned long j) {
120 if (j)
121 return __builtin_ffs (j) - 1;
122 else
123 return 0;
124}
125
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000126//===---------------------------------------------------------------------===//
127
128Use push/pop instructions in prolog/epilog sequences instead of stores off
129ESP (certain code size win, perf win on some [which?] processors).
Evan Cheng53f280a2006-02-25 10:04:07 +0000130Also, it appears icc use push for parameter passing. Need to investigate.
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000131
132//===---------------------------------------------------------------------===//
133
134Only use inc/neg/not instructions on processors where they are faster than
135add/sub/xor. They are slower on the P4 due to only updating some processor
136flags.
137
138//===---------------------------------------------------------------------===//
139
Chris Lattnerb638cd82006-01-29 09:08:15 +0000140The instruction selector sometimes misses folding a load into a compare. The
141pattern is written as (cmp reg, (load p)). Because the compare isn't
142commutative, it is not matched with the load on both sides. The dag combiner
143should be made smart enough to cannonicalize the load into the RHS of a compare
144when it can invert the result of the compare for free.
145
Evan Cheng0f4aa6e2006-09-11 05:25:15 +0000146//===---------------------------------------------------------------------===//
147
Evan Chengfc7c17a2006-04-13 05:09:45 +0000148How about intrinsics? An example is:
149 *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
150
151compiles to
152 pmuludq (%eax), %xmm0
153 movl 8(%esp), %eax
154 movdqa (%eax), %xmm1
155 pmulhuw %xmm0, %xmm1
156
157The transformation probably requires a X86 specific pass or a DAG combiner
158target specific hook.
159
Chris Lattner6a284562006-01-29 09:14:47 +0000160//===---------------------------------------------------------------------===//
161
Chris Lattner9acddcd2006-02-02 19:16:34 +0000162In many cases, LLVM generates code like this:
163
164_test:
165 movl 8(%esp), %eax
166 cmpl %eax, 4(%esp)
167 setl %al
168 movzbl %al, %eax
169 ret
170
171on some processors (which ones?), it is more efficient to do this:
172
173_test:
174 movl 8(%esp), %ebx
Evan Cheng0f4aa6e2006-09-11 05:25:15 +0000175 xor %eax, %eax
Chris Lattner9acddcd2006-02-02 19:16:34 +0000176 cmpl %ebx, 4(%esp)
177 setl %al
178 ret
179
180Doing this correctly is tricky though, as the xor clobbers the flags.
181
Chris Lattnerd395d092006-02-02 19:43:28 +0000182//===---------------------------------------------------------------------===//
183
Chris Lattner8f77b732006-02-08 06:52:06 +0000184We should generate bts/btr/etc instructions on targets where they are cheap or
185when codesize is important. e.g., for:
186
187void setbit(int *target, int bit) {
188 *target |= (1 << bit);
189}
190void clearbit(int *target, int bit) {
191 *target &= ~(1 << bit);
192}
193
Chris Lattnerdba382b2006-02-08 17:47:22 +0000194//===---------------------------------------------------------------------===//
195
Evan Cheng952b7d62006-02-14 08:25:32 +0000196Instead of the following for memset char*, 1, 10:
197
198 movl $16843009, 4(%edx)
199 movl $16843009, (%edx)
200 movw $257, 8(%edx)
201
202It might be better to generate
203
204 movl $16843009, %eax
205 movl %eax, 4(%edx)
206 movl %eax, (%edx)
207 movw al, 8(%edx)
208
209when we can spare a register. It reduces code size.
Evan Cheng7634ac42006-02-17 00:04:28 +0000210
211//===---------------------------------------------------------------------===//
212
Chris Lattnera648df22006-02-17 04:20:13 +0000213Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
214get this:
215
216int %test1(int %X) {
217 %Y = div int %X, 8
218 ret int %Y
219}
220
221_test1:
222 movl 4(%esp), %eax
223 movl %eax, %ecx
224 sarl $31, %ecx
225 shrl $29, %ecx
226 addl %ecx, %eax
227 sarl $3, %eax
228 ret
229
230GCC knows several different ways to codegen it, one of which is this:
231
232_test1:
233 movl 4(%esp), %eax
234 cmpl $-1, %eax
235 leal 7(%eax), %ecx
236 cmovle %ecx, %eax
237 sarl $3, %eax
238 ret
239
240which is probably slower, but it's interesting at least :)
241
Evan Cheng755ee8f2006-02-20 19:58:27 +0000242//===---------------------------------------------------------------------===//
243
Chris Lattner205065a2006-02-23 05:17:43 +0000244The first BB of this code:
245
246declare bool %foo()
247int %bar() {
248 %V = call bool %foo()
249 br bool %V, label %T, label %F
250T:
251 ret int 1
252F:
253 call bool %foo()
254 ret int 12
255}
256
257compiles to:
258
259_bar:
260 subl $12, %esp
261 call L_foo$stub
262 xorb $1, %al
263 testb %al, %al
264 jne LBB_bar_2 # F
265
266It would be better to emit "cmp %al, 1" than a xor and test.
267
Evan Cheng53f280a2006-02-25 10:04:07 +0000268//===---------------------------------------------------------------------===//
Chris Lattner205065a2006-02-23 05:17:43 +0000269
Evan Cheng53f280a2006-02-25 10:04:07 +0000270Enable X86InstrInfo::convertToThreeAddress().
Evan Chengaafc1412006-02-28 23:38:49 +0000271
272//===---------------------------------------------------------------------===//
273
Nate Begemanc02e5a82006-03-26 19:19:27 +0000274We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
275We should leave these as libcalls for everything over a much lower threshold,
276since libc is hand tuned for medium and large mem ops (avoiding RFO for large
277stores, TLB preheating, etc)
278
279//===---------------------------------------------------------------------===//
280
Chris Lattner181b9c62006-03-09 01:39:46 +0000281Optimize this into something reasonable:
282 x * copysign(1.0, y) * copysign(1.0, z)
283
284//===---------------------------------------------------------------------===//
285
286Optimize copysign(x, *y) to use an integer load from y.
287
288//===---------------------------------------------------------------------===//
289
Evan Cheng2771d212006-03-16 22:44:22 +0000290%X = weak global int 0
291
292void %foo(int %N) {
293 %N = cast int %N to uint
294 %tmp.24 = setgt int %N, 0
295 br bool %tmp.24, label %no_exit, label %return
296
297no_exit:
298 %indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
299 %i.0.0 = cast uint %indvar to int
300 volatile store int %i.0.0, int* %X
301 %indvar.next = add uint %indvar, 1
302 %exitcond = seteq uint %indvar.next, %N
303 br bool %exitcond, label %return, label %no_exit
304
305return:
306 ret void
307}
308
309compiles into:
310
311 .text
312 .align 4
313 .globl _foo
314_foo:
315 movl 4(%esp), %eax
316 cmpl $1, %eax
317 jl LBB_foo_4 # return
318LBB_foo_1: # no_exit.preheader
319 xorl %ecx, %ecx
320LBB_foo_2: # no_exit
321 movl L_X$non_lazy_ptr, %edx
322 movl %ecx, (%edx)
323 incl %ecx
324 cmpl %eax, %ecx
325 jne LBB_foo_2 # no_exit
326LBB_foo_3: # return.loopexit
327LBB_foo_4: # return
328 ret
329
330We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
331remateralization is implemented. This can be accomplished with 1) a target
332dependent LICM pass or 2) makeing SelectDAG represent the whole function.
333
334//===---------------------------------------------------------------------===//
Evan Cheng0def9c32006-03-19 06:08:11 +0000335
336The following tests perform worse with LSR:
337
338lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
Chris Lattner8bcf9262006-03-19 22:27:41 +0000339
340//===---------------------------------------------------------------------===//
341
Evan Cheng4485d382007-03-14 21:03:53 +0000342We are generating far worse code than gcc:
343
344volatile short X, Y;
345
346void foo(int N) {
347 int i;
348 for (i = 0; i < N; i++) { X = i; Y = i*4; }
349}
350
351LBB1_1: #bb.preheader
352 xorl %ecx, %ecx
353 xorw %dx, %dx
354LBB1_2: #bb
355 movl L_X$non_lazy_ptr, %esi
356 movw %dx, (%esi)
357 movw %dx, %si
358 shlw $2, %si
359 movl L_Y$non_lazy_ptr, %edi
360 movw %si, (%edi)
361 incl %ecx
362 incw %dx
363 cmpl %eax, %ecx
364 jne LBB1_2 #bb
365
366vs.
367
368 xorl %edx, %edx
369 movl L_X$non_lazy_ptr-"L00000000001$pb"(%ebx), %esi
370 movl L_Y$non_lazy_ptr-"L00000000001$pb"(%ebx), %ecx
371L4:
372 movw %dx, (%esi)
373 leal 0(,%edx,4), %eax
374 movw %ax, (%ecx)
375 addl $1, %edx
376 cmpl %edx, %edi
377 jne L4
378
379There are 3 issues:
380
3811. Lack of post regalloc LICM.
3822. Poor sub-regclass support. That leads to inability to promote the 16-bit
383 arithmetic op to 32-bit and making use of leal.
3843. LSR unable to reused IV for a different type (i16 vs. i32) even though
385 the cast would be free.
386
387//===---------------------------------------------------------------------===//
388
Chris Lattner9d5da1d2006-03-24 07:12:19 +0000389Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
Evan Cheng50a6d8c2006-03-21 07:12:57 +0000390FR64 to VR128.
Evan Chengb20aace2006-03-24 02:57:03 +0000391
392//===---------------------------------------------------------------------===//
393
394mov $reg, 48(%esp)
395...
396leal 48(%esp), %eax
397mov %eax, (%esp)
398call _foo
399
400Obviously it would have been better for the first mov (or any op) to store
401directly %esp[0] if there are no other uses.
Evan Cheng4c4a2e22006-03-28 02:49:12 +0000402
403//===---------------------------------------------------------------------===//
404
Evan Cheng8af5ef92006-04-05 23:46:04 +0000405Adding to the list of cmp / test poor codegen issues:
406
407int test(__m128 *A, __m128 *B) {
408 if (_mm_comige_ss(*A, *B))
409 return 3;
410 else
411 return 4;
412}
413
414_test:
415 movl 8(%esp), %eax
416 movaps (%eax), %xmm0
417 movl 4(%esp), %eax
418 movaps (%eax), %xmm1
419 comiss %xmm0, %xmm1
420 setae %al
421 movzbl %al, %ecx
422 movl $3, %eax
423 movl $4, %edx
424 cmpl $0, %ecx
425 cmove %edx, %eax
426 ret
427
428Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
429are a number of issues. 1) We are introducing a setcc between the result of the
430intrisic call and select. 2) The intrinsic is expected to produce a i32 value
431so a any extend (which becomes a zero extend) is added.
432
433We probably need some kind of target DAG combine hook to fix this.
Evan Cheng573cb7c2006-04-06 23:21:24 +0000434
435//===---------------------------------------------------------------------===//
436
Chris Lattner57a6c132006-04-23 19:47:09 +0000437We generate significantly worse code for this than GCC:
438http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
439http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
440
441There is also one case we do worse on PPC.
442
443//===---------------------------------------------------------------------===//
Evan Chengd7ec5182006-04-24 23:30:10 +0000444
Chris Lattner8ef67ac2006-05-08 21:24:21 +0000445If shorter, we should use things like:
446movzwl %ax, %eax
447instead of:
448andl $65535, %EAX
449
450The former can also be used when the two-addressy nature of the 'and' would
451require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
Chris Lattner217fde52006-04-27 21:40:57 +0000452
Chris Lattner9fd868a2006-05-08 21:39:45 +0000453//===---------------------------------------------------------------------===//
454
Evan Cheng4dfa85d2006-05-17 19:05:31 +0000455Bad codegen:
456
457char foo(int x) { return x; }
458
459_foo:
460 movl 4(%esp), %eax
461 shll $24, %eax
462 sarl $24, %eax
463 ret
Evan Chengc8999f22006-05-17 21:20:51 +0000464
Evan Chengc21051f2006-06-04 09:08:00 +0000465SIGN_EXTEND_INREG can be implemented as (sext (trunc)) to take advantage of
466sub-registers.
467
Evan Chengc8999f22006-05-17 21:20:51 +0000468//===---------------------------------------------------------------------===//
Chris Lattner778ae712006-05-19 20:55:31 +0000469
470Consider this:
471
472typedef struct pair { float A, B; } pair;
473void pairtest(pair P, float *FP) {
474 *FP = P.A+P.B;
475}
476
477We currently generate this code with llvmgcc4:
478
479_pairtest:
Chris Lattner9f09fa22006-12-11 01:20:25 +0000480 movl 8(%esp), %eax
481 movl 4(%esp), %ecx
482 movd %eax, %xmm0
483 movd %ecx, %xmm1
484 addss %xmm0, %xmm1
485 movl 12(%esp), %eax
486 movss %xmm1, (%eax)
Chris Lattner778ae712006-05-19 20:55:31 +0000487 ret
488
489we should be able to generate:
490_pairtest:
491 movss 4(%esp), %xmm0
492 movl 12(%esp), %eax
493 addss 8(%esp), %xmm0
494 movss %xmm0, (%eax)
495 ret
496
497The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
498integer chunks. It does this so that structs like {short,short} are passed in
499a single 32-bit integer stack slot. We should handle the safe cases above much
500nicer, while still handling the hard cases.
501
Chris Lattner9f09fa22006-12-11 01:20:25 +0000502While true in general, in this specific case we could do better by promoting
503load int + bitcast to float -> load fload. This basically needs alignment info,
504the code is already implemented (but disabled) in dag combine).
505
Chris Lattner778ae712006-05-19 20:55:31 +0000506//===---------------------------------------------------------------------===//
507
Evan Cheng435bcd72006-05-22 05:54:49 +0000508Another instruction selector deficiency:
509
510void %bar() {
511 %tmp = load int (int)** %foo
512 %tmp = tail call int %tmp( int 3 )
513 ret void
514}
515
516_bar:
517 subl $12, %esp
518 movl L_foo$non_lazy_ptr, %eax
519 movl (%eax), %eax
520 call *%eax
521 addl $12, %esp
522 ret
523
524The current isel scheme will not allow the load to be folded in the call since
525the load's chain result is read by the callseq_start.
Evan Cheng8c65fa52006-05-30 06:23:50 +0000526
527//===---------------------------------------------------------------------===//
528
529Don't forget to find a way to squash noop truncates in the JIT environment.
530
531//===---------------------------------------------------------------------===//
532
533Implement anyext in the same manner as truncate that would allow them to be
534eliminated.
535
536//===---------------------------------------------------------------------===//
537
538How about implementing truncate / anyext as a property of machine instruction
539operand? i.e. Print as 32-bit super-class register / 16-bit sub-class register.
540Do this for the cases where a truncate / anyext is guaranteed to be eliminated.
541For IA32 that is truncate from 32 to 16 and anyext from 16 to 32.
Evan Cheng5a622f22006-05-30 07:37:37 +0000542
543//===---------------------------------------------------------------------===//
544
545For this:
546
547int test(int a)
548{
549 return a * 3;
550}
551
552We currently emits
553 imull $3, 4(%esp), %eax
554
555Perhaps this is what we really should generate is? Is imull three or four
556cycles? Note: ICC generates this:
557 movl 4(%esp), %eax
558 leal (%eax,%eax,2), %eax
559
560The current instruction priority is based on pattern complexity. The former is
561more "complex" because it folds a load so the latter will not be emitted.
562
563Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
564should always try to match LEA first since the LEA matching code does some
565estimate to determine whether the match is profitable.
566
567However, if we care more about code size, then imull is better. It's two bytes
568shorter than movl + leal.
Evan Chengc21051f2006-06-04 09:08:00 +0000569
570//===---------------------------------------------------------------------===//
571
572Implement CTTZ, CTLZ with bsf and bsr.
573
574//===---------------------------------------------------------------------===//
575
576It appears gcc place string data with linkonce linkage in
577.section __TEXT,__const_coal,coalesced instead of
578.section __DATA,__const_coal,coalesced.
579Take a look at darwin.h, there are other Darwin assembler directives that we
580do not make use of.
581
582//===---------------------------------------------------------------------===//
583
Nate Begeman83a6d492006-08-02 05:31:20 +0000584int %foo(int* %a, int %t) {
585entry:
586 br label %cond_true
587
588cond_true: ; preds = %cond_true, %entry
Chris Lattner60e32b02006-09-21 05:46:00 +0000589 %x.0.0 = phi int [ 0, %entry ], [ %tmp9, %cond_true ]
590 %t_addr.0.0 = phi int [ %t, %entry ], [ %tmp7, %cond_true ]
591 %tmp2 = getelementptr int* %a, int %x.0.0
Nate Begeman83a6d492006-08-02 05:31:20 +0000592 %tmp3 = load int* %tmp2 ; <int> [#uses=1]
593 %tmp5 = add int %t_addr.0.0, %x.0.0 ; <int> [#uses=1]
594 %tmp7 = add int %tmp5, %tmp3 ; <int> [#uses=2]
595 %tmp9 = add int %x.0.0, 1 ; <int> [#uses=2]
596 %tmp = setgt int %tmp9, 39 ; <bool> [#uses=1]
597 br bool %tmp, label %bb12, label %cond_true
598
599bb12: ; preds = %cond_true
600 ret int %tmp7
Chris Lattner8e173de2006-06-15 21:33:31 +0000601}
602
Nate Begeman83a6d492006-08-02 05:31:20 +0000603is pessimized by -loop-reduce and -indvars
Chris Lattner8e173de2006-06-15 21:33:31 +0000604
605//===---------------------------------------------------------------------===//
Evan Cheng357edf82006-06-17 00:45:49 +0000606
Evan Chengabb4d782006-07-19 21:29:30 +0000607u32 to float conversion improvement:
608
609float uint32_2_float( unsigned u ) {
610 float fl = (int) (u & 0xffff);
611 float fh = (int) (u >> 16);
612 fh *= 0x1.0p16f;
613 return fh + fl;
614}
615
61600000000 subl $0x04,%esp
61700000003 movl 0x08(%esp,1),%eax
61800000007 movl %eax,%ecx
61900000009 shrl $0x10,%ecx
6200000000c cvtsi2ss %ecx,%xmm0
62100000010 andl $0x0000ffff,%eax
62200000015 cvtsi2ss %eax,%xmm1
62300000019 mulss 0x00000078,%xmm0
62400000021 addss %xmm1,%xmm0
62500000025 movss %xmm0,(%esp,1)
6260000002a flds (%esp,1)
6270000002d addl $0x04,%esp
62800000030 ret
Evan Chengae1d33f2006-07-26 21:49:52 +0000629
630//===---------------------------------------------------------------------===//
631
632When using fastcc abi, align stack slot of argument of type double on 8 byte
633boundary to improve performance.
Chris Lattner5c36d782006-08-16 02:47:44 +0000634
635//===---------------------------------------------------------------------===//
636
637Codegen:
638
Chris Lattner2a33a3f2006-09-11 22:57:51 +0000639int f(int a, int b) {
640 if (a == 4 || a == 6)
641 b++;
642 return b;
643}
644
Chris Lattner5c36d782006-08-16 02:47:44 +0000645
646as:
647
648or eax, 2
649cmp eax, 6
650jz label
651
Chris Lattner0bfd7fd2006-09-11 23:00:56 +0000652//===---------------------------------------------------------------------===//
653
Chris Lattnerd1468d32006-09-13 04:19:50 +0000654GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
655simplifications for integer "x cmp y ? a : b". For example, instead of:
656
657int G;
658void f(int X, int Y) {
659 G = X < 0 ? 14 : 13;
660}
661
662compiling to:
663
664_f:
665 movl $14, %eax
666 movl $13, %ecx
667 movl 4(%esp), %edx
668 testl %edx, %edx
669 cmovl %eax, %ecx
670 movl %ecx, _G
671 ret
672
673it could be:
674_f:
675 movl 4(%esp), %eax
676 sarl $31, %eax
677 notl %eax
678 addl $14, %eax
679 movl %eax, _G
680 ret
681
682etc.
683
684//===---------------------------------------------------------------------===//
Chris Lattner08c33012006-09-13 23:37:16 +0000685
Anton Korobeynikovbcb97702006-09-17 20:25:45 +0000686Currently we don't have elimination of redundant stack manipulations. Consider
687the code:
688
689int %main() {
690entry:
691 call fastcc void %test1( )
692 call fastcc void %test2( sbyte* cast (void ()* %test1 to sbyte*) )
693 ret int 0
694}
695
696declare fastcc void %test1()
697
698declare fastcc void %test2(sbyte*)
699
700
701This currently compiles to:
702
703 subl $16, %esp
704 call _test5
705 addl $12, %esp
706 subl $16, %esp
707 movl $_test5, (%esp)
708 call _test6
709 addl $12, %esp
710
711The add\sub pair is really unneeded here.
Chris Lattner7e932de2006-09-20 06:32:10 +0000712
713//===---------------------------------------------------------------------===//
714
Chris Lattner6e1f1fd2006-10-06 17:39:34 +0000715We currently compile sign_extend_inreg into two shifts:
716
717long foo(long X) {
718 return (long)(signed char)X;
719}
720
721becomes:
722
723_foo:
724 movl 4(%esp), %eax
725 shll $24, %eax
726 sarl $24, %eax
727 ret
728
729This could be:
730
731_foo:
732 movsbl 4(%esp),%eax
733 ret
734
735//===---------------------------------------------------------------------===//
Chris Lattner15bdc962006-10-12 22:01:26 +0000736
737Consider the expansion of:
738
739uint %test3(uint %X) {
740 %tmp1 = rem uint %X, 255
741 ret uint %tmp1
742}
743
744Currently it compiles to:
745
746...
747 movl $2155905153, %ecx
748 movl 8(%esp), %esi
749 movl %esi, %eax
750 mull %ecx
751...
752
753This could be "reassociated" into:
754
755 movl $2155905153, %eax
756 movl 8(%esp), %ecx
757 mull %ecx
758
759to avoid the copy. In fact, the existing two-address stuff would do this
760except that mul isn't a commutative 2-addr instruction. I guess this has
761to be done at isel time based on the #uses to mul?
762
Evan Chengead1b802006-11-28 19:59:25 +0000763//===---------------------------------------------------------------------===//
764
765Make sure the instruction which starts a loop does not cross a cacheline
766boundary. This requires knowning the exact length of each machine instruction.
767That is somewhat complicated, but doable. Example 256.bzip2:
768
769In the new trace, the hot loop has an instruction which crosses a cacheline
770boundary. In addition to potential cache misses, this can't help decoding as I
771imagine there has to be some kind of complicated decoder reset and realignment
772to grab the bytes from the next cacheline.
773
774532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
775942 942 0x3d03 movl %dh, (1809(%esp, %esi)
776937 937 0x3d0a incl %esi
7773 3 0x3d0b cmpb %bl, %dl
77827 27 0x3d0d jnz 0x000062db <main+11707>
779
780//===---------------------------------------------------------------------===//
781
782In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
Chris Lattnerb9853eb2006-12-22 01:03:22 +0000783
784//===---------------------------------------------------------------------===//
785
786This could be a single 16-bit load.
Chris Lattnerc9d34712007-01-03 19:12:31 +0000787
Chris Lattnerb9853eb2006-12-22 01:03:22 +0000788int f(char *p) {
Chris Lattnerc9d34712007-01-03 19:12:31 +0000789 if ((p[0] == 1) & (p[1] == 2)) return 1;
Chris Lattnerb9853eb2006-12-22 01:03:22 +0000790 return 0;
791}
792
Chris Lattnerab4be632007-01-06 01:30:45 +0000793//===---------------------------------------------------------------------===//
794
795We should inline lrintf and probably other libc functions.
796
797//===---------------------------------------------------------------------===//
Chris Lattner7ace2992007-01-15 06:25:39 +0000798
799Start using the flags more. For example, compile:
800
801int add_zf(int *x, int y, int a, int b) {
802 if ((*x += y) == 0)
803 return a;
804 else
805 return b;
806}
807
808to:
809 addl %esi, (%rdi)
810 movl %edx, %eax
811 cmovne %ecx, %eax
812 ret
813instead of:
814
815_add_zf:
816 addl (%rdi), %esi
817 movl %esi, (%rdi)
818 testl %esi, %esi
819 cmove %edx, %ecx
820 movl %ecx, %eax
821 ret
822
823and:
824
825int add_zf(int *x, int y, int a, int b) {
826 if ((*x + y) < 0)
827 return a;
828 else
829 return b;
830}
831
832to:
833
834add_zf:
835 addl (%rdi), %esi
836 movl %edx, %eax
837 cmovns %ecx, %eax
838 ret
839
840instead of:
841
842_add_zf:
843 addl (%rdi), %esi
844 testl %esi, %esi
845 cmovs %edx, %ecx
846 movl %ecx, %eax
847 ret
848
849//===---------------------------------------------------------------------===//
850
851This:
852#include <math.h>
853int foo(double X) { return isnan(X); }
854
855compiles to (-m64):
856
857_foo:
858 pxor %xmm1, %xmm1
859 ucomisd %xmm1, %xmm0
860 setp %al
861 movzbl %al, %eax
862 ret
863
864the pxor is not needed, we could compare the value against itself.
865
Chris Lattner66bb5b52007-01-21 07:03:37 +0000866//===---------------------------------------------------------------------===//
867
868These two functions have identical effects:
869
870unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
871unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
872
873We currently compile them to:
874
875_f:
876 movl 4(%esp), %eax
877 movl %eax, %ecx
878 incl %ecx
879 movl 8(%esp), %edx
880 cmpl %edx, %ecx
881 jne LBB1_2 #UnifiedReturnBlock
882LBB1_1: #cond_true
883 addl $2, %eax
884 ret
885LBB1_2: #UnifiedReturnBlock
886 movl %ecx, %eax
887 ret
888_f2:
889 movl 4(%esp), %eax
890 movl %eax, %ecx
891 incl %ecx
892 cmpl 8(%esp), %ecx
893 sete %cl
894 movzbl %cl, %ecx
895 leal 1(%ecx,%eax), %eax
896 ret
897
898both of which are inferior to GCC's:
899
900_f:
901 movl 4(%esp), %edx
902 leal 1(%edx), %eax
903 addl $2, %edx
904 cmpl 8(%esp), %eax
905 cmove %edx, %eax
906 ret
907_f2:
908 movl 4(%esp), %eax
909 addl $1, %eax
910 xorl %edx, %edx
911 cmpl 8(%esp), %eax
912 sete %dl
913 addl %edx, %eax
914 ret
915
916//===---------------------------------------------------------------------===//
917
Chris Lattner08ba1de2007-02-12 20:26:34 +0000918This code:
919
920void test(int X) {
921 if (X) abort();
922}
923
Chris Lattner48d3c102007-02-12 21:20:26 +0000924is currently compiled to:
Chris Lattner08ba1de2007-02-12 20:26:34 +0000925
926_test:
927 subl $12, %esp
928 cmpl $0, 16(%esp)
Chris Lattner48d3c102007-02-12 21:20:26 +0000929 jne LBB1_1
Chris Lattner08ba1de2007-02-12 20:26:34 +0000930 addl $12, %esp
931 ret
Chris Lattner48d3c102007-02-12 21:20:26 +0000932LBB1_1:
Chris Lattner08ba1de2007-02-12 20:26:34 +0000933 call L_abort$stub
934
935It would be better to produce:
936
937_test:
938 subl $12, %esp
939 cmpl $0, 16(%esp)
940 jne L_abort$stub
941 addl $12, %esp
942 ret
943
944This can be applied to any no-return function call that takes no arguments etc.
Chris Lattner48d3c102007-02-12 21:20:26 +0000945Alternatively, the stack save/restore logic could be shrink-wrapped, producing
946something like this:
947
948_test:
949 cmpl $0, 4(%esp)
950 jne LBB1_1
951 ret
952LBB1_1:
953 subl $12, %esp
954 call L_abort$stub
955
956Both are useful in different situations. Finally, it could be shrink-wrapped
957and tail called, like this:
958
959_test:
960 cmpl $0, 4(%esp)
961 jne LBB1_1
962 ret
963LBB1_1:
964 pop %eax # realign stack.
965 call L_abort$stub
966
967Though this probably isn't worth it.
Chris Lattner08ba1de2007-02-12 20:26:34 +0000968
969//===---------------------------------------------------------------------===//
Chris Lattner9b6f57c2007-03-02 05:04:52 +0000970
971We need to teach the codegen to convert two-address INC instructions to LEA
972when the flags are dead. For example, on X86-64, compile:
973
974int foo(int A, int B) {
975 return A+1;
976}
977
978to:
979
980_foo:
981 leal 1(%edi), %eax
982 ret
983
984instead of:
985
986_foo:
987 incl %edi
988 movl %edi, %eax
989 ret
990
991//===---------------------------------------------------------------------===//
Dale Johannesenaceaf5d2007-03-21 21:16:39 +0000992
993We use push/pop of stack space around calls in situations where we don't have to.
994Call to f below produces:
995 subl $16, %esp <<<<<
996 movl %eax, (%esp)
997 call L_f$stub
998 addl $16, %esp <<<<<
999The stack push/pop can be moved into the prolog/epilog. It does this because it's
1000building the frame pointer, but this should not be sufficient, only the use of alloca
1001should cause it to do this.
1002(There are other issues shown by this code, but this is one.)
1003
1004typedef struct _range_t {
1005 float fbias;
1006 float fscale;
1007 int ibias;
1008 int iscale;
1009 int ishift;
1010 unsigned char lut[];
1011} range_t;
1012
1013struct _decode_t {
1014 int type:4;
1015 int unit:4;
1016 int alpha:8;
1017 int N:8;
1018 int bpc:8;
1019 int bpp:16;
1020 int skip:8;
1021 int swap:8;
1022 const range_t*const*range;
1023};
1024
1025typedef struct _decode_t decode_t;
1026
1027extern int f(const decode_t* decode);
1028
1029int decode_byte (const decode_t* decode) {
1030 if (decode->swap != 0)
1031 return f(decode);
1032 return 0;
1033}