blob: 5789754439aaa3020c525132740fd2fde6ece37a [file] [log] [blame]
Chris Lattner1171ff42005-10-23 19:52:42 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the X86 backend.
3//===---------------------------------------------------------------------===//
4
5Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
6Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
7X86, & make the dag combiner produce it when needed. This will eliminate one
8imul from the code generated for:
9
10long long test(long long X, long long Y) { return X*Y; }
11
12by using the EAX result from the mul. We should add a similar node for
13DIVREM.
14
Chris Lattner865874c2005-12-02 00:11:20 +000015another case is:
16
17long long test(int X, int Y) { return (long long)X*Y; }
18
19... which should only be one imul instruction.
20
Chris Lattner1171ff42005-10-23 19:52:42 +000021//===---------------------------------------------------------------------===//
22
23This should be one DIV/IDIV instruction, not a libcall:
24
25unsigned test(unsigned long long X, unsigned Y) {
26 return X/Y;
27}
28
29This can be done trivially with a custom legalizer. What about overflow
30though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
31
32//===---------------------------------------------------------------------===//
33
Chris Lattner1171ff42005-10-23 19:52:42 +000034Improvements to the multiply -> shift/add algorithm:
35http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
36
37//===---------------------------------------------------------------------===//
38
39Improve code like this (occurs fairly frequently, e.g. in LLVM):
40long long foo(int x) { return 1LL << x; }
41
42http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
43http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
44http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
45
46Another useful one would be ~0ULL >> X and ~0ULL << X.
47
Chris Lattnerffff6172005-10-23 21:44:59 +000048//===---------------------------------------------------------------------===//
49
Chris Lattner1e4ed932005-11-28 04:52:39 +000050Compile this:
51_Bool f(_Bool a) { return a!=1; }
52
53into:
54 movzbl %dil, %eax
55 xorl $1, %eax
56 ret
Evan Cheng8dee8cc2005-12-17 01:25:19 +000057
58//===---------------------------------------------------------------------===//
59
60Some isel ideas:
61
621. Dynamic programming based approach when compile time if not an
63 issue.
642. Code duplication (addressing mode) during isel.
653. Other ideas from "Register-Sensitive Selection, Duplication, and
66 Sequencing of Instructions".
Chris Lattnercb298902006-02-08 07:12:07 +0000674. Scheduling for reduced register pressure. E.g. "Minimum Register
68 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
69 and other related papers.
70 http://citeseer.ist.psu.edu/govindarajan01minimum.html
Evan Cheng8dee8cc2005-12-17 01:25:19 +000071
72//===---------------------------------------------------------------------===//
73
74Should we promote i16 to i32 to avoid partial register update stalls?
Evan Cheng98abbfb2005-12-17 06:54:43 +000075
76//===---------------------------------------------------------------------===//
77
78Leave any_extend as pseudo instruction and hint to register
79allocator. Delay codegen until post register allocation.
Evan Chenga3195e82006-01-12 22:54:21 +000080
81//===---------------------------------------------------------------------===//
82
Evan Chenge08c2702006-01-13 01:20:42 +000083Model X86 EFLAGS as a real register to avoid redudant cmp / test. e.g.
84
85 cmpl $1, %eax
86 setg %al
87 testb %al, %al # unnecessary
88 jne .BB7
Chris Lattner1db4b4f2006-01-16 17:53:00 +000089
90//===---------------------------------------------------------------------===//
91
92Count leading zeros and count trailing zeros:
93
94int clz(int X) { return __builtin_clz(X); }
95int ctz(int X) { return __builtin_ctz(X); }
96
97$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
98clz:
99 bsr %eax, DWORD PTR [%esp+4]
100 xor %eax, 31
101 ret
102ctz:
103 bsf %eax, DWORD PTR [%esp+4]
104 ret
105
106however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
107aren't.
108
109//===---------------------------------------------------------------------===//
110
111Use push/pop instructions in prolog/epilog sequences instead of stores off
112ESP (certain code size win, perf win on some [which?] processors).
Evan Cheng53f280a2006-02-25 10:04:07 +0000113Also, it appears icc use push for parameter passing. Need to investigate.
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000114
115//===---------------------------------------------------------------------===//
116
117Only use inc/neg/not instructions on processors where they are faster than
118add/sub/xor. They are slower on the P4 due to only updating some processor
119flags.
120
121//===---------------------------------------------------------------------===//
122
Chris Lattnerb638cd82006-01-29 09:08:15 +0000123The instruction selector sometimes misses folding a load into a compare. The
124pattern is written as (cmp reg, (load p)). Because the compare isn't
125commutative, it is not matched with the load on both sides. The dag combiner
126should be made smart enough to cannonicalize the load into the RHS of a compare
127when it can invert the result of the compare for free.
128
Evan Chengfc7c17a2006-04-13 05:09:45 +0000129How about intrinsics? An example is:
130 *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
131
132compiles to
133 pmuludq (%eax), %xmm0
134 movl 8(%esp), %eax
135 movdqa (%eax), %xmm1
136 pmulhuw %xmm0, %xmm1
137
138The transformation probably requires a X86 specific pass or a DAG combiner
139target specific hook.
140
Chris Lattner6a284562006-01-29 09:14:47 +0000141//===---------------------------------------------------------------------===//
142
Chris Lattner8e38ae62006-01-31 02:10:06 +0000143The DAG Isel doesn't fold the loads into the adds in this testcase. The
144pattern selector does. This is because the chain value of the load gets
145selected first, and the loads aren't checking to see if they are only used by
146and add.
147
148.ll:
149
150int %test(int* %x, int* %y, int* %z) {
151 %X = load int* %x
152 %Y = load int* %y
153 %Z = load int* %z
154 %a = add int %X, %Y
155 %b = add int %a, %Z
156 ret int %b
157}
158
159dag isel:
160
161_test:
162 movl 4(%esp), %eax
163 movl (%eax), %eax
164 movl 8(%esp), %ecx
165 movl (%ecx), %ecx
166 addl %ecx, %eax
167 movl 12(%esp), %ecx
168 movl (%ecx), %ecx
169 addl %ecx, %eax
170 ret
171
172pattern isel:
173
174_test:
175 movl 12(%esp), %ecx
176 movl 4(%esp), %edx
177 movl 8(%esp), %eax
178 movl (%eax), %eax
179 addl (%edx), %eax
180 addl (%ecx), %eax
181 ret
182
183This is bad for register pressure, though the dag isel is producing a
184better schedule. :)
Chris Lattner3e1d5e52006-02-01 01:44:25 +0000185
186//===---------------------------------------------------------------------===//
187
Chris Lattner9acddcd2006-02-02 19:16:34 +0000188In many cases, LLVM generates code like this:
189
190_test:
191 movl 8(%esp), %eax
192 cmpl %eax, 4(%esp)
193 setl %al
194 movzbl %al, %eax
195 ret
196
197on some processors (which ones?), it is more efficient to do this:
198
199_test:
200 movl 8(%esp), %ebx
201 xor %eax, %eax
202 cmpl %ebx, 4(%esp)
203 setl %al
204 ret
205
206Doing this correctly is tricky though, as the xor clobbers the flags.
207
Chris Lattnerd395d092006-02-02 19:43:28 +0000208//===---------------------------------------------------------------------===//
209
210We should generate 'test' instead of 'cmp' in various cases, e.g.:
211
212bool %test(int %X) {
213 %Y = shl int %X, ubyte 1
214 %C = seteq int %Y, 0
215 ret bool %C
216}
217bool %test(int %X) {
218 %Y = and int %X, 8
219 %C = seteq int %Y, 0
220 ret bool %C
221}
222
223This may just be a matter of using 'test' to write bigger patterns for X86cmp.
224
Chris Lattner3f705e62006-05-18 17:38:16 +0000225An important case is comparison against zero:
226
227if (X == 0) ...
228
229instead of:
230
231 cmpl $0, %eax
232 je LBB4_2 #cond_next
233
234use:
235 test %eax, %eax
236 jz LBB4_2
237
238which is smaller.
239
Chris Lattnerd395d092006-02-02 19:43:28 +0000240//===---------------------------------------------------------------------===//
241
Chris Lattner8f77b732006-02-08 06:52:06 +0000242We should generate bts/btr/etc instructions on targets where they are cheap or
243when codesize is important. e.g., for:
244
245void setbit(int *target, int bit) {
246 *target |= (1 << bit);
247}
248void clearbit(int *target, int bit) {
249 *target &= ~(1 << bit);
250}
251
Chris Lattnerdba382b2006-02-08 17:47:22 +0000252//===---------------------------------------------------------------------===//
253
Evan Cheng952b7d62006-02-14 08:25:32 +0000254Instead of the following for memset char*, 1, 10:
255
256 movl $16843009, 4(%edx)
257 movl $16843009, (%edx)
258 movw $257, 8(%edx)
259
260It might be better to generate
261
262 movl $16843009, %eax
263 movl %eax, 4(%edx)
264 movl %eax, (%edx)
265 movw al, 8(%edx)
266
267when we can spare a register. It reduces code size.
Evan Cheng7634ac42006-02-17 00:04:28 +0000268
269//===---------------------------------------------------------------------===//
270
Chris Lattnera648df22006-02-17 04:20:13 +0000271Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
272get this:
273
274int %test1(int %X) {
275 %Y = div int %X, 8
276 ret int %Y
277}
278
279_test1:
280 movl 4(%esp), %eax
281 movl %eax, %ecx
282 sarl $31, %ecx
283 shrl $29, %ecx
284 addl %ecx, %eax
285 sarl $3, %eax
286 ret
287
288GCC knows several different ways to codegen it, one of which is this:
289
290_test1:
291 movl 4(%esp), %eax
292 cmpl $-1, %eax
293 leal 7(%eax), %ecx
294 cmovle %ecx, %eax
295 sarl $3, %eax
296 ret
297
298which is probably slower, but it's interesting at least :)
299
Evan Cheng755ee8f2006-02-20 19:58:27 +0000300//===---------------------------------------------------------------------===//
301
Evan Cheng7ab54042006-03-21 07:18:26 +0000302Should generate min/max for stuff like:
303
304void minf(float a, float b, float *X) {
305 *X = a <= b ? a : b;
306}
307
Evan Cheng30324102006-02-23 02:50:21 +0000308Make use of floating point min / max instructions. Perhaps introduce ISD::FMIN
309and ISD::FMAX node types?
310
311//===---------------------------------------------------------------------===//
312
Chris Lattner205065a2006-02-23 05:17:43 +0000313The first BB of this code:
314
315declare bool %foo()
316int %bar() {
317 %V = call bool %foo()
318 br bool %V, label %T, label %F
319T:
320 ret int 1
321F:
322 call bool %foo()
323 ret int 12
324}
325
326compiles to:
327
328_bar:
329 subl $12, %esp
330 call L_foo$stub
331 xorb $1, %al
332 testb %al, %al
333 jne LBB_bar_2 # F
334
335It would be better to emit "cmp %al, 1" than a xor and test.
336
Evan Cheng53f280a2006-02-25 10:04:07 +0000337//===---------------------------------------------------------------------===//
Chris Lattner205065a2006-02-23 05:17:43 +0000338
Evan Cheng53f280a2006-02-25 10:04:07 +0000339Enable X86InstrInfo::convertToThreeAddress().
Evan Chengaafc1412006-02-28 23:38:49 +0000340
341//===---------------------------------------------------------------------===//
342
343Investigate whether it is better to codegen the following
344
345 %tmp.1 = mul int %x, 9
346to
347
348 movl 4(%esp), %eax
349 leal (%eax,%eax,8), %eax
350
351as opposed to what llc is currently generating:
352
353 imull $9, 4(%esp), %eax
354
355Currently the load folding imull has a higher complexity than the LEA32 pattern.
Evan Chengf42f5162006-03-04 07:49:50 +0000356
357//===---------------------------------------------------------------------===//
358
Nate Begemanc02e5a82006-03-26 19:19:27 +0000359We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
360We should leave these as libcalls for everything over a much lower threshold,
361since libc is hand tuned for medium and large mem ops (avoiding RFO for large
362stores, TLB preheating, etc)
363
364//===---------------------------------------------------------------------===//
365
Chris Lattner181b9c62006-03-09 01:39:46 +0000366Optimize this into something reasonable:
367 x * copysign(1.0, y) * copysign(1.0, z)
368
369//===---------------------------------------------------------------------===//
370
371Optimize copysign(x, *y) to use an integer load from y.
372
373//===---------------------------------------------------------------------===//
374
Evan Cheng2771d212006-03-16 22:44:22 +0000375%X = weak global int 0
376
377void %foo(int %N) {
378 %N = cast int %N to uint
379 %tmp.24 = setgt int %N, 0
380 br bool %tmp.24, label %no_exit, label %return
381
382no_exit:
383 %indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
384 %i.0.0 = cast uint %indvar to int
385 volatile store int %i.0.0, int* %X
386 %indvar.next = add uint %indvar, 1
387 %exitcond = seteq uint %indvar.next, %N
388 br bool %exitcond, label %return, label %no_exit
389
390return:
391 ret void
392}
393
394compiles into:
395
396 .text
397 .align 4
398 .globl _foo
399_foo:
400 movl 4(%esp), %eax
401 cmpl $1, %eax
402 jl LBB_foo_4 # return
403LBB_foo_1: # no_exit.preheader
404 xorl %ecx, %ecx
405LBB_foo_2: # no_exit
406 movl L_X$non_lazy_ptr, %edx
407 movl %ecx, (%edx)
408 incl %ecx
409 cmpl %eax, %ecx
410 jne LBB_foo_2 # no_exit
411LBB_foo_3: # return.loopexit
412LBB_foo_4: # return
413 ret
414
415We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
416remateralization is implemented. This can be accomplished with 1) a target
417dependent LICM pass or 2) makeing SelectDAG represent the whole function.
418
419//===---------------------------------------------------------------------===//
Evan Cheng0def9c32006-03-19 06:08:11 +0000420
421The following tests perform worse with LSR:
422
423lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
Chris Lattner8bcf9262006-03-19 22:27:41 +0000424
425//===---------------------------------------------------------------------===//
426
Chris Lattner9d5da1d2006-03-24 07:12:19 +0000427Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
Evan Cheng50a6d8c2006-03-21 07:12:57 +0000428FR64 to VR128.
Evan Chengb20aace2006-03-24 02:57:03 +0000429
430//===---------------------------------------------------------------------===//
431
432mov $reg, 48(%esp)
433...
434leal 48(%esp), %eax
435mov %eax, (%esp)
436call _foo
437
438Obviously it would have been better for the first mov (or any op) to store
439directly %esp[0] if there are no other uses.
Evan Cheng4c4a2e22006-03-28 02:49:12 +0000440
441//===---------------------------------------------------------------------===//
442
Evan Cheng8af5ef92006-04-05 23:46:04 +0000443Adding to the list of cmp / test poor codegen issues:
444
445int test(__m128 *A, __m128 *B) {
446 if (_mm_comige_ss(*A, *B))
447 return 3;
448 else
449 return 4;
450}
451
452_test:
453 movl 8(%esp), %eax
454 movaps (%eax), %xmm0
455 movl 4(%esp), %eax
456 movaps (%eax), %xmm1
457 comiss %xmm0, %xmm1
458 setae %al
459 movzbl %al, %ecx
460 movl $3, %eax
461 movl $4, %edx
462 cmpl $0, %ecx
463 cmove %edx, %eax
464 ret
465
466Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
467are a number of issues. 1) We are introducing a setcc between the result of the
468intrisic call and select. 2) The intrinsic is expected to produce a i32 value
469so a any extend (which becomes a zero extend) is added.
470
471We probably need some kind of target DAG combine hook to fix this.
Evan Cheng573cb7c2006-04-06 23:21:24 +0000472
473//===---------------------------------------------------------------------===//
474
Chris Lattner57a6c132006-04-23 19:47:09 +0000475We generate significantly worse code for this than GCC:
476http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
477http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
478
479There is also one case we do worse on PPC.
480
481//===---------------------------------------------------------------------===//
Evan Chengd7ec5182006-04-24 23:30:10 +0000482
Chris Lattner8ef67ac2006-05-08 21:24:21 +0000483If shorter, we should use things like:
484movzwl %ax, %eax
485instead of:
486andl $65535, %EAX
487
488The former can also be used when the two-addressy nature of the 'and' would
489require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
Chris Lattner217fde52006-04-27 21:40:57 +0000490
Chris Lattner9fd868a2006-05-08 21:39:45 +0000491//===---------------------------------------------------------------------===//
492
493This code generates ugly code, probably due to costs being off or something:
494
495void %test(float* %P, <4 x float>* %P2 ) {
496 %xFloat0.688 = load float* %P
497 %loadVector37.712 = load <4 x float>* %P2
498 %inFloat3.713 = insertelement <4 x float> %loadVector37.712, float 0.000000e+00, uint 3
499 store <4 x float> %inFloat3.713, <4 x float>* %P2
500 ret void
501}
502
503Generates:
504
505_test:
506 pxor %xmm0, %xmm0
507 movd %xmm0, %eax ;; EAX = 0!
508 movl 8(%esp), %ecx
509 movaps (%ecx), %xmm0
510 pinsrw $6, %eax, %xmm0
511 shrl $16, %eax ;; EAX = 0 again!
512 pinsrw $7, %eax, %xmm0
513 movaps %xmm0, (%ecx)
514 ret
515
516It would be better to generate:
517
518_test:
519 movl 8(%esp), %ecx
520 movaps (%ecx), %xmm0
521 xor %eax, %eax
522 pinsrw $6, %eax, %xmm0
523 pinsrw $7, %eax, %xmm0
524 movaps %xmm0, (%ecx)
525 ret
526
527or use pxor (to make a zero vector) and shuffle (to insert it).
Evan Cheng4dfa85d2006-05-17 19:05:31 +0000528
529//===---------------------------------------------------------------------===//
530
531Bad codegen:
532
533char foo(int x) { return x; }
534
535_foo:
536 movl 4(%esp), %eax
537 shll $24, %eax
538 sarl $24, %eax
539 ret
Evan Chengc8999f22006-05-17 21:20:51 +0000540
Evan Chengc21051f2006-06-04 09:08:00 +0000541SIGN_EXTEND_INREG can be implemented as (sext (trunc)) to take advantage of
542sub-registers.
543
Evan Chengc8999f22006-05-17 21:20:51 +0000544//===---------------------------------------------------------------------===//
Chris Lattner778ae712006-05-19 20:55:31 +0000545
546Consider this:
547
548typedef struct pair { float A, B; } pair;
549void pairtest(pair P, float *FP) {
550 *FP = P.A+P.B;
551}
552
553We currently generate this code with llvmgcc4:
554
555_pairtest:
556 subl $12, %esp
557 movl 20(%esp), %eax
558 movl %eax, 4(%esp)
559 movl 16(%esp), %eax
560 movl %eax, (%esp)
561 movss (%esp), %xmm0
562 addss 4(%esp), %xmm0
563 movl 24(%esp), %eax
564 movss %xmm0, (%eax)
565 addl $12, %esp
566 ret
567
568we should be able to generate:
569_pairtest:
570 movss 4(%esp), %xmm0
571 movl 12(%esp), %eax
572 addss 8(%esp), %xmm0
573 movss %xmm0, (%eax)
574 ret
575
576The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
577integer chunks. It does this so that structs like {short,short} are passed in
578a single 32-bit integer stack slot. We should handle the safe cases above much
579nicer, while still handling the hard cases.
580
581//===---------------------------------------------------------------------===//
582
Evan Cheng0d23bc42006-05-20 07:44:53 +0000583Some ideas for instruction selection code simplification: 1. A pre-pass to
584determine which chain producing node can or cannot be folded. The generated
585isel code would then use the information. 2. The same pre-pass can force
586ordering of TokenFactor operands to allow load / store folding. 3. During isel,
587instead of recursively going up the chain operand chain, mark the chain operand
588as available and put it in some work list. Select other nodes in the normal
589manner. The chain operands are selected after all other nodes are selected. Uses
590of chain nodes are modified after instruction selection is completed.
591
Evan Cheng435bcd72006-05-22 05:54:49 +0000592//===---------------------------------------------------------------------===//
Evan Cheng0d23bc42006-05-20 07:44:53 +0000593
Evan Cheng435bcd72006-05-22 05:54:49 +0000594Another instruction selector deficiency:
595
596void %bar() {
597 %tmp = load int (int)** %foo
598 %tmp = tail call int %tmp( int 3 )
599 ret void
600}
601
602_bar:
603 subl $12, %esp
604 movl L_foo$non_lazy_ptr, %eax
605 movl (%eax), %eax
606 call *%eax
607 addl $12, %esp
608 ret
609
610The current isel scheme will not allow the load to be folded in the call since
611the load's chain result is read by the callseq_start.
Evan Cheng8c65fa52006-05-30 06:23:50 +0000612
613//===---------------------------------------------------------------------===//
614
615Don't forget to find a way to squash noop truncates in the JIT environment.
616
617//===---------------------------------------------------------------------===//
618
619Implement anyext in the same manner as truncate that would allow them to be
620eliminated.
621
622//===---------------------------------------------------------------------===//
623
624How about implementing truncate / anyext as a property of machine instruction
625operand? i.e. Print as 32-bit super-class register / 16-bit sub-class register.
626Do this for the cases where a truncate / anyext is guaranteed to be eliminated.
627For IA32 that is truncate from 32 to 16 and anyext from 16 to 32.
Evan Cheng5a622f22006-05-30 07:37:37 +0000628
629//===---------------------------------------------------------------------===//
630
631For this:
632
633int test(int a)
634{
635 return a * 3;
636}
637
638We currently emits
639 imull $3, 4(%esp), %eax
640
641Perhaps this is what we really should generate is? Is imull three or four
642cycles? Note: ICC generates this:
643 movl 4(%esp), %eax
644 leal (%eax,%eax,2), %eax
645
646The current instruction priority is based on pattern complexity. The former is
647more "complex" because it folds a load so the latter will not be emitted.
648
649Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
650should always try to match LEA first since the LEA matching code does some
651estimate to determine whether the match is profitable.
652
653However, if we care more about code size, then imull is better. It's two bytes
654shorter than movl + leal.
Evan Chengc21051f2006-06-04 09:08:00 +0000655
656//===---------------------------------------------------------------------===//
657
658Implement CTTZ, CTLZ with bsf and bsr.
659
660//===---------------------------------------------------------------------===//
661
662It appears gcc place string data with linkonce linkage in
663.section __TEXT,__const_coal,coalesced instead of
664.section __DATA,__const_coal,coalesced.
665Take a look at darwin.h, there are other Darwin assembler directives that we
666do not make use of.
667
668//===---------------------------------------------------------------------===//
669
670We should handle __attribute__ ((__visibility__ ("hidden"))).
Chris Lattner8e173de2006-06-15 21:33:31 +0000671
672//===---------------------------------------------------------------------===//
673
674Consider:
675int foo(int *a, int t) {
676int x;
677for (x=0; x<40; ++x)
678 t = t + a[x] + x;
679return t;
680}
681
682We generate:
683LBB1_1: #cond_true
684 movl %ecx, %esi
685 movl (%edx,%eax,4), %edi
686 movl %esi, %ecx
687 addl %edi, %ecx
688 addl %eax, %ecx
689 incl %eax
690 cmpl $40, %eax
691 jne LBB1_1 #cond_true
692
693GCC generates:
694
695L2:
696 addl (%ecx,%edx,4), %eax
697 addl %edx, %eax
698 addl $1, %edx
699 cmpl $40, %edx
700 jne L2
701
702Smells like a register coallescing/reassociation issue.
703
704//===---------------------------------------------------------------------===//
Evan Cheng357edf82006-06-17 00:45:49 +0000705
706Use cpuid to auto-detect CPU features such as SSE, SSE2, and SSE3.