blob: 2ae4088293af50589b557fa7a32ac7749270d339 [file] [log] [blame]
Dan Gohmanf17a25c2007-07-18 16:29:46 +00001//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
2
3TODO:
4* gpr0 allocation
5* implement do-loop -> bdnz transform
Chris Lattner4c70c612008-01-15 22:15:02 +00006* Implement __builtin_trap (ISD::TRAP) as 'tw 31, 0, 0' aka 'trap'.
Nate Begeman10c85752008-02-11 04:16:09 +00007* lmw/stmw pass a la arm load store optimizer for prolog/epilog
Dan Gohmanf17a25c2007-07-18 16:29:46 +00008
9===-------------------------------------------------------------------------===
10
11Support 'update' load/store instructions. These are cracked on the G5, but are
12still a codesize win.
13
14With preinc enabled, this:
15
16long *%test4(long *%X, long *%dest) {
17 %Y = getelementptr long* %X, int 4
18 %A = load long* %Y
19 store long %A, long* %dest
20 ret long* %Y
21}
22
23compiles to:
24
25_test4:
26 mr r2, r3
27 lwzu r5, 32(r2)
28 lwz r3, 36(r3)
29 stw r5, 0(r4)
30 stw r3, 4(r4)
31 mr r3, r2
32 blr
33
34with -sched=list-burr, I get:
35
36_test4:
37 lwz r2, 36(r3)
38 lwzu r5, 32(r3)
39 stw r2, 4(r4)
40 stw r5, 0(r4)
41 blr
42
43===-------------------------------------------------------------------------===
44
45We compile the hottest inner loop of viterbi to:
46
47 li r6, 0
48 b LBB1_84 ;bb432.i
49LBB1_83: ;bb420.i
50 lbzx r8, r5, r7
51 addi r6, r7, 1
52 stbx r8, r4, r7
53LBB1_84: ;bb432.i
54 mr r7, r6
55 cmplwi cr0, r7, 143
56 bne cr0, LBB1_83 ;bb420.i
57
58The CBE manages to produce:
59
60 li r0, 143
61 mtctr r0
62loop:
63 lbzx r2, r2, r11
64 stbx r0, r2, r9
65 addi r2, r2, 1
66 bdz later
67 b loop
68
69This could be much better (bdnz instead of bdz) but it still beats us. If we
70produced this with bdnz, the loop would be a single dispatch group.
71
72===-------------------------------------------------------------------------===
73
74Compile:
75
76void foo(int *P) {
77 if (P) *P = 0;
78}
79
80into:
81
82_foo:
83 cmpwi cr0,r3,0
84 beqlr cr0
85 li r0,0
86 stw r0,0(r3)
87 blr
88
89This is effectively a simple form of predication.
90
91===-------------------------------------------------------------------------===
92
93Lump the constant pool for each function into ONE pic object, and reference
94pieces of it as offsets from the start. For functions like this (contrived
95to have lots of constants obviously):
96
97double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
98
99We generate:
100
101_X:
102 lis r2, ha16(.CPI_X_0)
103 lfd f0, lo16(.CPI_X_0)(r2)
104 lis r2, ha16(.CPI_X_1)
105 lfd f2, lo16(.CPI_X_1)(r2)
106 fmadd f0, f1, f0, f2
107 lis r2, ha16(.CPI_X_2)
108 lfd f1, lo16(.CPI_X_2)(r2)
109 lis r2, ha16(.CPI_X_3)
110 lfd f2, lo16(.CPI_X_3)(r2)
111 fmadd f1, f0, f1, f2
112 blr
113
114It would be better to materialize .CPI_X into a register, then use immediates
115off of the register to avoid the lis's. This is even more important in PIC
116mode.
117
118Note that this (and the static variable version) is discussed here for GCC:
119http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
120
Chris Lattner35d65d72007-08-23 15:16:03 +0000121Here's another example (the sgn function):
122double testf(double a) {
123 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
124}
125
126it produces a BB like this:
127LBB1_1: ; cond_true
128 lis r2, ha16(LCPI1_0)
129 lfs f0, lo16(LCPI1_0)(r2)
130 lis r2, ha16(LCPI1_1)
131 lis r3, ha16(LCPI1_2)
132 lfs f2, lo16(LCPI1_2)(r3)
133 lfs f3, lo16(LCPI1_1)(r2)
134 fsub f0, f0, f1
135 fsel f1, f0, f2, f3
136 blr
137
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000138===-------------------------------------------------------------------------===
139
140PIC Code Gen IPO optimization:
141
142Squish small scalar globals together into a single global struct, allowing the
143address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
144of the GOT on targets with one).
145
146Note that this is discussed here for GCC:
147http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
148
149===-------------------------------------------------------------------------===
150
151Implement Newton-Rhapson method for improving estimate instructions to the
152correct accuracy, and implementing divide as multiply by reciprocal when it has
153more than one use. Itanium will want this too.
154
155===-------------------------------------------------------------------------===
156
157Compile this:
158
159int %f1(int %a, int %b) {
160 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
161 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
162 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
163 ret int %tmp.4
164}
165
166without a copy. We make this currently:
167
168_f1:
169 rlwinm r2, r4, 0, 24, 27
170 rlwimi r2, r3, 0, 28, 31
171 or r3, r2, r2
172 blr
173
174The two-addr pass or RA needs to learn when it is profitable to commute an
175instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
176currently only commutes to avoid inserting a copy BEFORE the two addr instr.
177
178===-------------------------------------------------------------------------===
179
180Compile offsets from allocas:
181
182int *%test() {
183 %X = alloca { int, int }
184 %Y = getelementptr {int,int}* %X, int 0, uint 1
185 ret int* %Y
186}
187
188into a single add, not two:
189
190_test:
191 addi r2, r1, -8
192 addi r3, r2, 4
193 blr
194
195--> important for C++.
196
197===-------------------------------------------------------------------------===
198
199No loads or stores of the constants should be needed:
200
201struct foo { double X, Y; };
202void xxx(struct foo F);
203void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
204
205===-------------------------------------------------------------------------===
206
207Darwin Stub LICM optimization:
208
209Loops like this:
210
211 for (...) bar();
212
213Have to go through an indirect stub if bar is external or linkonce. It would
214be better to compile it as:
215
216 fp = &bar;
217 for (...) fp();
218
219which only computes the address of bar once (instead of each time through the
220stub). This is Darwin specific and would have to be done in the code generator.
221Probably not a win on x86.
222
223===-------------------------------------------------------------------------===
224
225Simple IPO for argument passing, change:
226 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
227
228the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
229of arguments get assigned to r3 through r10. That is, if you have a function
230foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
231argument bytes for r4 and r5. The trick then would be to shuffle the argument
232order for functions we can internalize so that the maximum number of
233integers/pointers get passed in regs before you see any of the fp arguments.
234
235Instead of implementing this, it would actually probably be easier to just
236implement a PPC fastcc, where we could do whatever we wanted to the CC,
237including having this work sanely.
238
239===-------------------------------------------------------------------------===
240
241Fix Darwin FP-In-Integer Registers ABI
242
243Darwin passes doubles in structures in integer registers, which is very very
244bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
245that percolates these things out of functions.
246
247Check out how horrible this is:
248http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
249
250This is an extension of "interprocedural CC unmunging" that can't be done with
251just fastcc.
252
253===-------------------------------------------------------------------------===
254
255Compile this:
256
257int foo(int a) {
258 int b = (a < 8);
259 if (b) {
260 return b * 3; // ignore the fact that this is always 3.
261 } else {
262 return 2;
263 }
264}
265
266into something not this:
267
268_foo:
2691) cmpwi cr7, r3, 8
270 mfcr r2, 1
271 rlwinm r2, r2, 29, 31, 31
2721) cmpwi cr0, r3, 7
273 bgt cr0, LBB1_2 ; UnifiedReturnBlock
274LBB1_1: ; then
275 rlwinm r2, r2, 0, 31, 31
276 mulli r3, r2, 3
277 blr
278LBB1_2: ; UnifiedReturnBlock
279 li r3, 2
280 blr
281
282In particular, the two compares (marked 1) could be shared by reversing one.
283This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
284same operands (but backwards) exists. In this case, this wouldn't save us
285anything though, because the compares still wouldn't be shared.
286
287===-------------------------------------------------------------------------===
288
289We should custom expand setcc instead of pretending that we have it. That
290would allow us to expose the access of the crbit after the mfcr, allowing
291that access to be trivially folded into other ops. A simple example:
292
293int foo(int a, int b) { return (a < b) << 4; }
294
295compiles into:
296
297_foo:
298 cmpw cr7, r3, r4
299 mfcr r2, 1
300 rlwinm r2, r2, 29, 31, 31
301 slwi r3, r2, 4
302 blr
303
304===-------------------------------------------------------------------------===
305
306Fold add and sub with constant into non-extern, non-weak addresses so this:
307
308static int a;
309void bar(int b) { a = b; }
310void foo(unsigned char *c) {
311 *c = a;
312}
313
314So that
315
316_foo:
317 lis r2, ha16(_a)
318 la r2, lo16(_a)(r2)
319 lbz r2, 3(r2)
320 stb r2, 0(r3)
321 blr
322
323Becomes
324
325_foo:
326 lis r2, ha16(_a+3)
327 lbz r2, lo16(_a+3)(r2)
328 stb r2, 0(r3)
329 blr
330
331===-------------------------------------------------------------------------===
332
333We generate really bad code for this:
334
335int f(signed char *a, _Bool b, _Bool c) {
336 signed char t = 0;
337 if (b) t = *a;
338 if (c) *a = t;
339}
340
341===-------------------------------------------------------------------------===
342
343This:
344int test(unsigned *P) { return *P >> 24; }
345
346Should compile to:
347
348_test:
349 lbz r3,0(r3)
350 blr
351
352not:
353
354_test:
355 lwz r2, 0(r3)
356 srwi r3, r2, 24
357 blr
358
359===-------------------------------------------------------------------------===
360
361On the G5, logical CR operations are more expensive in their three
362address form: ops that read/write the same register are half as expensive as
363those that read from two registers that are different from their destination.
364
365We should model this with two separate instructions. The isel should generate
366the "two address" form of the instructions. When the register allocator
367detects that it needs to insert a copy due to the two-addresness of the CR
368logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
369we can convert to the "three address" instruction, to save code space.
370
371This only matters when we start generating cr logical ops.
372
373===-------------------------------------------------------------------------===
374
375We should compile these two functions to the same thing:
376
377#include <stdlib.h>
378void f(int a, int b, int *P) {
379 *P = (a-b)>=0?(a-b):(b-a);
380}
381void g(int a, int b, int *P) {
382 *P = abs(a-b);
383}
384
385Further, they should compile to something better than:
386
387_g:
388 subf r2, r4, r3
389 subfic r3, r2, 0
390 cmpwi cr0, r2, -1
391 bgt cr0, LBB2_2 ; entry
392LBB2_1: ; entry
393 mr r2, r3
394LBB2_2: ; entry
395 stw r2, 0(r5)
396 blr
397
398GCC produces:
399
400_g:
401 subf r4,r4,r3
402 srawi r2,r4,31
403 xor r0,r2,r4
404 subf r0,r2,r0
405 stw r0,0(r5)
406 blr
407
408... which is much nicer.
409
410This theoretically may help improve twolf slightly (used in dimbox.c:142?).
411
412===-------------------------------------------------------------------------===
413
414int foo(int N, int ***W, int **TK, int X) {
415 int t, i;
416
417 for (t = 0; t < N; ++t)
418 for (i = 0; i < 4; ++i)
419 W[t / X][i][t % X] = TK[i][t];
420
421 return 5;
422}
423
424We generate relatively atrocious code for this loop compared to gcc.
425
426We could also strength reduce the rem and the div:
427http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
428
429===-------------------------------------------------------------------------===
430
431float foo(float X) { return (int)(X); }
432
433Currently produces:
434
435_foo:
436 fctiwz f0, f1
437 stfd f0, -8(r1)
438 lwz r2, -4(r1)
439 extsw r2, r2
440 std r2, -16(r1)
441 lfd f0, -16(r1)
442 fcfid f0, f0
443 frsp f1, f0
444 blr
445
446We could use a target dag combine to turn the lwz/extsw into an lwa when the
447lwz has a single use. Since LWA is cracked anyway, this would be a codesize
448win only.
449
450===-------------------------------------------------------------------------===
451
452We generate ugly code for this:
453
454void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
455 unsigned code = 0;
456 if(dx < -dw) code |= 1;
457 if(dx > dw) code |= 2;
458 if(dy < -dw) code |= 4;
459 if(dy > dw) code |= 8;
460 if(dz < -dw) code |= 16;
461 if(dz > dw) code |= 32;
462 *ret = code;
463}
464
465===-------------------------------------------------------------------------===
466
467Complete the signed i32 to FP conversion code using 64-bit registers
468transformation, good for PI. See PPCISelLowering.cpp, this comment:
469
470 // FIXME: disable this lowered code. This generates 64-bit register values,
471 // and we don't model the fact that the top part is clobbered by calls. We
472 // need to flag these together so that the value isn't live across a call.
473 //setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
474
475Also, if the registers are spilled to the stack, we have to ensure that all
47664-bits of them are save/restored, otherwise we will miscompile the code. It
477sounds like we need to get the 64-bit register classes going.
478
479===-------------------------------------------------------------------------===
480
481%struct.B = type { i8, [3 x i8] }
482
483define void @bar(%struct.B* %b) {
484entry:
485 %tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
486 %tmp = load i32* %tmp ; <uint> [#uses=1]
487 %tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
488 %tmp4 = load i32* %tmp3 ; <uint> [#uses=1]
489 %tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2]
490 %tmp9 = load i32* %tmp8 ; <uint> [#uses=1]
491 %tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1]
492 %tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
493 %tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1]
494 %tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1]
495 %tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1]
496 %tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1]
497 store i32 %tmp13, i32* %tmp8
498 ret void
499}
500
501We emit:
502
503_foo:
504 lwz r2, 0(r3)
505 slwi r4, r2, 1
506 or r4, r4, r2
507 rlwimi r2, r4, 0, 0, 0
508 stw r2, 0(r3)
509 blr
510
511We could collapse a bunch of those ORs and ANDs and generate the following
512equivalent code:
513
514_foo:
515 lwz r2, 0(r3)
516 rlwinm r4, r2, 1, 0, 0
517 or r2, r2, r4
518 stw r2, 0(r3)
519 blr
520
521===-------------------------------------------------------------------------===
522
523We compile:
524
525unsigned test6(unsigned x) {
526 return ((x & 0x00FF0000) >> 16) | ((x & 0x000000FF) << 16);
527}
528
529into:
530
531_test6:
532 lis r2, 255
533 rlwinm r3, r3, 16, 0, 31
534 ori r2, r2, 255
535 and r3, r3, r2
536 blr
537
538GCC gets it down to:
539
540_test6:
541 rlwinm r0,r3,16,8,15
542 rlwinm r3,r3,16,24,31
543 or r3,r3,r0
544 blr
545
546
547===-------------------------------------------------------------------------===
548
549Consider a function like this:
550
551float foo(float X) { return X + 1234.4123f; }
552
553The FP constant ends up in the constant pool, so we need to get the LR register.
554 This ends up producing code like this:
555
556_foo:
557.LBB_foo_0: ; entry
558 mflr r11
559*** stw r11, 8(r1)
560 bl "L00000$pb"
561"L00000$pb":
562 mflr r2
563 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
564 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
565 fadds f1, f1, f0
566*** lwz r11, 8(r1)
567 mtlr r11
568 blr
569
570This is functional, but there is no reason to spill the LR register all the way
571to the stack (the two marked instrs): spilling it to a GPR is quite enough.
572
573Implementing this will require some codegen improvements. Nate writes:
574
575"So basically what we need to support the "no stack frame save and restore" is a
576generalization of the LR optimization to "callee-save regs".
577
578Currently, we have LR marked as a callee-save reg. The register allocator sees
579that it's callee save, and spills it directly to the stack.
580
581Ideally, something like this would happen:
582
583LR would be in a separate register class from the GPRs. The class of LR would be
584marked "unspillable". When the register allocator came across an unspillable
585reg, it would ask "what is the best class to copy this into that I *can* spill"
586If it gets a class back, which it will in this case (the gprs), it grabs a free
587register of that class. If it is then later necessary to spill that reg, so be
588it.
589
590===-------------------------------------------------------------------------===
591
592We compile this:
593int test(_Bool X) {
594 return X ? 524288 : 0;
595}
596
597to:
598_test:
599 cmplwi cr0, r3, 0
600 lis r2, 8
601 li r3, 0
602 beq cr0, LBB1_2 ;entry
603LBB1_1: ;entry
604 mr r3, r2
605LBB1_2: ;entry
606 blr
607
608instead of:
609_test:
610 addic r2,r3,-1
611 subfe r0,r2,r3
612 slwi r3,r0,19
613 blr
614
615This sort of thing occurs a lot due to globalopt.
616
617===-------------------------------------------------------------------------===
618
619We currently compile 32-bit bswap:
620
621declare i32 @llvm.bswap.i32(i32 %A)
622define i32 @test(i32 %A) {
623 %B = call i32 @llvm.bswap.i32(i32 %A)
624 ret i32 %B
625}
626
627to:
628
629_test:
630 rlwinm r2, r3, 24, 16, 23
631 slwi r4, r3, 24
632 rlwimi r2, r3, 8, 24, 31
633 rlwimi r4, r3, 8, 8, 15
634 rlwimi r4, r2, 0, 16, 31
635 mr r3, r4
636 blr
637
638it would be more efficient to produce:
639
640_foo: mr r0,r3
641 rlwinm r3,r3,8,0xffffffff
642 rlwimi r3,r0,24,0,7
643 rlwimi r3,r0,24,16,23
644 blr
645
646===-------------------------------------------------------------------------===
647
648test/CodeGen/PowerPC/2007-03-24-cntlzd.ll compiles to:
649
650__ZNK4llvm5APInt17countLeadingZerosEv:
651 ld r2, 0(r3)
652 cntlzd r2, r2
653 or r2, r2, r2 <<-- silly.
654 addi r3, r2, -64
655 blr
656
657The dead or is a 'truncate' from 64- to 32-bits.
658
659===-------------------------------------------------------------------------===
660
661We generate horrible ppc code for this:
662
663#define N 2000000
664double a[N],c[N];
665void simpleloop() {
666 int j;
667 for (j=0; j<N; j++)
668 c[j] = a[j];
669}
670
671LBB1_1: ;bb
672 lfdx f0, r3, r4
673 addi r5, r5, 1 ;; Extra IV for the exit value compare.
674 stfdx f0, r2, r4
675 addi r4, r4, 8
676
677 xoris r6, r5, 30 ;; This is due to a large immediate.
678 cmplwi cr0, r6, 33920
679 bne cr0, LBB1_1
680
Chris Lattner4084d492007-09-10 21:43:18 +0000681//===---------------------------------------------------------------------===//
682
683This:
684 #include <algorithm>
685 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
686 { return std::make_pair(a + b, a + b < a); }
687 bool no_overflow(unsigned a, unsigned b)
688 { return !full_add(a, b).second; }
689
690Should compile to:
691
692__Z11no_overflowjj:
693 add r4,r3,r4
694 subfc r3,r3,r4
695 li r3,0
696 adde r3,r3,r3
697 blr
698
699(or better) not:
700
701__Z11no_overflowjj:
702 add r2, r4, r3
703 cmplw cr7, r2, r3
704 mfcr r2
705 rlwinm r2, r2, 29, 31, 31
706 xori r3, r2, 1
707 blr
708
709//===---------------------------------------------------------------------===//
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000710
Chris Lattner6c36fb52008-01-08 06:46:30 +0000711We compile some FP comparisons into an mfcr with two rlwinms and an or. For
712example:
713#include <math.h>
714int test(double x, double y) { return islessequal(x, y);}
715int test2(double x, double y) { return islessgreater(x, y);}
716int test3(double x, double y) { return !islessequal(x, y);}
717
718Compiles into (all three are similar, but the bits differ):
719
720_test:
721 fcmpu cr7, f1, f2
722 mfcr r2
723 rlwinm r3, r2, 29, 31, 31
724 rlwinm r2, r2, 31, 31, 31
725 or r3, r2, r3
726 blr
727
728GCC compiles this into:
729
730 _test:
731 fcmpu cr7,f1,f2
732 cror 30,28,30
733 mfcr r3
734 rlwinm r3,r3,31,1
735 blr
736
737which is more efficient and can use mfocr. See PR642 for some more context.
738
739//===---------------------------------------------------------------------===//