blob: 2114fde89a1929efe71c9ed0a787c22a3cb2f97f [file] [log] [blame]
Chris Lattnerb86bd2c2006-03-27 07:04:16 +00001//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
2
Nate Begemanb64af912004-08-10 20:42:36 +00003TODO:
Nate Begemanef9531e2005-04-11 20:48:57 +00004* gpr0 allocation
Nate Begeman4a0de072004-10-26 04:10:53 +00005* implement do-loop -> bdnz transform
Nate Begeman50fb3c42005-12-24 01:00:15 +00006
Nate Begemana63fee82006-02-03 05:17:06 +00007===-------------------------------------------------------------------------===
Nate Begeman50fb3c42005-12-24 01:00:15 +00008
Nate Begemana63fee82006-02-03 05:17:06 +00009Support 'update' load/store instructions. These are cracked on the G5, but are
10still a codesize win.
11
Chris Lattner26ddb502006-11-10 01:33:53 +000012With preinc enabled, this:
13
14long *%test4(long *%X, long *%dest) {
15 %Y = getelementptr long* %X, int 4
16 %A = load long* %Y
17 store long %A, long* %dest
18 ret long* %Y
19}
20
21compiles to:
22
23_test4:
24 mr r2, r3
25 lwzu r5, 32(r2)
26 lwz r3, 36(r3)
27 stw r5, 0(r4)
28 stw r3, 4(r4)
29 mr r3, r2
30 blr
31
32with -sched=list-burr, I get:
33
34_test4:
35 lwz r2, 36(r3)
36 lwzu r5, 32(r3)
37 stw r2, 4(r4)
38 stw r5, 0(r4)
39 blr
40
Nate Begemana63fee82006-02-03 05:17:06 +000041===-------------------------------------------------------------------------===
42
Chris Lattner6e112952006-11-07 18:30:21 +000043We compile the hottest inner loop of viterbi to:
44
45 li r6, 0
46 b LBB1_84 ;bb432.i
47LBB1_83: ;bb420.i
48 lbzx r8, r5, r7
49 addi r6, r7, 1
50 stbx r8, r4, r7
51LBB1_84: ;bb432.i
52 mr r7, r6
53 cmplwi cr0, r7, 143
54 bne cr0, LBB1_83 ;bb420.i
55
56The CBE manages to produce:
57
58 li r0, 143
59 mtctr r0
60loop:
61 lbzx r2, r2, r11
62 stbx r0, r2, r9
63 addi r2, r2, 1
64 bdz later
65 b loop
66
67This could be much better (bdnz instead of bdz) but it still beats us. If we
68produced this with bdnz, the loop would be a single dispatch group.
69
70===-------------------------------------------------------------------------===
71
Chris Lattner6a250ec2006-10-13 20:20:58 +000072Compile:
73
74void foo(int *P) {
75 if (P) *P = 0;
76}
77
78into:
79
80_foo:
81 cmpwi cr0,r3,0
82 beqlr cr0
83 li r0,0
84 stw r0,0(r3)
85 blr
86
87This is effectively a simple form of predication.
88
89===-------------------------------------------------------------------------===
90
Chris Lattnera3c44542005-08-24 18:15:24 +000091Lump the constant pool for each function into ONE pic object, and reference
92pieces of it as offsets from the start. For functions like this (contrived
93to have lots of constants obviously):
94
95double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
96
97We generate:
98
99_X:
100 lis r2, ha16(.CPI_X_0)
101 lfd f0, lo16(.CPI_X_0)(r2)
102 lis r2, ha16(.CPI_X_1)
103 lfd f2, lo16(.CPI_X_1)(r2)
104 fmadd f0, f1, f0, f2
105 lis r2, ha16(.CPI_X_2)
106 lfd f1, lo16(.CPI_X_2)(r2)
107 lis r2, ha16(.CPI_X_3)
108 lfd f2, lo16(.CPI_X_3)(r2)
109 fmadd f1, f0, f1, f2
110 blr
111
112It would be better to materialize .CPI_X into a register, then use immediates
113off of the register to avoid the lis's. This is even more important in PIC
114mode.
115
Chris Lattner39b248b2006-02-02 23:50:22 +0000116Note that this (and the static variable version) is discussed here for GCC:
117http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
118
Chris Lattnera3c44542005-08-24 18:15:24 +0000119===-------------------------------------------------------------------------===
Nate Begeman92cce902005-09-06 15:30:48 +0000120
Chris Lattner33c1dab2006-02-03 06:22:11 +0000121PIC Code Gen IPO optimization:
122
123Squish small scalar globals together into a single global struct, allowing the
124address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
125of the GOT on targets with one).
126
127Note that this is discussed here for GCC:
128http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
129
130===-------------------------------------------------------------------------===
131
Nate Begeman92cce902005-09-06 15:30:48 +0000132Implement Newton-Rhapson method for improving estimate instructions to the
133correct accuracy, and implementing divide as multiply by reciprocal when it has
134more than one use. Itanium will want this too.
Nate Begeman21e463b2005-10-16 05:39:50 +0000135
136===-------------------------------------------------------------------------===
137
Chris Lattnerae4664a2005-11-05 08:57:56 +0000138Compile this:
139
140int %f1(int %a, int %b) {
141 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
142 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
143 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
144 ret int %tmp.4
145}
146
147without a copy. We make this currently:
148
149_f1:
150 rlwinm r2, r4, 0, 24, 27
151 rlwimi r2, r3, 0, 28, 31
152 or r3, r2, r2
153 blr
154
155The two-addr pass or RA needs to learn when it is profitable to commute an
156instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
157currently only commutes to avoid inserting a copy BEFORE the two addr instr.
158
Chris Lattner62c08dd2005-12-08 07:13:28 +0000159===-------------------------------------------------------------------------===
160
161Compile offsets from allocas:
162
163int *%test() {
164 %X = alloca { int, int }
165 %Y = getelementptr {int,int}* %X, int 0, uint 1
166 ret int* %Y
167}
168
169into a single add, not two:
170
171_test:
172 addi r2, r1, -8
173 addi r3, r2, 4
174 blr
175
176--> important for C++.
177
Chris Lattner39706e62005-12-22 17:19:28 +0000178===-------------------------------------------------------------------------===
179
Chris Lattner39706e62005-12-22 17:19:28 +0000180No loads or stores of the constants should be needed:
181
182struct foo { double X, Y; };
183void xxx(struct foo F);
184void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
185
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000186===-------------------------------------------------------------------------===
187
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000188Darwin Stub LICM optimization:
189
190Loops like this:
191
192 for (...) bar();
193
194Have to go through an indirect stub if bar is external or linkonce. It would
195be better to compile it as:
196
197 fp = &bar;
198 for (...) fp();
199
200which only computes the address of bar once (instead of each time through the
201stub). This is Darwin specific and would have to be done in the code generator.
202Probably not a win on x86.
203
204===-------------------------------------------------------------------------===
205
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000206Simple IPO for argument passing, change:
207 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
208
209the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
210of arguments get assigned to r3 through r10. That is, if you have a function
211foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
212argument bytes for r4 and r5. The trick then would be to shuffle the argument
213order for functions we can internalize so that the maximum number of
214integers/pointers get passed in regs before you see any of the fp arguments.
215
216Instead of implementing this, it would actually probably be easier to just
217implement a PPC fastcc, where we could do whatever we wanted to the CC,
218including having this work sanely.
219
220===-------------------------------------------------------------------------===
221
222Fix Darwin FP-In-Integer Registers ABI
223
224Darwin passes doubles in structures in integer registers, which is very very
225bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
226that percolates these things out of functions.
227
228Check out how horrible this is:
229http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
230
231This is an extension of "interprocedural CC unmunging" that can't be done with
232just fastcc.
233
234===-------------------------------------------------------------------------===
235
Chris Lattner56b69642006-01-31 02:55:28 +0000236Compile this:
237
Chris Lattner83e64ba2006-01-31 07:16:34 +0000238int foo(int a) {
239 int b = (a < 8);
240 if (b) {
241 return b * 3; // ignore the fact that this is always 3.
242 } else {
243 return 2;
244 }
245}
246
247into something not this:
248
249_foo:
2501) cmpwi cr7, r3, 8
251 mfcr r2, 1
252 rlwinm r2, r2, 29, 31, 31
2531) cmpwi cr0, r3, 7
254 bgt cr0, LBB1_2 ; UnifiedReturnBlock
255LBB1_1: ; then
256 rlwinm r2, r2, 0, 31, 31
257 mulli r3, r2, 3
258 blr
259LBB1_2: ; UnifiedReturnBlock
260 li r3, 2
261 blr
262
263In particular, the two compares (marked 1) could be shared by reversing one.
264This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
265same operands (but backwards) exists. In this case, this wouldn't save us
266anything though, because the compares still wouldn't be shared.
Chris Lattner0ddc1802006-02-01 00:28:12 +0000267
Chris Lattner5a7efc92006-02-01 17:54:23 +0000268===-------------------------------------------------------------------------===
269
270The legalizer should lower this:
271
272bool %test(ulong %x) {
273 %tmp = setlt ulong %x, 4294967296
274 ret bool %tmp
275}
276
277into "if x.high == 0", not:
278
279_test:
Nate Begeman908049b2007-01-29 21:21:22 +0000280 cntlzw r2, r3
281 xori r3, r3, 1
282 cmplwi cr0, r3, 0
Chris Lattner5a7efc92006-02-01 17:54:23 +0000283 srwi r2, r2, 5
Nate Begeman93c740b2006-02-02 07:27:56 +0000284 li r3, 0
Nate Begeman908049b2007-01-29 21:21:22 +0000285 beq cr0, LBB1_2 ;entry
286LBB1_1: ;entry
287 mr r3, r2
288LBB1_2: ;entry
289 blr
Chris Lattner5a7efc92006-02-01 17:54:23 +0000290
291noticed in 2005-05-11-Popcount-ffs-fls.c.
Chris Lattner275b8842006-02-02 07:37:11 +0000292
293
294===-------------------------------------------------------------------------===
295
296We should custom expand setcc instead of pretending that we have it. That
297would allow us to expose the access of the crbit after the mfcr, allowing
298that access to be trivially folded into other ops. A simple example:
299
300int foo(int a, int b) { return (a < b) << 4; }
301
302compiles into:
303
304_foo:
305 cmpw cr7, r3, r4
306 mfcr r2, 1
307 rlwinm r2, r2, 29, 31, 31
308 slwi r3, r2, 4
309 blr
310
Chris Lattnerd463f7f2006-02-03 01:49:49 +0000311===-------------------------------------------------------------------------===
312
Nate Begemana63fee82006-02-03 05:17:06 +0000313Fold add and sub with constant into non-extern, non-weak addresses so this:
314
315static int a;
316void bar(int b) { a = b; }
317void foo(unsigned char *c) {
318 *c = a;
319}
320
321So that
322
323_foo:
324 lis r2, ha16(_a)
325 la r2, lo16(_a)(r2)
326 lbz r2, 3(r2)
327 stb r2, 0(r3)
328 blr
329
330Becomes
331
332_foo:
333 lis r2, ha16(_a+3)
334 lbz r2, lo16(_a+3)(r2)
335 stb r2, 0(r3)
336 blr
Chris Lattner21384532006-02-05 05:27:35 +0000337
338===-------------------------------------------------------------------------===
339
340We generate really bad code for this:
341
342int f(signed char *a, _Bool b, _Bool c) {
343 signed char t = 0;
344 if (b) t = *a;
345 if (c) *a = t;
346}
347
Chris Lattner00d18f02006-03-01 06:36:20 +0000348===-------------------------------------------------------------------------===
349
350This:
351int test(unsigned *P) { return *P >> 24; }
352
353Should compile to:
354
355_test:
356 lbz r3,0(r3)
357 blr
358
359not:
360
361_test:
362 lwz r2, 0(r3)
363 srwi r3, r2, 24
364 blr
365
Chris Lattner5a63c472006-03-07 04:42:59 +0000366===-------------------------------------------------------------------------===
367
368On the G5, logical CR operations are more expensive in their three
369address form: ops that read/write the same register are half as expensive as
370those that read from two registers that are different from their destination.
371
372We should model this with two separate instructions. The isel should generate
373the "two address" form of the instructions. When the register allocator
374detects that it needs to insert a copy due to the two-addresness of the CR
375logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
376we can convert to the "three address" instruction, to save code space.
377
378This only matters when we start generating cr logical ops.
379
Chris Lattner49f398b2006-03-08 00:25:47 +0000380===-------------------------------------------------------------------------===
381
382We should compile these two functions to the same thing:
383
384#include <stdlib.h>
385void f(int a, int b, int *P) {
386 *P = (a-b)>=0?(a-b):(b-a);
387}
388void g(int a, int b, int *P) {
389 *P = abs(a-b);
390}
391
392Further, they should compile to something better than:
393
394_g:
395 subf r2, r4, r3
396 subfic r3, r2, 0
397 cmpwi cr0, r2, -1
398 bgt cr0, LBB2_2 ; entry
399LBB2_1: ; entry
400 mr r2, r3
401LBB2_2: ; entry
402 stw r2, 0(r5)
403 blr
404
405GCC produces:
406
407_g:
408 subf r4,r4,r3
409 srawi r2,r4,31
410 xor r0,r2,r4
411 subf r0,r2,r0
412 stw r0,0(r5)
413 blr
414
415... which is much nicer.
416
417This theoretically may help improve twolf slightly (used in dimbox.c:142?).
418
419===-------------------------------------------------------------------------===
420
Nate Begeman2df99282006-03-16 18:50:44 +0000421int foo(int N, int ***W, int **TK, int X) {
422 int t, i;
423
424 for (t = 0; t < N; ++t)
425 for (i = 0; i < 4; ++i)
426 W[t / X][i][t % X] = TK[i][t];
427
428 return 5;
429}
430
Chris Lattnered511692006-03-16 22:25:55 +0000431We generate relatively atrocious code for this loop compared to gcc.
432
Chris Lattneref040dd2006-03-21 00:47:09 +0000433We could also strength reduce the rem and the div:
434http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
435
Chris Lattner28b1a0b2006-03-19 05:33:30 +0000436===-------------------------------------------------------------------------===
Chris Lattnered511692006-03-16 22:25:55 +0000437
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000438float foo(float X) { return (int)(X); }
439
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000440Currently produces:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000441
442_foo:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000443 fctiwz f0, f1
444 stfd f0, -8(r1)
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000445 lwz r2, -4(r1)
446 extsw r2, r2
447 std r2, -16(r1)
448 lfd f0, -16(r1)
449 fcfid f0, f0
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000450 frsp f1, f0
451 blr
452
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000453We could use a target dag combine to turn the lwz/extsw into an lwa when the
454lwz has a single use. Since LWA is cracked anyway, this would be a codesize
455win only.
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000456
Chris Lattner716aefc2006-03-23 21:28:44 +0000457===-------------------------------------------------------------------------===
458
Chris Lattner057f09b2006-03-24 20:04:27 +0000459We generate ugly code for this:
460
461void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
462 unsigned code = 0;
463 if(dx < -dw) code |= 1;
464 if(dx > dw) code |= 2;
465 if(dy < -dw) code |= 4;
466 if(dy > dw) code |= 8;
467 if(dz < -dw) code |= 16;
468 if(dz > dw) code |= 32;
469 *ret = code;
470}
471
Chris Lattner420736d2006-03-25 06:47:10 +0000472===-------------------------------------------------------------------------===
473
Chris Lattnered937902006-04-13 16:48:00 +0000474Complete the signed i32 to FP conversion code using 64-bit registers
475transformation, good for PI. See PPCISelLowering.cpp, this comment:
Chris Lattner220d2b82006-04-02 07:20:00 +0000476
Chris Lattnered937902006-04-13 16:48:00 +0000477 // FIXME: disable this lowered code. This generates 64-bit register values,
478 // and we don't model the fact that the top part is clobbered by calls. We
479 // need to flag these together so that the value isn't live across a call.
480 //setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
Chris Lattner220d2b82006-04-02 07:20:00 +0000481
Chris Lattner9d62fa42006-05-17 19:02:25 +0000482Also, if the registers are spilled to the stack, we have to ensure that all
48364-bits of them are save/restored, otherwise we will miscompile the code. It
484sounds like we need to get the 64-bit register classes going.
485
Chris Lattner55c63252006-05-05 05:36:15 +0000486===-------------------------------------------------------------------------===
487
Nate Begeman908049b2007-01-29 21:21:22 +0000488%struct.B = type { i8, [3 x i8] }
Nate Begeman75146202006-05-08 20:54:02 +0000489
Nate Begeman908049b2007-01-29 21:21:22 +0000490define void @bar(%struct.B* %b) {
Nate Begeman75146202006-05-08 20:54:02 +0000491entry:
Nate Begeman908049b2007-01-29 21:21:22 +0000492 %tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
493 %tmp = load i32* %tmp ; <uint> [#uses=1]
494 %tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
495 %tmp4 = load i32* %tmp3 ; <uint> [#uses=1]
496 %tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2]
497 %tmp9 = load i32* %tmp8 ; <uint> [#uses=1]
498 %tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1]
499 %tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
500 %tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1]
501 %tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1]
502 %tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1]
503 %tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1]
504 store i32 %tmp13, i32* %tmp8
Chris Lattner55c63252006-05-05 05:36:15 +0000505 ret void
506}
507
508We emit:
509
510_foo:
511 lwz r2, 0(r3)
Nate Begeman75146202006-05-08 20:54:02 +0000512 slwi r4, r2, 1
513 or r4, r4, r2
514 rlwimi r2, r4, 0, 0, 0
Nate Begeman4667f2c2006-05-08 17:38:32 +0000515 stw r2, 0(r3)
Chris Lattner55c63252006-05-05 05:36:15 +0000516 blr
517
Nate Begeman75146202006-05-08 20:54:02 +0000518We could collapse a bunch of those ORs and ANDs and generate the following
519equivalent code:
Chris Lattner55c63252006-05-05 05:36:15 +0000520
Nate Begeman4667f2c2006-05-08 17:38:32 +0000521_foo:
522 lwz r2, 0(r3)
Nate Begemand8624ed2006-05-08 19:09:24 +0000523 rlwinm r4, r2, 1, 0, 0
Nate Begeman4667f2c2006-05-08 17:38:32 +0000524 or r2, r2, r4
525 stw r2, 0(r3)
526 blr
Chris Lattner1eeedae2006-07-14 04:07:29 +0000527
528===-------------------------------------------------------------------------===
529
Chris Lattnerf0613e12006-09-14 20:56:30 +0000530We compile:
531
532unsigned test6(unsigned x) {
533 return ((x & 0x00FF0000) >> 16) | ((x & 0x000000FF) << 16);
534}
535
536into:
537
538_test6:
539 lis r2, 255
540 rlwinm r3, r3, 16, 0, 31
541 ori r2, r2, 255
542 and r3, r3, r2
543 blr
544
545GCC gets it down to:
546
547_test6:
548 rlwinm r0,r3,16,8,15
549 rlwinm r3,r3,16,24,31
550 or r3,r3,r0
551 blr
552
Chris Lattnerafd7a082007-01-18 07:34:57 +0000553
554===-------------------------------------------------------------------------===
555
556Consider a function like this:
557
558float foo(float X) { return X + 1234.4123f; }
559
560The FP constant ends up in the constant pool, so we need to get the LR register.
561 This ends up producing code like this:
562
563_foo:
564.LBB_foo_0: ; entry
565 mflr r11
566*** stw r11, 8(r1)
567 bl "L00000$pb"
568"L00000$pb":
569 mflr r2
570 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
571 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
572 fadds f1, f1, f0
573*** lwz r11, 8(r1)
574 mtlr r11
575 blr
576
577This is functional, but there is no reason to spill the LR register all the way
578to the stack (the two marked instrs): spilling it to a GPR is quite enough.
579
580Implementing this will require some codegen improvements. Nate writes:
581
582"So basically what we need to support the "no stack frame save and restore" is a
583generalization of the LR optimization to "callee-save regs".
584
585Currently, we have LR marked as a callee-save reg. The register allocator sees
586that it's callee save, and spills it directly to the stack.
587
588Ideally, something like this would happen:
589
590LR would be in a separate register class from the GPRs. The class of LR would be
591marked "unspillable". When the register allocator came across an unspillable
592reg, it would ask "what is the best class to copy this into that I *can* spill"
593If it gets a class back, which it will in this case (the gprs), it grabs a free
594register of that class. If it is then later necessary to spill that reg, so be
595it.
596
597===-------------------------------------------------------------------------===
Chris Lattner95b9d6e2007-01-31 19:49:20 +0000598
599We compile this:
600int test(_Bool X) {
601 return X ? 524288 : 0;
602}
603
604to:
605_test:
606 cmplwi cr0, r3, 0
607 lis r2, 8
608 li r3, 0
609 beq cr0, LBB1_2 ;entry
610LBB1_1: ;entry
611 mr r3, r2
612LBB1_2: ;entry
613 blr
614
615instead of:
616_test:
617 addic r2,r3,-1
618 subfe r0,r2,r3
619 slwi r3,r0,19
620 blr
621
622This sort of thing occurs a lot due to globalopt.
623
624===-------------------------------------------------------------------------===