blob: 2c99e02e4e22f5277ff96e9dce4c574f82feb5f4 [file] [log] [blame]
Chris Lattnerb86bd2c2006-03-27 07:04:16 +00001//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
2
Nate Begemanb64af912004-08-10 20:42:36 +00003TODO:
Nate Begemanef9531e2005-04-11 20:48:57 +00004* gpr0 allocation
Nate Begeman4a0de072004-10-26 04:10:53 +00005* implement do-loop -> bdnz transform
Nate Begeman50fb3c42005-12-24 01:00:15 +00006
Nate Begemana63fee82006-02-03 05:17:06 +00007===-------------------------------------------------------------------------===
Nate Begeman50fb3c42005-12-24 01:00:15 +00008
Nate Begemana63fee82006-02-03 05:17:06 +00009Support 'update' load/store instructions. These are cracked on the G5, but are
10still a codesize win.
11
12===-------------------------------------------------------------------------===
13
Nate Begeman81e80972006-03-17 01:40:33 +000014Teach the .td file to pattern match PPC::BR_COND to appropriate bc variant, so
15we don't have to always run the branch selector for small functions.
Nate Begeman1ad9b3a2006-03-16 22:37:48 +000016
Chris Lattnera3c44542005-08-24 18:15:24 +000017===-------------------------------------------------------------------------===
18
Chris Lattner424dcbd2005-08-23 06:27:59 +000019* Codegen this:
20
21 void test2(int X) {
22 if (X == 0x12345678) bar();
23 }
24
25 as:
26
27 xoris r0,r3,0x1234
Nate Begeman6e53ceb2006-02-27 22:08:36 +000028 cmplwi cr0,r0,0x5678
Chris Lattner424dcbd2005-08-23 06:27:59 +000029 beq cr0,L6
30
31 not:
32
33 lis r2, 4660
34 ori r2, r2, 22136
35 cmpw cr0, r3, r2
36 bne .LBB_test2_2
37
Chris Lattnera3c44542005-08-24 18:15:24 +000038===-------------------------------------------------------------------------===
39
40Lump the constant pool for each function into ONE pic object, and reference
41pieces of it as offsets from the start. For functions like this (contrived
42to have lots of constants obviously):
43
44double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
45
46We generate:
47
48_X:
49 lis r2, ha16(.CPI_X_0)
50 lfd f0, lo16(.CPI_X_0)(r2)
51 lis r2, ha16(.CPI_X_1)
52 lfd f2, lo16(.CPI_X_1)(r2)
53 fmadd f0, f1, f0, f2
54 lis r2, ha16(.CPI_X_2)
55 lfd f1, lo16(.CPI_X_2)(r2)
56 lis r2, ha16(.CPI_X_3)
57 lfd f2, lo16(.CPI_X_3)(r2)
58 fmadd f1, f0, f1, f2
59 blr
60
61It would be better to materialize .CPI_X into a register, then use immediates
62off of the register to avoid the lis's. This is even more important in PIC
63mode.
64
Chris Lattner39b248b2006-02-02 23:50:22 +000065Note that this (and the static variable version) is discussed here for GCC:
66http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
67
Chris Lattnera3c44542005-08-24 18:15:24 +000068===-------------------------------------------------------------------------===
Nate Begeman92cce902005-09-06 15:30:48 +000069
Chris Lattner33c1dab2006-02-03 06:22:11 +000070PIC Code Gen IPO optimization:
71
72Squish small scalar globals together into a single global struct, allowing the
73address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
74of the GOT on targets with one).
75
76Note that this is discussed here for GCC:
77http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
78
79===-------------------------------------------------------------------------===
80
Nate Begeman92cce902005-09-06 15:30:48 +000081Implement Newton-Rhapson method for improving estimate instructions to the
82correct accuracy, and implementing divide as multiply by reciprocal when it has
83more than one use. Itanium will want this too.
Nate Begeman21e463b2005-10-16 05:39:50 +000084
85===-------------------------------------------------------------------------===
86
Chris Lattnerae4664a2005-11-05 08:57:56 +000087Compile this:
88
89int %f1(int %a, int %b) {
90 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
91 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
92 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
93 ret int %tmp.4
94}
95
96without a copy. We make this currently:
97
98_f1:
99 rlwinm r2, r4, 0, 24, 27
100 rlwimi r2, r3, 0, 28, 31
101 or r3, r2, r2
102 blr
103
104The two-addr pass or RA needs to learn when it is profitable to commute an
105instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
106currently only commutes to avoid inserting a copy BEFORE the two addr instr.
107
Chris Lattner62c08dd2005-12-08 07:13:28 +0000108===-------------------------------------------------------------------------===
109
110Compile offsets from allocas:
111
112int *%test() {
113 %X = alloca { int, int }
114 %Y = getelementptr {int,int}* %X, int 0, uint 1
115 ret int* %Y
116}
117
118into a single add, not two:
119
120_test:
121 addi r2, r1, -8
122 addi r3, r2, 4
123 blr
124
125--> important for C++.
126
Chris Lattner39706e62005-12-22 17:19:28 +0000127===-------------------------------------------------------------------------===
128
129int test3(int a, int b) { return (a < 0) ? a : 0; }
130
131should be branch free code. LLVM is turning it into < 1 because of the RHS.
132
133===-------------------------------------------------------------------------===
134
Chris Lattner39706e62005-12-22 17:19:28 +0000135No loads or stores of the constants should be needed:
136
137struct foo { double X, Y; };
138void xxx(struct foo F);
139void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
140
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000141===-------------------------------------------------------------------------===
142
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000143Darwin Stub LICM optimization:
144
145Loops like this:
146
147 for (...) bar();
148
149Have to go through an indirect stub if bar is external or linkonce. It would
150be better to compile it as:
151
152 fp = &bar;
153 for (...) fp();
154
155which only computes the address of bar once (instead of each time through the
156stub). This is Darwin specific and would have to be done in the code generator.
157Probably not a win on x86.
158
159===-------------------------------------------------------------------------===
160
161PowerPC i1/setcc stuff (depends on subreg stuff):
162
163Check out the PPC code we get for 'compare' in this testcase:
164http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
165
166oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
167invert, invert, or), we then have to compare it against zero instead of
168using the value already in a CR!
169
170that should be something like
171 cmpw cr7, r8, r5
172 cmpw cr0, r7, r3
173 crnand cr0, cr0, cr7
174 bne cr0, LBB_compare_4
175
176instead of
177 cmpw cr7, r8, r5
178 cmpw cr0, r7, r3
179 mfcr r7, 1
180 mcrf cr7, cr0
181 mfcr r8, 1
182 rlwinm r7, r7, 30, 31, 31
183 rlwinm r8, r8, 30, 31, 31
184 xori r7, r7, 1
185 xori r8, r8, 1
186 addi r2, r2, 1
187 or r7, r8, r7
188 cmpwi cr0, r7, 0
189 bne cr0, LBB_compare_4 ; loopexit
190
Chris Lattner8d3f4902006-02-08 06:43:51 +0000191FreeBench/mason has a basic block that looks like this:
192
193 %tmp.130 = seteq int %p.0__, 5 ; <bool> [#uses=1]
194 %tmp.134 = seteq int %p.1__, 6 ; <bool> [#uses=1]
195 %tmp.139 = seteq int %p.2__, 12 ; <bool> [#uses=1]
196 %tmp.144 = seteq int %p.3__, 13 ; <bool> [#uses=1]
197 %tmp.149 = seteq int %p.4__, 14 ; <bool> [#uses=1]
198 %tmp.154 = seteq int %p.5__, 15 ; <bool> [#uses=1]
199 %bothcond = and bool %tmp.134, %tmp.130 ; <bool> [#uses=1]
200 %bothcond123 = and bool %bothcond, %tmp.139 ; <bool>
201 %bothcond124 = and bool %bothcond123, %tmp.144 ; <bool>
202 %bothcond125 = and bool %bothcond124, %tmp.149 ; <bool>
203 %bothcond126 = and bool %bothcond125, %tmp.154 ; <bool>
204 br bool %bothcond126, label %shortcirc_next.5, label %else.0
205
206This is a particularly important case where handling CRs better will help.
207
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000208===-------------------------------------------------------------------------===
209
210Simple IPO for argument passing, change:
211 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
212
213the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
214of arguments get assigned to r3 through r10. That is, if you have a function
215foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
216argument bytes for r4 and r5. The trick then would be to shuffle the argument
217order for functions we can internalize so that the maximum number of
218integers/pointers get passed in regs before you see any of the fp arguments.
219
220Instead of implementing this, it would actually probably be easier to just
221implement a PPC fastcc, where we could do whatever we wanted to the CC,
222including having this work sanely.
223
224===-------------------------------------------------------------------------===
225
226Fix Darwin FP-In-Integer Registers ABI
227
228Darwin passes doubles in structures in integer registers, which is very very
229bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
230that percolates these things out of functions.
231
232Check out how horrible this is:
233http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
234
235This is an extension of "interprocedural CC unmunging" that can't be done with
236just fastcc.
237
238===-------------------------------------------------------------------------===
239
Chris Lattner56b69642006-01-31 02:55:28 +0000240Compile this:
241
Chris Lattner83e64ba2006-01-31 07:16:34 +0000242int foo(int a) {
243 int b = (a < 8);
244 if (b) {
245 return b * 3; // ignore the fact that this is always 3.
246 } else {
247 return 2;
248 }
249}
250
251into something not this:
252
253_foo:
2541) cmpwi cr7, r3, 8
255 mfcr r2, 1
256 rlwinm r2, r2, 29, 31, 31
2571) cmpwi cr0, r3, 7
258 bgt cr0, LBB1_2 ; UnifiedReturnBlock
259LBB1_1: ; then
260 rlwinm r2, r2, 0, 31, 31
261 mulli r3, r2, 3
262 blr
263LBB1_2: ; UnifiedReturnBlock
264 li r3, 2
265 blr
266
267In particular, the two compares (marked 1) could be shared by reversing one.
268This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
269same operands (but backwards) exists. In this case, this wouldn't save us
270anything though, because the compares still wouldn't be shared.
Chris Lattner0ddc1802006-02-01 00:28:12 +0000271
Chris Lattner5a7efc92006-02-01 17:54:23 +0000272===-------------------------------------------------------------------------===
273
274The legalizer should lower this:
275
276bool %test(ulong %x) {
277 %tmp = setlt ulong %x, 4294967296
278 ret bool %tmp
279}
280
281into "if x.high == 0", not:
282
283_test:
284 addi r2, r3, -1
285 cntlzw r2, r2
286 cntlzw r3, r3
287 srwi r2, r2, 5
Nate Begeman93c740b2006-02-02 07:27:56 +0000288 srwi r4, r3, 5
289 li r3, 0
Chris Lattner5a7efc92006-02-01 17:54:23 +0000290 cmpwi cr0, r2, 0
291 bne cr0, LBB1_2 ;
292LBB1_1:
Nate Begeman93c740b2006-02-02 07:27:56 +0000293 or r3, r4, r4
Chris Lattner5a7efc92006-02-01 17:54:23 +0000294LBB1_2:
Chris Lattner5a7efc92006-02-01 17:54:23 +0000295 blr
296
297noticed in 2005-05-11-Popcount-ffs-fls.c.
Chris Lattner275b8842006-02-02 07:37:11 +0000298
299
300===-------------------------------------------------------------------------===
301
302We should custom expand setcc instead of pretending that we have it. That
303would allow us to expose the access of the crbit after the mfcr, allowing
304that access to be trivially folded into other ops. A simple example:
305
306int foo(int a, int b) { return (a < b) << 4; }
307
308compiles into:
309
310_foo:
311 cmpw cr7, r3, r4
312 mfcr r2, 1
313 rlwinm r2, r2, 29, 31, 31
314 slwi r3, r2, 4
315 blr
316
Chris Lattnerd463f7f2006-02-03 01:49:49 +0000317===-------------------------------------------------------------------------===
318
Nate Begemana63fee82006-02-03 05:17:06 +0000319Fold add and sub with constant into non-extern, non-weak addresses so this:
320
321static int a;
322void bar(int b) { a = b; }
323void foo(unsigned char *c) {
324 *c = a;
325}
326
327So that
328
329_foo:
330 lis r2, ha16(_a)
331 la r2, lo16(_a)(r2)
332 lbz r2, 3(r2)
333 stb r2, 0(r3)
334 blr
335
336Becomes
337
338_foo:
339 lis r2, ha16(_a+3)
340 lbz r2, lo16(_a+3)(r2)
341 stb r2, 0(r3)
342 blr
Chris Lattner21384532006-02-05 05:27:35 +0000343
344===-------------------------------------------------------------------------===
345
346We generate really bad code for this:
347
348int f(signed char *a, _Bool b, _Bool c) {
349 signed char t = 0;
350 if (b) t = *a;
351 if (c) *a = t;
352}
353
Chris Lattner00d18f02006-03-01 06:36:20 +0000354===-------------------------------------------------------------------------===
355
356This:
357int test(unsigned *P) { return *P >> 24; }
358
359Should compile to:
360
361_test:
362 lbz r3,0(r3)
363 blr
364
365not:
366
367_test:
368 lwz r2, 0(r3)
369 srwi r3, r2, 24
370 blr
371
Chris Lattner5a63c472006-03-07 04:42:59 +0000372===-------------------------------------------------------------------------===
373
374On the G5, logical CR operations are more expensive in their three
375address form: ops that read/write the same register are half as expensive as
376those that read from two registers that are different from their destination.
377
378We should model this with two separate instructions. The isel should generate
379the "two address" form of the instructions. When the register allocator
380detects that it needs to insert a copy due to the two-addresness of the CR
381logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
382we can convert to the "three address" instruction, to save code space.
383
384This only matters when we start generating cr logical ops.
385
Chris Lattner49f398b2006-03-08 00:25:47 +0000386===-------------------------------------------------------------------------===
387
388We should compile these two functions to the same thing:
389
390#include <stdlib.h>
391void f(int a, int b, int *P) {
392 *P = (a-b)>=0?(a-b):(b-a);
393}
394void g(int a, int b, int *P) {
395 *P = abs(a-b);
396}
397
398Further, they should compile to something better than:
399
400_g:
401 subf r2, r4, r3
402 subfic r3, r2, 0
403 cmpwi cr0, r2, -1
404 bgt cr0, LBB2_2 ; entry
405LBB2_1: ; entry
406 mr r2, r3
407LBB2_2: ; entry
408 stw r2, 0(r5)
409 blr
410
411GCC produces:
412
413_g:
414 subf r4,r4,r3
415 srawi r2,r4,31
416 xor r0,r2,r4
417 subf r0,r2,r0
418 stw r0,0(r5)
419 blr
420
421... which is much nicer.
422
423This theoretically may help improve twolf slightly (used in dimbox.c:142?).
424
425===-------------------------------------------------------------------------===
426
Nate Begeman2df99282006-03-16 18:50:44 +0000427int foo(int N, int ***W, int **TK, int X) {
428 int t, i;
429
430 for (t = 0; t < N; ++t)
431 for (i = 0; i < 4; ++i)
432 W[t / X][i][t % X] = TK[i][t];
433
434 return 5;
435}
436
Chris Lattnered511692006-03-16 22:25:55 +0000437We generate relatively atrocious code for this loop compared to gcc.
438
Chris Lattneref040dd2006-03-21 00:47:09 +0000439We could also strength reduce the rem and the div:
440http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
441
Chris Lattner28b1a0b2006-03-19 05:33:30 +0000442===-------------------------------------------------------------------------===
Chris Lattnered511692006-03-16 22:25:55 +0000443
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000444float foo(float X) { return (int)(X); }
445
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000446Currently produces:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000447
448_foo:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000449 fctiwz f0, f1
450 stfd f0, -8(r1)
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000451 lwz r2, -4(r1)
452 extsw r2, r2
453 std r2, -16(r1)
454 lfd f0, -16(r1)
455 fcfid f0, f0
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000456 frsp f1, f0
457 blr
458
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000459We could use a target dag combine to turn the lwz/extsw into an lwa when the
460lwz has a single use. Since LWA is cracked anyway, this would be a codesize
461win only.
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000462
Chris Lattner716aefc2006-03-23 21:28:44 +0000463===-------------------------------------------------------------------------===
464
Chris Lattner057f09b2006-03-24 20:04:27 +0000465We generate ugly code for this:
466
467void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
468 unsigned code = 0;
469 if(dx < -dw) code |= 1;
470 if(dx > dw) code |= 2;
471 if(dy < -dw) code |= 4;
472 if(dy > dw) code |= 8;
473 if(dz < -dw) code |= 16;
474 if(dz > dw) code |= 32;
475 *ret = code;
476}
477
Chris Lattner420736d2006-03-25 06:47:10 +0000478===-------------------------------------------------------------------------===
479
Chris Lattnered937902006-04-13 16:48:00 +0000480Complete the signed i32 to FP conversion code using 64-bit registers
481transformation, good for PI. See PPCISelLowering.cpp, this comment:
Chris Lattner220d2b82006-04-02 07:20:00 +0000482
Chris Lattnered937902006-04-13 16:48:00 +0000483 // FIXME: disable this lowered code. This generates 64-bit register values,
484 // and we don't model the fact that the top part is clobbered by calls. We
485 // need to flag these together so that the value isn't live across a call.
486 //setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
Chris Lattner220d2b82006-04-02 07:20:00 +0000487
Chris Lattner9d62fa42006-05-17 19:02:25 +0000488Also, if the registers are spilled to the stack, we have to ensure that all
48964-bits of them are save/restored, otherwise we will miscompile the code. It
490sounds like we need to get the 64-bit register classes going.
491
Chris Lattner55c63252006-05-05 05:36:15 +0000492===-------------------------------------------------------------------------===
493
Nate Begeman75146202006-05-08 20:54:02 +0000494%struct.B = type { ubyte, [3 x ubyte] }
495
496void %foo(%struct.B* %b) {
497entry:
498 %tmp = cast %struct.B* %b to uint* ; <uint*> [#uses=1]
499 %tmp = load uint* %tmp ; <uint> [#uses=1]
500 %tmp3 = cast %struct.B* %b to uint* ; <uint*> [#uses=1]
501 %tmp4 = load uint* %tmp3 ; <uint> [#uses=1]
502 %tmp8 = cast %struct.B* %b to uint* ; <uint*> [#uses=2]
503 %tmp9 = load uint* %tmp8 ; <uint> [#uses=1]
504 %tmp4.mask17 = shl uint %tmp4, ubyte 1 ; <uint> [#uses=1]
505 %tmp1415 = and uint %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
506 %tmp.masked = and uint %tmp, 2147483648 ; <uint> [#uses=1]
507 %tmp11 = or uint %tmp1415, %tmp.masked ; <uint> [#uses=1]
508 %tmp12 = and uint %tmp9, 2147483647 ; <uint> [#uses=1]
509 %tmp13 = or uint %tmp12, %tmp11 ; <uint> [#uses=1]
510 store uint %tmp13, uint* %tmp8
Chris Lattner55c63252006-05-05 05:36:15 +0000511 ret void
512}
513
514We emit:
515
516_foo:
517 lwz r2, 0(r3)
Nate Begeman75146202006-05-08 20:54:02 +0000518 slwi r4, r2, 1
519 or r4, r4, r2
520 rlwimi r2, r4, 0, 0, 0
Nate Begeman4667f2c2006-05-08 17:38:32 +0000521 stw r2, 0(r3)
Chris Lattner55c63252006-05-05 05:36:15 +0000522 blr
523
Nate Begeman75146202006-05-08 20:54:02 +0000524We could collapse a bunch of those ORs and ANDs and generate the following
525equivalent code:
Chris Lattner55c63252006-05-05 05:36:15 +0000526
Nate Begeman4667f2c2006-05-08 17:38:32 +0000527_foo:
528 lwz r2, 0(r3)
Nate Begemand8624ed2006-05-08 19:09:24 +0000529 rlwinm r4, r2, 1, 0, 0
Nate Begeman4667f2c2006-05-08 17:38:32 +0000530 or r2, r2, r4
531 stw r2, 0(r3)
532 blr
Chris Lattner1eeedae2006-07-14 04:07:29 +0000533
534===-------------------------------------------------------------------------===
535
536On PPC64, this results in a truncate followed by a truncstore. These should
537be folded together.
538
539unsigned short G;
540void foo(unsigned long H) { G = H; }
541