blob: 21d0165077400789ffa371ad5091e8fe50fea88e [file] [log] [blame]
Chris Lattner22ec3e72006-03-27 07:04:16 +00001//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
2
Nate Begeman63be70d2004-08-10 20:42:36 +00003TODO:
Nate Begeman08698cf2005-04-11 20:48:57 +00004* gpr0 allocation
Nate Begeman4c6e1d62004-10-26 04:10:53 +00005* implement do-loop -> bdnz transform
Nate Begeman412602d2004-08-14 22:16:36 +00006* implement powerpc-64 for darwin
Nate Begeman9aea6e42005-12-24 01:00:15 +00007
Nate Begemanfc567d82006-02-03 05:17:06 +00008===-------------------------------------------------------------------------===
Nate Begeman9aea6e42005-12-24 01:00:15 +00009
Nate Begemanfc567d82006-02-03 05:17:06 +000010Support 'update' load/store instructions. These are cracked on the G5, but are
11still a codesize win.
12
13===-------------------------------------------------------------------------===
14
Nate Begemanbb01d4f2006-03-17 01:40:33 +000015Teach the .td file to pattern match PPC::BR_COND to appropriate bc variant, so
16we don't have to always run the branch selector for small functions.
Nate Begemanfb0e36f2006-03-16 22:37:48 +000017
Chris Lattner1e98a332005-08-24 18:15:24 +000018===-------------------------------------------------------------------------===
19
Chris Lattner5e3953d2005-08-23 06:27:59 +000020* Codegen this:
21
22 void test2(int X) {
23 if (X == 0x12345678) bar();
24 }
25
26 as:
27
28 xoris r0,r3,0x1234
Nate Begemanf918ed22006-02-27 22:08:36 +000029 cmplwi cr0,r0,0x5678
Chris Lattner5e3953d2005-08-23 06:27:59 +000030 beq cr0,L6
31
32 not:
33
34 lis r2, 4660
35 ori r2, r2, 22136
36 cmpw cr0, r3, r2
37 bne .LBB_test2_2
38
Chris Lattner1e98a332005-08-24 18:15:24 +000039===-------------------------------------------------------------------------===
40
41Lump the constant pool for each function into ONE pic object, and reference
42pieces of it as offsets from the start. For functions like this (contrived
43to have lots of constants obviously):
44
45double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
46
47We generate:
48
49_X:
50 lis r2, ha16(.CPI_X_0)
51 lfd f0, lo16(.CPI_X_0)(r2)
52 lis r2, ha16(.CPI_X_1)
53 lfd f2, lo16(.CPI_X_1)(r2)
54 fmadd f0, f1, f0, f2
55 lis r2, ha16(.CPI_X_2)
56 lfd f1, lo16(.CPI_X_2)(r2)
57 lis r2, ha16(.CPI_X_3)
58 lfd f2, lo16(.CPI_X_3)(r2)
59 fmadd f1, f0, f1, f2
60 blr
61
62It would be better to materialize .CPI_X into a register, then use immediates
63off of the register to avoid the lis's. This is even more important in PIC
64mode.
65
Chris Lattner9b178ce2006-02-02 23:50:22 +000066Note that this (and the static variable version) is discussed here for GCC:
67http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
68
Chris Lattner1e98a332005-08-24 18:15:24 +000069===-------------------------------------------------------------------------===
Nate Begemane9e2c6d2005-09-06 15:30:48 +000070
Chris Lattnera23b04a2006-02-03 06:22:11 +000071PIC Code Gen IPO optimization:
72
73Squish small scalar globals together into a single global struct, allowing the
74address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
75of the GOT on targets with one).
76
77Note that this is discussed here for GCC:
78http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
79
80===-------------------------------------------------------------------------===
81
Nate Begemane9e2c6d2005-09-06 15:30:48 +000082Implement Newton-Rhapson method for improving estimate instructions to the
83correct accuracy, and implementing divide as multiply by reciprocal when it has
84more than one use. Itanium will want this too.
Nate Begeman6cca84e2005-10-16 05:39:50 +000085
86===-------------------------------------------------------------------------===
87
Chris Lattner75fe59c2005-11-05 08:57:56 +000088Compile this:
89
90int %f1(int %a, int %b) {
91 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
92 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
93 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
94 ret int %tmp.4
95}
96
97without a copy. We make this currently:
98
99_f1:
100 rlwinm r2, r4, 0, 24, 27
101 rlwimi r2, r3, 0, 28, 31
102 or r3, r2, r2
103 blr
104
105The two-addr pass or RA needs to learn when it is profitable to commute an
106instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
107currently only commutes to avoid inserting a copy BEFORE the two addr instr.
108
Chris Lattner29e6c3d2005-12-08 07:13:28 +0000109===-------------------------------------------------------------------------===
110
111Compile offsets from allocas:
112
113int *%test() {
114 %X = alloca { int, int }
115 %Y = getelementptr {int,int}* %X, int 0, uint 1
116 ret int* %Y
117}
118
119into a single add, not two:
120
121_test:
122 addi r2, r1, -8
123 addi r3, r2, 4
124 blr
125
126--> important for C++.
127
Chris Lattnerffe35422005-12-22 17:19:28 +0000128===-------------------------------------------------------------------------===
129
130int test3(int a, int b) { return (a < 0) ? a : 0; }
131
132should be branch free code. LLVM is turning it into < 1 because of the RHS.
133
134===-------------------------------------------------------------------------===
135
Chris Lattnerffe35422005-12-22 17:19:28 +0000136No loads or stores of the constants should be needed:
137
138struct foo { double X, Y; };
139void xxx(struct foo F);
140void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
141
Chris Lattnerb2eacf42006-01-16 17:53:00 +0000142===-------------------------------------------------------------------------===
143
Chris Lattner7c762902006-01-16 17:58:54 +0000144Darwin Stub LICM optimization:
145
146Loops like this:
147
148 for (...) bar();
149
150Have to go through an indirect stub if bar is external or linkonce. It would
151be better to compile it as:
152
153 fp = &bar;
154 for (...) fp();
155
156which only computes the address of bar once (instead of each time through the
157stub). This is Darwin specific and would have to be done in the code generator.
158Probably not a win on x86.
159
160===-------------------------------------------------------------------------===
161
162PowerPC i1/setcc stuff (depends on subreg stuff):
163
164Check out the PPC code we get for 'compare' in this testcase:
165http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
166
167oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
168invert, invert, or), we then have to compare it against zero instead of
169using the value already in a CR!
170
171that should be something like
172 cmpw cr7, r8, r5
173 cmpw cr0, r7, r3
174 crnand cr0, cr0, cr7
175 bne cr0, LBB_compare_4
176
177instead of
178 cmpw cr7, r8, r5
179 cmpw cr0, r7, r3
180 mfcr r7, 1
181 mcrf cr7, cr0
182 mfcr r8, 1
183 rlwinm r7, r7, 30, 31, 31
184 rlwinm r8, r8, 30, 31, 31
185 xori r7, r7, 1
186 xori r8, r8, 1
187 addi r2, r2, 1
188 or r7, r8, r7
189 cmpwi cr0, r7, 0
190 bne cr0, LBB_compare_4 ; loopexit
191
Chris Lattnerb97142e2006-02-08 06:43:51 +0000192FreeBench/mason has a basic block that looks like this:
193
194 %tmp.130 = seteq int %p.0__, 5 ; <bool> [#uses=1]
195 %tmp.134 = seteq int %p.1__, 6 ; <bool> [#uses=1]
196 %tmp.139 = seteq int %p.2__, 12 ; <bool> [#uses=1]
197 %tmp.144 = seteq int %p.3__, 13 ; <bool> [#uses=1]
198 %tmp.149 = seteq int %p.4__, 14 ; <bool> [#uses=1]
199 %tmp.154 = seteq int %p.5__, 15 ; <bool> [#uses=1]
200 %bothcond = and bool %tmp.134, %tmp.130 ; <bool> [#uses=1]
201 %bothcond123 = and bool %bothcond, %tmp.139 ; <bool>
202 %bothcond124 = and bool %bothcond123, %tmp.144 ; <bool>
203 %bothcond125 = and bool %bothcond124, %tmp.149 ; <bool>
204 %bothcond126 = and bool %bothcond125, %tmp.154 ; <bool>
205 br bool %bothcond126, label %shortcirc_next.5, label %else.0
206
207This is a particularly important case where handling CRs better will help.
208
Chris Lattner7c762902006-01-16 17:58:54 +0000209===-------------------------------------------------------------------------===
210
211Simple IPO for argument passing, change:
212 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
213
214the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
215of arguments get assigned to r3 through r10. That is, if you have a function
216foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
217argument bytes for r4 and r5. The trick then would be to shuffle the argument
218order for functions we can internalize so that the maximum number of
219integers/pointers get passed in regs before you see any of the fp arguments.
220
221Instead of implementing this, it would actually probably be easier to just
222implement a PPC fastcc, where we could do whatever we wanted to the CC,
223including having this work sanely.
224
225===-------------------------------------------------------------------------===
226
227Fix Darwin FP-In-Integer Registers ABI
228
229Darwin passes doubles in structures in integer registers, which is very very
230bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
231that percolates these things out of functions.
232
233Check out how horrible this is:
234http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
235
236This is an extension of "interprocedural CC unmunging" that can't be done with
237just fastcc.
238
239===-------------------------------------------------------------------------===
240
Chris Lattnerc3c27032006-01-19 02:09:38 +0000241Generate lwbrx and other byteswapping load/store instructions when reasonable.
242
Chris Lattner0c7b4662006-01-28 05:40:47 +0000243===-------------------------------------------------------------------------===
244
Chris Lattnera9bfca82006-01-31 02:55:28 +0000245Compile this:
246
Chris Lattnerb0fe1382006-01-31 07:16:34 +0000247int foo(int a) {
248 int b = (a < 8);
249 if (b) {
250 return b * 3; // ignore the fact that this is always 3.
251 } else {
252 return 2;
253 }
254}
255
256into something not this:
257
258_foo:
2591) cmpwi cr7, r3, 8
260 mfcr r2, 1
261 rlwinm r2, r2, 29, 31, 31
2621) cmpwi cr0, r3, 7
263 bgt cr0, LBB1_2 ; UnifiedReturnBlock
264LBB1_1: ; then
265 rlwinm r2, r2, 0, 31, 31
266 mulli r3, r2, 3
267 blr
268LBB1_2: ; UnifiedReturnBlock
269 li r3, 2
270 blr
271
272In particular, the two compares (marked 1) could be shared by reversing one.
273This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
274same operands (but backwards) exists. In this case, this wouldn't save us
275anything though, because the compares still wouldn't be shared.
Chris Lattnera0527472006-02-01 00:28:12 +0000276
Chris Lattnera983bea2006-02-01 17:54:23 +0000277===-------------------------------------------------------------------------===
278
279The legalizer should lower this:
280
281bool %test(ulong %x) {
282 %tmp = setlt ulong %x, 4294967296
283 ret bool %tmp
284}
285
286into "if x.high == 0", not:
287
288_test:
289 addi r2, r3, -1
290 cntlzw r2, r2
291 cntlzw r3, r3
292 srwi r2, r2, 5
Nate Begemancd018522006-02-02 07:27:56 +0000293 srwi r4, r3, 5
294 li r3, 0
Chris Lattnera983bea2006-02-01 17:54:23 +0000295 cmpwi cr0, r2, 0
296 bne cr0, LBB1_2 ;
297LBB1_1:
Nate Begemancd018522006-02-02 07:27:56 +0000298 or r3, r4, r4
Chris Lattnera983bea2006-02-01 17:54:23 +0000299LBB1_2:
Chris Lattnera983bea2006-02-01 17:54:23 +0000300 blr
301
302noticed in 2005-05-11-Popcount-ffs-fls.c.
Chris Lattner9dd7df72006-02-02 07:37:11 +0000303
304
305===-------------------------------------------------------------------------===
306
307We should custom expand setcc instead of pretending that we have it. That
308would allow us to expose the access of the crbit after the mfcr, allowing
309that access to be trivially folded into other ops. A simple example:
310
311int foo(int a, int b) { return (a < b) << 4; }
312
313compiles into:
314
315_foo:
316 cmpw cr7, r3, r4
317 mfcr r2, 1
318 rlwinm r2, r2, 29, 31, 31
319 slwi r3, r2, 4
320 blr
321
Chris Lattnerf0a2d662006-02-03 01:49:49 +0000322===-------------------------------------------------------------------------===
323
Nate Begemanfc567d82006-02-03 05:17:06 +0000324Fold add and sub with constant into non-extern, non-weak addresses so this:
325
326static int a;
327void bar(int b) { a = b; }
328void foo(unsigned char *c) {
329 *c = a;
330}
331
332So that
333
334_foo:
335 lis r2, ha16(_a)
336 la r2, lo16(_a)(r2)
337 lbz r2, 3(r2)
338 stb r2, 0(r3)
339 blr
340
341Becomes
342
343_foo:
344 lis r2, ha16(_a+3)
345 lbz r2, lo16(_a+3)(r2)
346 stb r2, 0(r3)
347 blr
Chris Lattnerc0e48c62006-02-05 05:27:35 +0000348
349===-------------------------------------------------------------------------===
350
351We generate really bad code for this:
352
353int f(signed char *a, _Bool b, _Bool c) {
354 signed char t = 0;
355 if (b) t = *a;
356 if (c) *a = t;
357}
358
Chris Lattner3cb349a2006-03-01 06:36:20 +0000359===-------------------------------------------------------------------------===
360
361This:
362int test(unsigned *P) { return *P >> 24; }
363
364Should compile to:
365
366_test:
367 lbz r3,0(r3)
368 blr
369
370not:
371
372_test:
373 lwz r2, 0(r3)
374 srwi r3, r2, 24
375 blr
376
Chris Lattner883cefc2006-03-07 04:42:59 +0000377===-------------------------------------------------------------------------===
378
379On the G5, logical CR operations are more expensive in their three
380address form: ops that read/write the same register are half as expensive as
381those that read from two registers that are different from their destination.
382
383We should model this with two separate instructions. The isel should generate
384the "two address" form of the instructions. When the register allocator
385detects that it needs to insert a copy due to the two-addresness of the CR
386logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
387we can convert to the "three address" instruction, to save code space.
388
389This only matters when we start generating cr logical ops.
390
Chris Lattnera8dd6362006-03-08 00:25:47 +0000391===-------------------------------------------------------------------------===
392
393We should compile these two functions to the same thing:
394
395#include <stdlib.h>
396void f(int a, int b, int *P) {
397 *P = (a-b)>=0?(a-b):(b-a);
398}
399void g(int a, int b, int *P) {
400 *P = abs(a-b);
401}
402
403Further, they should compile to something better than:
404
405_g:
406 subf r2, r4, r3
407 subfic r3, r2, 0
408 cmpwi cr0, r2, -1
409 bgt cr0, LBB2_2 ; entry
410LBB2_1: ; entry
411 mr r2, r3
412LBB2_2: ; entry
413 stw r2, 0(r5)
414 blr
415
416GCC produces:
417
418_g:
419 subf r4,r4,r3
420 srawi r2,r4,31
421 xor r0,r2,r4
422 subf r0,r2,r0
423 stw r0,0(r5)
424 blr
425
426... which is much nicer.
427
428This theoretically may help improve twolf slightly (used in dimbox.c:142?).
429
430===-------------------------------------------------------------------------===
431
Nate Begeman32e73f92006-03-16 18:50:44 +0000432int foo(int N, int ***W, int **TK, int X) {
433 int t, i;
434
435 for (t = 0; t < N; ++t)
436 for (i = 0; i < 4; ++i)
437 W[t / X][i][t % X] = TK[i][t];
438
439 return 5;
440}
441
Chris Lattner325bb462006-03-16 22:25:55 +0000442We generate relatively atrocious code for this loop compared to gcc.
443
Chris Lattnerf1948342006-03-21 00:47:09 +0000444We could also strength reduce the rem and the div:
445http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
446
Chris Lattnerea646872006-03-19 05:33:30 +0000447===-------------------------------------------------------------------------===
Chris Lattner325bb462006-03-16 22:25:55 +0000448
Nate Begeman01312792006-03-21 18:58:20 +0000449float foo(float X) { return (int)(X); }
450
Chris Lattnereccf4692006-03-22 05:33:23 +0000451Currently produces:
Nate Begeman01312792006-03-21 18:58:20 +0000452
453_foo:
Nate Begeman01312792006-03-21 18:58:20 +0000454 fctiwz f0, f1
455 stfd f0, -8(r1)
Chris Lattnereccf4692006-03-22 05:33:23 +0000456 lwz r2, -4(r1)
457 extsw r2, r2
458 std r2, -16(r1)
459 lfd f0, -16(r1)
460 fcfid f0, f0
Nate Begeman01312792006-03-21 18:58:20 +0000461 frsp f1, f0
462 blr
463
Chris Lattnereccf4692006-03-22 05:33:23 +0000464We could use a target dag combine to turn the lwz/extsw into an lwa when the
465lwz has a single use. Since LWA is cracked anyway, this would be a codesize
466win only.
Nate Begeman01312792006-03-21 18:58:20 +0000467
Chris Lattnercbcfe4652006-03-23 21:28:44 +0000468===-------------------------------------------------------------------------===
469
Chris Lattner9f9b6112006-03-24 20:04:27 +0000470We generate ugly code for this:
471
472void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
473 unsigned code = 0;
474 if(dx < -dw) code |= 1;
475 if(dx > dw) code |= 2;
476 if(dy < -dw) code |= 4;
477 if(dy > dw) code |= 8;
478 if(dz < -dw) code |= 16;
479 if(dz > dw) code |= 32;
480 *ret = code;
481}
482
Chris Lattner5d70a7c2006-03-25 06:47:10 +0000483===-------------------------------------------------------------------------===
484
Chris Lattner5879efe2006-04-13 16:48:00 +0000485Complete the signed i32 to FP conversion code using 64-bit registers
486transformation, good for PI. See PPCISelLowering.cpp, this comment:
Chris Lattneracf1fc82006-04-02 07:20:00 +0000487
Chris Lattner5879efe2006-04-13 16:48:00 +0000488 // FIXME: disable this lowered code. This generates 64-bit register values,
489 // and we don't model the fact that the top part is clobbered by calls. We
490 // need to flag these together so that the value isn't live across a call.
491 //setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
Chris Lattneracf1fc82006-04-02 07:20:00 +0000492
Chris Lattner304bbf32006-05-05 05:36:15 +0000493===-------------------------------------------------------------------------===
494
495Another missed rlwimi case:
496
497void %foo(uint *%tmp) {
498 %tmp = load uint* %tmp ; <uint> [#uses=3]
499 %tmp1 = shr uint %tmp, ubyte 31 ; <uint> [#uses=1]
500 %tmp1 = cast uint %tmp1 to ubyte ; <ubyte> [#uses=1]
501 %tmp4.mask = shr uint %tmp, ubyte 30 ; <uint> [#uses=1]
502 %tmp4.mask = cast uint %tmp4.mask to ubyte ; <ubyte> [#uses=1]
503 %tmp = or ubyte %tmp4.mask, %tmp1 ; <ubyte> [#uses=1]
504 %tmp10 = cast ubyte %tmp to uint ; <uint> [#uses=1]
505 %tmp11 = shl uint %tmp10, ubyte 31 ; <uint> [#uses=1]
506 %tmp12 = and uint %tmp, 2147483647 ; <uint> [#uses=1]
507 %tmp13 = or uint %tmp11, %tmp12 ; <uint> [#uses=1]
508 store uint %tmp13, uint* %tmp
509 ret void
510}
511
512We emit:
513
514_foo:
515 lwz r2, 0(r3)
516 srwi r4, r2, 30
517 srwi r5, r2, 31
518 or r4, r4, r5
Nate Begeman9b6d4c22006-05-08 17:38:32 +0000519 rlwimi r2, r4, 31, 0, 0
520 stw r2, 0(r3)
Chris Lattner304bbf32006-05-05 05:36:15 +0000521 blr
522
Nate Begeman9b6d4c22006-05-08 17:38:32 +0000523What this code is really doing is ORing bit 0 with bit 1. We could codegen this
524as:
Chris Lattner304bbf32006-05-05 05:36:15 +0000525
Nate Begeman9b6d4c22006-05-08 17:38:32 +0000526_foo:
527 lwz r2, 0(r3)
Nate Begeman0eb8f2e2006-05-08 19:09:24 +0000528 rlwinm r4, r2, 1, 0, 0
Nate Begeman9b6d4c22006-05-08 17:38:32 +0000529 or r2, r2, r4
530 stw r2, 0(r3)
531 blr