blob: 410aa8a7edc5c8df9c33761e688491c8da39856d [file] [log] [blame]
Nate Begemanb64af912004-08-10 20:42:36 +00001TODO:
Nate Begemanef9531e2005-04-11 20:48:57 +00002* gpr0 allocation
Nate Begeman4a0de072004-10-26 04:10:53 +00003* implement do-loop -> bdnz transform
Nate Begemanca068e82004-08-14 22:16:36 +00004* implement powerpc-64 for darwin
Nate Begeman50fb3c42005-12-24 01:00:15 +00005
Nate Begemana63fee82006-02-03 05:17:06 +00006===-------------------------------------------------------------------------===
Nate Begeman50fb3c42005-12-24 01:00:15 +00007
Nate Begemana63fee82006-02-03 05:17:06 +00008Support 'update' load/store instructions. These are cracked on the G5, but are
9still a codesize win.
10
11===-------------------------------------------------------------------------===
12
Nate Begeman81e80972006-03-17 01:40:33 +000013Teach the .td file to pattern match PPC::BR_COND to appropriate bc variant, so
14we don't have to always run the branch selector for small functions.
Nate Begeman1ad9b3a2006-03-16 22:37:48 +000015
Chris Lattnera3c44542005-08-24 18:15:24 +000016===-------------------------------------------------------------------------===
17
Chris Lattner424dcbd2005-08-23 06:27:59 +000018* Codegen this:
19
20 void test2(int X) {
21 if (X == 0x12345678) bar();
22 }
23
24 as:
25
26 xoris r0,r3,0x1234
Nate Begeman6e53ceb2006-02-27 22:08:36 +000027 cmplwi cr0,r0,0x5678
Chris Lattner424dcbd2005-08-23 06:27:59 +000028 beq cr0,L6
29
30 not:
31
32 lis r2, 4660
33 ori r2, r2, 22136
34 cmpw cr0, r3, r2
35 bne .LBB_test2_2
36
Chris Lattnera3c44542005-08-24 18:15:24 +000037===-------------------------------------------------------------------------===
38
39Lump the constant pool for each function into ONE pic object, and reference
40pieces of it as offsets from the start. For functions like this (contrived
41to have lots of constants obviously):
42
43double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
44
45We generate:
46
47_X:
48 lis r2, ha16(.CPI_X_0)
49 lfd f0, lo16(.CPI_X_0)(r2)
50 lis r2, ha16(.CPI_X_1)
51 lfd f2, lo16(.CPI_X_1)(r2)
52 fmadd f0, f1, f0, f2
53 lis r2, ha16(.CPI_X_2)
54 lfd f1, lo16(.CPI_X_2)(r2)
55 lis r2, ha16(.CPI_X_3)
56 lfd f2, lo16(.CPI_X_3)(r2)
57 fmadd f1, f0, f1, f2
58 blr
59
60It would be better to materialize .CPI_X into a register, then use immediates
61off of the register to avoid the lis's. This is even more important in PIC
62mode.
63
Chris Lattner39b248b2006-02-02 23:50:22 +000064Note that this (and the static variable version) is discussed here for GCC:
65http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
66
Chris Lattnera3c44542005-08-24 18:15:24 +000067===-------------------------------------------------------------------------===
Nate Begeman92cce902005-09-06 15:30:48 +000068
Chris Lattner33c1dab2006-02-03 06:22:11 +000069PIC Code Gen IPO optimization:
70
71Squish small scalar globals together into a single global struct, allowing the
72address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
73of the GOT on targets with one).
74
75Note that this is discussed here for GCC:
76http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
77
78===-------------------------------------------------------------------------===
79
Nate Begeman92cce902005-09-06 15:30:48 +000080Implement Newton-Rhapson method for improving estimate instructions to the
81correct accuracy, and implementing divide as multiply by reciprocal when it has
82more than one use. Itanium will want this too.
Nate Begeman21e463b2005-10-16 05:39:50 +000083
84===-------------------------------------------------------------------------===
85
Nate Begeman5cd61ce2005-10-25 23:50:02 +000086#define ARRAY_LENGTH 16
87
88union bitfield {
89 struct {
90#ifndef __ppc__
91 unsigned int field0 : 6;
92 unsigned int field1 : 6;
93 unsigned int field2 : 6;
94 unsigned int field3 : 6;
95 unsigned int field4 : 3;
96 unsigned int field5 : 4;
97 unsigned int field6 : 1;
98#else
99 unsigned int field6 : 1;
100 unsigned int field5 : 4;
101 unsigned int field4 : 3;
102 unsigned int field3 : 6;
103 unsigned int field2 : 6;
104 unsigned int field1 : 6;
105 unsigned int field0 : 6;
106#endif
107 } bitfields, bits;
108 unsigned int u32All;
109 signed int i32All;
110 float f32All;
111};
112
113
114typedef struct program_t {
115 union bitfield array[ARRAY_LENGTH];
116 int size;
117 int loaded;
118} program;
119
120
121void AdjustBitfields(program* prog, unsigned int fmt1)
122{
Nate Begeman6e53ceb2006-02-27 22:08:36 +0000123 prog->array[0].bitfields.field0 = fmt1;
124 prog->array[0].bitfields.field1 = fmt1 + 1;
Nate Begeman5cd61ce2005-10-25 23:50:02 +0000125}
126
Nate Begeman6e53ceb2006-02-27 22:08:36 +0000127We currently generate:
Nate Begeman5cd61ce2005-10-25 23:50:02 +0000128
Nate Begeman6e53ceb2006-02-27 22:08:36 +0000129_AdjustBitfields:
130 lwz r2, 0(r3)
131 addi r5, r4, 1
132 rlwinm r2, r2, 0, 0, 19
133 rlwinm r5, r5, 6, 20, 25
134 rlwimi r2, r4, 0, 26, 31
135 or r2, r2, r5
136 stw r2, 0(r3)
137 blr
Nate Begeman5cd61ce2005-10-25 23:50:02 +0000138
Nate Begeman6e53ceb2006-02-27 22:08:36 +0000139We should teach someone that or (rlwimi, rlwinm) with disjoint masks can be
140turned into rlwimi (rlwimi)
141
142The better codegen would be:
143
144_AdjustBitfields:
145 lwz r0,0(r3)
146 rlwinm r4,r4,0,0xff
147 rlwimi r0,r4,0,26,31
148 addi r4,r4,1
149 rlwimi r0,r4,6,20,25
150 stw r0,0(r3)
151 blr
Chris Lattner01959102005-10-28 00:20:45 +0000152
153===-------------------------------------------------------------------------===
154
Chris Lattnerae4664a2005-11-05 08:57:56 +0000155Compile this:
156
157int %f1(int %a, int %b) {
158 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
159 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
160 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
161 ret int %tmp.4
162}
163
164without a copy. We make this currently:
165
166_f1:
167 rlwinm r2, r4, 0, 24, 27
168 rlwimi r2, r3, 0, 28, 31
169 or r3, r2, r2
170 blr
171
172The two-addr pass or RA needs to learn when it is profitable to commute an
173instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
174currently only commutes to avoid inserting a copy BEFORE the two addr instr.
175
Chris Lattner62c08dd2005-12-08 07:13:28 +0000176===-------------------------------------------------------------------------===
177
178Compile offsets from allocas:
179
180int *%test() {
181 %X = alloca { int, int }
182 %Y = getelementptr {int,int}* %X, int 0, uint 1
183 ret int* %Y
184}
185
186into a single add, not two:
187
188_test:
189 addi r2, r1, -8
190 addi r3, r2, 4
191 blr
192
193--> important for C++.
194
Chris Lattner39706e62005-12-22 17:19:28 +0000195===-------------------------------------------------------------------------===
196
197int test3(int a, int b) { return (a < 0) ? a : 0; }
198
199should be branch free code. LLVM is turning it into < 1 because of the RHS.
200
201===-------------------------------------------------------------------------===
202
Chris Lattner39706e62005-12-22 17:19:28 +0000203No loads or stores of the constants should be needed:
204
205struct foo { double X, Y; };
206void xxx(struct foo F);
207void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
208
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000209===-------------------------------------------------------------------------===
210
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000211Darwin Stub LICM optimization:
212
213Loops like this:
214
215 for (...) bar();
216
217Have to go through an indirect stub if bar is external or linkonce. It would
218be better to compile it as:
219
220 fp = &bar;
221 for (...) fp();
222
223which only computes the address of bar once (instead of each time through the
224stub). This is Darwin specific and would have to be done in the code generator.
225Probably not a win on x86.
226
227===-------------------------------------------------------------------------===
228
229PowerPC i1/setcc stuff (depends on subreg stuff):
230
231Check out the PPC code we get for 'compare' in this testcase:
232http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
233
234oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
235invert, invert, or), we then have to compare it against zero instead of
236using the value already in a CR!
237
238that should be something like
239 cmpw cr7, r8, r5
240 cmpw cr0, r7, r3
241 crnand cr0, cr0, cr7
242 bne cr0, LBB_compare_4
243
244instead of
245 cmpw cr7, r8, r5
246 cmpw cr0, r7, r3
247 mfcr r7, 1
248 mcrf cr7, cr0
249 mfcr r8, 1
250 rlwinm r7, r7, 30, 31, 31
251 rlwinm r8, r8, 30, 31, 31
252 xori r7, r7, 1
253 xori r8, r8, 1
254 addi r2, r2, 1
255 or r7, r8, r7
256 cmpwi cr0, r7, 0
257 bne cr0, LBB_compare_4 ; loopexit
258
Chris Lattner8d3f4902006-02-08 06:43:51 +0000259FreeBench/mason has a basic block that looks like this:
260
261 %tmp.130 = seteq int %p.0__, 5 ; <bool> [#uses=1]
262 %tmp.134 = seteq int %p.1__, 6 ; <bool> [#uses=1]
263 %tmp.139 = seteq int %p.2__, 12 ; <bool> [#uses=1]
264 %tmp.144 = seteq int %p.3__, 13 ; <bool> [#uses=1]
265 %tmp.149 = seteq int %p.4__, 14 ; <bool> [#uses=1]
266 %tmp.154 = seteq int %p.5__, 15 ; <bool> [#uses=1]
267 %bothcond = and bool %tmp.134, %tmp.130 ; <bool> [#uses=1]
268 %bothcond123 = and bool %bothcond, %tmp.139 ; <bool>
269 %bothcond124 = and bool %bothcond123, %tmp.144 ; <bool>
270 %bothcond125 = and bool %bothcond124, %tmp.149 ; <bool>
271 %bothcond126 = and bool %bothcond125, %tmp.154 ; <bool>
272 br bool %bothcond126, label %shortcirc_next.5, label %else.0
273
274This is a particularly important case where handling CRs better will help.
275
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000276===-------------------------------------------------------------------------===
277
278Simple IPO for argument passing, change:
279 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
280
281the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
282of arguments get assigned to r3 through r10. That is, if you have a function
283foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
284argument bytes for r4 and r5. The trick then would be to shuffle the argument
285order for functions we can internalize so that the maximum number of
286integers/pointers get passed in regs before you see any of the fp arguments.
287
288Instead of implementing this, it would actually probably be easier to just
289implement a PPC fastcc, where we could do whatever we wanted to the CC,
290including having this work sanely.
291
292===-------------------------------------------------------------------------===
293
294Fix Darwin FP-In-Integer Registers ABI
295
296Darwin passes doubles in structures in integer registers, which is very very
297bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
298that percolates these things out of functions.
299
300Check out how horrible this is:
301http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
302
303This is an extension of "interprocedural CC unmunging" that can't be done with
304just fastcc.
305
306===-------------------------------------------------------------------------===
307
Chris Lattner3cda14f2006-01-19 02:09:38 +0000308Generate lwbrx and other byteswapping load/store instructions when reasonable.
309
Chris Lattner96909792006-01-28 05:40:47 +0000310===-------------------------------------------------------------------------===
311
312Implement TargetConstantVec, and set up PPC to custom lower ConstantVec into
313TargetConstantVec's if it's one of the many forms that are algorithmically
314computable using the spiffy altivec instructions.
315
Chris Lattner56b69642006-01-31 02:55:28 +0000316===-------------------------------------------------------------------------===
317
318Compile this:
319
Chris Lattner83e64ba2006-01-31 07:16:34 +0000320int foo(int a) {
321 int b = (a < 8);
322 if (b) {
323 return b * 3; // ignore the fact that this is always 3.
324 } else {
325 return 2;
326 }
327}
328
329into something not this:
330
331_foo:
3321) cmpwi cr7, r3, 8
333 mfcr r2, 1
334 rlwinm r2, r2, 29, 31, 31
3351) cmpwi cr0, r3, 7
336 bgt cr0, LBB1_2 ; UnifiedReturnBlock
337LBB1_1: ; then
338 rlwinm r2, r2, 0, 31, 31
339 mulli r3, r2, 3
340 blr
341LBB1_2: ; UnifiedReturnBlock
342 li r3, 2
343 blr
344
345In particular, the two compares (marked 1) could be shared by reversing one.
346This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
347same operands (but backwards) exists. In this case, this wouldn't save us
348anything though, because the compares still wouldn't be shared.
Chris Lattner0ddc1802006-02-01 00:28:12 +0000349
Chris Lattner5a7efc92006-02-01 17:54:23 +0000350===-------------------------------------------------------------------------===
351
352The legalizer should lower this:
353
354bool %test(ulong %x) {
355 %tmp = setlt ulong %x, 4294967296
356 ret bool %tmp
357}
358
359into "if x.high == 0", not:
360
361_test:
362 addi r2, r3, -1
363 cntlzw r2, r2
364 cntlzw r3, r3
365 srwi r2, r2, 5
Nate Begeman93c740b2006-02-02 07:27:56 +0000366 srwi r4, r3, 5
367 li r3, 0
Chris Lattner5a7efc92006-02-01 17:54:23 +0000368 cmpwi cr0, r2, 0
369 bne cr0, LBB1_2 ;
370LBB1_1:
Nate Begeman93c740b2006-02-02 07:27:56 +0000371 or r3, r4, r4
Chris Lattner5a7efc92006-02-01 17:54:23 +0000372LBB1_2:
Chris Lattner5a7efc92006-02-01 17:54:23 +0000373 blr
374
375noticed in 2005-05-11-Popcount-ffs-fls.c.
Chris Lattner275b8842006-02-02 07:37:11 +0000376
377
378===-------------------------------------------------------------------------===
379
380We should custom expand setcc instead of pretending that we have it. That
381would allow us to expose the access of the crbit after the mfcr, allowing
382that access to be trivially folded into other ops. A simple example:
383
384int foo(int a, int b) { return (a < b) << 4; }
385
386compiles into:
387
388_foo:
389 cmpw cr7, r3, r4
390 mfcr r2, 1
391 rlwinm r2, r2, 29, 31, 31
392 slwi r3, r2, 4
393 blr
394
Chris Lattnerd463f7f2006-02-03 01:49:49 +0000395===-------------------------------------------------------------------------===
396
Nate Begemana63fee82006-02-03 05:17:06 +0000397Fold add and sub with constant into non-extern, non-weak addresses so this:
398
399static int a;
400void bar(int b) { a = b; }
401void foo(unsigned char *c) {
402 *c = a;
403}
404
405So that
406
407_foo:
408 lis r2, ha16(_a)
409 la r2, lo16(_a)(r2)
410 lbz r2, 3(r2)
411 stb r2, 0(r3)
412 blr
413
414Becomes
415
416_foo:
417 lis r2, ha16(_a+3)
418 lbz r2, lo16(_a+3)(r2)
419 stb r2, 0(r3)
420 blr
Chris Lattner21384532006-02-05 05:27:35 +0000421
422===-------------------------------------------------------------------------===
423
424We generate really bad code for this:
425
426int f(signed char *a, _Bool b, _Bool c) {
427 signed char t = 0;
428 if (b) t = *a;
429 if (c) *a = t;
430}
431
Chris Lattner00d18f02006-03-01 06:36:20 +0000432===-------------------------------------------------------------------------===
433
434This:
435int test(unsigned *P) { return *P >> 24; }
436
437Should compile to:
438
439_test:
440 lbz r3,0(r3)
441 blr
442
443not:
444
445_test:
446 lwz r2, 0(r3)
447 srwi r3, r2, 24
448 blr
449
Chris Lattner5a63c472006-03-07 04:42:59 +0000450===-------------------------------------------------------------------------===
451
452On the G5, logical CR operations are more expensive in their three
453address form: ops that read/write the same register are half as expensive as
454those that read from two registers that are different from their destination.
455
456We should model this with two separate instructions. The isel should generate
457the "two address" form of the instructions. When the register allocator
458detects that it needs to insert a copy due to the two-addresness of the CR
459logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
460we can convert to the "three address" instruction, to save code space.
461
462This only matters when we start generating cr logical ops.
463
Chris Lattner49f398b2006-03-08 00:25:47 +0000464===-------------------------------------------------------------------------===
465
466We should compile these two functions to the same thing:
467
468#include <stdlib.h>
469void f(int a, int b, int *P) {
470 *P = (a-b)>=0?(a-b):(b-a);
471}
472void g(int a, int b, int *P) {
473 *P = abs(a-b);
474}
475
476Further, they should compile to something better than:
477
478_g:
479 subf r2, r4, r3
480 subfic r3, r2, 0
481 cmpwi cr0, r2, -1
482 bgt cr0, LBB2_2 ; entry
483LBB2_1: ; entry
484 mr r2, r3
485LBB2_2: ; entry
486 stw r2, 0(r5)
487 blr
488
489GCC produces:
490
491_g:
492 subf r4,r4,r3
493 srawi r2,r4,31
494 xor r0,r2,r4
495 subf r0,r2,r0
496 stw r0,0(r5)
497 blr
498
499... which is much nicer.
500
501This theoretically may help improve twolf slightly (used in dimbox.c:142?).
502
503===-------------------------------------------------------------------------===
504
Chris Lattnered511692006-03-16 22:25:55 +0000505Implement PPCInstrInfo::isLoadFromStackSlot/isStoreToStackSlot for vector
506registers, to generate better spill code.
507
508===-------------------------------------------------------------------------===
Chris Lattner28b1a0b2006-03-19 05:33:30 +0000509
Nate Begeman2df99282006-03-16 18:50:44 +0000510int foo(int N, int ***W, int **TK, int X) {
511 int t, i;
512
513 for (t = 0; t < N; ++t)
514 for (i = 0; i < 4; ++i)
515 W[t / X][i][t % X] = TK[i][t];
516
517 return 5;
518}
519
Chris Lattnered511692006-03-16 22:25:55 +0000520We generate relatively atrocious code for this loop compared to gcc.
521
Chris Lattneref040dd2006-03-21 00:47:09 +0000522We could also strength reduce the rem and the div:
523http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
524
Chris Lattner28b1a0b2006-03-19 05:33:30 +0000525===-------------------------------------------------------------------------===
Chris Lattnered511692006-03-16 22:25:55 +0000526
Chris Lattner28b1a0b2006-03-19 05:33:30 +0000527Altivec support. The first should be a single lvx from the constant pool, the
528second should be a xor/stvx:
529
530void foo(void) {
531 int x[8] __attribute__((aligned(128))) = { 1, 1, 1, 1, 1, 1, 1, 1 };
532 bar (x);
533}
534
535#include <string.h>
536void foo(void) {
537 int x[8] __attribute__((aligned(128)));
538 memset (x, 0, sizeof (x));
539 bar (x);
540}
Chris Lattnered511692006-03-16 22:25:55 +0000541
Chris Lattner28097d02006-03-19 22:08:08 +0000542===-------------------------------------------------------------------------===
543
544Altivec: Codegen'ing MUL with vector FMADD should add -0.0, not 0.0:
545http://gcc.gnu.org/bugzilla/show_bug.cgi?id=8763
546
547We need to codegen -0.0 vector efficiently (no constant pool load).
548
549When -ffast-math is on, we can use 0.0.
550
551===-------------------------------------------------------------------------===
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000552
553float foo(float X) { return (int)(X); }
554
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000555Currently produces:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000556
557_foo:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000558 fctiwz f0, f1
559 stfd f0, -8(r1)
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000560 lwz r2, -4(r1)
561 extsw r2, r2
562 std r2, -16(r1)
563 lfd f0, -16(r1)
564 fcfid f0, f0
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000565 frsp f1, f0
566 blr
567
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000568We could use a target dag combine to turn the lwz/extsw into an lwa when the
569lwz has a single use. Since LWA is cracked anyway, this would be a codesize
570win only.
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000571
Chris Lattner716aefc2006-03-23 21:28:44 +0000572===-------------------------------------------------------------------------===
573
574 Consider this:
575 v4f32 Vector;
576 v4f32 Vector2 = { Vector.X, Vector.X, Vector.X, Vector.X };
577
578Since we know that "Vector" is 16-byte aligned and we know the element offset
579of ".X", we should change the load into a lve*x instruction, instead of doing
580a load/store/lve*x sequence.
581
Chris Lattner057f09b2006-03-24 20:04:27 +0000582===-------------------------------------------------------------------------===
583
584We generate ugly code for this:
585
586void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
587 unsigned code = 0;
588 if(dx < -dw) code |= 1;
589 if(dx > dw) code |= 2;
590 if(dy < -dw) code |= 4;
591 if(dy > dw) code |= 8;
592 if(dz < -dw) code |= 16;
593 if(dz > dw) code |= 32;
594 *ret = code;
595}
596
Chris Lattner420736d2006-03-25 06:47:10 +0000597===-------------------------------------------------------------------------===
598
599There are a wide range of vector constants we can generate with combinations of
600altivec instructions. For example, GCC does: t=vsplti*, r = t+t.
601
602===-------------------------------------------------------------------------===
603