blob: 938fff2865149d6af6c8e6cad8c9b7ad5e650f21 [file] [log] [blame]
Nate Begemanb64af912004-08-10 20:42:36 +00001TODO:
Nate Begemanef9531e2005-04-11 20:48:57 +00002* gpr0 allocation
Nate Begeman4a0de072004-10-26 04:10:53 +00003* implement do-loop -> bdnz transform
Nate Begemanca068e82004-08-14 22:16:36 +00004* implement powerpc-64 for darwin
Nate Begemand332fd52004-08-29 22:02:43 +00005* use stfiwx in float->int
Nate Begeman50fb3c42005-12-24 01:00:15 +00006
7* Fold add and sub with constant into non-extern, non-weak addresses so this:
Nate Begeman4ad870d2005-07-26 18:59:06 +00008 lis r2, ha16(l2__ZTV4Cell)
9 la r2, lo16(l2__ZTV4Cell)(r2)
10 addi r2, r2, 8
Nate Begeman50fb3c42005-12-24 01:00:15 +000011becomes:
12 lis r2, ha16(l2__ZTV4Cell+8)
13 la r2, lo16(l2__ZTV4Cell+8)(r2)
14
Chris Lattnerb65975a2005-07-26 19:07:51 +000015
Nate Begeman5a014812005-08-14 01:17:16 +000016* Teach LLVM how to codegen this:
17unsigned short foo(float a) { return a; }
18as:
19_foo:
20 fctiwz f0,f1
21 stfd f0,-8(r1)
22 lhz r3,-2(r1)
23 blr
24not:
25_foo:
26 fctiwz f0, f1
27 stfd f0, -8(r1)
28 lwz r2, -4(r1)
29 rlwinm r3, r2, 0, 16, 31
30 blr
31
Chris Lattner6281ae42005-08-05 19:18:32 +000032* Support 'update' load/store instructions. These are cracked on the G5, but
33 are still a codesize win.
34
Misha Brukman4ce5ce22004-07-27 18:43:04 +000035* should hint to the branch select pass that it doesn't need to print the
36 second unconditional branch, so we don't end up with things like:
Misha Brukman4ce5ce22004-07-27 18:43:04 +000037 b .LBBl42__2E_expand_function_8_674 ; loopentry.24
38 b .LBBl42__2E_expand_function_8_42 ; NewDefault
39 b .LBBl42__2E_expand_function_8_42 ; NewDefault
Chris Lattner424dcbd2005-08-23 06:27:59 +000040
Chris Lattnera3c44542005-08-24 18:15:24 +000041===-------------------------------------------------------------------------===
42
Chris Lattner424dcbd2005-08-23 06:27:59 +000043* Codegen this:
44
45 void test2(int X) {
46 if (X == 0x12345678) bar();
47 }
48
49 as:
50
51 xoris r0,r3,0x1234
52 cmpwi cr0,r0,0x5678
53 beq cr0,L6
54
55 not:
56
57 lis r2, 4660
58 ori r2, r2, 22136
59 cmpw cr0, r3, r2
60 bne .LBB_test2_2
61
Chris Lattnera3c44542005-08-24 18:15:24 +000062===-------------------------------------------------------------------------===
63
64Lump the constant pool for each function into ONE pic object, and reference
65pieces of it as offsets from the start. For functions like this (contrived
66to have lots of constants obviously):
67
68double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
69
70We generate:
71
72_X:
73 lis r2, ha16(.CPI_X_0)
74 lfd f0, lo16(.CPI_X_0)(r2)
75 lis r2, ha16(.CPI_X_1)
76 lfd f2, lo16(.CPI_X_1)(r2)
77 fmadd f0, f1, f0, f2
78 lis r2, ha16(.CPI_X_2)
79 lfd f1, lo16(.CPI_X_2)(r2)
80 lis r2, ha16(.CPI_X_3)
81 lfd f2, lo16(.CPI_X_3)(r2)
82 fmadd f1, f0, f1, f2
83 blr
84
85It would be better to materialize .CPI_X into a register, then use immediates
86off of the register to avoid the lis's. This is even more important in PIC
87mode.
88
89===-------------------------------------------------------------------------===
Nate Begeman92cce902005-09-06 15:30:48 +000090
91Implement Newton-Rhapson method for improving estimate instructions to the
92correct accuracy, and implementing divide as multiply by reciprocal when it has
93more than one use. Itanium will want this too.
Nate Begeman21e463b2005-10-16 05:39:50 +000094
95===-------------------------------------------------------------------------===
96
97int foo(int a, int b) { return a == b ? 16 : 0; }
98_foo:
99 cmpw cr7, r3, r4
100 mfcr r2
101 rlwinm r2, r2, 31, 31, 31
102 slwi r3, r2, 4
103 blr
104
105If we exposed the srl & mask ops after the MFCR that we are doing to select
106the correct CR bit, then we could fold the slwi into the rlwinm before it.
Nate Begeman5cd61ce2005-10-25 23:50:02 +0000107
108===-------------------------------------------------------------------------===
109
110#define ARRAY_LENGTH 16
111
112union bitfield {
113 struct {
114#ifndef __ppc__
115 unsigned int field0 : 6;
116 unsigned int field1 : 6;
117 unsigned int field2 : 6;
118 unsigned int field3 : 6;
119 unsigned int field4 : 3;
120 unsigned int field5 : 4;
121 unsigned int field6 : 1;
122#else
123 unsigned int field6 : 1;
124 unsigned int field5 : 4;
125 unsigned int field4 : 3;
126 unsigned int field3 : 6;
127 unsigned int field2 : 6;
128 unsigned int field1 : 6;
129 unsigned int field0 : 6;
130#endif
131 } bitfields, bits;
132 unsigned int u32All;
133 signed int i32All;
134 float f32All;
135};
136
137
138typedef struct program_t {
139 union bitfield array[ARRAY_LENGTH];
140 int size;
141 int loaded;
142} program;
143
144
145void AdjustBitfields(program* prog, unsigned int fmt1)
146{
147 unsigned int shift = 0;
148 unsigned int texCount = 0;
149 unsigned int i;
150
151 for (i = 0; i < 8; i++)
152 {
153 prog->array[i].bitfields.field0 = texCount;
154 prog->array[i].bitfields.field1 = texCount + 1;
155 prog->array[i].bitfields.field2 = texCount + 2;
156 prog->array[i].bitfields.field3 = texCount + 3;
157
158 texCount += (fmt1 >> shift) & 0x7;
159 shift += 3;
160 }
161}
162
163In the loop above, the bitfield adds get generated as
164(add (shl bitfield, C1), (shl C2, C1)) where C2 is 1, 2 or 3.
165
166Since the input to the (or and, and) is an (add) rather than a (shl), the shift
167doesn't get folded into the rlwimi instruction. We should ideally see through
168things like this, rather than forcing llvm to generate the equivalent
169
170(shl (add bitfield, C2), C1) with some kind of mask.
Chris Lattner01959102005-10-28 00:20:45 +0000171
172===-------------------------------------------------------------------------===
173
Chris Lattnerae4664a2005-11-05 08:57:56 +0000174Compile this:
175
176int %f1(int %a, int %b) {
177 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
178 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
179 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
180 ret int %tmp.4
181}
182
183without a copy. We make this currently:
184
185_f1:
186 rlwinm r2, r4, 0, 24, 27
187 rlwimi r2, r3, 0, 28, 31
188 or r3, r2, r2
189 blr
190
191The two-addr pass or RA needs to learn when it is profitable to commute an
192instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
193currently only commutes to avoid inserting a copy BEFORE the two addr instr.
194
Chris Lattner62c08dd2005-12-08 07:13:28 +0000195===-------------------------------------------------------------------------===
196
197Compile offsets from allocas:
198
199int *%test() {
200 %X = alloca { int, int }
201 %Y = getelementptr {int,int}* %X, int 0, uint 1
202 ret int* %Y
203}
204
205into a single add, not two:
206
207_test:
208 addi r2, r1, -8
209 addi r3, r2, 4
210 blr
211
212--> important for C++.
213
Chris Lattner39706e62005-12-22 17:19:28 +0000214===-------------------------------------------------------------------------===
215
216int test3(int a, int b) { return (a < 0) ? a : 0; }
217
218should be branch free code. LLVM is turning it into < 1 because of the RHS.
219
220===-------------------------------------------------------------------------===
221
Chris Lattner39706e62005-12-22 17:19:28 +0000222No loads or stores of the constants should be needed:
223
224struct foo { double X, Y; };
225void xxx(struct foo F);
226void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
227
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000228===-------------------------------------------------------------------------===
229
230For this:
231
232int h(int i, int j, int k) {
233 return (i==0||j==0||k == 0);
234}
235
236We currently emit this:
237
238_h:
239 cntlzw r2, r3
240 cntlzw r3, r4
241 cntlzw r4, r5
242 srwi r2, r2, 5
243 srwi r3, r3, 5
244 srwi r4, r4, 5
245 or r2, r3, r2
246 or r3, r2, r4
247 blr
248
249The ctlz/shift instructions are created by the isel, so the dag combiner doesn't
250have a chance to pull the shifts through the or's (eliminating two
251instructions). SETCC nodes should be custom lowered in this case, not expanded
252by the isel.
253
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000254===-------------------------------------------------------------------------===
255
256Darwin Stub LICM optimization:
257
258Loops like this:
259
260 for (...) bar();
261
262Have to go through an indirect stub if bar is external or linkonce. It would
263be better to compile it as:
264
265 fp = &bar;
266 for (...) fp();
267
268which only computes the address of bar once (instead of each time through the
269stub). This is Darwin specific and would have to be done in the code generator.
270Probably not a win on x86.
271
272===-------------------------------------------------------------------------===
273
274PowerPC i1/setcc stuff (depends on subreg stuff):
275
276Check out the PPC code we get for 'compare' in this testcase:
277http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
278
279oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
280invert, invert, or), we then have to compare it against zero instead of
281using the value already in a CR!
282
283that should be something like
284 cmpw cr7, r8, r5
285 cmpw cr0, r7, r3
286 crnand cr0, cr0, cr7
287 bne cr0, LBB_compare_4
288
289instead of
290 cmpw cr7, r8, r5
291 cmpw cr0, r7, r3
292 mfcr r7, 1
293 mcrf cr7, cr0
294 mfcr r8, 1
295 rlwinm r7, r7, 30, 31, 31
296 rlwinm r8, r8, 30, 31, 31
297 xori r7, r7, 1
298 xori r8, r8, 1
299 addi r2, r2, 1
300 or r7, r8, r7
301 cmpwi cr0, r7, 0
302 bne cr0, LBB_compare_4 ; loopexit
303
304===-------------------------------------------------------------------------===
305
306Simple IPO for argument passing, change:
307 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
308
309the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
310of arguments get assigned to r3 through r10. That is, if you have a function
311foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
312argument bytes for r4 and r5. The trick then would be to shuffle the argument
313order for functions we can internalize so that the maximum number of
314integers/pointers get passed in regs before you see any of the fp arguments.
315
316Instead of implementing this, it would actually probably be easier to just
317implement a PPC fastcc, where we could do whatever we wanted to the CC,
318including having this work sanely.
319
320===-------------------------------------------------------------------------===
321
322Fix Darwin FP-In-Integer Registers ABI
323
324Darwin passes doubles in structures in integer registers, which is very very
325bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
326that percolates these things out of functions.
327
328Check out how horrible this is:
329http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
330
331This is an extension of "interprocedural CC unmunging" that can't be done with
332just fastcc.
333
334===-------------------------------------------------------------------------===
335
336Code Gen IPO optimization:
337
338Squish small scalar globals together into a single global struct, allowing the
339address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
340of the GOT on targets with one).
341
Chris Lattner3cda14f2006-01-19 02:09:38 +0000342===-------------------------------------------------------------------------===
343
344Generate lwbrx and other byteswapping load/store instructions when reasonable.
345