blob: c155e20633432926dc710c0fd85936ac9ffbea98 [file] [log] [blame]
Rafael Espindolaf4d40052006-08-22 12:22:46 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the ARM backend.
3//===---------------------------------------------------------------------===//
4
Evan Chenga8e29892007-01-19 07:51:42 +00005Reimplement 'select' in terms of 'SEL'.
Rafael Espindolaf4d40052006-08-22 12:22:46 +00006
Evan Chenga8e29892007-01-19 07:51:42 +00007* We would really like to support UXTAB16, but we need to prove that the
8 add doesn't need to overflow between the two 16-bit chunks.
Rafael Espindola75645492006-09-22 11:36:17 +00009
Evan Chenga8e29892007-01-19 07:51:42 +000010* implement predication support
11* Implement pre/post increment support. (e.g. PR935)
12* Coalesce stack slots!
13* Implement smarter constant generation for binops with large immediates.
Rafael Espindola75645492006-09-22 11:36:17 +000014
Evan Chenga8e29892007-01-19 07:51:42 +000015* Consider materializing FP constants like 0.0f and 1.0f using integer
16 immediate instructions then copy to FPU. Slower than load into FPU?
Rafael Espindola75645492006-09-22 11:36:17 +000017
Evan Chenga8e29892007-01-19 07:51:42 +000018//===---------------------------------------------------------------------===//
Rafael Espindola75645492006-09-22 11:36:17 +000019
Dale Johannesenf1b214d2007-02-28 18:41:23 +000020The constant island pass is in good shape. Some cleanups might be desirable,
21but there is unlikely to be much improvement in the generated code.
Rafael Espindola75645492006-09-22 11:36:17 +000022
Dale Johannesenf1b214d2007-02-28 18:41:23 +0000231. There may be some advantage to trying to be smarter about the initial
Dale Johannesen88e37ae2007-02-23 05:02:36 +000024placement, rather than putting everything at the end.
Rafael Espindola75645492006-09-22 11:36:17 +000025
Dale Johannesenf1b214d2007-02-28 18:41:23 +0000262. The handling of 2-byte padding for Thumb is overly conservative. There
Dale Johannesen99c49a42007-02-25 00:47:03 +000027would be a small gain to keeping accurate track of the padding (which would
28require aligning functions containing constant pools to 4-byte boundaries).
29
Dale Johannesenf1b214d2007-02-28 18:41:23 +0000303. There might be some compile-time efficiency to be had by representing
31consecutive islands as a single block rather than multiple blocks.
32
Evan Cheng1a9da0d2007-03-09 19:46:06 +0000334. Use a priority queue to sort constant pool users in inverse order of
34 position so we always process the one closed to the end of functions
35 first. This may simply CreateNewWater.
36
Evan Chenga8e29892007-01-19 07:51:42 +000037//===---------------------------------------------------------------------===//
Rafael Espindola75645492006-09-22 11:36:17 +000038
Evan Chenga8e29892007-01-19 07:51:42 +000039We need to start generating predicated instructions. The .td files have a way
40to express this now (see the PPC conditional return instruction), but the
41branch folding pass (or a new if-cvt pass) should start producing these, at
42least in the trivial case.
Rafael Espindola75645492006-09-22 11:36:17 +000043
Evan Chenga8e29892007-01-19 07:51:42 +000044Among the obvious wins, doing so can eliminate the need to custom expand
45copysign (i.e. we won't need to custom expand it to get the conditional
46negate).
Rafael Espindola4dfab982006-12-11 23:56:10 +000047
Chris Lattner2d1222c2007-02-02 04:36:46 +000048This allows us to eliminate one instruction from:
49
50define i32 @_Z6slow4bii(i32 %x, i32 %y) {
51 %tmp = icmp sgt i32 %x, %y
52 %retval = select i1 %tmp, i32 %x, i32 %y
53 ret i32 %retval
54}
55
56__Z6slow4bii:
57 cmp r0, r1
58 movgt r1, r0
59 mov r0, r1
60 bx lr
61
Evan Chenga8e29892007-01-19 07:51:42 +000062//===---------------------------------------------------------------------===//
Rafael Espindola4dfab982006-12-11 23:56:10 +000063
Evan Chenga8e29892007-01-19 07:51:42 +000064Implement long long "X-3" with instructions that fold the immediate in. These
65were disabled due to badness with the ARM carry flag on subtracts.
Rafael Espindola4dfab982006-12-11 23:56:10 +000066
Evan Chenga8e29892007-01-19 07:51:42 +000067//===---------------------------------------------------------------------===//
Rafael Espindola4dfab982006-12-11 23:56:10 +000068
Evan Chenga8e29892007-01-19 07:51:42 +000069We currently compile abs:
70int foo(int p) { return p < 0 ? -p : p; }
Rafael Espindola4dfab982006-12-11 23:56:10 +000071
Evan Chenga8e29892007-01-19 07:51:42 +000072into:
Rafael Espindolacd71da52006-10-03 17:27:58 +000073
Evan Chenga8e29892007-01-19 07:51:42 +000074_foo:
75 rsb r1, r0, #0
76 cmn r0, #1
77 movgt r1, r0
78 mov r0, r1
79 bx lr
Rafael Espindolacd71da52006-10-03 17:27:58 +000080
Evan Chenga8e29892007-01-19 07:51:42 +000081This is very, uh, literal. This could be a 3 operation sequence:
82 t = (p sra 31);
83 res = (p xor t)-t
Rafael Espindola5af3a682006-10-09 14:18:33 +000084
Evan Chenga8e29892007-01-19 07:51:42 +000085Which would be better. This occurs in png decode.
Rafael Espindola5af3a682006-10-09 14:18:33 +000086
Evan Chenga8e29892007-01-19 07:51:42 +000087//===---------------------------------------------------------------------===//
88
89More load / store optimizations:
901) Look past instructions without side-effects (not load, store, branch, etc.)
91 when forming the list of loads / stores to optimize.
92
932) Smarter register allocation?
94We are probably missing some opportunities to use ldm / stm. Consider:
95
96ldr r5, [r0]
97ldr r4, [r0, #4]
98
99This cannot be merged into a ldm. Perhaps we will need to do the transformation
100before register allocation. Then teach the register allocator to allocate a
101chunk of consecutive registers.
102
1033) Better representation for block transfer? This is from Olden/power:
104
105 fldd d0, [r4]
106 fstd d0, [r4, #+32]
107 fldd d0, [r4, #+8]
108 fstd d0, [r4, #+40]
109 fldd d0, [r4, #+16]
110 fstd d0, [r4, #+48]
111 fldd d0, [r4, #+24]
112 fstd d0, [r4, #+56]
113
114If we can spare the registers, it would be better to use fldm and fstm here.
115Need major register allocator enhancement though.
116
1174) Can we recognize the relative position of constantpool entries? i.e. Treat
118
119 ldr r0, LCPI17_3
120 ldr r1, LCPI17_4
121 ldr r2, LCPI17_5
122
123 as
124 ldr r0, LCPI17
125 ldr r1, LCPI17+4
126 ldr r2, LCPI17+8
127
128 Then the ldr's can be combined into a single ldm. See Olden/power.
129
130Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
131double 64-bit FP constant:
132
133 adr r0, L6
134 ldmia r0, {r0-r1}
135
136 .align 2
137L6:
138 .long -858993459
139 .long 1074318540
140
1415) Can we make use of ldrd and strd? Instead of generating ldm / stm, use
142ldrd/strd instead if there are only two destination registers that form an
143odd/even pair. However, we probably would pay a penalty if the address is not
144aligned on 8-byte boundary. This requires more information on load / store
145nodes (and MI's?) then we currently carry.
146
Dale Johannesen818c0852007-03-09 19:18:59 +00001476) struct copies appear to be done field by field
Dale Johannesena6bc6fc2007-03-09 17:58:17 +0000148instead of by words, at least sometimes:
149
150struct foo { int x; short s; char c1; char c2; };
151void cpy(struct foo*a, struct foo*b) { *a = *b; }
152
153llvm code (-O2)
154 ldrb r3, [r1, #+6]
155 ldr r2, [r1]
156 ldrb r12, [r1, #+7]
157 ldrh r1, [r1, #+4]
158 str r2, [r0]
159 strh r1, [r0, #+4]
160 strb r3, [r0, #+6]
161 strb r12, [r0, #+7]
162gcc code (-O2)
163 ldmia r1, {r1-r2}
164 stmia r0, {r1-r2}
165
166In this benchmark poor handling of aggregate copies has shown up as
167having a large effect on size, and possibly speed as well (we don't have
168a good way to measure on ARM).
169
Evan Chenga8e29892007-01-19 07:51:42 +0000170//===---------------------------------------------------------------------===//
171
172* Consider this silly example:
173
174double bar(double x) {
175 double r = foo(3.1);
176 return x+r;
177}
178
179_bar:
180 sub sp, sp, #16
181 str r4, [sp, #+12]
182 str r5, [sp, #+8]
183 str lr, [sp, #+4]
184 mov r4, r0
185 mov r5, r1
186 ldr r0, LCPI2_0
187 bl _foo
188 fmsr f0, r0
189 fcvtsd d0, f0
190 fmdrr d1, r4, r5
191 faddd d0, d0, d1
192 fmrrd r0, r1, d0
193 ldr lr, [sp, #+4]
194 ldr r5, [sp, #+8]
195 ldr r4, [sp, #+12]
196 add sp, sp, #16
197 bx lr
198
199Ignore the prologue and epilogue stuff for a second. Note
200 mov r4, r0
201 mov r5, r1
202the copys to callee-save registers and the fact they are only being used by the
203fmdrr instruction. It would have been better had the fmdrr been scheduled
204before the call and place the result in a callee-save DPR register. The two
205mov ops would not have been necessary.
206
207//===---------------------------------------------------------------------===//
208
209Calling convention related stuff:
210
211* gcc's parameter passing implementation is terrible and we suffer as a result:
212
213e.g.
214struct s {
215 double d1;
216 int s1;
217};
218
219void foo(struct s S) {
220 printf("%g, %d\n", S.d1, S.s1);
221}
222
223'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
224then reload them to r1, r2, and r3 before issuing the call (r0 contains the
225address of the format string):
226
227 stmfd sp!, {r7, lr}
228 add r7, sp, #0
229 sub sp, sp, #12
230 stmia sp, {r0, r1, r2}
231 ldmia sp, {r1-r2}
232 ldr r0, L5
233 ldr r3, [sp, #8]
234L2:
235 add r0, pc, r0
236 bl L_printf$stub
237
238Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
239
240* Return an aggregate type is even worse:
241
242e.g.
243struct s foo(void) {
244 struct s S = {1.1, 2};
245 return S;
246}
247
248 mov ip, r0
249 ldr r0, L5
250 sub sp, sp, #12
251L2:
252 add r0, pc, r0
253 @ lr needed for prologue
254 ldmia r0, {r0, r1, r2}
255 stmia sp, {r0, r1, r2}
256 stmia ip, {r0, r1, r2}
257 mov r0, ip
258 add sp, sp, #12
259 bx lr
260
261r0 (and later ip) is the hidden parameter from caller to store the value in. The
262first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
263r2 into the address passed in. However, there is one additional stmia that
264stores r0, r1, and r2 to some stack location. The store is dead.
265
266The llvm-gcc generated code looks like this:
267
268csretcc void %foo(%struct.s* %agg.result) {
Rafael Espindola5af3a682006-10-09 14:18:33 +0000269entry:
Evan Chenga8e29892007-01-19 07:51:42 +0000270 %S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
271 %memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
272 cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
273 call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
274 cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
275 call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
276 cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
277 call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
Rafael Espindola5af3a682006-10-09 14:18:33 +0000278 ret void
279}
280
Evan Chenga8e29892007-01-19 07:51:42 +0000281llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
282constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
283into a number of load and stores, or 2) custom lower memcpy (of small size) to
284be ldmia / stmia. I think option 2 is better but the current register
285allocator cannot allocate a chunk of registers at a time.
Rafael Espindola5af3a682006-10-09 14:18:33 +0000286
Evan Chenga8e29892007-01-19 07:51:42 +0000287A feasible temporary solution is to use specific physical registers at the
288lowering time for small (<= 4 words?) transfer size.
Rafael Espindola5af3a682006-10-09 14:18:33 +0000289
Evan Chenga8e29892007-01-19 07:51:42 +0000290* ARM CSRet calling convention requires the hidden argument to be returned by
291the callee.
Rafael Espindolabec2e382006-10-16 16:33:29 +0000292
Evan Chenga8e29892007-01-19 07:51:42 +0000293//===---------------------------------------------------------------------===//
Rafael Espindolabec2e382006-10-16 16:33:29 +0000294
Evan Chenga8e29892007-01-19 07:51:42 +0000295We can definitely do a better job on BB placements to eliminate some branches.
296It's very common to see llvm generated assembly code that looks like this:
Rafael Espindola82c678b2006-10-16 17:17:22 +0000297
Evan Chenga8e29892007-01-19 07:51:42 +0000298LBB3:
299 ...
300LBB4:
301...
302 beq LBB3
303 b LBB2
Rafael Espindola82c678b2006-10-16 17:17:22 +0000304
Evan Chenga8e29892007-01-19 07:51:42 +0000305If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
306then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
307
308See McCat/18-imp/ComputeBoundingBoxes for an example.
309
310//===---------------------------------------------------------------------===//
311
Dale Johannesena6bc6fc2007-03-09 17:58:17 +0000312Register scavenging is now implemented. The example in the previous version
313of this document produces optimal code at -O2.
Evan Chenga8e29892007-01-19 07:51:42 +0000314
315//===---------------------------------------------------------------------===//
316
317Pre-/post- indexed load / stores:
318
3191) We should not make the pre/post- indexed load/store transform if the base ptr
320is guaranteed to be live beyond the load/store. This can happen if the base
321ptr is live out of the block we are performing the optimization. e.g.
322
323mov r1, r2
324ldr r3, [r1], #4
325...
326
327vs.
328
329ldr r3, [r2]
330add r1, r2, #4
331...
332
333In most cases, this is just a wasted optimization. However, sometimes it can
334negatively impact the performance because two-address code is more restrictive
335when it comes to scheduling.
336
337Unfortunately, liveout information is currently unavailable during DAG combine
338time.
339
3402) Consider spliting a indexed load / store into a pair of add/sub + load/store
341 to solve #1 (in TwoAddressInstructionPass.cpp).
342
3433) Enhance LSR to generate more opportunities for indexed ops.
344
3454) Once we added support for multiple result patterns, write indexed loads
346 patterns instead of C++ instruction selection code.
347
3485) Use FLDM / FSTM to emulate indexed FP load / store.
349
350//===---------------------------------------------------------------------===//
351
352We should add i64 support to take advantage of the 64-bit load / stores.
353We can add a pseudo i64 register class containing pseudo registers that are
354register pairs. All other ops (e.g. add, sub) would be expanded as usual.
355
356We need to add pseudo instructions (i.e. gethi / getlo) to extract i32 registers
357from the i64 register. These are single moves which can be eliminated if the
358destination register is a sub-register of the source. We should implement proper
359subreg support in the register allocator to coalesce these away.
360
361There are other minor issues such as multiple instructions for a spill / restore
362/ move.
363
364//===---------------------------------------------------------------------===//
365
366Implement support for some more tricky ways to materialize immediates. For
367example, to get 0xffff8000, we can use:
368
369mov r9, #&3f8000
370sub r9, r9, #&400000
371
372//===---------------------------------------------------------------------===//
373
374We sometimes generate multiple add / sub instructions to update sp in prologue
375and epilogue if the inc / dec value is too large to fit in a single immediate
376operand. In some cases, perhaps it might be better to load the value from a
377constantpool instead.
378
379//===---------------------------------------------------------------------===//
380
381GCC generates significantly better code for this function.
382
383int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) {
384 int i = 0;
385
386 if (StackPtr != 0) {
387 while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
388 Line[i++] = Stack[--StackPtr];
389 if (LineLen > 32768)
390 {
391 while (StackPtr != 0 && i < LineLen)
392 {
393 i++;
394 --StackPtr;
395 }
396 }
397 }
398 return StackPtr;
399}
400
401//===---------------------------------------------------------------------===//
402
403This should compile to the mlas instruction:
404int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
405
406//===---------------------------------------------------------------------===//
407
408At some point, we should triage these to see if they still apply to us:
409
410http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
411http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
412http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
413
414http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
415http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
416http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
417http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
418http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
419http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
420http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
421
422http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
423http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
424http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
425http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
426http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
427http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
428http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
429
430http://www.inf.u-szeged.hu/gcc-arm/
431http://citeseer.ist.psu.edu/debus04linktime.html
432
433//===---------------------------------------------------------------------===//
Evan Cheng2265b492007-03-09 19:34:51 +0000434
Dale Johannesen818c0852007-03-09 19:18:59 +0000435gcc generates smaller code for this function at -O2 or -Os:
Dale Johannesena6bc6fc2007-03-09 17:58:17 +0000436
437void foo(signed char* p) {
438 if (*p == 3)
439 bar();
440 else if (*p == 4)
441 baz();
442 else if (*p == 5)
443 quux();
444}
445
446llvm decides it's a good idea to turn the repeated if...else into a
447binary tree, as if it were a switch; the resulting code requires -1
448compare-and-branches when *p<=2 or *p==5, the same number if *p==4
449or *p>6, and +1 if *p==3. So it should be a speed win
450(on balance). However, the revised code is larger, with 4 conditional
451branches instead of 3.
452
453More seriously, there is a byte->word extend before
454each comparison, where there should be only one, and the condition codes
455are not remembered when the same two values are compared twice.
456
Evan Cheng2265b492007-03-09 19:34:51 +0000457//===---------------------------------------------------------------------===//
458
459More register scavenging work:
460
4611. Use the register scavenger to track frame index materialized into registers
462 (those that do not fit in addressing modes) to allow reuse in the same BB.
4632. Finish scavenging for Thumb.
4643. We know some spills and restores are unnecessary. The issue is once live
465 intervals are merged, they are not never split. So every def is spilled
466 and every use requires a restore if the register allocator decides the
467 resulting live interval is not assigned a physical register. It may be
468 possible (with the help of the scavenger) to turn some spill / restore
469 pairs into register copies.
Evan Cheng44f4fca2007-03-09 19:35:33 +0000470
471//===---------------------------------------------------------------------===//
472
473Teach LSR about ARM addressing modes.