blob: 940db16df66d3a25845f4d1d3b766532c0bebd16 [file] [log] [blame]
Rafael Espindolaf4d40052006-08-22 12:22:46 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the ARM backend.
3//===---------------------------------------------------------------------===//
4
Evan Chenga8e29892007-01-19 07:51:42 +00005Reimplement 'select' in terms of 'SEL'.
Rafael Espindolaf4d40052006-08-22 12:22:46 +00006
Evan Chenga8e29892007-01-19 07:51:42 +00007* We would really like to support UXTAB16, but we need to prove that the
8 add doesn't need to overflow between the two 16-bit chunks.
Rafael Espindola75645492006-09-22 11:36:17 +00009
Evan Chenga8e29892007-01-19 07:51:42 +000010* implement predication support
11* Implement pre/post increment support. (e.g. PR935)
12* Coalesce stack slots!
13* Implement smarter constant generation for binops with large immediates.
Rafael Espindola75645492006-09-22 11:36:17 +000014
Evan Chenga8e29892007-01-19 07:51:42 +000015* Consider materializing FP constants like 0.0f and 1.0f using integer
16 immediate instructions then copy to FPU. Slower than load into FPU?
Rafael Espindola75645492006-09-22 11:36:17 +000017
Evan Chenga8e29892007-01-19 07:51:42 +000018//===---------------------------------------------------------------------===//
Rafael Espindola75645492006-09-22 11:36:17 +000019
Dale Johannesenf1b214d2007-02-28 18:41:23 +000020The constant island pass is in good shape. Some cleanups might be desirable,
21but there is unlikely to be much improvement in the generated code.
Rafael Espindola75645492006-09-22 11:36:17 +000022
Dale Johannesenf1b214d2007-02-28 18:41:23 +0000231. There may be some advantage to trying to be smarter about the initial
Dale Johannesen88e37ae2007-02-23 05:02:36 +000024placement, rather than putting everything at the end.
Rafael Espindola75645492006-09-22 11:36:17 +000025
Dale Johannesenf1b214d2007-02-28 18:41:23 +0000262. The handling of 2-byte padding for Thumb is overly conservative. There
Dale Johannesen99c49a42007-02-25 00:47:03 +000027would be a small gain to keeping accurate track of the padding (which would
28require aligning functions containing constant pools to 4-byte boundaries).
29
Dale Johannesenf1b214d2007-02-28 18:41:23 +0000303. There might be some compile-time efficiency to be had by representing
31consecutive islands as a single block rather than multiple blocks.
32
Evan Chenga8e29892007-01-19 07:51:42 +000033//===---------------------------------------------------------------------===//
Rafael Espindola75645492006-09-22 11:36:17 +000034
Evan Chenga8e29892007-01-19 07:51:42 +000035We need to start generating predicated instructions. The .td files have a way
36to express this now (see the PPC conditional return instruction), but the
37branch folding pass (or a new if-cvt pass) should start producing these, at
38least in the trivial case.
Rafael Espindola75645492006-09-22 11:36:17 +000039
Evan Chenga8e29892007-01-19 07:51:42 +000040Among the obvious wins, doing so can eliminate the need to custom expand
41copysign (i.e. we won't need to custom expand it to get the conditional
42negate).
Rafael Espindola4dfab982006-12-11 23:56:10 +000043
Chris Lattner2d1222c2007-02-02 04:36:46 +000044This allows us to eliminate one instruction from:
45
46define i32 @_Z6slow4bii(i32 %x, i32 %y) {
47 %tmp = icmp sgt i32 %x, %y
48 %retval = select i1 %tmp, i32 %x, i32 %y
49 ret i32 %retval
50}
51
52__Z6slow4bii:
53 cmp r0, r1
54 movgt r1, r0
55 mov r0, r1
56 bx lr
57
Evan Chenga8e29892007-01-19 07:51:42 +000058//===---------------------------------------------------------------------===//
Rafael Espindola4dfab982006-12-11 23:56:10 +000059
Evan Chenga8e29892007-01-19 07:51:42 +000060Implement long long "X-3" with instructions that fold the immediate in. These
61were disabled due to badness with the ARM carry flag on subtracts.
Rafael Espindola4dfab982006-12-11 23:56:10 +000062
Evan Chenga8e29892007-01-19 07:51:42 +000063//===---------------------------------------------------------------------===//
Rafael Espindola4dfab982006-12-11 23:56:10 +000064
Evan Chenga8e29892007-01-19 07:51:42 +000065We currently compile abs:
66int foo(int p) { return p < 0 ? -p : p; }
Rafael Espindola4dfab982006-12-11 23:56:10 +000067
Evan Chenga8e29892007-01-19 07:51:42 +000068into:
Rafael Espindolacd71da52006-10-03 17:27:58 +000069
Evan Chenga8e29892007-01-19 07:51:42 +000070_foo:
71 rsb r1, r0, #0
72 cmn r0, #1
73 movgt r1, r0
74 mov r0, r1
75 bx lr
Rafael Espindolacd71da52006-10-03 17:27:58 +000076
Evan Chenga8e29892007-01-19 07:51:42 +000077This is very, uh, literal. This could be a 3 operation sequence:
78 t = (p sra 31);
79 res = (p xor t)-t
Rafael Espindola5af3a682006-10-09 14:18:33 +000080
Evan Chenga8e29892007-01-19 07:51:42 +000081Which would be better. This occurs in png decode.
Rafael Espindola5af3a682006-10-09 14:18:33 +000082
Evan Chenga8e29892007-01-19 07:51:42 +000083//===---------------------------------------------------------------------===//
84
85More load / store optimizations:
861) Look past instructions without side-effects (not load, store, branch, etc.)
87 when forming the list of loads / stores to optimize.
88
892) Smarter register allocation?
90We are probably missing some opportunities to use ldm / stm. Consider:
91
92ldr r5, [r0]
93ldr r4, [r0, #4]
94
95This cannot be merged into a ldm. Perhaps we will need to do the transformation
96before register allocation. Then teach the register allocator to allocate a
97chunk of consecutive registers.
98
993) Better representation for block transfer? This is from Olden/power:
100
101 fldd d0, [r4]
102 fstd d0, [r4, #+32]
103 fldd d0, [r4, #+8]
104 fstd d0, [r4, #+40]
105 fldd d0, [r4, #+16]
106 fstd d0, [r4, #+48]
107 fldd d0, [r4, #+24]
108 fstd d0, [r4, #+56]
109
110If we can spare the registers, it would be better to use fldm and fstm here.
111Need major register allocator enhancement though.
112
1134) Can we recognize the relative position of constantpool entries? i.e. Treat
114
115 ldr r0, LCPI17_3
116 ldr r1, LCPI17_4
117 ldr r2, LCPI17_5
118
119 as
120 ldr r0, LCPI17
121 ldr r1, LCPI17+4
122 ldr r2, LCPI17+8
123
124 Then the ldr's can be combined into a single ldm. See Olden/power.
125
126Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
127double 64-bit FP constant:
128
129 adr r0, L6
130 ldmia r0, {r0-r1}
131
132 .align 2
133L6:
134 .long -858993459
135 .long 1074318540
136
1375) Can we make use of ldrd and strd? Instead of generating ldm / stm, use
138ldrd/strd instead if there are only two destination registers that form an
139odd/even pair. However, we probably would pay a penalty if the address is not
140aligned on 8-byte boundary. This requires more information on load / store
141nodes (and MI's?) then we currently carry.
142
Dale Johannesen818c0852007-03-09 19:18:59 +00001436) struct copies appear to be done field by field
Dale Johannesena6bc6fc2007-03-09 17:58:17 +0000144instead of by words, at least sometimes:
145
146struct foo { int x; short s; char c1; char c2; };
147void cpy(struct foo*a, struct foo*b) { *a = *b; }
148
149llvm code (-O2)
150 ldrb r3, [r1, #+6]
151 ldr r2, [r1]
152 ldrb r12, [r1, #+7]
153 ldrh r1, [r1, #+4]
154 str r2, [r0]
155 strh r1, [r0, #+4]
156 strb r3, [r0, #+6]
157 strb r12, [r0, #+7]
158gcc code (-O2)
159 ldmia r1, {r1-r2}
160 stmia r0, {r1-r2}
161
162In this benchmark poor handling of aggregate copies has shown up as
163having a large effect on size, and possibly speed as well (we don't have
164a good way to measure on ARM).
165
Evan Chenga8e29892007-01-19 07:51:42 +0000166//===---------------------------------------------------------------------===//
167
168* Consider this silly example:
169
170double bar(double x) {
171 double r = foo(3.1);
172 return x+r;
173}
174
175_bar:
176 sub sp, sp, #16
177 str r4, [sp, #+12]
178 str r5, [sp, #+8]
179 str lr, [sp, #+4]
180 mov r4, r0
181 mov r5, r1
182 ldr r0, LCPI2_0
183 bl _foo
184 fmsr f0, r0
185 fcvtsd d0, f0
186 fmdrr d1, r4, r5
187 faddd d0, d0, d1
188 fmrrd r0, r1, d0
189 ldr lr, [sp, #+4]
190 ldr r5, [sp, #+8]
191 ldr r4, [sp, #+12]
192 add sp, sp, #16
193 bx lr
194
195Ignore the prologue and epilogue stuff for a second. Note
196 mov r4, r0
197 mov r5, r1
198the copys to callee-save registers and the fact they are only being used by the
199fmdrr instruction. It would have been better had the fmdrr been scheduled
200before the call and place the result in a callee-save DPR register. The two
201mov ops would not have been necessary.
202
203//===---------------------------------------------------------------------===//
204
205Calling convention related stuff:
206
207* gcc's parameter passing implementation is terrible and we suffer as a result:
208
209e.g.
210struct s {
211 double d1;
212 int s1;
213};
214
215void foo(struct s S) {
216 printf("%g, %d\n", S.d1, S.s1);
217}
218
219'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
220then reload them to r1, r2, and r3 before issuing the call (r0 contains the
221address of the format string):
222
223 stmfd sp!, {r7, lr}
224 add r7, sp, #0
225 sub sp, sp, #12
226 stmia sp, {r0, r1, r2}
227 ldmia sp, {r1-r2}
228 ldr r0, L5
229 ldr r3, [sp, #8]
230L2:
231 add r0, pc, r0
232 bl L_printf$stub
233
234Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
235
236* Return an aggregate type is even worse:
237
238e.g.
239struct s foo(void) {
240 struct s S = {1.1, 2};
241 return S;
242}
243
244 mov ip, r0
245 ldr r0, L5
246 sub sp, sp, #12
247L2:
248 add r0, pc, r0
249 @ lr needed for prologue
250 ldmia r0, {r0, r1, r2}
251 stmia sp, {r0, r1, r2}
252 stmia ip, {r0, r1, r2}
253 mov r0, ip
254 add sp, sp, #12
255 bx lr
256
257r0 (and later ip) is the hidden parameter from caller to store the value in. The
258first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
259r2 into the address passed in. However, there is one additional stmia that
260stores r0, r1, and r2 to some stack location. The store is dead.
261
262The llvm-gcc generated code looks like this:
263
264csretcc void %foo(%struct.s* %agg.result) {
Rafael Espindola5af3a682006-10-09 14:18:33 +0000265entry:
Evan Chenga8e29892007-01-19 07:51:42 +0000266 %S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
267 %memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
268 cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
269 call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
270 cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
271 call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
272 cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
273 call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
Rafael Espindola5af3a682006-10-09 14:18:33 +0000274 ret void
275}
276
Evan Chenga8e29892007-01-19 07:51:42 +0000277llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
278constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
279into a number of load and stores, or 2) custom lower memcpy (of small size) to
280be ldmia / stmia. I think option 2 is better but the current register
281allocator cannot allocate a chunk of registers at a time.
Rafael Espindola5af3a682006-10-09 14:18:33 +0000282
Evan Chenga8e29892007-01-19 07:51:42 +0000283A feasible temporary solution is to use specific physical registers at the
284lowering time for small (<= 4 words?) transfer size.
Rafael Espindola5af3a682006-10-09 14:18:33 +0000285
Evan Chenga8e29892007-01-19 07:51:42 +0000286* ARM CSRet calling convention requires the hidden argument to be returned by
287the callee.
Rafael Espindolabec2e382006-10-16 16:33:29 +0000288
Evan Chenga8e29892007-01-19 07:51:42 +0000289//===---------------------------------------------------------------------===//
Rafael Espindolabec2e382006-10-16 16:33:29 +0000290
Evan Chenga8e29892007-01-19 07:51:42 +0000291We can definitely do a better job on BB placements to eliminate some branches.
292It's very common to see llvm generated assembly code that looks like this:
Rafael Espindola82c678b2006-10-16 17:17:22 +0000293
Evan Chenga8e29892007-01-19 07:51:42 +0000294LBB3:
295 ...
296LBB4:
297...
298 beq LBB3
299 b LBB2
Rafael Espindola82c678b2006-10-16 17:17:22 +0000300
Evan Chenga8e29892007-01-19 07:51:42 +0000301If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
302then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
303
304See McCat/18-imp/ComputeBoundingBoxes for an example.
305
306//===---------------------------------------------------------------------===//
307
Dale Johannesena6bc6fc2007-03-09 17:58:17 +0000308Register scavenging is now implemented. The example in the previous version
309of this document produces optimal code at -O2.
Evan Chenga8e29892007-01-19 07:51:42 +0000310
311//===---------------------------------------------------------------------===//
312
313Pre-/post- indexed load / stores:
314
3151) We should not make the pre/post- indexed load/store transform if the base ptr
316is guaranteed to be live beyond the load/store. This can happen if the base
317ptr is live out of the block we are performing the optimization. e.g.
318
319mov r1, r2
320ldr r3, [r1], #4
321...
322
323vs.
324
325ldr r3, [r2]
326add r1, r2, #4
327...
328
329In most cases, this is just a wasted optimization. However, sometimes it can
330negatively impact the performance because two-address code is more restrictive
331when it comes to scheduling.
332
333Unfortunately, liveout information is currently unavailable during DAG combine
334time.
335
3362) Consider spliting a indexed load / store into a pair of add/sub + load/store
337 to solve #1 (in TwoAddressInstructionPass.cpp).
338
3393) Enhance LSR to generate more opportunities for indexed ops.
340
3414) Once we added support for multiple result patterns, write indexed loads
342 patterns instead of C++ instruction selection code.
343
3445) Use FLDM / FSTM to emulate indexed FP load / store.
345
346//===---------------------------------------------------------------------===//
347
348We should add i64 support to take advantage of the 64-bit load / stores.
349We can add a pseudo i64 register class containing pseudo registers that are
350register pairs. All other ops (e.g. add, sub) would be expanded as usual.
351
352We need to add pseudo instructions (i.e. gethi / getlo) to extract i32 registers
353from the i64 register. These are single moves which can be eliminated if the
354destination register is a sub-register of the source. We should implement proper
355subreg support in the register allocator to coalesce these away.
356
357There are other minor issues such as multiple instructions for a spill / restore
358/ move.
359
360//===---------------------------------------------------------------------===//
361
362Implement support for some more tricky ways to materialize immediates. For
363example, to get 0xffff8000, we can use:
364
365mov r9, #&3f8000
366sub r9, r9, #&400000
367
368//===---------------------------------------------------------------------===//
369
370We sometimes generate multiple add / sub instructions to update sp in prologue
371and epilogue if the inc / dec value is too large to fit in a single immediate
372operand. In some cases, perhaps it might be better to load the value from a
373constantpool instead.
374
375//===---------------------------------------------------------------------===//
376
377GCC generates significantly better code for this function.
378
379int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) {
380 int i = 0;
381
382 if (StackPtr != 0) {
383 while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
384 Line[i++] = Stack[--StackPtr];
385 if (LineLen > 32768)
386 {
387 while (StackPtr != 0 && i < LineLen)
388 {
389 i++;
390 --StackPtr;
391 }
392 }
393 }
394 return StackPtr;
395}
396
397//===---------------------------------------------------------------------===//
398
399This should compile to the mlas instruction:
400int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
401
402//===---------------------------------------------------------------------===//
403
404At some point, we should triage these to see if they still apply to us:
405
406http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
407http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
408http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
409
410http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
411http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
412http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
413http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
414http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
415http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
416http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
417
418http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
419http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
420http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
421http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
422http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
423http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
424http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
425
426http://www.inf.u-szeged.hu/gcc-arm/
427http://citeseer.ist.psu.edu/debus04linktime.html
428
429//===---------------------------------------------------------------------===//
Evan Cheng2265b492007-03-09 19:34:51 +0000430
Dale Johannesen818c0852007-03-09 19:18:59 +0000431gcc generates smaller code for this function at -O2 or -Os:
Dale Johannesena6bc6fc2007-03-09 17:58:17 +0000432
433void foo(signed char* p) {
434 if (*p == 3)
435 bar();
436 else if (*p == 4)
437 baz();
438 else if (*p == 5)
439 quux();
440}
441
442llvm decides it's a good idea to turn the repeated if...else into a
443binary tree, as if it were a switch; the resulting code requires -1
444compare-and-branches when *p<=2 or *p==5, the same number if *p==4
445or *p>6, and +1 if *p==3. So it should be a speed win
446(on balance). However, the revised code is larger, with 4 conditional
447branches instead of 3.
448
449More seriously, there is a byte->word extend before
450each comparison, where there should be only one, and the condition codes
451are not remembered when the same two values are compared twice.
452
Evan Cheng2265b492007-03-09 19:34:51 +0000453//===---------------------------------------------------------------------===//
454
455More register scavenging work:
456
4571. Use the register scavenger to track frame index materialized into registers
458 (those that do not fit in addressing modes) to allow reuse in the same BB.
4592. Finish scavenging for Thumb.
4603. We know some spills and restores are unnecessary. The issue is once live
461 intervals are merged, they are not never split. So every def is spilled
462 and every use requires a restore if the register allocator decides the
463 resulting live interval is not assigned a physical register. It may be
464 possible (with the help of the scavenger) to turn some spill / restore
465 pairs into register copies.
Evan Cheng44f4fca2007-03-09 19:35:33 +0000466
467//===---------------------------------------------------------------------===//
468
469Teach LSR about ARM addressing modes.