blob: 9efb5a1426a5a87bd4ae5a3b8968efbc03c9a826 [file] [log] [blame]
Dan Gohmanf17a25c2007-07-18 16:29:46 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the ARM backend.
3//===---------------------------------------------------------------------===//
4
5Reimplement 'select' in terms of 'SEL'.
6
7* We would really like to support UXTAB16, but we need to prove that the
8 add doesn't need to overflow between the two 16-bit chunks.
9
10* Implement pre/post increment support. (e.g. PR935)
Dan Gohmanf17a25c2007-07-18 16:29:46 +000011* Implement smarter constant generation for binops with large immediates.
12
Dan Gohmanf17a25c2007-07-18 16:29:46 +000013//===---------------------------------------------------------------------===//
14
15Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
16time regalloc happens, these values are now in a 32-bit register, usually with
17the top-bits known to be sign or zero extended. If spilled, we should be able
18to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
19of the reload.
20
21Doing this reduces the size of the stack frame (important for thumb etc), and
22also increases the likelihood that we will be able to reload multiple values
23from the stack with a single load.
24
25//===---------------------------------------------------------------------===//
26
27The constant island pass is in good shape. Some cleanups might be desirable,
28but there is unlikely to be much improvement in the generated code.
29
301. There may be some advantage to trying to be smarter about the initial
31placement, rather than putting everything at the end.
32
332. There might be some compile-time efficiency to be had by representing
34consecutive islands as a single block rather than multiple blocks.
35
363. Use a priority queue to sort constant pool users in inverse order of
37 position so we always process the one closed to the end of functions
38 first. This may simply CreateNewWater.
39
40//===---------------------------------------------------------------------===//
41
42Eliminate copysign custom expansion. We are still generating crappy code with
43default expansion + if-conversion.
44
45//===---------------------------------------------------------------------===//
46
47Eliminate one instruction from:
48
49define i32 @_Z6slow4bii(i32 %x, i32 %y) {
50 %tmp = icmp sgt i32 %x, %y
51 %retval = select i1 %tmp, i32 %x, i32 %y
52 ret i32 %retval
53}
54
55__Z6slow4bii:
56 cmp r0, r1
57 movgt r1, r0
58 mov r0, r1
59 bx lr
60=>
61
62__Z6slow4bii:
63 cmp r0, r1
64 movle r0, r1
65 bx lr
66
67//===---------------------------------------------------------------------===//
68
69Implement long long "X-3" with instructions that fold the immediate in. These
70were disabled due to badness with the ARM carry flag on subtracts.
71
72//===---------------------------------------------------------------------===//
73
Dan Gohmanf17a25c2007-07-18 16:29:46 +000074More load / store optimizations:
Evan Cheng1362add2009-06-26 00:25:27 +0000751) Better representation for block transfer? This is from Olden/power:
Dan Gohmanf17a25c2007-07-18 16:29:46 +000076
77 fldd d0, [r4]
78 fstd d0, [r4, #+32]
79 fldd d0, [r4, #+8]
80 fstd d0, [r4, #+40]
81 fldd d0, [r4, #+16]
82 fstd d0, [r4, #+48]
83 fldd d0, [r4, #+24]
84 fstd d0, [r4, #+56]
85
86If we can spare the registers, it would be better to use fldm and fstm here.
87Need major register allocator enhancement though.
88
Evan Cheng1362add2009-06-26 00:25:27 +0000892) Can we recognize the relative position of constantpool entries? i.e. Treat
Dan Gohmanf17a25c2007-07-18 16:29:46 +000090
91 ldr r0, LCPI17_3
92 ldr r1, LCPI17_4
93 ldr r2, LCPI17_5
94
95 as
96 ldr r0, LCPI17
97 ldr r1, LCPI17+4
98 ldr r2, LCPI17+8
99
100 Then the ldr's can be combined into a single ldm. See Olden/power.
101
102Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
103double 64-bit FP constant:
104
105 adr r0, L6
106 ldmia r0, {r0-r1}
107
108 .align 2
109L6:
110 .long -858993459
111 .long 1074318540
112
Evan Cheng1362add2009-06-26 00:25:27 +00001133) struct copies appear to be done field by field
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000114instead of by words, at least sometimes:
115
116struct foo { int x; short s; char c1; char c2; };
117void cpy(struct foo*a, struct foo*b) { *a = *b; }
118
119llvm code (-O2)
120 ldrb r3, [r1, #+6]
121 ldr r2, [r1]
122 ldrb r12, [r1, #+7]
123 ldrh r1, [r1, #+4]
124 str r2, [r0]
125 strh r1, [r0, #+4]
126 strb r3, [r0, #+6]
127 strb r12, [r0, #+7]
128gcc code (-O2)
129 ldmia r1, {r1-r2}
130 stmia r0, {r1-r2}
131
132In this benchmark poor handling of aggregate copies has shown up as
133having a large effect on size, and possibly speed as well (we don't have
134a good way to measure on ARM).
135
136//===---------------------------------------------------------------------===//
137
138* Consider this silly example:
139
140double bar(double x) {
141 double r = foo(3.1);
142 return x+r;
143}
144
145_bar:
Chris Lattnera1688ed2007-11-27 22:41:52 +0000146 stmfd sp!, {r4, r5, r7, lr}
147 add r7, sp, #8
148 mov r4, r0
149 mov r5, r1
150 fldd d0, LCPI1_0
151 fmrrd r0, r1, d0
152 bl _foo
153 fmdrr d0, r4, r5
154 fmsr s2, r0
155 fsitod d1, s2
156 faddd d0, d1, d0
157 fmrrd r0, r1, d0
158 ldmfd sp!, {r4, r5, r7, pc}
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000159
160Ignore the prologue and epilogue stuff for a second. Note
161 mov r4, r0
162 mov r5, r1
163the copys to callee-save registers and the fact they are only being used by the
164fmdrr instruction. It would have been better had the fmdrr been scheduled
165before the call and place the result in a callee-save DPR register. The two
166mov ops would not have been necessary.
167
168//===---------------------------------------------------------------------===//
169
170Calling convention related stuff:
171
172* gcc's parameter passing implementation is terrible and we suffer as a result:
173
174e.g.
175struct s {
176 double d1;
177 int s1;
178};
179
180void foo(struct s S) {
181 printf("%g, %d\n", S.d1, S.s1);
182}
183
184'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
185then reload them to r1, r2, and r3 before issuing the call (r0 contains the
186address of the format string):
187
188 stmfd sp!, {r7, lr}
189 add r7, sp, #0
190 sub sp, sp, #12
191 stmia sp, {r0, r1, r2}
192 ldmia sp, {r1-r2}
193 ldr r0, L5
194 ldr r3, [sp, #8]
195L2:
196 add r0, pc, r0
197 bl L_printf$stub
198
199Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
200
201* Return an aggregate type is even worse:
202
203e.g.
204struct s foo(void) {
205 struct s S = {1.1, 2};
206 return S;
207}
208
209 mov ip, r0
210 ldr r0, L5
211 sub sp, sp, #12
212L2:
213 add r0, pc, r0
214 @ lr needed for prologue
215 ldmia r0, {r0, r1, r2}
216 stmia sp, {r0, r1, r2}
217 stmia ip, {r0, r1, r2}
218 mov r0, ip
219 add sp, sp, #12
220 bx lr
221
222r0 (and later ip) is the hidden parameter from caller to store the value in. The
223first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
224r2 into the address passed in. However, there is one additional stmia that
225stores r0, r1, and r2 to some stack location. The store is dead.
226
227The llvm-gcc generated code looks like this:
228
229csretcc void %foo(%struct.s* %agg.result) {
230entry:
231 %S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
232 %memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
233 cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
234 call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
235 cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
236 call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
237 cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
238 call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
239 ret void
240}
241
242llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
243constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
244into a number of load and stores, or 2) custom lower memcpy (of small size) to
245be ldmia / stmia. I think option 2 is better but the current register
246allocator cannot allocate a chunk of registers at a time.
247
248A feasible temporary solution is to use specific physical registers at the
249lowering time for small (<= 4 words?) transfer size.
250
251* ARM CSRet calling convention requires the hidden argument to be returned by
252the callee.
253
254//===---------------------------------------------------------------------===//
255
256We can definitely do a better job on BB placements to eliminate some branches.
257It's very common to see llvm generated assembly code that looks like this:
258
259LBB3:
260 ...
261LBB4:
262...
263 beq LBB3
264 b LBB2
265
266If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
267then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
268
269See McCat/18-imp/ComputeBoundingBoxes for an example.
270
271//===---------------------------------------------------------------------===//
272
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000273Pre-/post- indexed load / stores:
274
2751) We should not make the pre/post- indexed load/store transform if the base ptr
276is guaranteed to be live beyond the load/store. This can happen if the base
277ptr is live out of the block we are performing the optimization. e.g.
278
279mov r1, r2
280ldr r3, [r1], #4
281...
282
283vs.
284
285ldr r3, [r2]
286add r1, r2, #4
287...
288
289In most cases, this is just a wasted optimization. However, sometimes it can
290negatively impact the performance because two-address code is more restrictive
291when it comes to scheduling.
292
293Unfortunately, liveout information is currently unavailable during DAG combine
294time.
295
2962) Consider spliting a indexed load / store into a pair of add/sub + load/store
297 to solve #1 (in TwoAddressInstructionPass.cpp).
298
2993) Enhance LSR to generate more opportunities for indexed ops.
300
3014) Once we added support for multiple result patterns, write indexed loads
302 patterns instead of C++ instruction selection code.
303
Jim Grosbache2fda532009-11-09 00:11:35 +00003045) Use VLDM / VSTM to emulate indexed FP load / store.
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000305
306//===---------------------------------------------------------------------===//
307
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000308Implement support for some more tricky ways to materialize immediates. For
309example, to get 0xffff8000, we can use:
310
311mov r9, #&3f8000
312sub r9, r9, #&400000
313
314//===---------------------------------------------------------------------===//
315
316We sometimes generate multiple add / sub instructions to update sp in prologue
317and epilogue if the inc / dec value is too large to fit in a single immediate
318operand. In some cases, perhaps it might be better to load the value from a
319constantpool instead.
320
321//===---------------------------------------------------------------------===//
322
323GCC generates significantly better code for this function.
324
325int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) {
326 int i = 0;
327
328 if (StackPtr != 0) {
329 while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
330 Line[i++] = Stack[--StackPtr];
331 if (LineLen > 32768)
332 {
333 while (StackPtr != 0 && i < LineLen)
334 {
335 i++;
336 --StackPtr;
337 }
338 }
339 }
340 return StackPtr;
341}
342
343//===---------------------------------------------------------------------===//
344
345This should compile to the mlas instruction:
346int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
347
348//===---------------------------------------------------------------------===//
349
350At some point, we should triage these to see if they still apply to us:
351
352http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
353http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
354http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
355
356http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
357http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
358http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
359http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
360http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
361http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
362http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
363
364http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
365http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
366http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
367http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
368http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
369http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
370http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
371
372http://www.inf.u-szeged.hu/gcc-arm/
373http://citeseer.ist.psu.edu/debus04linktime.html
374
375//===---------------------------------------------------------------------===//
376
377gcc generates smaller code for this function at -O2 or -Os:
378
379void foo(signed char* p) {
380 if (*p == 3)
381 bar();
382 else if (*p == 4)
383 baz();
384 else if (*p == 5)
385 quux();
386}
387
388llvm decides it's a good idea to turn the repeated if...else into a
389binary tree, as if it were a switch; the resulting code requires -1
390compare-and-branches when *p<=2 or *p==5, the same number if *p==4
391or *p>6, and +1 if *p==3. So it should be a speed win
392(on balance). However, the revised code is larger, with 4 conditional
393branches instead of 3.
394
395More seriously, there is a byte->word extend before
396each comparison, where there should be only one, and the condition codes
397are not remembered when the same two values are compared twice.
398
399//===---------------------------------------------------------------------===//
400
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000401More LSR enhancements possible:
402
4031. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
404 in a load / store.
4052. Allow iv reuse even when a type conversion is required. For example, i8
406 and i32 load / store addressing modes are identical.
407
408
409//===---------------------------------------------------------------------===//
410
411This:
412
413int foo(int a, int b, int c, int d) {
414 long long acc = (long long)a * (long long)b;
415 acc += (long long)c * (long long)d;
416 return (int)(acc >> 32);
417}
418
419Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
420two signed 32-bit values to produce a 64-bit value, and accumulates this with
421a 64-bit value.
422
Chris Lattnera1688ed2007-11-27 22:41:52 +0000423We currently get this with both v4 and v6:
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000424
425_foo:
Chris Lattnera1688ed2007-11-27 22:41:52 +0000426 smull r1, r0, r1, r0
427 smull r3, r2, r3, r2
428 adds r3, r3, r1
429 adc r0, r2, r0
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000430 bx lr
431
Dan Gohmanf17a25c2007-07-18 16:29:46 +0000432//===---------------------------------------------------------------------===//
Chris Lattner4084d492007-09-10 21:43:18 +0000433
434This:
435 #include <algorithm>
436 std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
437 { return std::make_pair(a + b, a + b < a); }
438 bool no_overflow(unsigned a, unsigned b)
439 { return !full_add(a, b).second; }
440
441Should compile to:
442
443_Z8full_addjj:
444 adds r2, r1, r2
445 movcc r1, #0
446 movcs r1, #1
447 str r2, [r0, #0]
448 strb r1, [r0, #4]
449 mov pc, lr
450
451_Z11no_overflowjj:
452 cmn r0, r1
453 movcs r0, #0
454 movcc r0, #1
455 mov pc, lr
456
457not:
458
459__Z8full_addjj:
460 add r3, r2, r1
461 str r3, [r0]
462 mov r2, #1
463 mov r12, #0
464 cmp r3, r1
465 movlo r12, r2
466 str r12, [r0, #+4]
467 bx lr
468__Z11no_overflowjj:
469 add r3, r1, r0
470 mov r2, #1
471 mov r1, #0
472 cmp r3, r0
473 movhs r1, r2
474 mov r0, r1
475 bx lr
476
477//===---------------------------------------------------------------------===//
Chris Lattner01eb7282007-10-19 03:29:26 +0000478
Bob Wilsone60fee02009-06-22 23:27:02 +0000479Some of the NEON intrinsics may be appropriate for more general use, either
480as target-independent intrinsics or perhaps elsewhere in the ARM backend.
481Some of them may also be lowered to target-independent SDNodes, and perhaps
482some new SDNodes could be added.
483
484For example, maximum, minimum, and absolute value operations are well-defined
485and standard operations, both for vector and scalar types.
486
487The current NEON-specific intrinsics for count leading zeros and count one
488bits could perhaps be replaced by the target-independent ctlz and ctpop
489intrinsics. It may also make sense to add a target-independent "ctls"
490intrinsic for "count leading sign bits". Likewise, the backend could use
491the target-independent SDNodes for these operations.
492
493ARMv6 has scalar saturating and halving adds and subtracts. The same
494intrinsics could possibly be used for both NEON's vector implementations of
495those operations and the ARMv6 scalar versions.
496
497//===---------------------------------------------------------------------===//
498
Evan Cheng9d4e2002009-06-26 00:28:48 +0000499ARM::MOVCCr is commutable (by flipping the condition). But we need to implement
500ARMInstrInfo::commuteInstruction() to support it.
Evan Cheng532cdc52009-06-29 07:51:04 +0000501
502//===---------------------------------------------------------------------===//
503
504Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
505LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
506ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
507while ARMConstantIslandPass only need to worry about LDR (literal).
Evan Cheng85230432009-07-22 00:58:27 +0000508
509//===---------------------------------------------------------------------===//
510
Evan Cheng4e502402009-07-24 19:31:03 +0000511Constant island pass should make use of full range SoImm values for LEApcrel.
512Be careful though as the last attempt caused infinite looping on lencod.
Chris Lattnerecd9b642009-07-30 16:08:58 +0000513
514//===---------------------------------------------------------------------===//
515
516Predication issue. This function:
517
518extern unsigned array[ 128 ];
519int foo( int x ) {
520 int y;
521 y = array[ x & 127 ];
522 if ( x & 128 )
523 y = 123456789 & ( y >> 2 );
524 else
525 y = 123456789 & y;
526 return y;
527}
528
529compiles to:
530
531_foo:
532 and r1, r0, #127
533 ldr r2, LCPI1_0
534 ldr r2, [r2]
535 ldr r1, [r2, +r1, lsl #2]
536 mov r2, r1, lsr #2
537 tst r0, #128
538 moveq r2, r1
539 ldr r0, LCPI1_1
540 and r0, r2, r0
541 bx lr
542
543It would be better to do something like this, to fold the shift into the
544conditional move:
545
546 and r1, r0, #127
547 ldr r2, LCPI1_0
548 ldr r2, [r2]
549 ldr r1, [r2, +r1, lsl #2]
550 tst r0, #128
551 movne r1, r1, lsr #2
552 ldr r0, LCPI1_1
553 and r0, r1, r0
554 bx lr
555
556it saves an instruction and a register.
557
558//===---------------------------------------------------------------------===//
Evan Cheng16c012d2009-09-28 09:14:39 +0000559
Evan Cheng16c012d2009-09-28 09:14:39 +0000560It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
561with the same bottom half.
Bob Wilson4381e3b2009-10-30 22:22:46 +0000562
563//===---------------------------------------------------------------------===//
564
565Robert Muth started working on an alternate jump table implementation that
566does not put the tables in-line in the text. This is more like the llvm
567default jump table implementation. This might be useful sometime. Several
568revisions of patches are on the mailing list, beginning at:
569http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html
570
Evan Chengd9ca3d12009-11-02 07:51:19 +0000571//===---------------------------------------------------------------------===//
572
573Make use of the "rbit" instruction.
Evan Chengca529232009-11-07 04:04:34 +0000574
575//===---------------------------------------------------------------------===//
576
577Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
578to licm and cse the unnecessary load from cp#1.
Jim Grosbach40e4b112010-01-22 00:08:13 +0000579
580//===---------------------------------------------------------------------===//
581
582The CMN instruction sets the flags like an ADD instruction, while CMP sets
583them like a subtract. Therefore to be able to use CMN for comparisons other
584than the Z bit, we'll need additional logic to reverse the conditionals
585associated with the comparison. Perhaps a pseudo-instruction for the comparison,
586with a post-codegen pass to clean up and handle the condition codes?
587See PR5694 for testcase.