blob: a6f26a5dfe1954d7f3fc66158391aa10959d0fd3 [file] [log] [blame]
Rafael Espindolaf4d40052006-08-22 12:22:46 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the ARM backend.
3//===---------------------------------------------------------------------===//
4
Evan Chenga8e29892007-01-19 07:51:42 +00005Reimplement 'select' in terms of 'SEL'.
Rafael Espindolaf4d40052006-08-22 12:22:46 +00006
Evan Chenga8e29892007-01-19 07:51:42 +00007* We would really like to support UXTAB16, but we need to prove that the
8 add doesn't need to overflow between the two 16-bit chunks.
Rafael Espindola75645492006-09-22 11:36:17 +00009
Evan Chenga8e29892007-01-19 07:51:42 +000010* Implement pre/post increment support. (e.g. PR935)
Evan Chenga8e29892007-01-19 07:51:42 +000011* Implement smarter constant generation for binops with large immediates.
Rafael Espindola75645492006-09-22 11:36:17 +000012
Evan Chenga8e29892007-01-19 07:51:42 +000013//===---------------------------------------------------------------------===//
Rafael Espindola75645492006-09-22 11:36:17 +000014
Chris Lattner93305bc2007-04-20 20:18:43 +000015Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
16time regalloc happens, these values are now in a 32-bit register, usually with
17the top-bits known to be sign or zero extended. If spilled, we should be able
18to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
19of the reload.
20
21Doing this reduces the size of the stack frame (important for thumb etc), and
22also increases the likelihood that we will be able to reload multiple values
23from the stack with a single load.
24
25//===---------------------------------------------------------------------===//
26
Dale Johannesenf1b214d2007-02-28 18:41:23 +000027The constant island pass is in good shape. Some cleanups might be desirable,
28but there is unlikely to be much improvement in the generated code.
Rafael Espindola75645492006-09-22 11:36:17 +000029
Dale Johannesenf1b214d2007-02-28 18:41:23 +0000301. There may be some advantage to trying to be smarter about the initial
Dale Johannesen88e37ae2007-02-23 05:02:36 +000031placement, rather than putting everything at the end.
Rafael Espindola75645492006-09-22 11:36:17 +000032
Dale Johannesen9118dbc2007-04-30 00:32:06 +0000332. There might be some compile-time efficiency to be had by representing
Dale Johannesenf1b214d2007-02-28 18:41:23 +000034consecutive islands as a single block rather than multiple blocks.
35
Dale Johannesen9118dbc2007-04-30 00:32:06 +0000363. Use a priority queue to sort constant pool users in inverse order of
Evan Cheng1a9da0d2007-03-09 19:46:06 +000037 position so we always process the one closed to the end of functions
38 first. This may simply CreateNewWater.
39
Evan Chenga8e29892007-01-19 07:51:42 +000040//===---------------------------------------------------------------------===//
Rafael Espindola75645492006-09-22 11:36:17 +000041
Evan Chengc608ff22007-07-10 21:49:47 +000042Eliminate copysign custom expansion. We are still generating crappy code with
43default expansion + if-conversion.
Rafael Espindola75645492006-09-22 11:36:17 +000044
Evan Chengc608ff22007-07-10 21:49:47 +000045//===---------------------------------------------------------------------===//
Rafael Espindola4dfab982006-12-11 23:56:10 +000046
Evan Chengc608ff22007-07-10 21:49:47 +000047Eliminate one instruction from:
Chris Lattner2d1222c2007-02-02 04:36:46 +000048
49define i32 @_Z6slow4bii(i32 %x, i32 %y) {
50 %tmp = icmp sgt i32 %x, %y
51 %retval = select i1 %tmp, i32 %x, i32 %y
52 ret i32 %retval
53}
54
55__Z6slow4bii:
56 cmp r0, r1
57 movgt r1, r0
58 mov r0, r1
59 bx lr
Evan Chengc608ff22007-07-10 21:49:47 +000060=>
61
62__Z6slow4bii:
63 cmp r0, r1
64 movle r0, r1
65 bx lr
Chris Lattner2d1222c2007-02-02 04:36:46 +000066
Evan Chenga8e29892007-01-19 07:51:42 +000067//===---------------------------------------------------------------------===//
Rafael Espindola4dfab982006-12-11 23:56:10 +000068
Evan Chenga8e29892007-01-19 07:51:42 +000069Implement long long "X-3" with instructions that fold the immediate in. These
70were disabled due to badness with the ARM carry flag on subtracts.
Rafael Espindola4dfab982006-12-11 23:56:10 +000071
Evan Chenga8e29892007-01-19 07:51:42 +000072//===---------------------------------------------------------------------===//
Rafael Espindola4dfab982006-12-11 23:56:10 +000073
Evan Chenga8e29892007-01-19 07:51:42 +000074We currently compile abs:
75int foo(int p) { return p < 0 ? -p : p; }
Rafael Espindola4dfab982006-12-11 23:56:10 +000076
Evan Chenga8e29892007-01-19 07:51:42 +000077into:
Rafael Espindolacd71da52006-10-03 17:27:58 +000078
Evan Chenga8e29892007-01-19 07:51:42 +000079_foo:
80 rsb r1, r0, #0
81 cmn r0, #1
82 movgt r1, r0
83 mov r0, r1
84 bx lr
Rafael Espindolacd71da52006-10-03 17:27:58 +000085
Evan Chenga8e29892007-01-19 07:51:42 +000086This is very, uh, literal. This could be a 3 operation sequence:
87 t = (p sra 31);
88 res = (p xor t)-t
Rafael Espindola5af3a682006-10-09 14:18:33 +000089
Evan Chenga8e29892007-01-19 07:51:42 +000090Which would be better. This occurs in png decode.
Rafael Espindola5af3a682006-10-09 14:18:33 +000091
Evan Chenga8e29892007-01-19 07:51:42 +000092//===---------------------------------------------------------------------===//
93
94More load / store optimizations:
Evan Chengb604b2c2009-06-26 00:25:27 +0000951) Better representation for block transfer? This is from Olden/power:
Evan Chenga8e29892007-01-19 07:51:42 +000096
97 fldd d0, [r4]
98 fstd d0, [r4, #+32]
99 fldd d0, [r4, #+8]
100 fstd d0, [r4, #+40]
101 fldd d0, [r4, #+16]
102 fstd d0, [r4, #+48]
103 fldd d0, [r4, #+24]
104 fstd d0, [r4, #+56]
105
106If we can spare the registers, it would be better to use fldm and fstm here.
107Need major register allocator enhancement though.
108
Evan Chengb604b2c2009-06-26 00:25:27 +00001092) Can we recognize the relative position of constantpool entries? i.e. Treat
Evan Chenga8e29892007-01-19 07:51:42 +0000110
111 ldr r0, LCPI17_3
112 ldr r1, LCPI17_4
113 ldr r2, LCPI17_5
114
115 as
116 ldr r0, LCPI17
117 ldr r1, LCPI17+4
118 ldr r2, LCPI17+8
119
120 Then the ldr's can be combined into a single ldm. See Olden/power.
121
122Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
123double 64-bit FP constant:
124
125 adr r0, L6
126 ldmia r0, {r0-r1}
127
128 .align 2
129L6:
130 .long -858993459
131 .long 1074318540
132
Evan Chengb604b2c2009-06-26 00:25:27 +00001333) struct copies appear to be done field by field
Dale Johannesena6bc6fc2007-03-09 17:58:17 +0000134instead of by words, at least sometimes:
135
136struct foo { int x; short s; char c1; char c2; };
137void cpy(struct foo*a, struct foo*b) { *a = *b; }
138
139llvm code (-O2)
140 ldrb r3, [r1, #+6]
141 ldr r2, [r1]
142 ldrb r12, [r1, #+7]
143 ldrh r1, [r1, #+4]
144 str r2, [r0]
145 strh r1, [r0, #+4]
146 strb r3, [r0, #+6]
147 strb r12, [r0, #+7]
148gcc code (-O2)
149 ldmia r1, {r1-r2}
150 stmia r0, {r1-r2}
151
152In this benchmark poor handling of aggregate copies has shown up as
153having a large effect on size, and possibly speed as well (we don't have
154a good way to measure on ARM).
155
Evan Chenga8e29892007-01-19 07:51:42 +0000156//===---------------------------------------------------------------------===//
157
158* Consider this silly example:
159
160double bar(double x) {
161 double r = foo(3.1);
162 return x+r;
163}
164
165_bar:
Chris Lattnera3f61df2007-11-27 22:41:52 +0000166 stmfd sp!, {r4, r5, r7, lr}
167 add r7, sp, #8
168 mov r4, r0
169 mov r5, r1
170 fldd d0, LCPI1_0
171 fmrrd r0, r1, d0
172 bl _foo
173 fmdrr d0, r4, r5
174 fmsr s2, r0
175 fsitod d1, s2
176 faddd d0, d1, d0
177 fmrrd r0, r1, d0
178 ldmfd sp!, {r4, r5, r7, pc}
Evan Chenga8e29892007-01-19 07:51:42 +0000179
180Ignore the prologue and epilogue stuff for a second. Note
181 mov r4, r0
182 mov r5, r1
183the copys to callee-save registers and the fact they are only being used by the
184fmdrr instruction. It would have been better had the fmdrr been scheduled
185before the call and place the result in a callee-save DPR register. The two
186mov ops would not have been necessary.
187
188//===---------------------------------------------------------------------===//
189
190Calling convention related stuff:
191
192* gcc's parameter passing implementation is terrible and we suffer as a result:
193
194e.g.
195struct s {
196 double d1;
197 int s1;
198};
199
200void foo(struct s S) {
201 printf("%g, %d\n", S.d1, S.s1);
202}
203
204'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
205then reload them to r1, r2, and r3 before issuing the call (r0 contains the
206address of the format string):
207
208 stmfd sp!, {r7, lr}
209 add r7, sp, #0
210 sub sp, sp, #12
211 stmia sp, {r0, r1, r2}
212 ldmia sp, {r1-r2}
213 ldr r0, L5
214 ldr r3, [sp, #8]
215L2:
216 add r0, pc, r0
217 bl L_printf$stub
218
219Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
220
221* Return an aggregate type is even worse:
222
223e.g.
224struct s foo(void) {
225 struct s S = {1.1, 2};
226 return S;
227}
228
229 mov ip, r0
230 ldr r0, L5
231 sub sp, sp, #12
232L2:
233 add r0, pc, r0
234 @ lr needed for prologue
235 ldmia r0, {r0, r1, r2}
236 stmia sp, {r0, r1, r2}
237 stmia ip, {r0, r1, r2}
238 mov r0, ip
239 add sp, sp, #12
240 bx lr
241
242r0 (and later ip) is the hidden parameter from caller to store the value in. The
243first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
244r2 into the address passed in. However, there is one additional stmia that
245stores r0, r1, and r2 to some stack location. The store is dead.
246
247The llvm-gcc generated code looks like this:
248
249csretcc void %foo(%struct.s* %agg.result) {
Rafael Espindola5af3a682006-10-09 14:18:33 +0000250entry:
Evan Chenga8e29892007-01-19 07:51:42 +0000251 %S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
252 %memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
253 cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
254 call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
255 cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
256 call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
257 cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
258 call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
Rafael Espindola5af3a682006-10-09 14:18:33 +0000259 ret void
260}
261
Evan Chenga8e29892007-01-19 07:51:42 +0000262llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
263constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
264into a number of load and stores, or 2) custom lower memcpy (of small size) to
265be ldmia / stmia. I think option 2 is better but the current register
266allocator cannot allocate a chunk of registers at a time.
Rafael Espindola5af3a682006-10-09 14:18:33 +0000267
Evan Chenga8e29892007-01-19 07:51:42 +0000268A feasible temporary solution is to use specific physical registers at the
269lowering time for small (<= 4 words?) transfer size.
Rafael Espindola5af3a682006-10-09 14:18:33 +0000270
Evan Chenga8e29892007-01-19 07:51:42 +0000271* ARM CSRet calling convention requires the hidden argument to be returned by
272the callee.
Rafael Espindolabec2e382006-10-16 16:33:29 +0000273
Evan Chenga8e29892007-01-19 07:51:42 +0000274//===---------------------------------------------------------------------===//
Rafael Espindolabec2e382006-10-16 16:33:29 +0000275
Evan Chenga8e29892007-01-19 07:51:42 +0000276We can definitely do a better job on BB placements to eliminate some branches.
277It's very common to see llvm generated assembly code that looks like this:
Rafael Espindola82c678b2006-10-16 17:17:22 +0000278
Evan Chenga8e29892007-01-19 07:51:42 +0000279LBB3:
280 ...
281LBB4:
282...
283 beq LBB3
284 b LBB2
Rafael Espindola82c678b2006-10-16 17:17:22 +0000285
Evan Chenga8e29892007-01-19 07:51:42 +0000286If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
287then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
288
289See McCat/18-imp/ComputeBoundingBoxes for an example.
290
291//===---------------------------------------------------------------------===//
292
Evan Chenga8e29892007-01-19 07:51:42 +0000293Pre-/post- indexed load / stores:
294
2951) We should not make the pre/post- indexed load/store transform if the base ptr
296is guaranteed to be live beyond the load/store. This can happen if the base
297ptr is live out of the block we are performing the optimization. e.g.
298
299mov r1, r2
300ldr r3, [r1], #4
301...
302
303vs.
304
305ldr r3, [r2]
306add r1, r2, #4
307...
308
309In most cases, this is just a wasted optimization. However, sometimes it can
310negatively impact the performance because two-address code is more restrictive
311when it comes to scheduling.
312
313Unfortunately, liveout information is currently unavailable during DAG combine
314time.
315
3162) Consider spliting a indexed load / store into a pair of add/sub + load/store
317 to solve #1 (in TwoAddressInstructionPass.cpp).
318
3193) Enhance LSR to generate more opportunities for indexed ops.
320
3214) Once we added support for multiple result patterns, write indexed loads
322 patterns instead of C++ instruction selection code.
323
Jim Grosbache5165492009-11-09 00:11:35 +00003245) Use VLDM / VSTM to emulate indexed FP load / store.
Evan Chenga8e29892007-01-19 07:51:42 +0000325
326//===---------------------------------------------------------------------===//
327
Evan Chenga8e29892007-01-19 07:51:42 +0000328Implement support for some more tricky ways to materialize immediates. For
329example, to get 0xffff8000, we can use:
330
331mov r9, #&3f8000
332sub r9, r9, #&400000
333
334//===---------------------------------------------------------------------===//
335
336We sometimes generate multiple add / sub instructions to update sp in prologue
337and epilogue if the inc / dec value is too large to fit in a single immediate
338operand. In some cases, perhaps it might be better to load the value from a
339constantpool instead.
340
341//===---------------------------------------------------------------------===//
342
343GCC generates significantly better code for this function.
344
345int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) {
346 int i = 0;
347
348 if (StackPtr != 0) {
349 while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
350 Line[i++] = Stack[--StackPtr];
351 if (LineLen > 32768)
352 {
353 while (StackPtr != 0 && i < LineLen)
354 {
355 i++;
356 --StackPtr;
357 }
358 }
359 }
360 return StackPtr;
361}
362
363//===---------------------------------------------------------------------===//
364
365This should compile to the mlas instruction:
366int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
367
368//===---------------------------------------------------------------------===//
369
370At some point, we should triage these to see if they still apply to us:
371
372http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
373http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
374http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
375
376http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
377http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
378http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
379http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
380http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
381http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
382http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
383
384http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
385http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
386http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
387http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
388http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
389http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
390http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
391
392http://www.inf.u-szeged.hu/gcc-arm/
393http://citeseer.ist.psu.edu/debus04linktime.html
394
395//===---------------------------------------------------------------------===//
Evan Cheng2265b492007-03-09 19:34:51 +0000396
Dale Johannesen818c0852007-03-09 19:18:59 +0000397gcc generates smaller code for this function at -O2 or -Os:
Dale Johannesena6bc6fc2007-03-09 17:58:17 +0000398
399void foo(signed char* p) {
400 if (*p == 3)
401 bar();
402 else if (*p == 4)
403 baz();
404 else if (*p == 5)
405 quux();
406}
407
408llvm decides it's a good idea to turn the repeated if...else into a
409binary tree, as if it were a switch; the resulting code requires -1
410compare-and-branches when *p<=2 or *p==5, the same number if *p==4
411or *p>6, and +1 if *p==3. So it should be a speed win
412(on balance). However, the revised code is larger, with 4 conditional
413branches instead of 3.
414
415More seriously, there is a byte->word extend before
416each comparison, where there should be only one, and the condition codes
417are not remembered when the same two values are compared twice.
418
Evan Cheng2265b492007-03-09 19:34:51 +0000419//===---------------------------------------------------------------------===//
420
Evan Chenga125cbe2007-03-20 22:32:39 +0000421More LSR enhancements possible:
422
4231. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
424 in a load / store.
4252. Allow iv reuse even when a type conversion is required. For example, i8
426 and i32 load / store addressing modes are identical.
Chris Lattner3c30d102007-04-17 18:03:00 +0000427
428
429//===---------------------------------------------------------------------===//
430
431This:
432
433int foo(int a, int b, int c, int d) {
434 long long acc = (long long)a * (long long)b;
435 acc += (long long)c * (long long)d;
436 return (int)(acc >> 32);
437}
438
439Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
440two signed 32-bit values to produce a 64-bit value, and accumulates this with
441a 64-bit value.
442
Chris Lattnera3f61df2007-11-27 22:41:52 +0000443We currently get this with both v4 and v6:
Chris Lattner3c30d102007-04-17 18:03:00 +0000444
445_foo:
Chris Lattnera3f61df2007-11-27 22:41:52 +0000446 smull r1, r0, r1, r0
447 smull r3, r2, r3, r2
448 adds r3, r3, r1
449 adc r0, r2, r0
Chris Lattner3c30d102007-04-17 18:03:00 +0000450 bx lr
451
Chris Lattner3c30d102007-04-17 18:03:00 +0000452//===---------------------------------------------------------------------===//
Chris Lattnerbf8ae842007-09-10 21:43:18 +0000453
454This:
455 #include <algorithm>
456 std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
457 { return std::make_pair(a + b, a + b < a); }
458 bool no_overflow(unsigned a, unsigned b)
459 { return !full_add(a, b).second; }
460
461Should compile to:
462
463_Z8full_addjj:
464 adds r2, r1, r2
465 movcc r1, #0
466 movcs r1, #1
467 str r2, [r0, #0]
468 strb r1, [r0, #4]
469 mov pc, lr
470
471_Z11no_overflowjj:
472 cmn r0, r1
473 movcs r0, #0
474 movcc r0, #1
475 mov pc, lr
476
477not:
478
479__Z8full_addjj:
480 add r3, r2, r1
481 str r3, [r0]
482 mov r2, #1
483 mov r12, #0
484 cmp r3, r1
485 movlo r12, r2
486 str r12, [r0, #+4]
487 bx lr
488__Z11no_overflowjj:
489 add r3, r1, r0
490 mov r2, #1
491 mov r1, #0
492 cmp r3, r0
493 movhs r1, r2
494 mov r0, r1
495 bx lr
496
497//===---------------------------------------------------------------------===//
Chris Lattner3a7c33a2007-10-19 03:29:26 +0000498
Bob Wilson5bafff32009-06-22 23:27:02 +0000499Some of the NEON intrinsics may be appropriate for more general use, either
500as target-independent intrinsics or perhaps elsewhere in the ARM backend.
501Some of them may also be lowered to target-independent SDNodes, and perhaps
502some new SDNodes could be added.
503
504For example, maximum, minimum, and absolute value operations are well-defined
505and standard operations, both for vector and scalar types.
506
507The current NEON-specific intrinsics for count leading zeros and count one
508bits could perhaps be replaced by the target-independent ctlz and ctpop
509intrinsics. It may also make sense to add a target-independent "ctls"
510intrinsic for "count leading sign bits". Likewise, the backend could use
511the target-independent SDNodes for these operations.
512
513ARMv6 has scalar saturating and halving adds and subtracts. The same
514intrinsics could possibly be used for both NEON's vector implementations of
515those operations and the ARMv6 scalar versions.
516
517//===---------------------------------------------------------------------===//
518
Evan Cheng151b9af2009-06-26 00:28:48 +0000519ARM::MOVCCr is commutable (by flipping the condition). But we need to implement
520ARMInstrInfo::commuteInstruction() to support it.
Evan Cheng055b0312009-06-29 07:51:04 +0000521
522//===---------------------------------------------------------------------===//
523
524Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
525LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
526ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
527while ARMConstantIslandPass only need to worry about LDR (literal).
Evan Cheng064a6ea2009-07-22 00:58:27 +0000528
529//===---------------------------------------------------------------------===//
530
Evan Cheng40efc252009-07-24 19:31:03 +0000531Constant island pass should make use of full range SoImm values for LEApcrel.
532Be careful though as the last attempt caused infinite looping on lencod.
Chris Lattner51350392009-07-30 16:08:58 +0000533
534//===---------------------------------------------------------------------===//
535
536Predication issue. This function:
537
538extern unsigned array[ 128 ];
539int foo( int x ) {
540 int y;
541 y = array[ x & 127 ];
542 if ( x & 128 )
543 y = 123456789 & ( y >> 2 );
544 else
545 y = 123456789 & y;
546 return y;
547}
548
549compiles to:
550
551_foo:
552 and r1, r0, #127
553 ldr r2, LCPI1_0
554 ldr r2, [r2]
555 ldr r1, [r2, +r1, lsl #2]
556 mov r2, r1, lsr #2
557 tst r0, #128
558 moveq r2, r1
559 ldr r0, LCPI1_1
560 and r0, r2, r0
561 bx lr
562
563It would be better to do something like this, to fold the shift into the
564conditional move:
565
566 and r1, r0, #127
567 ldr r2, LCPI1_0
568 ldr r2, [r2]
569 ldr r1, [r2, +r1, lsl #2]
570 tst r0, #128
571 movne r1, r1, lsr #2
572 ldr r0, LCPI1_1
573 and r0, r1, r0
574 bx lr
575
576it saves an instruction and a register.
577
578//===---------------------------------------------------------------------===//
Evan Cheng5adb66a2009-09-28 09:14:39 +0000579
Evan Cheng5adb66a2009-09-28 09:14:39 +0000580It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
581with the same bottom half.
Bob Wilson57f224a2009-10-30 22:22:46 +0000582
583//===---------------------------------------------------------------------===//
584
585Robert Muth started working on an alternate jump table implementation that
586does not put the tables in-line in the text. This is more like the llvm
587default jump table implementation. This might be useful sometime. Several
588revisions of patches are on the mailing list, beginning at:
589http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html
590
Evan Chenge3b88fc2009-11-02 07:51:19 +0000591//===---------------------------------------------------------------------===//
592
593Make use of the "rbit" instruction.
Evan Chengd457e6e2009-11-07 04:04:34 +0000594
595//===---------------------------------------------------------------------===//
596
597Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
598to licm and cse the unnecessary load from cp#1.
Jim Grosbachd5d2bae2010-01-22 00:08:13 +0000599
600//===---------------------------------------------------------------------===//
601
602The CMN instruction sets the flags like an ADD instruction, while CMP sets
603them like a subtract. Therefore to be able to use CMN for comparisons other
604than the Z bit, we'll need additional logic to reverse the conditionals
605associated with the comparison. Perhaps a pseudo-instruction for the comparison,
606with a post-codegen pass to clean up and handle the condition codes?
607See PR5694 for testcase.