blob: 6e5a7a151e063c7b8906c17a517accda6097d053 [file] [log] [blame]
Rafael Espindolaf4d40052006-08-22 12:22:46 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the ARM backend.
3//===---------------------------------------------------------------------===//
4
Evan Chenga8e29892007-01-19 07:51:42 +00005Reimplement 'select' in terms of 'SEL'.
Rafael Espindolaf4d40052006-08-22 12:22:46 +00006
Evan Chenga8e29892007-01-19 07:51:42 +00007* We would really like to support UXTAB16, but we need to prove that the
8 add doesn't need to overflow between the two 16-bit chunks.
Rafael Espindola75645492006-09-22 11:36:17 +00009
Evan Chenga8e29892007-01-19 07:51:42 +000010* Implement pre/post increment support. (e.g. PR935)
11* Coalesce stack slots!
12* Implement smarter constant generation for binops with large immediates.
Rafael Espindola75645492006-09-22 11:36:17 +000013
Evan Chenga8e29892007-01-19 07:51:42 +000014* Consider materializing FP constants like 0.0f and 1.0f using integer
15 immediate instructions then copy to FPU. Slower than load into FPU?
Rafael Espindola75645492006-09-22 11:36:17 +000016
Evan Chenga8e29892007-01-19 07:51:42 +000017//===---------------------------------------------------------------------===//
Rafael Espindola75645492006-09-22 11:36:17 +000018
Chris Lattner93305bc2007-04-20 20:18:43 +000019Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
20time regalloc happens, these values are now in a 32-bit register, usually with
21the top-bits known to be sign or zero extended. If spilled, we should be able
22to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
23of the reload.
24
25Doing this reduces the size of the stack frame (important for thumb etc), and
26also increases the likelihood that we will be able to reload multiple values
27from the stack with a single load.
28
29//===---------------------------------------------------------------------===//
30
Dale Johannesenf1b214d2007-02-28 18:41:23 +000031The constant island pass is in good shape. Some cleanups might be desirable,
32but there is unlikely to be much improvement in the generated code.
Rafael Espindola75645492006-09-22 11:36:17 +000033
Dale Johannesenf1b214d2007-02-28 18:41:23 +0000341. There may be some advantage to trying to be smarter about the initial
Dale Johannesen88e37ae2007-02-23 05:02:36 +000035placement, rather than putting everything at the end.
Rafael Espindola75645492006-09-22 11:36:17 +000036
Dale Johannesen9118dbc2007-04-30 00:32:06 +0000372. There might be some compile-time efficiency to be had by representing
Dale Johannesenf1b214d2007-02-28 18:41:23 +000038consecutive islands as a single block rather than multiple blocks.
39
Dale Johannesen9118dbc2007-04-30 00:32:06 +0000403. Use a priority queue to sort constant pool users in inverse order of
Evan Cheng1a9da0d2007-03-09 19:46:06 +000041 position so we always process the one closed to the end of functions
42 first. This may simply CreateNewWater.
43
Evan Chenga8e29892007-01-19 07:51:42 +000044//===---------------------------------------------------------------------===//
Rafael Espindola75645492006-09-22 11:36:17 +000045
Evan Chengc608ff22007-07-10 21:49:47 +000046Eliminate copysign custom expansion. We are still generating crappy code with
47default expansion + if-conversion.
Rafael Espindola75645492006-09-22 11:36:17 +000048
Evan Chengc608ff22007-07-10 21:49:47 +000049//===---------------------------------------------------------------------===//
Rafael Espindola4dfab982006-12-11 23:56:10 +000050
Evan Chengc608ff22007-07-10 21:49:47 +000051Eliminate one instruction from:
Chris Lattner2d1222c2007-02-02 04:36:46 +000052
53define i32 @_Z6slow4bii(i32 %x, i32 %y) {
54 %tmp = icmp sgt i32 %x, %y
55 %retval = select i1 %tmp, i32 %x, i32 %y
56 ret i32 %retval
57}
58
59__Z6slow4bii:
60 cmp r0, r1
61 movgt r1, r0
62 mov r0, r1
63 bx lr
Evan Chengc608ff22007-07-10 21:49:47 +000064=>
65
66__Z6slow4bii:
67 cmp r0, r1
68 movle r0, r1
69 bx lr
Chris Lattner2d1222c2007-02-02 04:36:46 +000070
Evan Chenga8e29892007-01-19 07:51:42 +000071//===---------------------------------------------------------------------===//
Rafael Espindola4dfab982006-12-11 23:56:10 +000072
Evan Chenga8e29892007-01-19 07:51:42 +000073Implement long long "X-3" with instructions that fold the immediate in. These
74were disabled due to badness with the ARM carry flag on subtracts.
Rafael Espindola4dfab982006-12-11 23:56:10 +000075
Evan Chenga8e29892007-01-19 07:51:42 +000076//===---------------------------------------------------------------------===//
Rafael Espindola4dfab982006-12-11 23:56:10 +000077
Evan Chenga8e29892007-01-19 07:51:42 +000078We currently compile abs:
79int foo(int p) { return p < 0 ? -p : p; }
Rafael Espindola4dfab982006-12-11 23:56:10 +000080
Evan Chenga8e29892007-01-19 07:51:42 +000081into:
Rafael Espindolacd71da52006-10-03 17:27:58 +000082
Evan Chenga8e29892007-01-19 07:51:42 +000083_foo:
84 rsb r1, r0, #0
85 cmn r0, #1
86 movgt r1, r0
87 mov r0, r1
88 bx lr
Rafael Espindolacd71da52006-10-03 17:27:58 +000089
Evan Chenga8e29892007-01-19 07:51:42 +000090This is very, uh, literal. This could be a 3 operation sequence:
91 t = (p sra 31);
92 res = (p xor t)-t
Rafael Espindola5af3a682006-10-09 14:18:33 +000093
Evan Chenga8e29892007-01-19 07:51:42 +000094Which would be better. This occurs in png decode.
Rafael Espindola5af3a682006-10-09 14:18:33 +000095
Evan Chenga8e29892007-01-19 07:51:42 +000096//===---------------------------------------------------------------------===//
97
98More load / store optimizations:
Evan Chengb604b2c2009-06-26 00:25:27 +0000991) Better representation for block transfer? This is from Olden/power:
Evan Chenga8e29892007-01-19 07:51:42 +0000100
101 fldd d0, [r4]
102 fstd d0, [r4, #+32]
103 fldd d0, [r4, #+8]
104 fstd d0, [r4, #+40]
105 fldd d0, [r4, #+16]
106 fstd d0, [r4, #+48]
107 fldd d0, [r4, #+24]
108 fstd d0, [r4, #+56]
109
110If we can spare the registers, it would be better to use fldm and fstm here.
111Need major register allocator enhancement though.
112
Evan Chengb604b2c2009-06-26 00:25:27 +00001132) Can we recognize the relative position of constantpool entries? i.e. Treat
Evan Chenga8e29892007-01-19 07:51:42 +0000114
115 ldr r0, LCPI17_3
116 ldr r1, LCPI17_4
117 ldr r2, LCPI17_5
118
119 as
120 ldr r0, LCPI17
121 ldr r1, LCPI17+4
122 ldr r2, LCPI17+8
123
124 Then the ldr's can be combined into a single ldm. See Olden/power.
125
126Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
127double 64-bit FP constant:
128
129 adr r0, L6
130 ldmia r0, {r0-r1}
131
132 .align 2
133L6:
134 .long -858993459
135 .long 1074318540
136
Evan Chengb604b2c2009-06-26 00:25:27 +00001373) struct copies appear to be done field by field
Dale Johannesena6bc6fc2007-03-09 17:58:17 +0000138instead of by words, at least sometimes:
139
140struct foo { int x; short s; char c1; char c2; };
141void cpy(struct foo*a, struct foo*b) { *a = *b; }
142
143llvm code (-O2)
144 ldrb r3, [r1, #+6]
145 ldr r2, [r1]
146 ldrb r12, [r1, #+7]
147 ldrh r1, [r1, #+4]
148 str r2, [r0]
149 strh r1, [r0, #+4]
150 strb r3, [r0, #+6]
151 strb r12, [r0, #+7]
152gcc code (-O2)
153 ldmia r1, {r1-r2}
154 stmia r0, {r1-r2}
155
156In this benchmark poor handling of aggregate copies has shown up as
157having a large effect on size, and possibly speed as well (we don't have
158a good way to measure on ARM).
159
Evan Chenga8e29892007-01-19 07:51:42 +0000160//===---------------------------------------------------------------------===//
161
162* Consider this silly example:
163
164double bar(double x) {
165 double r = foo(3.1);
166 return x+r;
167}
168
169_bar:
Chris Lattnera3f61df2007-11-27 22:41:52 +0000170 stmfd sp!, {r4, r5, r7, lr}
171 add r7, sp, #8
172 mov r4, r0
173 mov r5, r1
174 fldd d0, LCPI1_0
175 fmrrd r0, r1, d0
176 bl _foo
177 fmdrr d0, r4, r5
178 fmsr s2, r0
179 fsitod d1, s2
180 faddd d0, d1, d0
181 fmrrd r0, r1, d0
182 ldmfd sp!, {r4, r5, r7, pc}
Evan Chenga8e29892007-01-19 07:51:42 +0000183
184Ignore the prologue and epilogue stuff for a second. Note
185 mov r4, r0
186 mov r5, r1
187the copys to callee-save registers and the fact they are only being used by the
188fmdrr instruction. It would have been better had the fmdrr been scheduled
189before the call and place the result in a callee-save DPR register. The two
190mov ops would not have been necessary.
191
192//===---------------------------------------------------------------------===//
193
194Calling convention related stuff:
195
196* gcc's parameter passing implementation is terrible and we suffer as a result:
197
198e.g.
199struct s {
200 double d1;
201 int s1;
202};
203
204void foo(struct s S) {
205 printf("%g, %d\n", S.d1, S.s1);
206}
207
208'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
209then reload them to r1, r2, and r3 before issuing the call (r0 contains the
210address of the format string):
211
212 stmfd sp!, {r7, lr}
213 add r7, sp, #0
214 sub sp, sp, #12
215 stmia sp, {r0, r1, r2}
216 ldmia sp, {r1-r2}
217 ldr r0, L5
218 ldr r3, [sp, #8]
219L2:
220 add r0, pc, r0
221 bl L_printf$stub
222
223Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
224
225* Return an aggregate type is even worse:
226
227e.g.
228struct s foo(void) {
229 struct s S = {1.1, 2};
230 return S;
231}
232
233 mov ip, r0
234 ldr r0, L5
235 sub sp, sp, #12
236L2:
237 add r0, pc, r0
238 @ lr needed for prologue
239 ldmia r0, {r0, r1, r2}
240 stmia sp, {r0, r1, r2}
241 stmia ip, {r0, r1, r2}
242 mov r0, ip
243 add sp, sp, #12
244 bx lr
245
246r0 (and later ip) is the hidden parameter from caller to store the value in. The
247first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
248r2 into the address passed in. However, there is one additional stmia that
249stores r0, r1, and r2 to some stack location. The store is dead.
250
251The llvm-gcc generated code looks like this:
252
253csretcc void %foo(%struct.s* %agg.result) {
Rafael Espindola5af3a682006-10-09 14:18:33 +0000254entry:
Evan Chenga8e29892007-01-19 07:51:42 +0000255 %S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
256 %memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
257 cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
258 call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
259 cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
260 call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
261 cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
262 call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
Rafael Espindola5af3a682006-10-09 14:18:33 +0000263 ret void
264}
265
Evan Chenga8e29892007-01-19 07:51:42 +0000266llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
267constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
268into a number of load and stores, or 2) custom lower memcpy (of small size) to
269be ldmia / stmia. I think option 2 is better but the current register
270allocator cannot allocate a chunk of registers at a time.
Rafael Espindola5af3a682006-10-09 14:18:33 +0000271
Evan Chenga8e29892007-01-19 07:51:42 +0000272A feasible temporary solution is to use specific physical registers at the
273lowering time for small (<= 4 words?) transfer size.
Rafael Espindola5af3a682006-10-09 14:18:33 +0000274
Evan Chenga8e29892007-01-19 07:51:42 +0000275* ARM CSRet calling convention requires the hidden argument to be returned by
276the callee.
Rafael Espindolabec2e382006-10-16 16:33:29 +0000277
Evan Chenga8e29892007-01-19 07:51:42 +0000278//===---------------------------------------------------------------------===//
Rafael Espindolabec2e382006-10-16 16:33:29 +0000279
Evan Chenga8e29892007-01-19 07:51:42 +0000280We can definitely do a better job on BB placements to eliminate some branches.
281It's very common to see llvm generated assembly code that looks like this:
Rafael Espindola82c678b2006-10-16 17:17:22 +0000282
Evan Chenga8e29892007-01-19 07:51:42 +0000283LBB3:
284 ...
285LBB4:
286...
287 beq LBB3
288 b LBB2
Rafael Espindola82c678b2006-10-16 17:17:22 +0000289
Evan Chenga8e29892007-01-19 07:51:42 +0000290If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
291then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
292
293See McCat/18-imp/ComputeBoundingBoxes for an example.
294
295//===---------------------------------------------------------------------===//
296
Evan Chenga8e29892007-01-19 07:51:42 +0000297Pre-/post- indexed load / stores:
298
2991) We should not make the pre/post- indexed load/store transform if the base ptr
300is guaranteed to be live beyond the load/store. This can happen if the base
301ptr is live out of the block we are performing the optimization. e.g.
302
303mov r1, r2
304ldr r3, [r1], #4
305...
306
307vs.
308
309ldr r3, [r2]
310add r1, r2, #4
311...
312
313In most cases, this is just a wasted optimization. However, sometimes it can
314negatively impact the performance because two-address code is more restrictive
315when it comes to scheduling.
316
317Unfortunately, liveout information is currently unavailable during DAG combine
318time.
319
3202) Consider spliting a indexed load / store into a pair of add/sub + load/store
321 to solve #1 (in TwoAddressInstructionPass.cpp).
322
3233) Enhance LSR to generate more opportunities for indexed ops.
324
3254) Once we added support for multiple result patterns, write indexed loads
326 patterns instead of C++ instruction selection code.
327
3285) Use FLDM / FSTM to emulate indexed FP load / store.
329
330//===---------------------------------------------------------------------===//
331
Evan Chenga8e29892007-01-19 07:51:42 +0000332Implement support for some more tricky ways to materialize immediates. For
333example, to get 0xffff8000, we can use:
334
335mov r9, #&3f8000
336sub r9, r9, #&400000
337
338//===---------------------------------------------------------------------===//
339
340We sometimes generate multiple add / sub instructions to update sp in prologue
341and epilogue if the inc / dec value is too large to fit in a single immediate
342operand. In some cases, perhaps it might be better to load the value from a
343constantpool instead.
344
345//===---------------------------------------------------------------------===//
346
347GCC generates significantly better code for this function.
348
349int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) {
350 int i = 0;
351
352 if (StackPtr != 0) {
353 while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
354 Line[i++] = Stack[--StackPtr];
355 if (LineLen > 32768)
356 {
357 while (StackPtr != 0 && i < LineLen)
358 {
359 i++;
360 --StackPtr;
361 }
362 }
363 }
364 return StackPtr;
365}
366
367//===---------------------------------------------------------------------===//
368
369This should compile to the mlas instruction:
370int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
371
372//===---------------------------------------------------------------------===//
373
374At some point, we should triage these to see if they still apply to us:
375
376http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
377http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
378http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
379
380http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
381http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
382http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
383http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
384http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
385http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
386http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
387
388http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
389http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
390http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
391http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
392http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
393http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
394http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
395
396http://www.inf.u-szeged.hu/gcc-arm/
397http://citeseer.ist.psu.edu/debus04linktime.html
398
399//===---------------------------------------------------------------------===//
Evan Cheng2265b492007-03-09 19:34:51 +0000400
Dale Johannesen818c0852007-03-09 19:18:59 +0000401gcc generates smaller code for this function at -O2 or -Os:
Dale Johannesena6bc6fc2007-03-09 17:58:17 +0000402
403void foo(signed char* p) {
404 if (*p == 3)
405 bar();
406 else if (*p == 4)
407 baz();
408 else if (*p == 5)
409 quux();
410}
411
412llvm decides it's a good idea to turn the repeated if...else into a
413binary tree, as if it were a switch; the resulting code requires -1
414compare-and-branches when *p<=2 or *p==5, the same number if *p==4
415or *p>6, and +1 if *p==3. So it should be a speed win
416(on balance). However, the revised code is larger, with 4 conditional
417branches instead of 3.
418
419More seriously, there is a byte->word extend before
420each comparison, where there should be only one, and the condition codes
421are not remembered when the same two values are compared twice.
422
Evan Cheng2265b492007-03-09 19:34:51 +0000423//===---------------------------------------------------------------------===//
424
425More register scavenging work:
426
4271. Use the register scavenger to track frame index materialized into registers
428 (those that do not fit in addressing modes) to allow reuse in the same BB.
4292. Finish scavenging for Thumb.
Evan Cheng44f4fca2007-03-09 19:35:33 +0000430
431//===---------------------------------------------------------------------===//
432
Evan Chenga125cbe2007-03-20 22:32:39 +0000433More LSR enhancements possible:
434
4351. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
436 in a load / store.
4372. Allow iv reuse even when a type conversion is required. For example, i8
438 and i32 load / store addressing modes are identical.
Chris Lattner3c30d102007-04-17 18:03:00 +0000439
440
441//===---------------------------------------------------------------------===//
442
443This:
444
445int foo(int a, int b, int c, int d) {
446 long long acc = (long long)a * (long long)b;
447 acc += (long long)c * (long long)d;
448 return (int)(acc >> 32);
449}
450
451Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
452two signed 32-bit values to produce a 64-bit value, and accumulates this with
453a 64-bit value.
454
Chris Lattnera3f61df2007-11-27 22:41:52 +0000455We currently get this with both v4 and v6:
Chris Lattner3c30d102007-04-17 18:03:00 +0000456
457_foo:
Chris Lattnera3f61df2007-11-27 22:41:52 +0000458 smull r1, r0, r1, r0
459 smull r3, r2, r3, r2
460 adds r3, r3, r1
461 adc r0, r2, r0
Chris Lattner3c30d102007-04-17 18:03:00 +0000462 bx lr
463
Chris Lattner3c30d102007-04-17 18:03:00 +0000464//===---------------------------------------------------------------------===//
Chris Lattnerbf8ae842007-09-10 21:43:18 +0000465
466This:
467 #include <algorithm>
468 std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
469 { return std::make_pair(a + b, a + b < a); }
470 bool no_overflow(unsigned a, unsigned b)
471 { return !full_add(a, b).second; }
472
473Should compile to:
474
475_Z8full_addjj:
476 adds r2, r1, r2
477 movcc r1, #0
478 movcs r1, #1
479 str r2, [r0, #0]
480 strb r1, [r0, #4]
481 mov pc, lr
482
483_Z11no_overflowjj:
484 cmn r0, r1
485 movcs r0, #0
486 movcc r0, #1
487 mov pc, lr
488
489not:
490
491__Z8full_addjj:
492 add r3, r2, r1
493 str r3, [r0]
494 mov r2, #1
495 mov r12, #0
496 cmp r3, r1
497 movlo r12, r2
498 str r12, [r0, #+4]
499 bx lr
500__Z11no_overflowjj:
501 add r3, r1, r0
502 mov r2, #1
503 mov r1, #0
504 cmp r3, r0
505 movhs r1, r2
506 mov r0, r1
507 bx lr
508
509//===---------------------------------------------------------------------===//
Chris Lattner3a7c33a2007-10-19 03:29:26 +0000510
Bob Wilson5bafff32009-06-22 23:27:02 +0000511Some of the NEON intrinsics may be appropriate for more general use, either
512as target-independent intrinsics or perhaps elsewhere in the ARM backend.
513Some of them may also be lowered to target-independent SDNodes, and perhaps
514some new SDNodes could be added.
515
516For example, maximum, minimum, and absolute value operations are well-defined
517and standard operations, both for vector and scalar types.
518
519The current NEON-specific intrinsics for count leading zeros and count one
520bits could perhaps be replaced by the target-independent ctlz and ctpop
521intrinsics. It may also make sense to add a target-independent "ctls"
522intrinsic for "count leading sign bits". Likewise, the backend could use
523the target-independent SDNodes for these operations.
524
525ARMv6 has scalar saturating and halving adds and subtracts. The same
526intrinsics could possibly be used for both NEON's vector implementations of
527those operations and the ARMv6 scalar versions.
528
529//===---------------------------------------------------------------------===//
530
Evan Cheng151b9af2009-06-26 00:28:48 +0000531ARM::MOVCCr is commutable (by flipping the condition). But we need to implement
532ARMInstrInfo::commuteInstruction() to support it.
Evan Cheng055b0312009-06-29 07:51:04 +0000533
534//===---------------------------------------------------------------------===//
535
536Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
537LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
538ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
539while ARMConstantIslandPass only need to worry about LDR (literal).
Evan Cheng064a6ea2009-07-22 00:58:27 +0000540
541//===---------------------------------------------------------------------===//
542
543We need to fix constant isel for ARMv6t2 to use MOVT.
Evan Cheng40efc252009-07-24 19:31:03 +0000544
545//===---------------------------------------------------------------------===//
546
547Constant island pass should make use of full range SoImm values for LEApcrel.
548Be careful though as the last attempt caused infinite looping on lencod.
Chris Lattner51350392009-07-30 16:08:58 +0000549
550//===---------------------------------------------------------------------===//
551
552Predication issue. This function:
553
554extern unsigned array[ 128 ];
555int foo( int x ) {
556 int y;
557 y = array[ x & 127 ];
558 if ( x & 128 )
559 y = 123456789 & ( y >> 2 );
560 else
561 y = 123456789 & y;
562 return y;
563}
564
565compiles to:
566
567_foo:
568 and r1, r0, #127
569 ldr r2, LCPI1_0
570 ldr r2, [r2]
571 ldr r1, [r2, +r1, lsl #2]
572 mov r2, r1, lsr #2
573 tst r0, #128
574 moveq r2, r1
575 ldr r0, LCPI1_1
576 and r0, r2, r0
577 bx lr
578
579It would be better to do something like this, to fold the shift into the
580conditional move:
581
582 and r1, r0, #127
583 ldr r2, LCPI1_0
584 ldr r2, [r2]
585 ldr r1, [r2, +r1, lsl #2]
586 tst r0, #128
587 movne r1, r1, lsr #2
588 ldr r0, LCPI1_1
589 and r0, r1, r0
590 bx lr
591
592it saves an instruction and a register.
593
594//===---------------------------------------------------------------------===//
Evan Cheng5adb66a2009-09-28 09:14:39 +0000595
596add/sub/and/or + i32 imm can be simplified by folding part of the immediate
597into the operation.
598
599//===---------------------------------------------------------------------===//
600
601It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
602with the same bottom half.
Bob Wilson57f224a2009-10-30 22:22:46 +0000603
604//===---------------------------------------------------------------------===//
605
606Robert Muth started working on an alternate jump table implementation that
607does not put the tables in-line in the text. This is more like the llvm
608default jump table implementation. This might be useful sometime. Several
609revisions of patches are on the mailing list, beginning at:
610http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html
611
Evan Chenge3b88fc2009-11-02 07:51:19 +0000612//===---------------------------------------------------------------------===//
613
614Make use of the "rbit" instruction.