blob: 2ae4088293af50589b557fa7a32ac7749270d339 [file] [log] [blame]
Chris Lattnerb86bd2c2006-03-27 07:04:16 +00001//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
2
Nate Begemanb64af912004-08-10 20:42:36 +00003TODO:
Nate Begemanef9531e2005-04-11 20:48:57 +00004* gpr0 allocation
Nate Begeman4a0de072004-10-26 04:10:53 +00005* implement do-loop -> bdnz transform
Chris Lattnere46307a2008-01-15 22:15:02 +00006* Implement __builtin_trap (ISD::TRAP) as 'tw 31, 0, 0' aka 'trap'.
Nate Begemana6ed0aa2008-02-11 04:16:09 +00007* lmw/stmw pass a la arm load store optimizer for prolog/epilog
Nate Begeman50fb3c42005-12-24 01:00:15 +00008
Nate Begemana63fee82006-02-03 05:17:06 +00009===-------------------------------------------------------------------------===
Nate Begeman50fb3c42005-12-24 01:00:15 +000010
Nate Begemana63fee82006-02-03 05:17:06 +000011Support 'update' load/store instructions. These are cracked on the G5, but are
12still a codesize win.
13
Chris Lattner26ddb502006-11-10 01:33:53 +000014With preinc enabled, this:
15
16long *%test4(long *%X, long *%dest) {
17 %Y = getelementptr long* %X, int 4
18 %A = load long* %Y
19 store long %A, long* %dest
20 ret long* %Y
21}
22
23compiles to:
24
25_test4:
26 mr r2, r3
27 lwzu r5, 32(r2)
28 lwz r3, 36(r3)
29 stw r5, 0(r4)
30 stw r3, 4(r4)
31 mr r3, r2
32 blr
33
34with -sched=list-burr, I get:
35
36_test4:
37 lwz r2, 36(r3)
38 lwzu r5, 32(r3)
39 stw r2, 4(r4)
40 stw r5, 0(r4)
41 blr
42
Nate Begemana63fee82006-02-03 05:17:06 +000043===-------------------------------------------------------------------------===
44
Chris Lattner6e112952006-11-07 18:30:21 +000045We compile the hottest inner loop of viterbi to:
46
47 li r6, 0
48 b LBB1_84 ;bb432.i
49LBB1_83: ;bb420.i
50 lbzx r8, r5, r7
51 addi r6, r7, 1
52 stbx r8, r4, r7
53LBB1_84: ;bb432.i
54 mr r7, r6
55 cmplwi cr0, r7, 143
56 bne cr0, LBB1_83 ;bb420.i
57
58The CBE manages to produce:
59
60 li r0, 143
61 mtctr r0
62loop:
63 lbzx r2, r2, r11
64 stbx r0, r2, r9
65 addi r2, r2, 1
66 bdz later
67 b loop
68
69This could be much better (bdnz instead of bdz) but it still beats us. If we
70produced this with bdnz, the loop would be a single dispatch group.
71
72===-------------------------------------------------------------------------===
73
Chris Lattner6a250ec2006-10-13 20:20:58 +000074Compile:
75
76void foo(int *P) {
77 if (P) *P = 0;
78}
79
80into:
81
82_foo:
83 cmpwi cr0,r3,0
84 beqlr cr0
85 li r0,0
86 stw r0,0(r3)
87 blr
88
89This is effectively a simple form of predication.
90
91===-------------------------------------------------------------------------===
92
Chris Lattnera3c44542005-08-24 18:15:24 +000093Lump the constant pool for each function into ONE pic object, and reference
94pieces of it as offsets from the start. For functions like this (contrived
95to have lots of constants obviously):
96
97double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
98
99We generate:
100
101_X:
102 lis r2, ha16(.CPI_X_0)
103 lfd f0, lo16(.CPI_X_0)(r2)
104 lis r2, ha16(.CPI_X_1)
105 lfd f2, lo16(.CPI_X_1)(r2)
106 fmadd f0, f1, f0, f2
107 lis r2, ha16(.CPI_X_2)
108 lfd f1, lo16(.CPI_X_2)(r2)
109 lis r2, ha16(.CPI_X_3)
110 lfd f2, lo16(.CPI_X_3)(r2)
111 fmadd f1, f0, f1, f2
112 blr
113
114It would be better to materialize .CPI_X into a register, then use immediates
115off of the register to avoid the lis's. This is even more important in PIC
116mode.
117
Chris Lattner39b248b2006-02-02 23:50:22 +0000118Note that this (and the static variable version) is discussed here for GCC:
119http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
120
Chris Lattneraabd0352007-08-23 15:16:03 +0000121Here's another example (the sgn function):
122double testf(double a) {
123 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
124}
125
126it produces a BB like this:
127LBB1_1: ; cond_true
128 lis r2, ha16(LCPI1_0)
129 lfs f0, lo16(LCPI1_0)(r2)
130 lis r2, ha16(LCPI1_1)
131 lis r3, ha16(LCPI1_2)
132 lfs f2, lo16(LCPI1_2)(r3)
133 lfs f3, lo16(LCPI1_1)(r2)
134 fsub f0, f0, f1
135 fsel f1, f0, f2, f3
136 blr
137
Chris Lattnera3c44542005-08-24 18:15:24 +0000138===-------------------------------------------------------------------------===
Nate Begeman92cce902005-09-06 15:30:48 +0000139
Chris Lattner33c1dab2006-02-03 06:22:11 +0000140PIC Code Gen IPO optimization:
141
142Squish small scalar globals together into a single global struct, allowing the
143address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
144of the GOT on targets with one).
145
146Note that this is discussed here for GCC:
147http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
148
149===-------------------------------------------------------------------------===
150
Nate Begeman92cce902005-09-06 15:30:48 +0000151Implement Newton-Rhapson method for improving estimate instructions to the
152correct accuracy, and implementing divide as multiply by reciprocal when it has
153more than one use. Itanium will want this too.
Nate Begeman21e463b2005-10-16 05:39:50 +0000154
155===-------------------------------------------------------------------------===
156
Chris Lattnerae4664a2005-11-05 08:57:56 +0000157Compile this:
158
159int %f1(int %a, int %b) {
160 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
161 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
162 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
163 ret int %tmp.4
164}
165
166without a copy. We make this currently:
167
168_f1:
169 rlwinm r2, r4, 0, 24, 27
170 rlwimi r2, r3, 0, 28, 31
171 or r3, r2, r2
172 blr
173
174The two-addr pass or RA needs to learn when it is profitable to commute an
175instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
176currently only commutes to avoid inserting a copy BEFORE the two addr instr.
177
Chris Lattner62c08dd2005-12-08 07:13:28 +0000178===-------------------------------------------------------------------------===
179
180Compile offsets from allocas:
181
182int *%test() {
183 %X = alloca { int, int }
184 %Y = getelementptr {int,int}* %X, int 0, uint 1
185 ret int* %Y
186}
187
188into a single add, not two:
189
190_test:
191 addi r2, r1, -8
192 addi r3, r2, 4
193 blr
194
195--> important for C++.
196
Chris Lattner39706e62005-12-22 17:19:28 +0000197===-------------------------------------------------------------------------===
198
Chris Lattner39706e62005-12-22 17:19:28 +0000199No loads or stores of the constants should be needed:
200
201struct foo { double X, Y; };
202void xxx(struct foo F);
203void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
204
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000205===-------------------------------------------------------------------------===
206
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000207Darwin Stub LICM optimization:
208
209Loops like this:
210
211 for (...) bar();
212
213Have to go through an indirect stub if bar is external or linkonce. It would
214be better to compile it as:
215
216 fp = &bar;
217 for (...) fp();
218
219which only computes the address of bar once (instead of each time through the
220stub). This is Darwin specific and would have to be done in the code generator.
221Probably not a win on x86.
222
223===-------------------------------------------------------------------------===
224
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000225Simple IPO for argument passing, change:
226 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
227
228the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
229of arguments get assigned to r3 through r10. That is, if you have a function
230foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
231argument bytes for r4 and r5. The trick then would be to shuffle the argument
232order for functions we can internalize so that the maximum number of
233integers/pointers get passed in regs before you see any of the fp arguments.
234
235Instead of implementing this, it would actually probably be easier to just
236implement a PPC fastcc, where we could do whatever we wanted to the CC,
237including having this work sanely.
238
239===-------------------------------------------------------------------------===
240
241Fix Darwin FP-In-Integer Registers ABI
242
243Darwin passes doubles in structures in integer registers, which is very very
244bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
245that percolates these things out of functions.
246
247Check out how horrible this is:
248http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
249
250This is an extension of "interprocedural CC unmunging" that can't be done with
251just fastcc.
252
253===-------------------------------------------------------------------------===
254
Chris Lattner56b69642006-01-31 02:55:28 +0000255Compile this:
256
Chris Lattner83e64ba2006-01-31 07:16:34 +0000257int foo(int a) {
258 int b = (a < 8);
259 if (b) {
260 return b * 3; // ignore the fact that this is always 3.
261 } else {
262 return 2;
263 }
264}
265
266into something not this:
267
268_foo:
2691) cmpwi cr7, r3, 8
270 mfcr r2, 1
271 rlwinm r2, r2, 29, 31, 31
2721) cmpwi cr0, r3, 7
273 bgt cr0, LBB1_2 ; UnifiedReturnBlock
274LBB1_1: ; then
275 rlwinm r2, r2, 0, 31, 31
276 mulli r3, r2, 3
277 blr
278LBB1_2: ; UnifiedReturnBlock
279 li r3, 2
280 blr
281
282In particular, the two compares (marked 1) could be shared by reversing one.
283This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
284same operands (but backwards) exists. In this case, this wouldn't save us
285anything though, because the compares still wouldn't be shared.
Chris Lattner0ddc1802006-02-01 00:28:12 +0000286
Chris Lattner5a7efc92006-02-01 17:54:23 +0000287===-------------------------------------------------------------------------===
288
Chris Lattner275b8842006-02-02 07:37:11 +0000289We should custom expand setcc instead of pretending that we have it. That
290would allow us to expose the access of the crbit after the mfcr, allowing
291that access to be trivially folded into other ops. A simple example:
292
293int foo(int a, int b) { return (a < b) << 4; }
294
295compiles into:
296
297_foo:
298 cmpw cr7, r3, r4
299 mfcr r2, 1
300 rlwinm r2, r2, 29, 31, 31
301 slwi r3, r2, 4
302 blr
303
Chris Lattnerd463f7f2006-02-03 01:49:49 +0000304===-------------------------------------------------------------------------===
305
Nate Begemana63fee82006-02-03 05:17:06 +0000306Fold add and sub with constant into non-extern, non-weak addresses so this:
307
308static int a;
309void bar(int b) { a = b; }
310void foo(unsigned char *c) {
311 *c = a;
312}
313
314So that
315
316_foo:
317 lis r2, ha16(_a)
318 la r2, lo16(_a)(r2)
319 lbz r2, 3(r2)
320 stb r2, 0(r3)
321 blr
322
323Becomes
324
325_foo:
326 lis r2, ha16(_a+3)
327 lbz r2, lo16(_a+3)(r2)
328 stb r2, 0(r3)
329 blr
Chris Lattner21384532006-02-05 05:27:35 +0000330
331===-------------------------------------------------------------------------===
332
333We generate really bad code for this:
334
335int f(signed char *a, _Bool b, _Bool c) {
336 signed char t = 0;
337 if (b) t = *a;
338 if (c) *a = t;
339}
340
Chris Lattner00d18f02006-03-01 06:36:20 +0000341===-------------------------------------------------------------------------===
342
343This:
344int test(unsigned *P) { return *P >> 24; }
345
346Should compile to:
347
348_test:
349 lbz r3,0(r3)
350 blr
351
352not:
353
354_test:
355 lwz r2, 0(r3)
356 srwi r3, r2, 24
357 blr
358
Chris Lattner5a63c472006-03-07 04:42:59 +0000359===-------------------------------------------------------------------------===
360
361On the G5, logical CR operations are more expensive in their three
362address form: ops that read/write the same register are half as expensive as
363those that read from two registers that are different from their destination.
364
365We should model this with two separate instructions. The isel should generate
366the "two address" form of the instructions. When the register allocator
367detects that it needs to insert a copy due to the two-addresness of the CR
368logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
369we can convert to the "three address" instruction, to save code space.
370
371This only matters when we start generating cr logical ops.
372
Chris Lattner49f398b2006-03-08 00:25:47 +0000373===-------------------------------------------------------------------------===
374
375We should compile these two functions to the same thing:
376
377#include <stdlib.h>
378void f(int a, int b, int *P) {
379 *P = (a-b)>=0?(a-b):(b-a);
380}
381void g(int a, int b, int *P) {
382 *P = abs(a-b);
383}
384
385Further, they should compile to something better than:
386
387_g:
388 subf r2, r4, r3
389 subfic r3, r2, 0
390 cmpwi cr0, r2, -1
391 bgt cr0, LBB2_2 ; entry
392LBB2_1: ; entry
393 mr r2, r3
394LBB2_2: ; entry
395 stw r2, 0(r5)
396 blr
397
398GCC produces:
399
400_g:
401 subf r4,r4,r3
402 srawi r2,r4,31
403 xor r0,r2,r4
404 subf r0,r2,r0
405 stw r0,0(r5)
406 blr
407
408... which is much nicer.
409
410This theoretically may help improve twolf slightly (used in dimbox.c:142?).
411
412===-------------------------------------------------------------------------===
413
Nate Begeman2df99282006-03-16 18:50:44 +0000414int foo(int N, int ***W, int **TK, int X) {
415 int t, i;
416
417 for (t = 0; t < N; ++t)
418 for (i = 0; i < 4; ++i)
419 W[t / X][i][t % X] = TK[i][t];
420
421 return 5;
422}
423
Chris Lattnered511692006-03-16 22:25:55 +0000424We generate relatively atrocious code for this loop compared to gcc.
425
Chris Lattneref040dd2006-03-21 00:47:09 +0000426We could also strength reduce the rem and the div:
427http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
428
Chris Lattner28b1a0b2006-03-19 05:33:30 +0000429===-------------------------------------------------------------------------===
Chris Lattnered511692006-03-16 22:25:55 +0000430
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000431float foo(float X) { return (int)(X); }
432
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000433Currently produces:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000434
435_foo:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000436 fctiwz f0, f1
437 stfd f0, -8(r1)
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000438 lwz r2, -4(r1)
439 extsw r2, r2
440 std r2, -16(r1)
441 lfd f0, -16(r1)
442 fcfid f0, f0
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000443 frsp f1, f0
444 blr
445
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000446We could use a target dag combine to turn the lwz/extsw into an lwa when the
447lwz has a single use. Since LWA is cracked anyway, this would be a codesize
448win only.
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000449
Chris Lattner716aefc2006-03-23 21:28:44 +0000450===-------------------------------------------------------------------------===
451
Chris Lattner057f09b2006-03-24 20:04:27 +0000452We generate ugly code for this:
453
454void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
455 unsigned code = 0;
456 if(dx < -dw) code |= 1;
457 if(dx > dw) code |= 2;
458 if(dy < -dw) code |= 4;
459 if(dy > dw) code |= 8;
460 if(dz < -dw) code |= 16;
461 if(dz > dw) code |= 32;
462 *ret = code;
463}
464
Chris Lattner420736d2006-03-25 06:47:10 +0000465===-------------------------------------------------------------------------===
466
Chris Lattnered937902006-04-13 16:48:00 +0000467Complete the signed i32 to FP conversion code using 64-bit registers
468transformation, good for PI. See PPCISelLowering.cpp, this comment:
Chris Lattner220d2b82006-04-02 07:20:00 +0000469
Chris Lattnered937902006-04-13 16:48:00 +0000470 // FIXME: disable this lowered code. This generates 64-bit register values,
471 // and we don't model the fact that the top part is clobbered by calls. We
472 // need to flag these together so that the value isn't live across a call.
473 //setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
Chris Lattner220d2b82006-04-02 07:20:00 +0000474
Chris Lattner9d62fa42006-05-17 19:02:25 +0000475Also, if the registers are spilled to the stack, we have to ensure that all
47664-bits of them are save/restored, otherwise we will miscompile the code. It
477sounds like we need to get the 64-bit register classes going.
478
Chris Lattner55c63252006-05-05 05:36:15 +0000479===-------------------------------------------------------------------------===
480
Nate Begeman908049b2007-01-29 21:21:22 +0000481%struct.B = type { i8, [3 x i8] }
Nate Begeman75146202006-05-08 20:54:02 +0000482
Nate Begeman908049b2007-01-29 21:21:22 +0000483define void @bar(%struct.B* %b) {
Nate Begeman75146202006-05-08 20:54:02 +0000484entry:
Nate Begeman908049b2007-01-29 21:21:22 +0000485 %tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
486 %tmp = load i32* %tmp ; <uint> [#uses=1]
487 %tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
488 %tmp4 = load i32* %tmp3 ; <uint> [#uses=1]
489 %tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2]
490 %tmp9 = load i32* %tmp8 ; <uint> [#uses=1]
491 %tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1]
492 %tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
493 %tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1]
494 %tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1]
495 %tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1]
496 %tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1]
497 store i32 %tmp13, i32* %tmp8
Chris Lattner55c63252006-05-05 05:36:15 +0000498 ret void
499}
500
501We emit:
502
503_foo:
504 lwz r2, 0(r3)
Nate Begeman75146202006-05-08 20:54:02 +0000505 slwi r4, r2, 1
506 or r4, r4, r2
507 rlwimi r2, r4, 0, 0, 0
Nate Begeman4667f2c2006-05-08 17:38:32 +0000508 stw r2, 0(r3)
Chris Lattner55c63252006-05-05 05:36:15 +0000509 blr
510
Nate Begeman75146202006-05-08 20:54:02 +0000511We could collapse a bunch of those ORs and ANDs and generate the following
512equivalent code:
Chris Lattner55c63252006-05-05 05:36:15 +0000513
Nate Begeman4667f2c2006-05-08 17:38:32 +0000514_foo:
515 lwz r2, 0(r3)
Nate Begemand8624ed2006-05-08 19:09:24 +0000516 rlwinm r4, r2, 1, 0, 0
Nate Begeman4667f2c2006-05-08 17:38:32 +0000517 or r2, r2, r4
518 stw r2, 0(r3)
519 blr
Chris Lattner1eeedae2006-07-14 04:07:29 +0000520
521===-------------------------------------------------------------------------===
522
Chris Lattnerf0613e12006-09-14 20:56:30 +0000523We compile:
524
525unsigned test6(unsigned x) {
526 return ((x & 0x00FF0000) >> 16) | ((x & 0x000000FF) << 16);
527}
528
529into:
530
531_test6:
532 lis r2, 255
533 rlwinm r3, r3, 16, 0, 31
534 ori r2, r2, 255
535 and r3, r3, r2
536 blr
537
538GCC gets it down to:
539
540_test6:
541 rlwinm r0,r3,16,8,15
542 rlwinm r3,r3,16,24,31
543 or r3,r3,r0
544 blr
545
Chris Lattnerafd7a082007-01-18 07:34:57 +0000546
547===-------------------------------------------------------------------------===
548
549Consider a function like this:
550
551float foo(float X) { return X + 1234.4123f; }
552
553The FP constant ends up in the constant pool, so we need to get the LR register.
554 This ends up producing code like this:
555
556_foo:
557.LBB_foo_0: ; entry
558 mflr r11
559*** stw r11, 8(r1)
560 bl "L00000$pb"
561"L00000$pb":
562 mflr r2
563 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
564 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
565 fadds f1, f1, f0
566*** lwz r11, 8(r1)
567 mtlr r11
568 blr
569
570This is functional, but there is no reason to spill the LR register all the way
571to the stack (the two marked instrs): spilling it to a GPR is quite enough.
572
573Implementing this will require some codegen improvements. Nate writes:
574
575"So basically what we need to support the "no stack frame save and restore" is a
576generalization of the LR optimization to "callee-save regs".
577
578Currently, we have LR marked as a callee-save reg. The register allocator sees
579that it's callee save, and spills it directly to the stack.
580
581Ideally, something like this would happen:
582
583LR would be in a separate register class from the GPRs. The class of LR would be
584marked "unspillable". When the register allocator came across an unspillable
585reg, it would ask "what is the best class to copy this into that I *can* spill"
586If it gets a class back, which it will in this case (the gprs), it grabs a free
587register of that class. If it is then later necessary to spill that reg, so be
588it.
589
590===-------------------------------------------------------------------------===
Chris Lattner95b9d6e2007-01-31 19:49:20 +0000591
592We compile this:
593int test(_Bool X) {
594 return X ? 524288 : 0;
595}
596
597to:
598_test:
599 cmplwi cr0, r3, 0
600 lis r2, 8
601 li r3, 0
602 beq cr0, LBB1_2 ;entry
603LBB1_1: ;entry
604 mr r3, r2
605LBB1_2: ;entry
606 blr
607
608instead of:
609_test:
610 addic r2,r3,-1
611 subfe r0,r2,r3
612 slwi r3,r0,19
613 blr
614
615This sort of thing occurs a lot due to globalopt.
616
617===-------------------------------------------------------------------------===
Chris Lattner8abcfe12007-02-09 17:38:01 +0000618
619We currently compile 32-bit bswap:
620
621declare i32 @llvm.bswap.i32(i32 %A)
622define i32 @test(i32 %A) {
623 %B = call i32 @llvm.bswap.i32(i32 %A)
624 ret i32 %B
625}
626
627to:
628
629_test:
630 rlwinm r2, r3, 24, 16, 23
631 slwi r4, r3, 24
632 rlwimi r2, r3, 8, 24, 31
633 rlwimi r4, r3, 8, 8, 15
634 rlwimi r4, r2, 0, 16, 31
635 mr r3, r4
636 blr
637
638it would be more efficient to produce:
639
640_foo: mr r0,r3
641 rlwinm r3,r3,8,0xffffffff
642 rlwimi r3,r0,24,0,7
643 rlwimi r3,r0,24,16,23
644 blr
645
646===-------------------------------------------------------------------------===
Chris Lattner013e0512007-03-25 04:46:28 +0000647
648test/CodeGen/PowerPC/2007-03-24-cntlzd.ll compiles to:
649
650__ZNK4llvm5APInt17countLeadingZerosEv:
651 ld r2, 0(r3)
652 cntlzd r2, r2
653 or r2, r2, r2 <<-- silly.
654 addi r3, r2, -64
655 blr
656
657The dead or is a 'truncate' from 64- to 32-bits.
658
659===-------------------------------------------------------------------------===
Chris Lattnerfcb1e612007-03-31 07:06:25 +0000660
661We generate horrible ppc code for this:
662
663#define N 2000000
664double a[N],c[N];
665void simpleloop() {
666 int j;
667 for (j=0; j<N; j++)
668 c[j] = a[j];
669}
670
671LBB1_1: ;bb
672 lfdx f0, r3, r4
673 addi r5, r5, 1 ;; Extra IV for the exit value compare.
674 stfdx f0, r2, r4
675 addi r4, r4, 8
676
677 xoris r6, r5, 30 ;; This is due to a large immediate.
678 cmplwi cr0, r6, 33920
679 bne cr0, LBB1_1
680
Chris Lattnerbf8ae842007-09-10 21:43:18 +0000681//===---------------------------------------------------------------------===//
682
683This:
684 #include <algorithm>
685 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
686 { return std::make_pair(a + b, a + b < a); }
687 bool no_overflow(unsigned a, unsigned b)
688 { return !full_add(a, b).second; }
689
690Should compile to:
691
692__Z11no_overflowjj:
693 add r4,r3,r4
694 subfc r3,r3,r4
695 li r3,0
696 adde r3,r3,r3
697 blr
698
699(or better) not:
700
701__Z11no_overflowjj:
702 add r2, r4, r3
703 cmplw cr7, r2, r3
704 mfcr r2
705 rlwinm r2, r2, 29, 31, 31
706 xori r3, r2, 1
707 blr
708
709//===---------------------------------------------------------------------===//
Chris Lattnerfcb1e612007-03-31 07:06:25 +0000710
Chris Lattnerfe39edd2008-01-08 06:46:30 +0000711We compile some FP comparisons into an mfcr with two rlwinms and an or. For
712example:
713#include <math.h>
714int test(double x, double y) { return islessequal(x, y);}
715int test2(double x, double y) { return islessgreater(x, y);}
716int test3(double x, double y) { return !islessequal(x, y);}
717
718Compiles into (all three are similar, but the bits differ):
719
720_test:
721 fcmpu cr7, f1, f2
722 mfcr r2
723 rlwinm r3, r2, 29, 31, 31
724 rlwinm r2, r2, 31, 31, 31
725 or r3, r2, r3
726 blr
727
728GCC compiles this into:
729
730 _test:
731 fcmpu cr7,f1,f2
732 cror 30,28,30
733 mfcr r3
734 rlwinm r3,r3,31,1
735 blr
736
737which is more efficient and can use mfocr. See PR642 for some more context.
738
739//===---------------------------------------------------------------------===//