blob: a6034e1fa9ad8bed10ad0c0f056dd32339a8c2ca [file] [log] [blame]
Chris Lattnerb86bd2c2006-03-27 07:04:16 +00001//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
2
Nate Begemanb64af912004-08-10 20:42:36 +00003TODO:
Nate Begemanef9531e2005-04-11 20:48:57 +00004* gpr0 allocation
Nate Begeman4a0de072004-10-26 04:10:53 +00005* implement do-loop -> bdnz transform
Nate Begeman50fb3c42005-12-24 01:00:15 +00006
Nate Begemana63fee82006-02-03 05:17:06 +00007===-------------------------------------------------------------------------===
Nate Begeman50fb3c42005-12-24 01:00:15 +00008
Nate Begemana63fee82006-02-03 05:17:06 +00009Support 'update' load/store instructions. These are cracked on the G5, but are
10still a codesize win.
11
Chris Lattner26ddb502006-11-10 01:33:53 +000012With preinc enabled, this:
13
14long *%test4(long *%X, long *%dest) {
15 %Y = getelementptr long* %X, int 4
16 %A = load long* %Y
17 store long %A, long* %dest
18 ret long* %Y
19}
20
21compiles to:
22
23_test4:
24 mr r2, r3
25 lwzu r5, 32(r2)
26 lwz r3, 36(r3)
27 stw r5, 0(r4)
28 stw r3, 4(r4)
29 mr r3, r2
30 blr
31
32with -sched=list-burr, I get:
33
34_test4:
35 lwz r2, 36(r3)
36 lwzu r5, 32(r3)
37 stw r2, 4(r4)
38 stw r5, 0(r4)
39 blr
40
Nate Begemana63fee82006-02-03 05:17:06 +000041===-------------------------------------------------------------------------===
42
Chris Lattner6e112952006-11-07 18:30:21 +000043We compile the hottest inner loop of viterbi to:
44
45 li r6, 0
46 b LBB1_84 ;bb432.i
47LBB1_83: ;bb420.i
48 lbzx r8, r5, r7
49 addi r6, r7, 1
50 stbx r8, r4, r7
51LBB1_84: ;bb432.i
52 mr r7, r6
53 cmplwi cr0, r7, 143
54 bne cr0, LBB1_83 ;bb420.i
55
56The CBE manages to produce:
57
58 li r0, 143
59 mtctr r0
60loop:
61 lbzx r2, r2, r11
62 stbx r0, r2, r9
63 addi r2, r2, 1
64 bdz later
65 b loop
66
67This could be much better (bdnz instead of bdz) but it still beats us. If we
68produced this with bdnz, the loop would be a single dispatch group.
69
70===-------------------------------------------------------------------------===
71
Chris Lattner6a250ec2006-10-13 20:20:58 +000072Compile:
73
74void foo(int *P) {
75 if (P) *P = 0;
76}
77
78into:
79
80_foo:
81 cmpwi cr0,r3,0
82 beqlr cr0
83 li r0,0
84 stw r0,0(r3)
85 blr
86
87This is effectively a simple form of predication.
88
89===-------------------------------------------------------------------------===
90
Chris Lattnera3c44542005-08-24 18:15:24 +000091Lump the constant pool for each function into ONE pic object, and reference
92pieces of it as offsets from the start. For functions like this (contrived
93to have lots of constants obviously):
94
95double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
96
97We generate:
98
99_X:
100 lis r2, ha16(.CPI_X_0)
101 lfd f0, lo16(.CPI_X_0)(r2)
102 lis r2, ha16(.CPI_X_1)
103 lfd f2, lo16(.CPI_X_1)(r2)
104 fmadd f0, f1, f0, f2
105 lis r2, ha16(.CPI_X_2)
106 lfd f1, lo16(.CPI_X_2)(r2)
107 lis r2, ha16(.CPI_X_3)
108 lfd f2, lo16(.CPI_X_3)(r2)
109 fmadd f1, f0, f1, f2
110 blr
111
112It would be better to materialize .CPI_X into a register, then use immediates
113off of the register to avoid the lis's. This is even more important in PIC
114mode.
115
Chris Lattner39b248b2006-02-02 23:50:22 +0000116Note that this (and the static variable version) is discussed here for GCC:
117http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
118
Chris Lattneraabd0352007-08-23 15:16:03 +0000119Here's another example (the sgn function):
120double testf(double a) {
121 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
122}
123
124it produces a BB like this:
125LBB1_1: ; cond_true
126 lis r2, ha16(LCPI1_0)
127 lfs f0, lo16(LCPI1_0)(r2)
128 lis r2, ha16(LCPI1_1)
129 lis r3, ha16(LCPI1_2)
130 lfs f2, lo16(LCPI1_2)(r3)
131 lfs f3, lo16(LCPI1_1)(r2)
132 fsub f0, f0, f1
133 fsel f1, f0, f2, f3
134 blr
135
Chris Lattnera3c44542005-08-24 18:15:24 +0000136===-------------------------------------------------------------------------===
Nate Begeman92cce902005-09-06 15:30:48 +0000137
Chris Lattner33c1dab2006-02-03 06:22:11 +0000138PIC Code Gen IPO optimization:
139
140Squish small scalar globals together into a single global struct, allowing the
141address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
142of the GOT on targets with one).
143
144Note that this is discussed here for GCC:
145http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
146
147===-------------------------------------------------------------------------===
148
Nate Begeman92cce902005-09-06 15:30:48 +0000149Implement Newton-Rhapson method for improving estimate instructions to the
150correct accuracy, and implementing divide as multiply by reciprocal when it has
151more than one use. Itanium will want this too.
Nate Begeman21e463b2005-10-16 05:39:50 +0000152
153===-------------------------------------------------------------------------===
154
Chris Lattnerae4664a2005-11-05 08:57:56 +0000155Compile this:
156
157int %f1(int %a, int %b) {
158 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
159 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
160 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
161 ret int %tmp.4
162}
163
164without a copy. We make this currently:
165
166_f1:
167 rlwinm r2, r4, 0, 24, 27
168 rlwimi r2, r3, 0, 28, 31
169 or r3, r2, r2
170 blr
171
172The two-addr pass or RA needs to learn when it is profitable to commute an
173instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
174currently only commutes to avoid inserting a copy BEFORE the two addr instr.
175
Chris Lattner62c08dd2005-12-08 07:13:28 +0000176===-------------------------------------------------------------------------===
177
178Compile offsets from allocas:
179
180int *%test() {
181 %X = alloca { int, int }
182 %Y = getelementptr {int,int}* %X, int 0, uint 1
183 ret int* %Y
184}
185
186into a single add, not two:
187
188_test:
189 addi r2, r1, -8
190 addi r3, r2, 4
191 blr
192
193--> important for C++.
194
Chris Lattner39706e62005-12-22 17:19:28 +0000195===-------------------------------------------------------------------------===
196
Chris Lattner39706e62005-12-22 17:19:28 +0000197No loads or stores of the constants should be needed:
198
199struct foo { double X, Y; };
200void xxx(struct foo F);
201void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
202
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000203===-------------------------------------------------------------------------===
204
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000205Darwin Stub LICM optimization:
206
207Loops like this:
208
209 for (...) bar();
210
211Have to go through an indirect stub if bar is external or linkonce. It would
212be better to compile it as:
213
214 fp = &bar;
215 for (...) fp();
216
217which only computes the address of bar once (instead of each time through the
218stub). This is Darwin specific and would have to be done in the code generator.
219Probably not a win on x86.
220
221===-------------------------------------------------------------------------===
222
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000223Simple IPO for argument passing, change:
224 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
225
226the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
227of arguments get assigned to r3 through r10. That is, if you have a function
228foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
229argument bytes for r4 and r5. The trick then would be to shuffle the argument
230order for functions we can internalize so that the maximum number of
231integers/pointers get passed in regs before you see any of the fp arguments.
232
233Instead of implementing this, it would actually probably be easier to just
234implement a PPC fastcc, where we could do whatever we wanted to the CC,
235including having this work sanely.
236
237===-------------------------------------------------------------------------===
238
239Fix Darwin FP-In-Integer Registers ABI
240
241Darwin passes doubles in structures in integer registers, which is very very
242bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
243that percolates these things out of functions.
244
245Check out how horrible this is:
246http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
247
248This is an extension of "interprocedural CC unmunging" that can't be done with
249just fastcc.
250
251===-------------------------------------------------------------------------===
252
Chris Lattner56b69642006-01-31 02:55:28 +0000253Compile this:
254
Chris Lattner83e64ba2006-01-31 07:16:34 +0000255int foo(int a) {
256 int b = (a < 8);
257 if (b) {
258 return b * 3; // ignore the fact that this is always 3.
259 } else {
260 return 2;
261 }
262}
263
264into something not this:
265
266_foo:
2671) cmpwi cr7, r3, 8
268 mfcr r2, 1
269 rlwinm r2, r2, 29, 31, 31
2701) cmpwi cr0, r3, 7
271 bgt cr0, LBB1_2 ; UnifiedReturnBlock
272LBB1_1: ; then
273 rlwinm r2, r2, 0, 31, 31
274 mulli r3, r2, 3
275 blr
276LBB1_2: ; UnifiedReturnBlock
277 li r3, 2
278 blr
279
280In particular, the two compares (marked 1) could be shared by reversing one.
281This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
282same operands (but backwards) exists. In this case, this wouldn't save us
283anything though, because the compares still wouldn't be shared.
Chris Lattner0ddc1802006-02-01 00:28:12 +0000284
Chris Lattner5a7efc92006-02-01 17:54:23 +0000285===-------------------------------------------------------------------------===
286
Chris Lattner275b8842006-02-02 07:37:11 +0000287We should custom expand setcc instead of pretending that we have it. That
288would allow us to expose the access of the crbit after the mfcr, allowing
289that access to be trivially folded into other ops. A simple example:
290
291int foo(int a, int b) { return (a < b) << 4; }
292
293compiles into:
294
295_foo:
296 cmpw cr7, r3, r4
297 mfcr r2, 1
298 rlwinm r2, r2, 29, 31, 31
299 slwi r3, r2, 4
300 blr
301
Chris Lattnerd463f7f2006-02-03 01:49:49 +0000302===-------------------------------------------------------------------------===
303
Nate Begemana63fee82006-02-03 05:17:06 +0000304Fold add and sub with constant into non-extern, non-weak addresses so this:
305
306static int a;
307void bar(int b) { a = b; }
308void foo(unsigned char *c) {
309 *c = a;
310}
311
312So that
313
314_foo:
315 lis r2, ha16(_a)
316 la r2, lo16(_a)(r2)
317 lbz r2, 3(r2)
318 stb r2, 0(r3)
319 blr
320
321Becomes
322
323_foo:
324 lis r2, ha16(_a+3)
325 lbz r2, lo16(_a+3)(r2)
326 stb r2, 0(r3)
327 blr
Chris Lattner21384532006-02-05 05:27:35 +0000328
329===-------------------------------------------------------------------------===
330
331We generate really bad code for this:
332
333int f(signed char *a, _Bool b, _Bool c) {
334 signed char t = 0;
335 if (b) t = *a;
336 if (c) *a = t;
337}
338
Chris Lattner00d18f02006-03-01 06:36:20 +0000339===-------------------------------------------------------------------------===
340
341This:
342int test(unsigned *P) { return *P >> 24; }
343
344Should compile to:
345
346_test:
347 lbz r3,0(r3)
348 blr
349
350not:
351
352_test:
353 lwz r2, 0(r3)
354 srwi r3, r2, 24
355 blr
356
Chris Lattner5a63c472006-03-07 04:42:59 +0000357===-------------------------------------------------------------------------===
358
359On the G5, logical CR operations are more expensive in their three
360address form: ops that read/write the same register are half as expensive as
361those that read from two registers that are different from their destination.
362
363We should model this with two separate instructions. The isel should generate
364the "two address" form of the instructions. When the register allocator
365detects that it needs to insert a copy due to the two-addresness of the CR
366logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
367we can convert to the "three address" instruction, to save code space.
368
369This only matters when we start generating cr logical ops.
370
Chris Lattner49f398b2006-03-08 00:25:47 +0000371===-------------------------------------------------------------------------===
372
373We should compile these two functions to the same thing:
374
375#include <stdlib.h>
376void f(int a, int b, int *P) {
377 *P = (a-b)>=0?(a-b):(b-a);
378}
379void g(int a, int b, int *P) {
380 *P = abs(a-b);
381}
382
383Further, they should compile to something better than:
384
385_g:
386 subf r2, r4, r3
387 subfic r3, r2, 0
388 cmpwi cr0, r2, -1
389 bgt cr0, LBB2_2 ; entry
390LBB2_1: ; entry
391 mr r2, r3
392LBB2_2: ; entry
393 stw r2, 0(r5)
394 blr
395
396GCC produces:
397
398_g:
399 subf r4,r4,r3
400 srawi r2,r4,31
401 xor r0,r2,r4
402 subf r0,r2,r0
403 stw r0,0(r5)
404 blr
405
406... which is much nicer.
407
408This theoretically may help improve twolf slightly (used in dimbox.c:142?).
409
410===-------------------------------------------------------------------------===
411
Nate Begeman2df99282006-03-16 18:50:44 +0000412int foo(int N, int ***W, int **TK, int X) {
413 int t, i;
414
415 for (t = 0; t < N; ++t)
416 for (i = 0; i < 4; ++i)
417 W[t / X][i][t % X] = TK[i][t];
418
419 return 5;
420}
421
Chris Lattnered511692006-03-16 22:25:55 +0000422We generate relatively atrocious code for this loop compared to gcc.
423
Chris Lattneref040dd2006-03-21 00:47:09 +0000424We could also strength reduce the rem and the div:
425http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
426
Chris Lattner28b1a0b2006-03-19 05:33:30 +0000427===-------------------------------------------------------------------------===
Chris Lattnered511692006-03-16 22:25:55 +0000428
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000429float foo(float X) { return (int)(X); }
430
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000431Currently produces:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000432
433_foo:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000434 fctiwz f0, f1
435 stfd f0, -8(r1)
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000436 lwz r2, -4(r1)
437 extsw r2, r2
438 std r2, -16(r1)
439 lfd f0, -16(r1)
440 fcfid f0, f0
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000441 frsp f1, f0
442 blr
443
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000444We could use a target dag combine to turn the lwz/extsw into an lwa when the
445lwz has a single use. Since LWA is cracked anyway, this would be a codesize
446win only.
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000447
Chris Lattner716aefc2006-03-23 21:28:44 +0000448===-------------------------------------------------------------------------===
449
Chris Lattner057f09b2006-03-24 20:04:27 +0000450We generate ugly code for this:
451
452void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
453 unsigned code = 0;
454 if(dx < -dw) code |= 1;
455 if(dx > dw) code |= 2;
456 if(dy < -dw) code |= 4;
457 if(dy > dw) code |= 8;
458 if(dz < -dw) code |= 16;
459 if(dz > dw) code |= 32;
460 *ret = code;
461}
462
Chris Lattner420736d2006-03-25 06:47:10 +0000463===-------------------------------------------------------------------------===
464
Chris Lattnered937902006-04-13 16:48:00 +0000465Complete the signed i32 to FP conversion code using 64-bit registers
466transformation, good for PI. See PPCISelLowering.cpp, this comment:
Chris Lattner220d2b82006-04-02 07:20:00 +0000467
Chris Lattnered937902006-04-13 16:48:00 +0000468 // FIXME: disable this lowered code. This generates 64-bit register values,
469 // and we don't model the fact that the top part is clobbered by calls. We
470 // need to flag these together so that the value isn't live across a call.
471 //setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
Chris Lattner220d2b82006-04-02 07:20:00 +0000472
Chris Lattner9d62fa42006-05-17 19:02:25 +0000473Also, if the registers are spilled to the stack, we have to ensure that all
47464-bits of them are save/restored, otherwise we will miscompile the code. It
475sounds like we need to get the 64-bit register classes going.
476
Chris Lattner55c63252006-05-05 05:36:15 +0000477===-------------------------------------------------------------------------===
478
Nate Begeman908049b2007-01-29 21:21:22 +0000479%struct.B = type { i8, [3 x i8] }
Nate Begeman75146202006-05-08 20:54:02 +0000480
Nate Begeman908049b2007-01-29 21:21:22 +0000481define void @bar(%struct.B* %b) {
Nate Begeman75146202006-05-08 20:54:02 +0000482entry:
Nate Begeman908049b2007-01-29 21:21:22 +0000483 %tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
484 %tmp = load i32* %tmp ; <uint> [#uses=1]
485 %tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
486 %tmp4 = load i32* %tmp3 ; <uint> [#uses=1]
487 %tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2]
488 %tmp9 = load i32* %tmp8 ; <uint> [#uses=1]
489 %tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1]
490 %tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
491 %tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1]
492 %tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1]
493 %tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1]
494 %tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1]
495 store i32 %tmp13, i32* %tmp8
Chris Lattner55c63252006-05-05 05:36:15 +0000496 ret void
497}
498
499We emit:
500
501_foo:
502 lwz r2, 0(r3)
Nate Begeman75146202006-05-08 20:54:02 +0000503 slwi r4, r2, 1
504 or r4, r4, r2
505 rlwimi r2, r4, 0, 0, 0
Nate Begeman4667f2c2006-05-08 17:38:32 +0000506 stw r2, 0(r3)
Chris Lattner55c63252006-05-05 05:36:15 +0000507 blr
508
Nate Begeman75146202006-05-08 20:54:02 +0000509We could collapse a bunch of those ORs and ANDs and generate the following
510equivalent code:
Chris Lattner55c63252006-05-05 05:36:15 +0000511
Nate Begeman4667f2c2006-05-08 17:38:32 +0000512_foo:
513 lwz r2, 0(r3)
Nate Begemand8624ed2006-05-08 19:09:24 +0000514 rlwinm r4, r2, 1, 0, 0
Nate Begeman4667f2c2006-05-08 17:38:32 +0000515 or r2, r2, r4
516 stw r2, 0(r3)
517 blr
Chris Lattner1eeedae2006-07-14 04:07:29 +0000518
519===-------------------------------------------------------------------------===
520
Chris Lattnerf0613e12006-09-14 20:56:30 +0000521We compile:
522
523unsigned test6(unsigned x) {
524 return ((x & 0x00FF0000) >> 16) | ((x & 0x000000FF) << 16);
525}
526
527into:
528
529_test6:
530 lis r2, 255
531 rlwinm r3, r3, 16, 0, 31
532 ori r2, r2, 255
533 and r3, r3, r2
534 blr
535
536GCC gets it down to:
537
538_test6:
539 rlwinm r0,r3,16,8,15
540 rlwinm r3,r3,16,24,31
541 or r3,r3,r0
542 blr
543
Chris Lattnerafd7a082007-01-18 07:34:57 +0000544
545===-------------------------------------------------------------------------===
546
547Consider a function like this:
548
549float foo(float X) { return X + 1234.4123f; }
550
551The FP constant ends up in the constant pool, so we need to get the LR register.
552 This ends up producing code like this:
553
554_foo:
555.LBB_foo_0: ; entry
556 mflr r11
557*** stw r11, 8(r1)
558 bl "L00000$pb"
559"L00000$pb":
560 mflr r2
561 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
562 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
563 fadds f1, f1, f0
564*** lwz r11, 8(r1)
565 mtlr r11
566 blr
567
568This is functional, but there is no reason to spill the LR register all the way
569to the stack (the two marked instrs): spilling it to a GPR is quite enough.
570
571Implementing this will require some codegen improvements. Nate writes:
572
573"So basically what we need to support the "no stack frame save and restore" is a
574generalization of the LR optimization to "callee-save regs".
575
576Currently, we have LR marked as a callee-save reg. The register allocator sees
577that it's callee save, and spills it directly to the stack.
578
579Ideally, something like this would happen:
580
581LR would be in a separate register class from the GPRs. The class of LR would be
582marked "unspillable". When the register allocator came across an unspillable
583reg, it would ask "what is the best class to copy this into that I *can* spill"
584If it gets a class back, which it will in this case (the gprs), it grabs a free
585register of that class. If it is then later necessary to spill that reg, so be
586it.
587
588===-------------------------------------------------------------------------===
Chris Lattner95b9d6e2007-01-31 19:49:20 +0000589
590We compile this:
591int test(_Bool X) {
592 return X ? 524288 : 0;
593}
594
595to:
596_test:
597 cmplwi cr0, r3, 0
598 lis r2, 8
599 li r3, 0
600 beq cr0, LBB1_2 ;entry
601LBB1_1: ;entry
602 mr r3, r2
603LBB1_2: ;entry
604 blr
605
606instead of:
607_test:
608 addic r2,r3,-1
609 subfe r0,r2,r3
610 slwi r3,r0,19
611 blr
612
613This sort of thing occurs a lot due to globalopt.
614
615===-------------------------------------------------------------------------===
Chris Lattner8abcfe12007-02-09 17:38:01 +0000616
617We currently compile 32-bit bswap:
618
619declare i32 @llvm.bswap.i32(i32 %A)
620define i32 @test(i32 %A) {
621 %B = call i32 @llvm.bswap.i32(i32 %A)
622 ret i32 %B
623}
624
625to:
626
627_test:
628 rlwinm r2, r3, 24, 16, 23
629 slwi r4, r3, 24
630 rlwimi r2, r3, 8, 24, 31
631 rlwimi r4, r3, 8, 8, 15
632 rlwimi r4, r2, 0, 16, 31
633 mr r3, r4
634 blr
635
636it would be more efficient to produce:
637
638_foo: mr r0,r3
639 rlwinm r3,r3,8,0xffffffff
640 rlwimi r3,r0,24,0,7
641 rlwimi r3,r0,24,16,23
642 blr
643
644===-------------------------------------------------------------------------===
Chris Lattner013e0512007-03-25 04:46:28 +0000645
646test/CodeGen/PowerPC/2007-03-24-cntlzd.ll compiles to:
647
648__ZNK4llvm5APInt17countLeadingZerosEv:
649 ld r2, 0(r3)
650 cntlzd r2, r2
651 or r2, r2, r2 <<-- silly.
652 addi r3, r2, -64
653 blr
654
655The dead or is a 'truncate' from 64- to 32-bits.
656
657===-------------------------------------------------------------------------===
Chris Lattnerfcb1e612007-03-31 07:06:25 +0000658
659We generate horrible ppc code for this:
660
661#define N 2000000
662double a[N],c[N];
663void simpleloop() {
664 int j;
665 for (j=0; j<N; j++)
666 c[j] = a[j];
667}
668
669LBB1_1: ;bb
670 lfdx f0, r3, r4
671 addi r5, r5, 1 ;; Extra IV for the exit value compare.
672 stfdx f0, r2, r4
673 addi r4, r4, 8
674
675 xoris r6, r5, 30 ;; This is due to a large immediate.
676 cmplwi cr0, r6, 33920
677 bne cr0, LBB1_1
678
Chris Lattnerbf8ae842007-09-10 21:43:18 +0000679//===---------------------------------------------------------------------===//
680
681This:
682 #include <algorithm>
683 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
684 { return std::make_pair(a + b, a + b < a); }
685 bool no_overflow(unsigned a, unsigned b)
686 { return !full_add(a, b).second; }
687
688Should compile to:
689
690__Z11no_overflowjj:
691 add r4,r3,r4
692 subfc r3,r3,r4
693 li r3,0
694 adde r3,r3,r3
695 blr
696
697(or better) not:
698
699__Z11no_overflowjj:
700 add r2, r4, r3
701 cmplw cr7, r2, r3
702 mfcr r2
703 rlwinm r2, r2, 29, 31, 31
704 xori r3, r2, 1
705 blr
706
707//===---------------------------------------------------------------------===//
Chris Lattnerfcb1e612007-03-31 07:06:25 +0000708
Chris Lattnerfe39edd2008-01-08 06:46:30 +0000709We compile some FP comparisons into an mfcr with two rlwinms and an or. For
710example:
711#include <math.h>
712int test(double x, double y) { return islessequal(x, y);}
713int test2(double x, double y) { return islessgreater(x, y);}
714int test3(double x, double y) { return !islessequal(x, y);}
715
716Compiles into (all three are similar, but the bits differ):
717
718_test:
719 fcmpu cr7, f1, f2
720 mfcr r2
721 rlwinm r3, r2, 29, 31, 31
722 rlwinm r2, r2, 31, 31, 31
723 or r3, r2, r3
724 blr
725
726GCC compiles this into:
727
728 _test:
729 fcmpu cr7,f1,f2
730 cror 30,28,30
731 mfcr r3
732 rlwinm r3,r3,31,1
733 blr
734
735which is more efficient and can use mfocr. See PR642 for some more context.
736
737//===---------------------------------------------------------------------===//