blob: 077bc2555d8669de5eca001d336d2345c84a9196 [file] [log] [blame]
Chris Lattnerb86bd2c2006-03-27 07:04:16 +00001//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
2
Nate Begemanb64af912004-08-10 20:42:36 +00003TODO:
Nate Begemanef9531e2005-04-11 20:48:57 +00004* gpr0 allocation
Nate Begeman4a0de072004-10-26 04:10:53 +00005* implement do-loop -> bdnz transform
Chris Lattner86c9c342007-03-25 05:10:46 +00006* __builtin_return_address not supported on PPC
Nate Begeman50fb3c42005-12-24 01:00:15 +00007
Nate Begemana63fee82006-02-03 05:17:06 +00008===-------------------------------------------------------------------------===
Nate Begeman50fb3c42005-12-24 01:00:15 +00009
Nate Begemana63fee82006-02-03 05:17:06 +000010Support 'update' load/store instructions. These are cracked on the G5, but are
11still a codesize win.
12
Chris Lattner26ddb502006-11-10 01:33:53 +000013With preinc enabled, this:
14
15long *%test4(long *%X, long *%dest) {
16 %Y = getelementptr long* %X, int 4
17 %A = load long* %Y
18 store long %A, long* %dest
19 ret long* %Y
20}
21
22compiles to:
23
24_test4:
25 mr r2, r3
26 lwzu r5, 32(r2)
27 lwz r3, 36(r3)
28 stw r5, 0(r4)
29 stw r3, 4(r4)
30 mr r3, r2
31 blr
32
33with -sched=list-burr, I get:
34
35_test4:
36 lwz r2, 36(r3)
37 lwzu r5, 32(r3)
38 stw r2, 4(r4)
39 stw r5, 0(r4)
40 blr
41
Nate Begemana63fee82006-02-03 05:17:06 +000042===-------------------------------------------------------------------------===
43
Chris Lattner6e112952006-11-07 18:30:21 +000044We compile the hottest inner loop of viterbi to:
45
46 li r6, 0
47 b LBB1_84 ;bb432.i
48LBB1_83: ;bb420.i
49 lbzx r8, r5, r7
50 addi r6, r7, 1
51 stbx r8, r4, r7
52LBB1_84: ;bb432.i
53 mr r7, r6
54 cmplwi cr0, r7, 143
55 bne cr0, LBB1_83 ;bb420.i
56
57The CBE manages to produce:
58
59 li r0, 143
60 mtctr r0
61loop:
62 lbzx r2, r2, r11
63 stbx r0, r2, r9
64 addi r2, r2, 1
65 bdz later
66 b loop
67
68This could be much better (bdnz instead of bdz) but it still beats us. If we
69produced this with bdnz, the loop would be a single dispatch group.
70
71===-------------------------------------------------------------------------===
72
Chris Lattner6a250ec2006-10-13 20:20:58 +000073Compile:
74
75void foo(int *P) {
76 if (P) *P = 0;
77}
78
79into:
80
81_foo:
82 cmpwi cr0,r3,0
83 beqlr cr0
84 li r0,0
85 stw r0,0(r3)
86 blr
87
88This is effectively a simple form of predication.
89
90===-------------------------------------------------------------------------===
91
Chris Lattnera3c44542005-08-24 18:15:24 +000092Lump the constant pool for each function into ONE pic object, and reference
93pieces of it as offsets from the start. For functions like this (contrived
94to have lots of constants obviously):
95
96double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
97
98We generate:
99
100_X:
101 lis r2, ha16(.CPI_X_0)
102 lfd f0, lo16(.CPI_X_0)(r2)
103 lis r2, ha16(.CPI_X_1)
104 lfd f2, lo16(.CPI_X_1)(r2)
105 fmadd f0, f1, f0, f2
106 lis r2, ha16(.CPI_X_2)
107 lfd f1, lo16(.CPI_X_2)(r2)
108 lis r2, ha16(.CPI_X_3)
109 lfd f2, lo16(.CPI_X_3)(r2)
110 fmadd f1, f0, f1, f2
111 blr
112
113It would be better to materialize .CPI_X into a register, then use immediates
114off of the register to avoid the lis's. This is even more important in PIC
115mode.
116
Chris Lattner39b248b2006-02-02 23:50:22 +0000117Note that this (and the static variable version) is discussed here for GCC:
118http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
119
Chris Lattneraabd0352007-08-23 15:16:03 +0000120Here's another example (the sgn function):
121double testf(double a) {
122 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
123}
124
125it produces a BB like this:
126LBB1_1: ; cond_true
127 lis r2, ha16(LCPI1_0)
128 lfs f0, lo16(LCPI1_0)(r2)
129 lis r2, ha16(LCPI1_1)
130 lis r3, ha16(LCPI1_2)
131 lfs f2, lo16(LCPI1_2)(r3)
132 lfs f3, lo16(LCPI1_1)(r2)
133 fsub f0, f0, f1
134 fsel f1, f0, f2, f3
135 blr
136
Chris Lattnera3c44542005-08-24 18:15:24 +0000137===-------------------------------------------------------------------------===
Nate Begeman92cce902005-09-06 15:30:48 +0000138
Chris Lattner33c1dab2006-02-03 06:22:11 +0000139PIC Code Gen IPO optimization:
140
141Squish small scalar globals together into a single global struct, allowing the
142address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
143of the GOT on targets with one).
144
145Note that this is discussed here for GCC:
146http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
147
148===-------------------------------------------------------------------------===
149
Nate Begeman92cce902005-09-06 15:30:48 +0000150Implement Newton-Rhapson method for improving estimate instructions to the
151correct accuracy, and implementing divide as multiply by reciprocal when it has
152more than one use. Itanium will want this too.
Nate Begeman21e463b2005-10-16 05:39:50 +0000153
154===-------------------------------------------------------------------------===
155
Chris Lattnerae4664a2005-11-05 08:57:56 +0000156Compile this:
157
158int %f1(int %a, int %b) {
159 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
160 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
161 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
162 ret int %tmp.4
163}
164
165without a copy. We make this currently:
166
167_f1:
168 rlwinm r2, r4, 0, 24, 27
169 rlwimi r2, r3, 0, 28, 31
170 or r3, r2, r2
171 blr
172
173The two-addr pass or RA needs to learn when it is profitable to commute an
174instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
175currently only commutes to avoid inserting a copy BEFORE the two addr instr.
176
Chris Lattner62c08dd2005-12-08 07:13:28 +0000177===-------------------------------------------------------------------------===
178
179Compile offsets from allocas:
180
181int *%test() {
182 %X = alloca { int, int }
183 %Y = getelementptr {int,int}* %X, int 0, uint 1
184 ret int* %Y
185}
186
187into a single add, not two:
188
189_test:
190 addi r2, r1, -8
191 addi r3, r2, 4
192 blr
193
194--> important for C++.
195
Chris Lattner39706e62005-12-22 17:19:28 +0000196===-------------------------------------------------------------------------===
197
Chris Lattner39706e62005-12-22 17:19:28 +0000198No loads or stores of the constants should be needed:
199
200struct foo { double X, Y; };
201void xxx(struct foo F);
202void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
203
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000204===-------------------------------------------------------------------------===
205
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000206Darwin Stub LICM optimization:
207
208Loops like this:
209
210 for (...) bar();
211
212Have to go through an indirect stub if bar is external or linkonce. It would
213be better to compile it as:
214
215 fp = &bar;
216 for (...) fp();
217
218which only computes the address of bar once (instead of each time through the
219stub). This is Darwin specific and would have to be done in the code generator.
220Probably not a win on x86.
221
222===-------------------------------------------------------------------------===
223
Chris Lattner98fbc2f2006-01-16 17:58:54 +0000224Simple IPO for argument passing, change:
225 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
226
227the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
228of arguments get assigned to r3 through r10. That is, if you have a function
229foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
230argument bytes for r4 and r5. The trick then would be to shuffle the argument
231order for functions we can internalize so that the maximum number of
232integers/pointers get passed in regs before you see any of the fp arguments.
233
234Instead of implementing this, it would actually probably be easier to just
235implement a PPC fastcc, where we could do whatever we wanted to the CC,
236including having this work sanely.
237
238===-------------------------------------------------------------------------===
239
240Fix Darwin FP-In-Integer Registers ABI
241
242Darwin passes doubles in structures in integer registers, which is very very
243bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
244that percolates these things out of functions.
245
246Check out how horrible this is:
247http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
248
249This is an extension of "interprocedural CC unmunging" that can't be done with
250just fastcc.
251
252===-------------------------------------------------------------------------===
253
Chris Lattner56b69642006-01-31 02:55:28 +0000254Compile this:
255
Chris Lattner83e64ba2006-01-31 07:16:34 +0000256int foo(int a) {
257 int b = (a < 8);
258 if (b) {
259 return b * 3; // ignore the fact that this is always 3.
260 } else {
261 return 2;
262 }
263}
264
265into something not this:
266
267_foo:
2681) cmpwi cr7, r3, 8
269 mfcr r2, 1
270 rlwinm r2, r2, 29, 31, 31
2711) cmpwi cr0, r3, 7
272 bgt cr0, LBB1_2 ; UnifiedReturnBlock
273LBB1_1: ; then
274 rlwinm r2, r2, 0, 31, 31
275 mulli r3, r2, 3
276 blr
277LBB1_2: ; UnifiedReturnBlock
278 li r3, 2
279 blr
280
281In particular, the two compares (marked 1) could be shared by reversing one.
282This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
283same operands (but backwards) exists. In this case, this wouldn't save us
284anything though, because the compares still wouldn't be shared.
Chris Lattner0ddc1802006-02-01 00:28:12 +0000285
Chris Lattner5a7efc92006-02-01 17:54:23 +0000286===-------------------------------------------------------------------------===
287
Chris Lattner275b8842006-02-02 07:37:11 +0000288We should custom expand setcc instead of pretending that we have it. That
289would allow us to expose the access of the crbit after the mfcr, allowing
290that access to be trivially folded into other ops. A simple example:
291
292int foo(int a, int b) { return (a < b) << 4; }
293
294compiles into:
295
296_foo:
297 cmpw cr7, r3, r4
298 mfcr r2, 1
299 rlwinm r2, r2, 29, 31, 31
300 slwi r3, r2, 4
301 blr
302
Chris Lattnerd463f7f2006-02-03 01:49:49 +0000303===-------------------------------------------------------------------------===
304
Nate Begemana63fee82006-02-03 05:17:06 +0000305Fold add and sub with constant into non-extern, non-weak addresses so this:
306
307static int a;
308void bar(int b) { a = b; }
309void foo(unsigned char *c) {
310 *c = a;
311}
312
313So that
314
315_foo:
316 lis r2, ha16(_a)
317 la r2, lo16(_a)(r2)
318 lbz r2, 3(r2)
319 stb r2, 0(r3)
320 blr
321
322Becomes
323
324_foo:
325 lis r2, ha16(_a+3)
326 lbz r2, lo16(_a+3)(r2)
327 stb r2, 0(r3)
328 blr
Chris Lattner21384532006-02-05 05:27:35 +0000329
330===-------------------------------------------------------------------------===
331
332We generate really bad code for this:
333
334int f(signed char *a, _Bool b, _Bool c) {
335 signed char t = 0;
336 if (b) t = *a;
337 if (c) *a = t;
338}
339
Chris Lattner00d18f02006-03-01 06:36:20 +0000340===-------------------------------------------------------------------------===
341
342This:
343int test(unsigned *P) { return *P >> 24; }
344
345Should compile to:
346
347_test:
348 lbz r3,0(r3)
349 blr
350
351not:
352
353_test:
354 lwz r2, 0(r3)
355 srwi r3, r2, 24
356 blr
357
Chris Lattner5a63c472006-03-07 04:42:59 +0000358===-------------------------------------------------------------------------===
359
360On the G5, logical CR operations are more expensive in their three
361address form: ops that read/write the same register are half as expensive as
362those that read from two registers that are different from their destination.
363
364We should model this with two separate instructions. The isel should generate
365the "two address" form of the instructions. When the register allocator
366detects that it needs to insert a copy due to the two-addresness of the CR
367logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
368we can convert to the "three address" instruction, to save code space.
369
370This only matters when we start generating cr logical ops.
371
Chris Lattner49f398b2006-03-08 00:25:47 +0000372===-------------------------------------------------------------------------===
373
374We should compile these two functions to the same thing:
375
376#include <stdlib.h>
377void f(int a, int b, int *P) {
378 *P = (a-b)>=0?(a-b):(b-a);
379}
380void g(int a, int b, int *P) {
381 *P = abs(a-b);
382}
383
384Further, they should compile to something better than:
385
386_g:
387 subf r2, r4, r3
388 subfic r3, r2, 0
389 cmpwi cr0, r2, -1
390 bgt cr0, LBB2_2 ; entry
391LBB2_1: ; entry
392 mr r2, r3
393LBB2_2: ; entry
394 stw r2, 0(r5)
395 blr
396
397GCC produces:
398
399_g:
400 subf r4,r4,r3
401 srawi r2,r4,31
402 xor r0,r2,r4
403 subf r0,r2,r0
404 stw r0,0(r5)
405 blr
406
407... which is much nicer.
408
409This theoretically may help improve twolf slightly (used in dimbox.c:142?).
410
411===-------------------------------------------------------------------------===
412
Nate Begeman2df99282006-03-16 18:50:44 +0000413int foo(int N, int ***W, int **TK, int X) {
414 int t, i;
415
416 for (t = 0; t < N; ++t)
417 for (i = 0; i < 4; ++i)
418 W[t / X][i][t % X] = TK[i][t];
419
420 return 5;
421}
422
Chris Lattnered511692006-03-16 22:25:55 +0000423We generate relatively atrocious code for this loop compared to gcc.
424
Chris Lattneref040dd2006-03-21 00:47:09 +0000425We could also strength reduce the rem and the div:
426http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
427
Chris Lattner28b1a0b2006-03-19 05:33:30 +0000428===-------------------------------------------------------------------------===
Chris Lattnered511692006-03-16 22:25:55 +0000429
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000430float foo(float X) { return (int)(X); }
431
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000432Currently produces:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000433
434_foo:
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000435 fctiwz f0, f1
436 stfd f0, -8(r1)
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000437 lwz r2, -4(r1)
438 extsw r2, r2
439 std r2, -16(r1)
440 lfd f0, -16(r1)
441 fcfid f0, f0
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000442 frsp f1, f0
443 blr
444
Chris Lattner9d86a9d2006-03-22 05:33:23 +0000445We could use a target dag combine to turn the lwz/extsw into an lwa when the
446lwz has a single use. Since LWA is cracked anyway, this would be a codesize
447win only.
Nate Begemanc0a8b6d2006-03-21 18:58:20 +0000448
Chris Lattner716aefc2006-03-23 21:28:44 +0000449===-------------------------------------------------------------------------===
450
Chris Lattner057f09b2006-03-24 20:04:27 +0000451We generate ugly code for this:
452
453void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
454 unsigned code = 0;
455 if(dx < -dw) code |= 1;
456 if(dx > dw) code |= 2;
457 if(dy < -dw) code |= 4;
458 if(dy > dw) code |= 8;
459 if(dz < -dw) code |= 16;
460 if(dz > dw) code |= 32;
461 *ret = code;
462}
463
Chris Lattner420736d2006-03-25 06:47:10 +0000464===-------------------------------------------------------------------------===
465
Chris Lattnered937902006-04-13 16:48:00 +0000466Complete the signed i32 to FP conversion code using 64-bit registers
467transformation, good for PI. See PPCISelLowering.cpp, this comment:
Chris Lattner220d2b82006-04-02 07:20:00 +0000468
Chris Lattnered937902006-04-13 16:48:00 +0000469 // FIXME: disable this lowered code. This generates 64-bit register values,
470 // and we don't model the fact that the top part is clobbered by calls. We
471 // need to flag these together so that the value isn't live across a call.
472 //setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
Chris Lattner220d2b82006-04-02 07:20:00 +0000473
Chris Lattner9d62fa42006-05-17 19:02:25 +0000474Also, if the registers are spilled to the stack, we have to ensure that all
47564-bits of them are save/restored, otherwise we will miscompile the code. It
476sounds like we need to get the 64-bit register classes going.
477
Chris Lattner55c63252006-05-05 05:36:15 +0000478===-------------------------------------------------------------------------===
479
Nate Begeman908049b2007-01-29 21:21:22 +0000480%struct.B = type { i8, [3 x i8] }
Nate Begeman75146202006-05-08 20:54:02 +0000481
Nate Begeman908049b2007-01-29 21:21:22 +0000482define void @bar(%struct.B* %b) {
Nate Begeman75146202006-05-08 20:54:02 +0000483entry:
Nate Begeman908049b2007-01-29 21:21:22 +0000484 %tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
485 %tmp = load i32* %tmp ; <uint> [#uses=1]
486 %tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
487 %tmp4 = load i32* %tmp3 ; <uint> [#uses=1]
488 %tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2]
489 %tmp9 = load i32* %tmp8 ; <uint> [#uses=1]
490 %tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1]
491 %tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
492 %tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1]
493 %tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1]
494 %tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1]
495 %tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1]
496 store i32 %tmp13, i32* %tmp8
Chris Lattner55c63252006-05-05 05:36:15 +0000497 ret void
498}
499
500We emit:
501
502_foo:
503 lwz r2, 0(r3)
Nate Begeman75146202006-05-08 20:54:02 +0000504 slwi r4, r2, 1
505 or r4, r4, r2
506 rlwimi r2, r4, 0, 0, 0
Nate Begeman4667f2c2006-05-08 17:38:32 +0000507 stw r2, 0(r3)
Chris Lattner55c63252006-05-05 05:36:15 +0000508 blr
509
Nate Begeman75146202006-05-08 20:54:02 +0000510We could collapse a bunch of those ORs and ANDs and generate the following
511equivalent code:
Chris Lattner55c63252006-05-05 05:36:15 +0000512
Nate Begeman4667f2c2006-05-08 17:38:32 +0000513_foo:
514 lwz r2, 0(r3)
Nate Begemand8624ed2006-05-08 19:09:24 +0000515 rlwinm r4, r2, 1, 0, 0
Nate Begeman4667f2c2006-05-08 17:38:32 +0000516 or r2, r2, r4
517 stw r2, 0(r3)
518 blr
Chris Lattner1eeedae2006-07-14 04:07:29 +0000519
520===-------------------------------------------------------------------------===
521
Chris Lattnerf0613e12006-09-14 20:56:30 +0000522We compile:
523
524unsigned test6(unsigned x) {
525 return ((x & 0x00FF0000) >> 16) | ((x & 0x000000FF) << 16);
526}
527
528into:
529
530_test6:
531 lis r2, 255
532 rlwinm r3, r3, 16, 0, 31
533 ori r2, r2, 255
534 and r3, r3, r2
535 blr
536
537GCC gets it down to:
538
539_test6:
540 rlwinm r0,r3,16,8,15
541 rlwinm r3,r3,16,24,31
542 or r3,r3,r0
543 blr
544
Chris Lattnerafd7a082007-01-18 07:34:57 +0000545
546===-------------------------------------------------------------------------===
547
548Consider a function like this:
549
550float foo(float X) { return X + 1234.4123f; }
551
552The FP constant ends up in the constant pool, so we need to get the LR register.
553 This ends up producing code like this:
554
555_foo:
556.LBB_foo_0: ; entry
557 mflr r11
558*** stw r11, 8(r1)
559 bl "L00000$pb"
560"L00000$pb":
561 mflr r2
562 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
563 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
564 fadds f1, f1, f0
565*** lwz r11, 8(r1)
566 mtlr r11
567 blr
568
569This is functional, but there is no reason to spill the LR register all the way
570to the stack (the two marked instrs): spilling it to a GPR is quite enough.
571
572Implementing this will require some codegen improvements. Nate writes:
573
574"So basically what we need to support the "no stack frame save and restore" is a
575generalization of the LR optimization to "callee-save regs".
576
577Currently, we have LR marked as a callee-save reg. The register allocator sees
578that it's callee save, and spills it directly to the stack.
579
580Ideally, something like this would happen:
581
582LR would be in a separate register class from the GPRs. The class of LR would be
583marked "unspillable". When the register allocator came across an unspillable
584reg, it would ask "what is the best class to copy this into that I *can* spill"
585If it gets a class back, which it will in this case (the gprs), it grabs a free
586register of that class. If it is then later necessary to spill that reg, so be
587it.
588
589===-------------------------------------------------------------------------===
Chris Lattner95b9d6e2007-01-31 19:49:20 +0000590
591We compile this:
592int test(_Bool X) {
593 return X ? 524288 : 0;
594}
595
596to:
597_test:
598 cmplwi cr0, r3, 0
599 lis r2, 8
600 li r3, 0
601 beq cr0, LBB1_2 ;entry
602LBB1_1: ;entry
603 mr r3, r2
604LBB1_2: ;entry
605 blr
606
607instead of:
608_test:
609 addic r2,r3,-1
610 subfe r0,r2,r3
611 slwi r3,r0,19
612 blr
613
614This sort of thing occurs a lot due to globalopt.
615
616===-------------------------------------------------------------------------===
Chris Lattner8abcfe12007-02-09 17:38:01 +0000617
618We currently compile 32-bit bswap:
619
620declare i32 @llvm.bswap.i32(i32 %A)
621define i32 @test(i32 %A) {
622 %B = call i32 @llvm.bswap.i32(i32 %A)
623 ret i32 %B
624}
625
626to:
627
628_test:
629 rlwinm r2, r3, 24, 16, 23
630 slwi r4, r3, 24
631 rlwimi r2, r3, 8, 24, 31
632 rlwimi r4, r3, 8, 8, 15
633 rlwimi r4, r2, 0, 16, 31
634 mr r3, r4
635 blr
636
637it would be more efficient to produce:
638
639_foo: mr r0,r3
640 rlwinm r3,r3,8,0xffffffff
641 rlwimi r3,r0,24,0,7
642 rlwimi r3,r0,24,16,23
643 blr
644
645===-------------------------------------------------------------------------===
Chris Lattner013e0512007-03-25 04:46:28 +0000646
647test/CodeGen/PowerPC/2007-03-24-cntlzd.ll compiles to:
648
649__ZNK4llvm5APInt17countLeadingZerosEv:
650 ld r2, 0(r3)
651 cntlzd r2, r2
652 or r2, r2, r2 <<-- silly.
653 addi r3, r2, -64
654 blr
655
656The dead or is a 'truncate' from 64- to 32-bits.
657
658===-------------------------------------------------------------------------===
Chris Lattnerfcb1e612007-03-31 07:06:25 +0000659
660We generate horrible ppc code for this:
661
662#define N 2000000
663double a[N],c[N];
664void simpleloop() {
665 int j;
666 for (j=0; j<N; j++)
667 c[j] = a[j];
668}
669
670LBB1_1: ;bb
671 lfdx f0, r3, r4
672 addi r5, r5, 1 ;; Extra IV for the exit value compare.
673 stfdx f0, r2, r4
674 addi r4, r4, 8
675
676 xoris r6, r5, 30 ;; This is due to a large immediate.
677 cmplwi cr0, r6, 33920
678 bne cr0, LBB1_1
679
Chris Lattnerbf8ae842007-09-10 21:43:18 +0000680//===---------------------------------------------------------------------===//
681
682This:
683 #include <algorithm>
684 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
685 { return std::make_pair(a + b, a + b < a); }
686 bool no_overflow(unsigned a, unsigned b)
687 { return !full_add(a, b).second; }
688
689Should compile to:
690
691__Z11no_overflowjj:
692 add r4,r3,r4
693 subfc r3,r3,r4
694 li r3,0
695 adde r3,r3,r3
696 blr
697
698(or better) not:
699
700__Z11no_overflowjj:
701 add r2, r4, r3
702 cmplw cr7, r2, r3
703 mfcr r2
704 rlwinm r2, r2, 29, 31, 31
705 xori r3, r2, 1
706 blr
707
708//===---------------------------------------------------------------------===//
Chris Lattnerfcb1e612007-03-31 07:06:25 +0000709