Blame - lib/Target/PowerPC/README.txt - platform/external/llvm

blob: 5af108ac683b8c85d2e7c7cb632993162100f4aa [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
				2
				3	TODO:
				4	* gpr0 allocation
				5	* implement do-loop -> bdnz transform
Nate Begeman	10c8575	2008-02-11 04:16:09 +0000	[diff] [blame]	6	* lmw/stmw pass a la arm load store optimizer for prolog/epilog
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	7
				8	===-------------------------------------------------------------------------===
				9
Chris Lattner	c6cc7da	2010-01-07 17:53:10 +0000	[diff] [blame]	10	On PPC64, this:
				11
				12	long f2 (long x) { return 0xfffffff000000000UL; }
				13	long f3 (long x) { return 0x1ffffffffUL; }
				14
				15	could compile into:
				16
				17	_f2:
				18	li r3,-1
				19	rldicr r3,r3,0,27
				20	blr
				21	_f3:
				22	li r3,-1
				23	rldicl r3,r3,0,31
				24	blr
				25
				26	we produce:
				27
				28	_f2:
				29	lis r2, 4095
				30	ori r2, r2, 65535
				31	sldi r3, r2, 36
				32	blr
				33	_f3:
				34	li r2, 1
				35	sldi r2, r2, 32
				36	oris r2, r2, 65535
				37	ori r3, r2, 65535
				38	blr
				39
				40
				41	===-------------------------------------------------------------------------===
				42
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	43	Support 'update' load/store instructions. These are cracked on the G5, but are
				44	still a codesize win.
				45
				46	With preinc enabled, this:
				47
				48	long %test4(long %X, long *%dest) {
				49	%Y = getelementptr long* %X, int 4
				50	%A = load long* %Y
				51	store long %A, long* %dest
				52	ret long* %Y
				53	}
				54
				55	compiles to:
				56
				57	_test4:
				58	mr r2, r3
				59	lwzu r5, 32(r2)
				60	lwz r3, 36(r3)
				61	stw r5, 0(r4)
				62	stw r3, 4(r4)
				63	mr r3, r2
				64	blr
				65
				66	with -sched=list-burr, I get:
				67
				68	_test4:
				69	lwz r2, 36(r3)
				70	lwzu r5, 32(r3)
				71	stw r2, 4(r4)
				72	stw r5, 0(r4)
				73	blr
				74
				75	===-------------------------------------------------------------------------===
				76
				77	We compile the hottest inner loop of viterbi to:
				78
				79	li r6, 0
				80	b LBB1_84 ;bb432.i
				81	LBB1_83: ;bb420.i
				82	lbzx r8, r5, r7
				83	addi r6, r7, 1
				84	stbx r8, r4, r7
				85	LBB1_84: ;bb432.i
				86	mr r7, r6
				87	cmplwi cr0, r7, 143
				88	bne cr0, LBB1_83 ;bb420.i
				89
				90	The CBE manages to produce:
				91
				92	li r0, 143
				93	mtctr r0
				94	loop:
				95	lbzx r2, r2, r11
				96	stbx r0, r2, r9
				97	addi r2, r2, 1
				98	bdz later
				99	b loop
				100
				101	This could be much better (bdnz instead of bdz) but it still beats us. If we
				102	produced this with bdnz, the loop would be a single dispatch group.
				103
				104	===-------------------------------------------------------------------------===
				105
				106	Compile:
				107
				108	void foo(int *P) {
				109	if (P) *P = 0;
				110	}
				111
				112	into:
				113
				114	_foo:
				115	cmpwi cr0,r3,0
				116	beqlr cr0
				117	li r0,0
				118	stw r0,0(r3)
				119	blr
				120
				121	This is effectively a simple form of predication.
				122
				123	===-------------------------------------------------------------------------===
				124
				125	Lump the constant pool for each function into ONE pic object, and reference
				126	pieces of it as offsets from the start. For functions like this (contrived
				127	to have lots of constants obviously):
				128
				129	double X(double Y) { return (Y1.23 + 4.512)2.34 + 14.38; }
				130
				131	We generate:
				132
				133	_X:
				134	lis r2, ha16(.CPI_X_0)
				135	lfd f0, lo16(.CPI_X_0)(r2)
				136	lis r2, ha16(.CPI_X_1)
				137	lfd f2, lo16(.CPI_X_1)(r2)
				138	fmadd f0, f1, f0, f2
				139	lis r2, ha16(.CPI_X_2)
				140	lfd f1, lo16(.CPI_X_2)(r2)
				141	lis r2, ha16(.CPI_X_3)
				142	lfd f2, lo16(.CPI_X_3)(r2)
				143	fmadd f1, f0, f1, f2
				144	blr
				145
				146	It would be better to materialize .CPI_X into a register, then use immediates
				147	off of the register to avoid the lis's. This is even more important in PIC
				148	mode.
				149
				150	Note that this (and the static variable version) is discussed here for GCC:
				151	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				152
Chris Lattner	35d65d7	2007-08-23 15:16:03 +0000	[diff] [blame]	153	Here's another example (the sgn function):
				154	double testf(double a) {
				155	return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
				156	}
				157
				158	it produces a BB like this:
				159	LBB1_1: ; cond_true
				160	lis r2, ha16(LCPI1_0)
				161	lfs f0, lo16(LCPI1_0)(r2)
				162	lis r2, ha16(LCPI1_1)
				163	lis r3, ha16(LCPI1_2)
				164	lfs f2, lo16(LCPI1_2)(r3)
				165	lfs f3, lo16(LCPI1_1)(r2)
				166	fsub f0, f0, f1
				167	fsel f1, f0, f2, f3
				168	blr
				169
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	170	===-------------------------------------------------------------------------===
				171
				172	PIC Code Gen IPO optimization:
				173
				174	Squish small scalar globals together into a single global struct, allowing the
				175	address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
				176	of the GOT on targets with one).
				177
				178	Note that this is discussed here for GCC:
				179	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				180
				181	===-------------------------------------------------------------------------===
				182
				183	Implement Newton-Rhapson method for improving estimate instructions to the
				184	correct accuracy, and implementing divide as multiply by reciprocal when it has
Dan Gohman	2a5ddf3	2009-07-24 00:30:09 +0000	[diff] [blame]	185	more than one use. Itanium would want this too.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	186
				187	===-------------------------------------------------------------------------===
				188
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	189	Compile offsets from allocas:
				190
				191	int *%test() {
				192	%X = alloca { int, int }
				193	%Y = getelementptr {int,int}* %X, int 0, uint 1
				194	ret int* %Y
				195	}
				196
				197	into a single add, not two:
				198
				199	_test:
				200	addi r2, r1, -8
				201	addi r3, r2, 4
				202	blr
				203
				204	--> important for C++.
				205
				206	===-------------------------------------------------------------------------===
				207
				208	No loads or stores of the constants should be needed:
				209
				210	struct foo { double X, Y; };
				211	void xxx(struct foo F);
				212	void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
				213
				214	===-------------------------------------------------------------------------===
				215
Dale Johannesen	3960163	2009-07-01 23:36:02 +0000	[diff] [blame]	216	Darwin Stub removal:
				217
				218	We still generate calls to foo$stub, and stubs, on Darwin. This is not
Chris Lattner	763a26f	2009-07-02 01:24:34 +0000	[diff] [blame]	219	necessary when building with the Leopard (10.5) or later linker, as stubs are
				220	generated by ld when necessary. Parameterizing this based on the deployment
				221	target (-mmacosx-version-min) is probably enough. x86-32 does this right, see
				222	its logic.
Dale Johannesen	3960163	2009-07-01 23:36:02 +0000	[diff] [blame]	223
				224	===-------------------------------------------------------------------------===
				225
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	226	Darwin Stub LICM optimization:
				227
				228	Loops like this:
				229
				230	for (...) bar();
				231
				232	Have to go through an indirect stub if bar is external or linkonce. It would
				233	be better to compile it as:
				234
				235	fp = &bar;
				236	for (...) fp();
				237
				238	which only computes the address of bar once (instead of each time through the
				239	stub). This is Darwin specific and would have to be done in the code generator.
				240	Probably not a win on x86.
				241
				242	===-------------------------------------------------------------------------===
				243
				244	Simple IPO for argument passing, change:
				245	void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
				246
				247	the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
				248	of arguments get assigned to r3 through r10. That is, if you have a function
				249	foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
				250	argument bytes for r4 and r5. The trick then would be to shuffle the argument
				251	order for functions we can internalize so that the maximum number of
				252	integers/pointers get passed in regs before you see any of the fp arguments.
				253
				254	Instead of implementing this, it would actually probably be easier to just
				255	implement a PPC fastcc, where we could do whatever we wanted to the CC,
				256	including having this work sanely.
				257
				258	===-------------------------------------------------------------------------===
				259
				260	Fix Darwin FP-In-Integer Registers ABI
				261
				262	Darwin passes doubles in structures in integer registers, which is very very
				263	bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
				264	that percolates these things out of functions.
				265
				266	Check out how horrible this is:
				267	http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
				268
				269	This is an extension of "interprocedural CC unmunging" that can't be done with
				270	just fastcc.
				271
				272	===-------------------------------------------------------------------------===
				273
				274	Compile this:
				275
				276	int foo(int a) {
				277	int b = (a < 8);
				278	if (b) {
				279	return b * 3; // ignore the fact that this is always 3.
				280	} else {
				281	return 2;
				282	}
				283	}
				284
				285	into something not this:
				286
				287	_foo:
				288	1) cmpwi cr7, r3, 8
				289	mfcr r2, 1
				290	rlwinm r2, r2, 29, 31, 31
				291	1) cmpwi cr0, r3, 7
				292	bgt cr0, LBB1_2 ; UnifiedReturnBlock
				293	LBB1_1: ; then
				294	rlwinm r2, r2, 0, 31, 31
				295	mulli r3, r2, 3
				296	blr
				297	LBB1_2: ; UnifiedReturnBlock
				298	li r3, 2
				299	blr
				300
				301	In particular, the two compares (marked 1) could be shared by reversing one.
				302	This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
				303	same operands (but backwards) exists. In this case, this wouldn't save us
				304	anything though, because the compares still wouldn't be shared.
				305
				306	===-------------------------------------------------------------------------===
				307
				308	We should custom expand setcc instead of pretending that we have it. That
				309	would allow us to expose the access of the crbit after the mfcr, allowing
				310	that access to be trivially folded into other ops. A simple example:
				311
				312	int foo(int a, int b) { return (a < b) << 4; }
				313
				314	compiles into:
				315
				316	_foo:
				317	cmpw cr7, r3, r4
				318	mfcr r2, 1
				319	rlwinm r2, r2, 29, 31, 31
				320	slwi r3, r2, 4
				321	blr
				322
				323	===-------------------------------------------------------------------------===
				324
				325	Fold add and sub with constant into non-extern, non-weak addresses so this:
				326
				327	static int a;
				328	void bar(int b) { a = b; }
				329	void foo(unsigned char *c) {
				330	*c = a;
				331	}
				332
				333	So that
				334
				335	_foo:
				336	lis r2, ha16(_a)
				337	la r2, lo16(_a)(r2)
				338	lbz r2, 3(r2)
				339	stb r2, 0(r3)
				340	blr
				341
				342	Becomes
				343
				344	_foo:
				345	lis r2, ha16(_a+3)
				346	lbz r2, lo16(_a+3)(r2)
				347	stb r2, 0(r3)
				348	blr
				349
				350	===-------------------------------------------------------------------------===
				351
				352	We generate really bad code for this:
				353
				354	int f(signed char *a, _Bool b, _Bool c) {
				355	signed char t = 0;
				356	if (b) t = *a;
				357	if (c) *a = t;
				358	}
				359
				360	===-------------------------------------------------------------------------===
				361
				362	This:
				363	int test(unsigned P) { return P >> 24; }
				364
				365	Should compile to:
				366
				367	_test:
				368	lbz r3,0(r3)
				369	blr
				370
				371	not:
				372
				373	_test:
				374	lwz r2, 0(r3)
				375	srwi r3, r2, 24
				376	blr
				377
				378	===-------------------------------------------------------------------------===
				379
				380	On the G5, logical CR operations are more expensive in their three
				381	address form: ops that read/write the same register are half as expensive as
				382	those that read from two registers that are different from their destination.
				383
				384	We should model this with two separate instructions. The isel should generate
				385	the "two address" form of the instructions. When the register allocator
				386	detects that it needs to insert a copy due to the two-addresness of the CR
				387	logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
				388	we can convert to the "three address" instruction, to save code space.
				389
				390	This only matters when we start generating cr logical ops.
				391
				392	===-------------------------------------------------------------------------===
				393
				394	We should compile these two functions to the same thing:
				395
				396	#include <stdlib.h>
				397	void f(int a, int b, int *P) {
				398	*P = (a-b)>=0?(a-b):(b-a);
				399	}
				400	void g(int a, int b, int *P) {
				401	*P = abs(a-b);
				402	}
				403
				404	Further, they should compile to something better than:
				405
				406	_g:
				407	subf r2, r4, r3
				408	subfic r3, r2, 0
				409	cmpwi cr0, r2, -1
				410	bgt cr0, LBB2_2 ; entry
				411	LBB2_1: ; entry
				412	mr r2, r3
				413	LBB2_2: ; entry
				414	stw r2, 0(r5)
				415	blr
				416
				417	GCC produces:
				418
				419	_g:
				420	subf r4,r4,r3
				421	srawi r2,r4,31
				422	xor r0,r2,r4
				423	subf r0,r2,r0
				424	stw r0,0(r5)
				425	blr
				426
				427	... which is much nicer.
				428
				429	This theoretically may help improve twolf slightly (used in dimbox.c:142?).
				430
				431	===-------------------------------------------------------------------------===
				432
				433	int foo(int N, int *W, int TK, int X) {
				434	int t, i;
				435
				436	for (t = 0; t < N; ++t)
				437	for (i = 0; i < 4; ++i)
				438	W[t / X][i][t % X] = TK[i][t];
				439
				440	return 5;
				441	}
				442
				443	We generate relatively atrocious code for this loop compared to gcc.
				444
				445	We could also strength reduce the rem and the div:
				446	http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
				447
				448	===-------------------------------------------------------------------------===
				449
				450	float foo(float X) { return (int)(X); }
				451
				452	Currently produces:
				453
				454	_foo:
				455	fctiwz f0, f1
				456	stfd f0, -8(r1)
				457	lwz r2, -4(r1)
				458	extsw r2, r2
				459	std r2, -16(r1)
				460	lfd f0, -16(r1)
				461	fcfid f0, f0
				462	frsp f1, f0
				463	blr
				464
				465	We could use a target dag combine to turn the lwz/extsw into an lwa when the
				466	lwz has a single use. Since LWA is cracked anyway, this would be a codesize
				467	win only.
				468
				469	===-------------------------------------------------------------------------===
				470
				471	We generate ugly code for this:
				472
				473	void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
				474	unsigned code = 0;
				475	if(dx < -dw) code \|= 1;
				476	if(dx > dw) code \|= 2;
				477	if(dy < -dw) code \|= 4;
				478	if(dy > dw) code \|= 8;
				479	if(dz < -dw) code \|= 16;
				480	if(dz > dw) code \|= 32;
				481	*ret = code;
				482	}
				483
				484	===-------------------------------------------------------------------------===
				485
				486	Complete the signed i32 to FP conversion code using 64-bit registers
				487	transformation, good for PI. See PPCISelLowering.cpp, this comment:
				488
				489	// FIXME: disable this lowered code. This generates 64-bit register values,
				490	// and we don't model the fact that the top part is clobbered by calls. We
				491	// need to flag these together so that the value isn't live across a call.
				492	//setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
				493
				494	Also, if the registers are spilled to the stack, we have to ensure that all
				495	64-bits of them are save/restored, otherwise we will miscompile the code. It
				496	sounds like we need to get the 64-bit register classes going.
				497
				498	===-------------------------------------------------------------------------===
				499
				500	%struct.B = type { i8, [3 x i8] }
				501
				502	define void @bar(%struct.B* %b) {
				503	entry:
				504	%tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
				505	%tmp = load i32* %tmp ; <uint> [#uses=1]
				506	%tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
				507	%tmp4 = load i32* %tmp3 ; <uint> [#uses=1]
				508	%tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2]
				509	%tmp9 = load i32* %tmp8 ; <uint> [#uses=1]
				510	%tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1]
				511	%tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
				512	%tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1]
				513	%tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1]
				514	%tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1]
				515	%tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1]
				516	store i32 %tmp13, i32* %tmp8
				517	ret void
				518	}
				519
				520	We emit:
				521
				522	_foo:
				523	lwz r2, 0(r3)
				524	slwi r4, r2, 1
				525	or r4, r4, r2
				526	rlwimi r2, r4, 0, 0, 0
				527	stw r2, 0(r3)
				528	blr
				529
				530	We could collapse a bunch of those ORs and ANDs and generate the following
				531	equivalent code:
				532
				533	_foo:
				534	lwz r2, 0(r3)
				535	rlwinm r4, r2, 1, 0, 0
				536	or r2, r2, r4
				537	stw r2, 0(r3)
				538	blr
				539
				540	===-------------------------------------------------------------------------===
				541
				542	We compile:
				543
				544	unsigned test6(unsigned x) {
				545	return ((x & 0x00FF0000) >> 16) \| ((x & 0x000000FF) << 16);
				546	}
				547
				548	into:
				549
				550	_test6:
				551	lis r2, 255
				552	rlwinm r3, r3, 16, 0, 31
				553	ori r2, r2, 255
				554	and r3, r3, r2
				555	blr
				556
				557	GCC gets it down to:
				558
				559	_test6:
				560	rlwinm r0,r3,16,8,15
				561	rlwinm r3,r3,16,24,31
				562	or r3,r3,r0
				563	blr
				564
				565
				566	===-------------------------------------------------------------------------===
				567
				568	Consider a function like this:
				569
				570	float foo(float X) { return X + 1234.4123f; }
				571
				572	The FP constant ends up in the constant pool, so we need to get the LR register.
				573	This ends up producing code like this:
				574
				575	_foo:
				576	.LBB_foo_0: ; entry
				577	mflr r11
				578	*** stw r11, 8(r1)
				579	bl "L00000$pb"
				580	"L00000$pb":
				581	mflr r2
				582	addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
				583	lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
				584	fadds f1, f1, f0
				585	*** lwz r11, 8(r1)
				586	mtlr r11
				587	blr
				588
				589	This is functional, but there is no reason to spill the LR register all the way
				590	to the stack (the two marked instrs): spilling it to a GPR is quite enough.
				591
				592	Implementing this will require some codegen improvements. Nate writes:
				593
				594	"So basically what we need to support the "no stack frame save and restore" is a
				595	generalization of the LR optimization to "callee-save regs".
				596
				597	Currently, we have LR marked as a callee-save reg. The register allocator sees
				598	that it's callee save, and spills it directly to the stack.
				599
				600	Ideally, something like this would happen:
				601
				602	LR would be in a separate register class from the GPRs. The class of LR would be
				603	marked "unspillable". When the register allocator came across an unspillable
				604	reg, it would ask "what is the best class to copy this into that I can spill"
				605	If it gets a class back, which it will in this case (the gprs), it grabs a free
				606	register of that class. If it is then later necessary to spill that reg, so be
				607	it.
				608
				609	===-------------------------------------------------------------------------===
				610
				611	We compile this:
				612	int test(_Bool X) {
				613	return X ? 524288 : 0;
				614	}
				615
				616	to:
				617	_test:
				618	cmplwi cr0, r3, 0
				619	lis r2, 8
				620	li r3, 0
				621	beq cr0, LBB1_2 ;entry
				622	LBB1_1: ;entry
				623	mr r3, r2
				624	LBB1_2: ;entry
				625	blr
				626
				627	instead of:
				628	_test:
				629	addic r2,r3,-1
				630	subfe r0,r2,r3
				631	slwi r3,r0,19
				632	blr
				633
				634	This sort of thing occurs a lot due to globalopt.
				635
				636	===-------------------------------------------------------------------------===
				637
Chris Lattner	3a7103b	2010-01-23 18:42:37 +0000	[diff] [blame]	638	We compile:
				639
				640	define i32 @bar(i32 %x) nounwind readnone ssp {
				641	entry:
				642	%0 = icmp eq i32 %x, 0 ; <i1> [#uses=1]
Chris Lattner	85510ca	2010-01-24 00:09:49 +0000	[diff] [blame^]	643	%neg = sext i1 %0 to i32 ; <i32> [#uses=1]
Chris Lattner	3a7103b	2010-01-23 18:42:37 +0000	[diff] [blame]	644	ret i32 %neg
				645	}
				646
				647	to:
				648
				649	_bar:
Chris Lattner	85510ca	2010-01-24 00:09:49 +0000	[diff] [blame^]	650	cntlzw r2, r3
				651	slwi r2, r2, 26
				652	srawi r3, r2, 31
Chris Lattner	3a7103b	2010-01-23 18:42:37 +0000	[diff] [blame]	653	blr
				654
Chris Lattner	85510ca	2010-01-24 00:09:49 +0000	[diff] [blame^]	655	it would be better to produce:
Chris Lattner	3a7103b	2010-01-23 18:42:37 +0000	[diff] [blame]	656
				657	_bar:
				658	addic r3,r3,-1
				659	subfe r3,r3,r3
				660	blr
				661
				662	===-------------------------------------------------------------------------===
				663
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	664	We currently compile 32-bit bswap:
				665
				666	declare i32 @llvm.bswap.i32(i32 %A)
				667	define i32 @test(i32 %A) {
				668	%B = call i32 @llvm.bswap.i32(i32 %A)
				669	ret i32 %B
				670	}
				671
				672	to:
				673
				674	_test:
				675	rlwinm r2, r3, 24, 16, 23
				676	slwi r4, r3, 24
				677	rlwimi r2, r3, 8, 24, 31
				678	rlwimi r4, r3, 8, 8, 15
				679	rlwimi r4, r2, 0, 16, 31
				680	mr r3, r4
				681	blr
				682
				683	it would be more efficient to produce:
				684
				685	_foo: mr r0,r3
				686	rlwinm r3,r3,8,0xffffffff
				687	rlwimi r3,r0,24,0,7
				688	rlwimi r3,r0,24,16,23
				689	blr
				690
				691	===-------------------------------------------------------------------------===
				692
				693	test/CodeGen/PowerPC/2007-03-24-cntlzd.ll compiles to:
				694
				695	__ZNK4llvm5APInt17countLeadingZerosEv:
				696	ld r2, 0(r3)
				697	cntlzd r2, r2
				698	or r2, r2, r2 <<-- silly.
				699	addi r3, r2, -64
				700	blr
				701
				702	The dead or is a 'truncate' from 64- to 32-bits.
				703
				704	===-------------------------------------------------------------------------===
				705
				706	We generate horrible ppc code for this:
				707
				708	#define N 2000000
				709	double a[N],c[N];
				710	void simpleloop() {
				711	int j;
				712	for (j=0; j<N; j++)
				713	c[j] = a[j];
				714	}
				715
				716	LBB1_1: ;bb
				717	lfdx f0, r3, r4
				718	addi r5, r5, 1 ;; Extra IV for the exit value compare.
				719	stfdx f0, r2, r4
				720	addi r4, r4, 8
				721
				722	xoris r6, r5, 30 ;; This is due to a large immediate.
				723	cmplwi cr0, r6, 33920
				724	bne cr0, LBB1_1
				725
Chris Lattner	4084d49	2007-09-10 21:43:18 +0000	[diff] [blame]	726	//===---------------------------------------------------------------------===//
				727
				728	This:
				729	#include <algorithm>
				730	inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				731	{ return std::make_pair(a + b, a + b < a); }
				732	bool no_overflow(unsigned a, unsigned b)
				733	{ return !full_add(a, b).second; }
				734
				735	Should compile to:
				736
				737	__Z11no_overflowjj:
				738	add r4,r3,r4
				739	subfc r3,r3,r4
				740	li r3,0
				741	adde r3,r3,r3
				742	blr
				743
				744	(or better) not:
				745
				746	__Z11no_overflowjj:
				747	add r2, r4, r3
				748	cmplw cr7, r2, r3
				749	mfcr r2
				750	rlwinm r2, r2, 29, 31, 31
				751	xori r3, r2, 1
				752	blr
				753
				754	//===---------------------------------------------------------------------===//
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	755
Chris Lattner	6c36fb5	2008-01-08 06:46:30 +0000	[diff] [blame]	756	We compile some FP comparisons into an mfcr with two rlwinms and an or. For
				757	example:
				758	#include <math.h>
				759	int test(double x, double y) { return islessequal(x, y);}
				760	int test2(double x, double y) { return islessgreater(x, y);}
				761	int test3(double x, double y) { return !islessequal(x, y);}
				762
				763	Compiles into (all three are similar, but the bits differ):
				764
				765	_test:
				766	fcmpu cr7, f1, f2
				767	mfcr r2
				768	rlwinm r3, r2, 29, 31, 31
				769	rlwinm r2, r2, 31, 31, 31
				770	or r3, r2, r3
				771	blr
				772
				773	GCC compiles this into:
				774
				775	_test:
				776	fcmpu cr7,f1,f2
				777	cror 30,28,30
				778	mfcr r3
				779	rlwinm r3,r3,31,1
				780	blr
				781
				782	which is more efficient and can use mfocr. See PR642 for some more context.
				783
				784	//===---------------------------------------------------------------------===//
Chris Lattner	869440b	2008-03-02 19:27:34 +0000	[diff] [blame]	785
				786	void foo(float *data, float d) {
				787	long i;
				788	for (i = 0; i < 8000; i++)
				789	data[i] = d;
				790	}
				791	void foo2(float *data, float d) {
				792	long i;
				793	data--;
				794	for (i = 0; i < 8000; i++) {
				795	data[1] = d;
				796	data++;
				797	}
				798	}
				799
				800	These compile to:
				801
				802	_foo:
				803	li r2, 0
				804	LBB1_1: ; bb
				805	addi r4, r2, 4
				806	stfsx f1, r3, r2
				807	cmplwi cr0, r4, 32000
				808	mr r2, r4
				809	bne cr0, LBB1_1 ; bb
				810	blr
				811	_foo2:
				812	li r2, 0
				813	LBB2_1: ; bb
				814	addi r4, r2, 4
				815	stfsx f1, r3, r2
				816	cmplwi cr0, r4, 32000
				817	mr r2, r4
				818	bne cr0, LBB2_1 ; bb
				819	blr
				820
				821	The 'mr' could be eliminated to folding the add into the cmp better.
				822
				823	//===---------------------------------------------------------------------===//
Dale Johannesen	089c6c0	2008-11-17 18:56:34 +0000	[diff] [blame]	824	Codegen for the following (low-probability) case deteriorated considerably
				825	when the correctness fixes for unordered comparisons went in (PR 642, 58871).
				826	It should be possible to recover the code quality described in the comments.
				827
				828	; RUN: llvm-as < %s \| llc -march=ppc32 \| grep or \| count 3
				829	; This should produce one 'or' or 'cror' instruction per function.
				830
				831	; RUN: llvm-as < %s \| llc -march=ppc32 \| grep mfcr \| count 3
				832	; PR2964
				833
				834	define i32 @test(double %x, double %y) nounwind {
				835	entry:
				836	%tmp3 = fcmp ole double %x, %y ; <i1> [#uses=1]
				837	%tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
				838	ret i32 %tmp345
				839	}
				840
				841	define i32 @test2(double %x, double %y) nounwind {
				842	entry:
				843	%tmp3 = fcmp one double %x, %y ; <i1> [#uses=1]
				844	%tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
				845	ret i32 %tmp345
				846	}
				847
				848	define i32 @test3(double %x, double %y) nounwind {
				849	entry:
				850	%tmp3 = fcmp ugt double %x, %y ; <i1> [#uses=1]
				851	%tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
				852	ret i32 %tmp34
				853	}
				854	//===----------------------------------------------------------------------===//
				855	; RUN: llvm-as < %s \| llc -march=ppc32 \| not grep fneg
				856
				857	; This could generate FSEL with appropriate flags (FSEL is not IEEE-safe, and
				858	; should not be generated except with -enable-finite-only-fp-math or the like).
				859	; With the correctness fixes for PR642 (58871) LowerSELECT_CC would need to
				860	; recognize a more elaborate tree than a simple SETxx.
				861
				862	define double @test_FNEG_sel(double %A, double %B, double %C) {
				863	%D = sub double -0.000000e+00, %A ; <double> [#uses=1]
				864	%Cond = fcmp ugt double %D, -0.000000e+00 ; <i1> [#uses=1]
				865	%E = select i1 %Cond, double %B, double %C ; <double> [#uses=1]
				866	ret double %E
				867	}
				868