Blame - lib/Target/PowerPC/README.txt - fp2-dev/platform/external/llvm

blob: 2114fde89a1929efe71c9ed0a787c22a3cb2f97f [file] [log] [blame]

Chris Lattner	b86bd2c	2006-03-27 07:04:16 +0000	[diff] [blame]	1	//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
				2
Nate Begeman	b64af91	2004-08-10 20:42:36 +0000	[diff] [blame]	3	TODO:
Nate Begeman	ef9531e	2005-04-11 20:48:57 +0000	[diff] [blame]	4	* gpr0 allocation
Nate Begeman	4a0de07	2004-10-26 04:10:53 +0000	[diff] [blame]	5	* implement do-loop -> bdnz transform
Nate Begeman	50fb3c4	2005-12-24 01:00:15 +0000	[diff] [blame]	6
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	7	===-------------------------------------------------------------------------===
Nate Begeman	50fb3c4	2005-12-24 01:00:15 +0000	[diff] [blame]	8
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	9	Support 'update' load/store instructions. These are cracked on the G5, but are
				10	still a codesize win.
				11
Chris Lattner	26ddb50	2006-11-10 01:33:53 +0000	[diff] [blame]	12	With preinc enabled, this:
				13
				14	long %test4(long %X, long *%dest) {
				15	%Y = getelementptr long* %X, int 4
				16	%A = load long* %Y
				17	store long %A, long* %dest
				18	ret long* %Y
				19	}
				20
				21	compiles to:
				22
				23	_test4:
				24	mr r2, r3
				25	lwzu r5, 32(r2)
				26	lwz r3, 36(r3)
				27	stw r5, 0(r4)
				28	stw r3, 4(r4)
				29	mr r3, r2
				30	blr
				31
				32	with -sched=list-burr, I get:
				33
				34	_test4:
				35	lwz r2, 36(r3)
				36	lwzu r5, 32(r3)
				37	stw r2, 4(r4)
				38	stw r5, 0(r4)
				39	blr
				40
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	41	===-------------------------------------------------------------------------===
				42
Chris Lattner	6e11295	2006-11-07 18:30:21 +0000	[diff] [blame]	43	We compile the hottest inner loop of viterbi to:
				44
				45	li r6, 0
				46	b LBB1_84 ;bb432.i
				47	LBB1_83: ;bb420.i
				48	lbzx r8, r5, r7
				49	addi r6, r7, 1
				50	stbx r8, r4, r7
				51	LBB1_84: ;bb432.i
				52	mr r7, r6
				53	cmplwi cr0, r7, 143
				54	bne cr0, LBB1_83 ;bb420.i
				55
				56	The CBE manages to produce:
				57
				58	li r0, 143
				59	mtctr r0
				60	loop:
				61	lbzx r2, r2, r11
				62	stbx r0, r2, r9
				63	addi r2, r2, 1
				64	bdz later
				65	b loop
				66
				67	This could be much better (bdnz instead of bdz) but it still beats us. If we
				68	produced this with bdnz, the loop would be a single dispatch group.
				69
				70	===-------------------------------------------------------------------------===
				71
Chris Lattner	6a250ec	2006-10-13 20:20:58 +0000	[diff] [blame]	72	Compile:
				73
				74	void foo(int *P) {
				75	if (P) *P = 0;
				76	}
				77
				78	into:
				79
				80	_foo:
				81	cmpwi cr0,r3,0
				82	beqlr cr0
				83	li r0,0
				84	stw r0,0(r3)
				85	blr
				86
				87	This is effectively a simple form of predication.
				88
				89	===-------------------------------------------------------------------------===
				90
Chris Lattner	a3c4454	2005-08-24 18:15:24 +0000	[diff] [blame]	91	Lump the constant pool for each function into ONE pic object, and reference
				92	pieces of it as offsets from the start. For functions like this (contrived
				93	to have lots of constants obviously):
				94
				95	double X(double Y) { return (Y1.23 + 4.512)2.34 + 14.38; }
				96
				97	We generate:
				98
				99	_X:
				100	lis r2, ha16(.CPI_X_0)
				101	lfd f0, lo16(.CPI_X_0)(r2)
				102	lis r2, ha16(.CPI_X_1)
				103	lfd f2, lo16(.CPI_X_1)(r2)
				104	fmadd f0, f1, f0, f2
				105	lis r2, ha16(.CPI_X_2)
				106	lfd f1, lo16(.CPI_X_2)(r2)
				107	lis r2, ha16(.CPI_X_3)
				108	lfd f2, lo16(.CPI_X_3)(r2)
				109	fmadd f1, f0, f1, f2
				110	blr
				111
				112	It would be better to materialize .CPI_X into a register, then use immediates
				113	off of the register to avoid the lis's. This is even more important in PIC
				114	mode.
				115
Chris Lattner	39b248b	2006-02-02 23:50:22 +0000	[diff] [blame]	116	Note that this (and the static variable version) is discussed here for GCC:
				117	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				118
Chris Lattner	a3c4454	2005-08-24 18:15:24 +0000	[diff] [blame]	119	===-------------------------------------------------------------------------===
Nate Begeman	92cce90	2005-09-06 15:30:48 +0000	[diff] [blame]	120
Chris Lattner	33c1dab	2006-02-03 06:22:11 +0000	[diff] [blame]	121	PIC Code Gen IPO optimization:
				122
				123	Squish small scalar globals together into a single global struct, allowing the
				124	address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
				125	of the GOT on targets with one).
				126
				127	Note that this is discussed here for GCC:
				128	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				129
				130	===-------------------------------------------------------------------------===
				131
Nate Begeman	92cce90	2005-09-06 15:30:48 +0000	[diff] [blame]	132	Implement Newton-Rhapson method for improving estimate instructions to the
				133	correct accuracy, and implementing divide as multiply by reciprocal when it has
				134	more than one use. Itanium will want this too.
Nate Begeman	21e463b	2005-10-16 05:39:50 +0000	[diff] [blame]	135
				136	===-------------------------------------------------------------------------===
				137
Chris Lattner	ae4664a	2005-11-05 08:57:56 +0000	[diff] [blame]	138	Compile this:
				139
				140	int %f1(int %a, int %b) {
				141	%tmp.1 = and int %a, 15 ; <int> [#uses=1]
				142	%tmp.3 = and int %b, 240 ; <int> [#uses=1]
				143	%tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
				144	ret int %tmp.4
				145	}
				146
				147	without a copy. We make this currently:
				148
				149	_f1:
				150	rlwinm r2, r4, 0, 24, 27
				151	rlwimi r2, r3, 0, 28, 31
				152	or r3, r2, r2
				153	blr
				154
				155	The two-addr pass or RA needs to learn when it is profitable to commute an
				156	instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
				157	currently only commutes to avoid inserting a copy BEFORE the two addr instr.
				158
Chris Lattner	62c08dd	2005-12-08 07:13:28 +0000	[diff] [blame]	159	===-------------------------------------------------------------------------===
				160
				161	Compile offsets from allocas:
				162
				163	int *%test() {
				164	%X = alloca { int, int }
				165	%Y = getelementptr {int,int}* %X, int 0, uint 1
				166	ret int* %Y
				167	}
				168
				169	into a single add, not two:
				170
				171	_test:
				172	addi r2, r1, -8
				173	addi r3, r2, 4
				174	blr
				175
				176	--> important for C++.
				177
Chris Lattner	39706e6	2005-12-22 17:19:28 +0000	[diff] [blame]	178	===-------------------------------------------------------------------------===
				179
Chris Lattner	39706e6	2005-12-22 17:19:28 +0000	[diff] [blame]	180	No loads or stores of the constants should be needed:
				181
				182	struct foo { double X, Y; };
				183	void xxx(struct foo F);
				184	void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
				185
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	186	===-------------------------------------------------------------------------===
				187
Chris Lattner	98fbc2f	2006-01-16 17:58:54 +0000	[diff] [blame]	188	Darwin Stub LICM optimization:
				189
				190	Loops like this:
				191
				192	for (...) bar();
				193
				194	Have to go through an indirect stub if bar is external or linkonce. It would
				195	be better to compile it as:
				196
				197	fp = &bar;
				198	for (...) fp();
				199
				200	which only computes the address of bar once (instead of each time through the
				201	stub). This is Darwin specific and would have to be done in the code generator.
				202	Probably not a win on x86.
				203
				204	===-------------------------------------------------------------------------===
				205
Chris Lattner	98fbc2f	2006-01-16 17:58:54 +0000	[diff] [blame]	206	Simple IPO for argument passing, change:
				207	void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
				208
				209	the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
				210	of arguments get assigned to r3 through r10. That is, if you have a function
				211	foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
				212	argument bytes for r4 and r5. The trick then would be to shuffle the argument
				213	order for functions we can internalize so that the maximum number of
				214	integers/pointers get passed in regs before you see any of the fp arguments.
				215
				216	Instead of implementing this, it would actually probably be easier to just
				217	implement a PPC fastcc, where we could do whatever we wanted to the CC,
				218	including having this work sanely.
				219
				220	===-------------------------------------------------------------------------===
				221
				222	Fix Darwin FP-In-Integer Registers ABI
				223
				224	Darwin passes doubles in structures in integer registers, which is very very
				225	bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
				226	that percolates these things out of functions.
				227
				228	Check out how horrible this is:
				229	http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
				230
				231	This is an extension of "interprocedural CC unmunging" that can't be done with
				232	just fastcc.
				233
				234	===-------------------------------------------------------------------------===
				235
Chris Lattner	56b6964	2006-01-31 02:55:28 +0000	[diff] [blame]	236	Compile this:
				237
Chris Lattner	83e64ba	2006-01-31 07:16:34 +0000	[diff] [blame]	238	int foo(int a) {
				239	int b = (a < 8);
				240	if (b) {
				241	return b * 3; // ignore the fact that this is always 3.
				242	} else {
				243	return 2;
				244	}
				245	}
				246
				247	into something not this:
				248
				249	_foo:
				250	1) cmpwi cr7, r3, 8
				251	mfcr r2, 1
				252	rlwinm r2, r2, 29, 31, 31
				253	1) cmpwi cr0, r3, 7
				254	bgt cr0, LBB1_2 ; UnifiedReturnBlock
				255	LBB1_1: ; then
				256	rlwinm r2, r2, 0, 31, 31
				257	mulli r3, r2, 3
				258	blr
				259	LBB1_2: ; UnifiedReturnBlock
				260	li r3, 2
				261	blr
				262
				263	In particular, the two compares (marked 1) could be shared by reversing one.
				264	This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
				265	same operands (but backwards) exists. In this case, this wouldn't save us
				266	anything though, because the compares still wouldn't be shared.
Chris Lattner	0ddc180	2006-02-01 00:28:12 +0000	[diff] [blame]	267
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	268	===-------------------------------------------------------------------------===
				269
				270	The legalizer should lower this:
				271
				272	bool %test(ulong %x) {
				273	%tmp = setlt ulong %x, 4294967296
				274	ret bool %tmp
				275	}
				276
				277	into "if x.high == 0", not:
				278
				279	_test:
Nate Begeman	908049b	2007-01-29 21:21:22 +0000	[diff] [blame]	280	cntlzw r2, r3
				281	xori r3, r3, 1
				282	cmplwi cr0, r3, 0
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	283	srwi r2, r2, 5
Nate Begeman	93c740b	2006-02-02 07:27:56 +0000	[diff] [blame]	284	li r3, 0
Nate Begeman	908049b	2007-01-29 21:21:22 +0000	[diff] [blame]	285	beq cr0, LBB1_2 ;entry
				286	LBB1_1: ;entry
				287	mr r3, r2
				288	LBB1_2: ;entry
				289	blr
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	290
				291	noticed in 2005-05-11-Popcount-ffs-fls.c.
Chris Lattner	275b884	2006-02-02 07:37:11 +0000	[diff] [blame]	292
				293
				294	===-------------------------------------------------------------------------===
				295
				296	We should custom expand setcc instead of pretending that we have it. That
				297	would allow us to expose the access of the crbit after the mfcr, allowing
				298	that access to be trivially folded into other ops. A simple example:
				299
				300	int foo(int a, int b) { return (a < b) << 4; }
				301
				302	compiles into:
				303
				304	_foo:
				305	cmpw cr7, r3, r4
				306	mfcr r2, 1
				307	rlwinm r2, r2, 29, 31, 31
				308	slwi r3, r2, 4
				309	blr
				310
Chris Lattner	d463f7f	2006-02-03 01:49:49 +0000	[diff] [blame]	311	===-------------------------------------------------------------------------===
				312
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	313	Fold add and sub with constant into non-extern, non-weak addresses so this:
				314
				315	static int a;
				316	void bar(int b) { a = b; }
				317	void foo(unsigned char *c) {
				318	*c = a;
				319	}
				320
				321	So that
				322
				323	_foo:
				324	lis r2, ha16(_a)
				325	la r2, lo16(_a)(r2)
				326	lbz r2, 3(r2)
				327	stb r2, 0(r3)
				328	blr
				329
				330	Becomes
				331
				332	_foo:
				333	lis r2, ha16(_a+3)
				334	lbz r2, lo16(_a+3)(r2)
				335	stb r2, 0(r3)
				336	blr
Chris Lattner	2138453	2006-02-05 05:27:35 +0000	[diff] [blame]	337
				338	===-------------------------------------------------------------------------===
				339
				340	We generate really bad code for this:
				341
				342	int f(signed char *a, _Bool b, _Bool c) {
				343	signed char t = 0;
				344	if (b) t = *a;
				345	if (c) *a = t;
				346	}
				347
Chris Lattner	00d18f0	2006-03-01 06:36:20 +0000	[diff] [blame]	348	===-------------------------------------------------------------------------===
				349
				350	This:
				351	int test(unsigned P) { return P >> 24; }
				352
				353	Should compile to:
				354
				355	_test:
				356	lbz r3,0(r3)
				357	blr
				358
				359	not:
				360
				361	_test:
				362	lwz r2, 0(r3)
				363	srwi r3, r2, 24
				364	blr
				365
Chris Lattner	5a63c47	2006-03-07 04:42:59 +0000	[diff] [blame]	366	===-------------------------------------------------------------------------===
				367
				368	On the G5, logical CR operations are more expensive in their three
				369	address form: ops that read/write the same register are half as expensive as
				370	those that read from two registers that are different from their destination.
				371
				372	We should model this with two separate instructions. The isel should generate
				373	the "two address" form of the instructions. When the register allocator
				374	detects that it needs to insert a copy due to the two-addresness of the CR
				375	logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
				376	we can convert to the "three address" instruction, to save code space.
				377
				378	This only matters when we start generating cr logical ops.
				379
Chris Lattner	49f398b	2006-03-08 00:25:47 +0000	[diff] [blame]	380	===-------------------------------------------------------------------------===
				381
				382	We should compile these two functions to the same thing:
				383
				384	#include <stdlib.h>
				385	void f(int a, int b, int *P) {
				386	*P = (a-b)>=0?(a-b):(b-a);
				387	}
				388	void g(int a, int b, int *P) {
				389	*P = abs(a-b);
				390	}
				391
				392	Further, they should compile to something better than:
				393
				394	_g:
				395	subf r2, r4, r3
				396	subfic r3, r2, 0
				397	cmpwi cr0, r2, -1
				398	bgt cr0, LBB2_2 ; entry
				399	LBB2_1: ; entry
				400	mr r2, r3
				401	LBB2_2: ; entry
				402	stw r2, 0(r5)
				403	blr
				404
				405	GCC produces:
				406
				407	_g:
				408	subf r4,r4,r3
				409	srawi r2,r4,31
				410	xor r0,r2,r4
				411	subf r0,r2,r0
				412	stw r0,0(r5)
				413	blr
				414
				415	... which is much nicer.
				416
				417	This theoretically may help improve twolf slightly (used in dimbox.c:142?).
				418
				419	===-------------------------------------------------------------------------===
				420
Nate Begeman	2df9928	2006-03-16 18:50:44 +0000	[diff] [blame]	421	int foo(int N, int *W, int TK, int X) {
				422	int t, i;
				423
				424	for (t = 0; t < N; ++t)
				425	for (i = 0; i < 4; ++i)
				426	W[t / X][i][t % X] = TK[i][t];
				427
				428	return 5;
				429	}
				430
Chris Lattner	ed51169	2006-03-16 22:25:55 +0000	[diff] [blame]	431	We generate relatively atrocious code for this loop compared to gcc.
				432
Chris Lattner	ef040dd	2006-03-21 00:47:09 +0000	[diff] [blame]	433	We could also strength reduce the rem and the div:
				434	http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
				435
Chris Lattner	28b1a0b	2006-03-19 05:33:30 +0000	[diff] [blame]	436	===-------------------------------------------------------------------------===
Chris Lattner	ed51169	2006-03-16 22:25:55 +0000	[diff] [blame]	437
Nate Begeman	c0a8b6d	2006-03-21 18:58:20 +0000	[diff] [blame]	438	float foo(float X) { return (int)(X); }
				439
Chris Lattner	9d86a9d	2006-03-22 05:33:23 +0000	[diff] [blame]	440	Currently produces:
Nate Begeman	c0a8b6d	2006-03-21 18:58:20 +0000	[diff] [blame]	441
				442	_foo:
Nate Begeman	c0a8b6d	2006-03-21 18:58:20 +0000	[diff] [blame]	443	fctiwz f0, f1
				444	stfd f0, -8(r1)
Chris Lattner	9d86a9d	2006-03-22 05:33:23 +0000	[diff] [blame]	445	lwz r2, -4(r1)
				446	extsw r2, r2
				447	std r2, -16(r1)
				448	lfd f0, -16(r1)
				449	fcfid f0, f0
Nate Begeman	c0a8b6d	2006-03-21 18:58:20 +0000	[diff] [blame]	450	frsp f1, f0
				451	blr
				452
Chris Lattner	9d86a9d	2006-03-22 05:33:23 +0000	[diff] [blame]	453	We could use a target dag combine to turn the lwz/extsw into an lwa when the
				454	lwz has a single use. Since LWA is cracked anyway, this would be a codesize
				455	win only.
Nate Begeman	c0a8b6d	2006-03-21 18:58:20 +0000	[diff] [blame]	456
Chris Lattner	716aefc	2006-03-23 21:28:44 +0000	[diff] [blame]	457	===-------------------------------------------------------------------------===
				458
Chris Lattner	057f09b	2006-03-24 20:04:27 +0000	[diff] [blame]	459	We generate ugly code for this:
				460
				461	void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
				462	unsigned code = 0;
				463	if(dx < -dw) code \|= 1;
				464	if(dx > dw) code \|= 2;
				465	if(dy < -dw) code \|= 4;
				466	if(dy > dw) code \|= 8;
				467	if(dz < -dw) code \|= 16;
				468	if(dz > dw) code \|= 32;
				469	*ret = code;
				470	}
				471
Chris Lattner	420736d	2006-03-25 06:47:10 +0000	[diff] [blame]	472	===-------------------------------------------------------------------------===
				473
Chris Lattner	ed93790	2006-04-13 16:48:00 +0000	[diff] [blame]	474	Complete the signed i32 to FP conversion code using 64-bit registers
				475	transformation, good for PI. See PPCISelLowering.cpp, this comment:
Chris Lattner	220d2b8	2006-04-02 07:20:00 +0000	[diff] [blame]	476
Chris Lattner	ed93790	2006-04-13 16:48:00 +0000	[diff] [blame]	477	// FIXME: disable this lowered code. This generates 64-bit register values,
				478	// and we don't model the fact that the top part is clobbered by calls. We
				479	// need to flag these together so that the value isn't live across a call.
				480	//setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
Chris Lattner	220d2b8	2006-04-02 07:20:00 +0000	[diff] [blame]	481
Chris Lattner	9d62fa4	2006-05-17 19:02:25 +0000	[diff] [blame]	482	Also, if the registers are spilled to the stack, we have to ensure that all
				483	64-bits of them are save/restored, otherwise we will miscompile the code. It
				484	sounds like we need to get the 64-bit register classes going.
				485
Chris Lattner	55c6325	2006-05-05 05:36:15 +0000	[diff] [blame]	486	===-------------------------------------------------------------------------===
				487
Nate Begeman	908049b	2007-01-29 21:21:22 +0000	[diff] [blame]	488	%struct.B = type { i8, [3 x i8] }
Nate Begeman	7514620	2006-05-08 20:54:02 +0000	[diff] [blame]	489
Nate Begeman	908049b	2007-01-29 21:21:22 +0000	[diff] [blame]	490	define void @bar(%struct.B* %b) {
Nate Begeman	7514620	2006-05-08 20:54:02 +0000	[diff] [blame]	491	entry:
Nate Begeman	908049b	2007-01-29 21:21:22 +0000	[diff] [blame]	492	%tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
				493	%tmp = load i32* %tmp ; <uint> [#uses=1]
				494	%tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
				495	%tmp4 = load i32* %tmp3 ; <uint> [#uses=1]
				496	%tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2]
				497	%tmp9 = load i32* %tmp8 ; <uint> [#uses=1]
				498	%tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1]
				499	%tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
				500	%tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1]
				501	%tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1]
				502	%tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1]
				503	%tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1]
				504	store i32 %tmp13, i32* %tmp8
Chris Lattner	55c6325	2006-05-05 05:36:15 +0000	[diff] [blame]	505	ret void
				506	}
				507
				508	We emit:
				509
				510	_foo:
				511	lwz r2, 0(r3)
Nate Begeman	7514620	2006-05-08 20:54:02 +0000	[diff] [blame]	512	slwi r4, r2, 1
				513	or r4, r4, r2
				514	rlwimi r2, r4, 0, 0, 0
Nate Begeman	4667f2c	2006-05-08 17:38:32 +0000	[diff] [blame]	515	stw r2, 0(r3)
Chris Lattner	55c6325	2006-05-05 05:36:15 +0000	[diff] [blame]	516	blr
				517
Nate Begeman	7514620	2006-05-08 20:54:02 +0000	[diff] [blame]	518	We could collapse a bunch of those ORs and ANDs and generate the following
				519	equivalent code:
Chris Lattner	55c6325	2006-05-05 05:36:15 +0000	[diff] [blame]	520
Nate Begeman	4667f2c	2006-05-08 17:38:32 +0000	[diff] [blame]	521	_foo:
				522	lwz r2, 0(r3)
Nate Begeman	d8624ed	2006-05-08 19:09:24 +0000	[diff] [blame]	523	rlwinm r4, r2, 1, 0, 0
Nate Begeman	4667f2c	2006-05-08 17:38:32 +0000	[diff] [blame]	524	or r2, r2, r4
				525	stw r2, 0(r3)
				526	blr
Chris Lattner	1eeedae	2006-07-14 04:07:29 +0000	[diff] [blame]	527
				528	===-------------------------------------------------------------------------===
				529
Chris Lattner	f0613e1	2006-09-14 20:56:30 +0000	[diff] [blame]	530	We compile:
				531
				532	unsigned test6(unsigned x) {
				533	return ((x & 0x00FF0000) >> 16) \| ((x & 0x000000FF) << 16);
				534	}
				535
				536	into:
				537
				538	_test6:
				539	lis r2, 255
				540	rlwinm r3, r3, 16, 0, 31
				541	ori r2, r2, 255
				542	and r3, r3, r2
				543	blr
				544
				545	GCC gets it down to:
				546
				547	_test6:
				548	rlwinm r0,r3,16,8,15
				549	rlwinm r3,r3,16,24,31
				550	or r3,r3,r0
				551	blr
				552
Chris Lattner	afd7a08	2007-01-18 07:34:57 +0000	[diff] [blame]	553
				554	===-------------------------------------------------------------------------===
				555
				556	Consider a function like this:
				557
				558	float foo(float X) { return X + 1234.4123f; }
				559
				560	The FP constant ends up in the constant pool, so we need to get the LR register.
				561	This ends up producing code like this:
				562
				563	_foo:
				564	.LBB_foo_0: ; entry
				565	mflr r11
				566	*** stw r11, 8(r1)
				567	bl "L00000$pb"
				568	"L00000$pb":
				569	mflr r2
				570	addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
				571	lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
				572	fadds f1, f1, f0
				573	*** lwz r11, 8(r1)
				574	mtlr r11
				575	blr
				576
				577	This is functional, but there is no reason to spill the LR register all the way
				578	to the stack (the two marked instrs): spilling it to a GPR is quite enough.
				579
				580	Implementing this will require some codegen improvements. Nate writes:
				581
				582	"So basically what we need to support the "no stack frame save and restore" is a
				583	generalization of the LR optimization to "callee-save regs".
				584
				585	Currently, we have LR marked as a callee-save reg. The register allocator sees
				586	that it's callee save, and spills it directly to the stack.
				587
				588	Ideally, something like this would happen:
				589
				590	LR would be in a separate register class from the GPRs. The class of LR would be
				591	marked "unspillable". When the register allocator came across an unspillable
				592	reg, it would ask "what is the best class to copy this into that I can spill"
				593	If it gets a class back, which it will in this case (the gprs), it grabs a free
				594	register of that class. If it is then later necessary to spill that reg, so be
				595	it.
				596
				597	===-------------------------------------------------------------------------===
Chris Lattner	95b9d6e	2007-01-31 19:49:20 +0000	[diff] [blame^]	598
				599	We compile this:
				600	int test(_Bool X) {
				601	return X ? 524288 : 0;
				602	}
				603
				604	to:
				605	_test:
				606	cmplwi cr0, r3, 0
				607	lis r2, 8
				608	li r3, 0
				609	beq cr0, LBB1_2 ;entry
				610	LBB1_1: ;entry
				611	mr r3, r2
				612	LBB1_2: ;entry
				613	blr
				614
				615	instead of:
				616	_test:
				617	addic r2,r3,-1
				618	subfe r0,r2,r3
				619	slwi r3,r0,19
				620	blr
				621
				622	This sort of thing occurs a lot due to globalopt.
				623
				624	===-------------------------------------------------------------------------===