Blame - lib/Target/PowerPC/README.txt - fp2-dev/platform/external/llvm

blob: 077bc2555d8669de5eca001d336d2345c84a9196 [file] [log] [blame]

Chris Lattner	b86bd2c	2006-03-27 07:04:16 +0000	[diff] [blame]	1	//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
				2
Nate Begeman	b64af91	2004-08-10 20:42:36 +0000	[diff] [blame]	3	TODO:
Nate Begeman	ef9531e	2005-04-11 20:48:57 +0000	[diff] [blame]	4	* gpr0 allocation
Nate Begeman	4a0de07	2004-10-26 04:10:53 +0000	[diff] [blame]	5	* implement do-loop -> bdnz transform
Chris Lattner	86c9c34	2007-03-25 05:10:46 +0000	[diff] [blame]	6	* __builtin_return_address not supported on PPC
Nate Begeman	50fb3c4	2005-12-24 01:00:15 +0000	[diff] [blame]	7
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	8	===-------------------------------------------------------------------------===
Nate Begeman	50fb3c4	2005-12-24 01:00:15 +0000	[diff] [blame]	9
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	10	Support 'update' load/store instructions. These are cracked on the G5, but are
				11	still a codesize win.
				12
Chris Lattner	26ddb50	2006-11-10 01:33:53 +0000	[diff] [blame]	13	With preinc enabled, this:
				14
				15	long %test4(long %X, long *%dest) {
				16	%Y = getelementptr long* %X, int 4
				17	%A = load long* %Y
				18	store long %A, long* %dest
				19	ret long* %Y
				20	}
				21
				22	compiles to:
				23
				24	_test4:
				25	mr r2, r3
				26	lwzu r5, 32(r2)
				27	lwz r3, 36(r3)
				28	stw r5, 0(r4)
				29	stw r3, 4(r4)
				30	mr r3, r2
				31	blr
				32
				33	with -sched=list-burr, I get:
				34
				35	_test4:
				36	lwz r2, 36(r3)
				37	lwzu r5, 32(r3)
				38	stw r2, 4(r4)
				39	stw r5, 0(r4)
				40	blr
				41
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	42	===-------------------------------------------------------------------------===
				43
Chris Lattner	6e11295	2006-11-07 18:30:21 +0000	[diff] [blame]	44	We compile the hottest inner loop of viterbi to:
				45
				46	li r6, 0
				47	b LBB1_84 ;bb432.i
				48	LBB1_83: ;bb420.i
				49	lbzx r8, r5, r7
				50	addi r6, r7, 1
				51	stbx r8, r4, r7
				52	LBB1_84: ;bb432.i
				53	mr r7, r6
				54	cmplwi cr0, r7, 143
				55	bne cr0, LBB1_83 ;bb420.i
				56
				57	The CBE manages to produce:
				58
				59	li r0, 143
				60	mtctr r0
				61	loop:
				62	lbzx r2, r2, r11
				63	stbx r0, r2, r9
				64	addi r2, r2, 1
				65	bdz later
				66	b loop
				67
				68	This could be much better (bdnz instead of bdz) but it still beats us. If we
				69	produced this with bdnz, the loop would be a single dispatch group.
				70
				71	===-------------------------------------------------------------------------===
				72
Chris Lattner	6a250ec	2006-10-13 20:20:58 +0000	[diff] [blame]	73	Compile:
				74
				75	void foo(int *P) {
				76	if (P) *P = 0;
				77	}
				78
				79	into:
				80
				81	_foo:
				82	cmpwi cr0,r3,0
				83	beqlr cr0
				84	li r0,0
				85	stw r0,0(r3)
				86	blr
				87
				88	This is effectively a simple form of predication.
				89
				90	===-------------------------------------------------------------------------===
				91
Chris Lattner	a3c4454	2005-08-24 18:15:24 +0000	[diff] [blame]	92	Lump the constant pool for each function into ONE pic object, and reference
				93	pieces of it as offsets from the start. For functions like this (contrived
				94	to have lots of constants obviously):
				95
				96	double X(double Y) { return (Y1.23 + 4.512)2.34 + 14.38; }
				97
				98	We generate:
				99
				100	_X:
				101	lis r2, ha16(.CPI_X_0)
				102	lfd f0, lo16(.CPI_X_0)(r2)
				103	lis r2, ha16(.CPI_X_1)
				104	lfd f2, lo16(.CPI_X_1)(r2)
				105	fmadd f0, f1, f0, f2
				106	lis r2, ha16(.CPI_X_2)
				107	lfd f1, lo16(.CPI_X_2)(r2)
				108	lis r2, ha16(.CPI_X_3)
				109	lfd f2, lo16(.CPI_X_3)(r2)
				110	fmadd f1, f0, f1, f2
				111	blr
				112
				113	It would be better to materialize .CPI_X into a register, then use immediates
				114	off of the register to avoid the lis's. This is even more important in PIC
				115	mode.
				116
Chris Lattner	39b248b	2006-02-02 23:50:22 +0000	[diff] [blame]	117	Note that this (and the static variable version) is discussed here for GCC:
				118	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				119
Chris Lattner	aabd035	2007-08-23 15:16:03 +0000	[diff] [blame]	120	Here's another example (the sgn function):
				121	double testf(double a) {
				122	return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
				123	}
				124
				125	it produces a BB like this:
				126	LBB1_1: ; cond_true
				127	lis r2, ha16(LCPI1_0)
				128	lfs f0, lo16(LCPI1_0)(r2)
				129	lis r2, ha16(LCPI1_1)
				130	lis r3, ha16(LCPI1_2)
				131	lfs f2, lo16(LCPI1_2)(r3)
				132	lfs f3, lo16(LCPI1_1)(r2)
				133	fsub f0, f0, f1
				134	fsel f1, f0, f2, f3
				135	blr
				136
Chris Lattner	a3c4454	2005-08-24 18:15:24 +0000	[diff] [blame]	137	===-------------------------------------------------------------------------===
Nate Begeman	92cce90	2005-09-06 15:30:48 +0000	[diff] [blame]	138
Chris Lattner	33c1dab	2006-02-03 06:22:11 +0000	[diff] [blame]	139	PIC Code Gen IPO optimization:
				140
				141	Squish small scalar globals together into a single global struct, allowing the
				142	address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
				143	of the GOT on targets with one).
				144
				145	Note that this is discussed here for GCC:
				146	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				147
				148	===-------------------------------------------------------------------------===
				149
Nate Begeman	92cce90	2005-09-06 15:30:48 +0000	[diff] [blame]	150	Implement Newton-Rhapson method for improving estimate instructions to the
				151	correct accuracy, and implementing divide as multiply by reciprocal when it has
				152	more than one use. Itanium will want this too.
Nate Begeman	21e463b	2005-10-16 05:39:50 +0000	[diff] [blame]	153
				154	===-------------------------------------------------------------------------===
				155
Chris Lattner	ae4664a	2005-11-05 08:57:56 +0000	[diff] [blame]	156	Compile this:
				157
				158	int %f1(int %a, int %b) {
				159	%tmp.1 = and int %a, 15 ; <int> [#uses=1]
				160	%tmp.3 = and int %b, 240 ; <int> [#uses=1]
				161	%tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
				162	ret int %tmp.4
				163	}
				164
				165	without a copy. We make this currently:
				166
				167	_f1:
				168	rlwinm r2, r4, 0, 24, 27
				169	rlwimi r2, r3, 0, 28, 31
				170	or r3, r2, r2
				171	blr
				172
				173	The two-addr pass or RA needs to learn when it is profitable to commute an
				174	instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
				175	currently only commutes to avoid inserting a copy BEFORE the two addr instr.
				176
Chris Lattner	62c08dd	2005-12-08 07:13:28 +0000	[diff] [blame]	177	===-------------------------------------------------------------------------===
				178
				179	Compile offsets from allocas:
				180
				181	int *%test() {
				182	%X = alloca { int, int }
				183	%Y = getelementptr {int,int}* %X, int 0, uint 1
				184	ret int* %Y
				185	}
				186
				187	into a single add, not two:
				188
				189	_test:
				190	addi r2, r1, -8
				191	addi r3, r2, 4
				192	blr
				193
				194	--> important for C++.
				195
Chris Lattner	39706e6	2005-12-22 17:19:28 +0000	[diff] [blame]	196	===-------------------------------------------------------------------------===
				197
Chris Lattner	39706e6	2005-12-22 17:19:28 +0000	[diff] [blame]	198	No loads or stores of the constants should be needed:
				199
				200	struct foo { double X, Y; };
				201	void xxx(struct foo F);
				202	void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
				203
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	204	===-------------------------------------------------------------------------===
				205
Chris Lattner	98fbc2f	2006-01-16 17:58:54 +0000	[diff] [blame]	206	Darwin Stub LICM optimization:
				207
				208	Loops like this:
				209
				210	for (...) bar();
				211
				212	Have to go through an indirect stub if bar is external or linkonce. It would
				213	be better to compile it as:
				214
				215	fp = &bar;
				216	for (...) fp();
				217
				218	which only computes the address of bar once (instead of each time through the
				219	stub). This is Darwin specific and would have to be done in the code generator.
				220	Probably not a win on x86.
				221
				222	===-------------------------------------------------------------------------===
				223
Chris Lattner	98fbc2f	2006-01-16 17:58:54 +0000	[diff] [blame]	224	Simple IPO for argument passing, change:
				225	void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
				226
				227	the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
				228	of arguments get assigned to r3 through r10. That is, if you have a function
				229	foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
				230	argument bytes for r4 and r5. The trick then would be to shuffle the argument
				231	order for functions we can internalize so that the maximum number of
				232	integers/pointers get passed in regs before you see any of the fp arguments.
				233
				234	Instead of implementing this, it would actually probably be easier to just
				235	implement a PPC fastcc, where we could do whatever we wanted to the CC,
				236	including having this work sanely.
				237
				238	===-------------------------------------------------------------------------===
				239
				240	Fix Darwin FP-In-Integer Registers ABI
				241
				242	Darwin passes doubles in structures in integer registers, which is very very
				243	bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
				244	that percolates these things out of functions.
				245
				246	Check out how horrible this is:
				247	http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
				248
				249	This is an extension of "interprocedural CC unmunging" that can't be done with
				250	just fastcc.
				251
				252	===-------------------------------------------------------------------------===
				253
Chris Lattner	56b6964	2006-01-31 02:55:28 +0000	[diff] [blame]	254	Compile this:
				255
Chris Lattner	83e64ba	2006-01-31 07:16:34 +0000	[diff] [blame]	256	int foo(int a) {
				257	int b = (a < 8);
				258	if (b) {
				259	return b * 3; // ignore the fact that this is always 3.
				260	} else {
				261	return 2;
				262	}
				263	}
				264
				265	into something not this:
				266
				267	_foo:
				268	1) cmpwi cr7, r3, 8
				269	mfcr r2, 1
				270	rlwinm r2, r2, 29, 31, 31
				271	1) cmpwi cr0, r3, 7
				272	bgt cr0, LBB1_2 ; UnifiedReturnBlock
				273	LBB1_1: ; then
				274	rlwinm r2, r2, 0, 31, 31
				275	mulli r3, r2, 3
				276	blr
				277	LBB1_2: ; UnifiedReturnBlock
				278	li r3, 2
				279	blr
				280
				281	In particular, the two compares (marked 1) could be shared by reversing one.
				282	This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
				283	same operands (but backwards) exists. In this case, this wouldn't save us
				284	anything though, because the compares still wouldn't be shared.
Chris Lattner	0ddc180	2006-02-01 00:28:12 +0000	[diff] [blame]	285
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	286	===-------------------------------------------------------------------------===
				287
Chris Lattner	275b884	2006-02-02 07:37:11 +0000	[diff] [blame]	288	We should custom expand setcc instead of pretending that we have it. That
				289	would allow us to expose the access of the crbit after the mfcr, allowing
				290	that access to be trivially folded into other ops. A simple example:
				291
				292	int foo(int a, int b) { return (a < b) << 4; }
				293
				294	compiles into:
				295
				296	_foo:
				297	cmpw cr7, r3, r4
				298	mfcr r2, 1
				299	rlwinm r2, r2, 29, 31, 31
				300	slwi r3, r2, 4
				301	blr
				302
Chris Lattner	d463f7f	2006-02-03 01:49:49 +0000	[diff] [blame]	303	===-------------------------------------------------------------------------===
				304
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	305	Fold add and sub with constant into non-extern, non-weak addresses so this:
				306
				307	static int a;
				308	void bar(int b) { a = b; }
				309	void foo(unsigned char *c) {
				310	*c = a;
				311	}
				312
				313	So that
				314
				315	_foo:
				316	lis r2, ha16(_a)
				317	la r2, lo16(_a)(r2)
				318	lbz r2, 3(r2)
				319	stb r2, 0(r3)
				320	blr
				321
				322	Becomes
				323
				324	_foo:
				325	lis r2, ha16(_a+3)
				326	lbz r2, lo16(_a+3)(r2)
				327	stb r2, 0(r3)
				328	blr
Chris Lattner	2138453	2006-02-05 05:27:35 +0000	[diff] [blame]	329
				330	===-------------------------------------------------------------------------===
				331
				332	We generate really bad code for this:
				333
				334	int f(signed char *a, _Bool b, _Bool c) {
				335	signed char t = 0;
				336	if (b) t = *a;
				337	if (c) *a = t;
				338	}
				339
Chris Lattner	00d18f0	2006-03-01 06:36:20 +0000	[diff] [blame]	340	===-------------------------------------------------------------------------===
				341
				342	This:
				343	int test(unsigned P) { return P >> 24; }
				344
				345	Should compile to:
				346
				347	_test:
				348	lbz r3,0(r3)
				349	blr
				350
				351	not:
				352
				353	_test:
				354	lwz r2, 0(r3)
				355	srwi r3, r2, 24
				356	blr
				357
Chris Lattner	5a63c47	2006-03-07 04:42:59 +0000	[diff] [blame]	358	===-------------------------------------------------------------------------===
				359
				360	On the G5, logical CR operations are more expensive in their three
				361	address form: ops that read/write the same register are half as expensive as
				362	those that read from two registers that are different from their destination.
				363
				364	We should model this with two separate instructions. The isel should generate
				365	the "two address" form of the instructions. When the register allocator
				366	detects that it needs to insert a copy due to the two-addresness of the CR
				367	logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
				368	we can convert to the "three address" instruction, to save code space.
				369
				370	This only matters when we start generating cr logical ops.
				371
Chris Lattner	49f398b	2006-03-08 00:25:47 +0000	[diff] [blame]	372	===-------------------------------------------------------------------------===
				373
				374	We should compile these two functions to the same thing:
				375
				376	#include <stdlib.h>
				377	void f(int a, int b, int *P) {
				378	*P = (a-b)>=0?(a-b):(b-a);
				379	}
				380	void g(int a, int b, int *P) {
				381	*P = abs(a-b);
				382	}
				383
				384	Further, they should compile to something better than:
				385
				386	_g:
				387	subf r2, r4, r3
				388	subfic r3, r2, 0
				389	cmpwi cr0, r2, -1
				390	bgt cr0, LBB2_2 ; entry
				391	LBB2_1: ; entry
				392	mr r2, r3
				393	LBB2_2: ; entry
				394	stw r2, 0(r5)
				395	blr
				396
				397	GCC produces:
				398
				399	_g:
				400	subf r4,r4,r3
				401	srawi r2,r4,31
				402	xor r0,r2,r4
				403	subf r0,r2,r0
				404	stw r0,0(r5)
				405	blr
				406
				407	... which is much nicer.
				408
				409	This theoretically may help improve twolf slightly (used in dimbox.c:142?).
				410
				411	===-------------------------------------------------------------------------===
				412
Nate Begeman	2df9928	2006-03-16 18:50:44 +0000	[diff] [blame]	413	int foo(int N, int *W, int TK, int X) {
				414	int t, i;
				415
				416	for (t = 0; t < N; ++t)
				417	for (i = 0; i < 4; ++i)
				418	W[t / X][i][t % X] = TK[i][t];
				419
				420	return 5;
				421	}
				422
Chris Lattner	ed51169	2006-03-16 22:25:55 +0000	[diff] [blame]	423	We generate relatively atrocious code for this loop compared to gcc.
				424
Chris Lattner	ef040dd	2006-03-21 00:47:09 +0000	[diff] [blame]	425	We could also strength reduce the rem and the div:
				426	http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
				427
Chris Lattner	28b1a0b	2006-03-19 05:33:30 +0000	[diff] [blame]	428	===-------------------------------------------------------------------------===
Chris Lattner	ed51169	2006-03-16 22:25:55 +0000	[diff] [blame]	429
Nate Begeman	c0a8b6d	2006-03-21 18:58:20 +0000	[diff] [blame]	430	float foo(float X) { return (int)(X); }
				431
Chris Lattner	9d86a9d	2006-03-22 05:33:23 +0000	[diff] [blame]	432	Currently produces:
Nate Begeman	c0a8b6d	2006-03-21 18:58:20 +0000	[diff] [blame]	433
				434	_foo:
Nate Begeman	c0a8b6d	2006-03-21 18:58:20 +0000	[diff] [blame]	435	fctiwz f0, f1
				436	stfd f0, -8(r1)
Chris Lattner	9d86a9d	2006-03-22 05:33:23 +0000	[diff] [blame]	437	lwz r2, -4(r1)
				438	extsw r2, r2
				439	std r2, -16(r1)
				440	lfd f0, -16(r1)
				441	fcfid f0, f0
Nate Begeman	c0a8b6d	2006-03-21 18:58:20 +0000	[diff] [blame]	442	frsp f1, f0
				443	blr
				444
Chris Lattner	9d86a9d	2006-03-22 05:33:23 +0000	[diff] [blame]	445	We could use a target dag combine to turn the lwz/extsw into an lwa when the
				446	lwz has a single use. Since LWA is cracked anyway, this would be a codesize
				447	win only.
Nate Begeman	c0a8b6d	2006-03-21 18:58:20 +0000	[diff] [blame]	448
Chris Lattner	716aefc	2006-03-23 21:28:44 +0000	[diff] [blame]	449	===-------------------------------------------------------------------------===
				450
Chris Lattner	057f09b	2006-03-24 20:04:27 +0000	[diff] [blame]	451	We generate ugly code for this:
				452
				453	void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
				454	unsigned code = 0;
				455	if(dx < -dw) code \|= 1;
				456	if(dx > dw) code \|= 2;
				457	if(dy < -dw) code \|= 4;
				458	if(dy > dw) code \|= 8;
				459	if(dz < -dw) code \|= 16;
				460	if(dz > dw) code \|= 32;
				461	*ret = code;
				462	}
				463
Chris Lattner	420736d	2006-03-25 06:47:10 +0000	[diff] [blame]	464	===-------------------------------------------------------------------------===
				465
Chris Lattner	ed93790	2006-04-13 16:48:00 +0000	[diff] [blame]	466	Complete the signed i32 to FP conversion code using 64-bit registers
				467	transformation, good for PI. See PPCISelLowering.cpp, this comment:
Chris Lattner	220d2b8	2006-04-02 07:20:00 +0000	[diff] [blame]	468
Chris Lattner	ed93790	2006-04-13 16:48:00 +0000	[diff] [blame]	469	// FIXME: disable this lowered code. This generates 64-bit register values,
				470	// and we don't model the fact that the top part is clobbered by calls. We
				471	// need to flag these together so that the value isn't live across a call.
				472	//setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
Chris Lattner	220d2b8	2006-04-02 07:20:00 +0000	[diff] [blame]	473
Chris Lattner	9d62fa4	2006-05-17 19:02:25 +0000	[diff] [blame]	474	Also, if the registers are spilled to the stack, we have to ensure that all
				475	64-bits of them are save/restored, otherwise we will miscompile the code. It
				476	sounds like we need to get the 64-bit register classes going.
				477
Chris Lattner	55c6325	2006-05-05 05:36:15 +0000	[diff] [blame]	478	===-------------------------------------------------------------------------===
				479
Nate Begeman	908049b	2007-01-29 21:21:22 +0000	[diff] [blame]	480	%struct.B = type { i8, [3 x i8] }
Nate Begeman	7514620	2006-05-08 20:54:02 +0000	[diff] [blame]	481
Nate Begeman	908049b	2007-01-29 21:21:22 +0000	[diff] [blame]	482	define void @bar(%struct.B* %b) {
Nate Begeman	7514620	2006-05-08 20:54:02 +0000	[diff] [blame]	483	entry:
Nate Begeman	908049b	2007-01-29 21:21:22 +0000	[diff] [blame]	484	%tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
				485	%tmp = load i32* %tmp ; <uint> [#uses=1]
				486	%tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
				487	%tmp4 = load i32* %tmp3 ; <uint> [#uses=1]
				488	%tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2]
				489	%tmp9 = load i32* %tmp8 ; <uint> [#uses=1]
				490	%tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1]
				491	%tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
				492	%tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1]
				493	%tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1]
				494	%tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1]
				495	%tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1]
				496	store i32 %tmp13, i32* %tmp8
Chris Lattner	55c6325	2006-05-05 05:36:15 +0000	[diff] [blame]	497	ret void
				498	}
				499
				500	We emit:
				501
				502	_foo:
				503	lwz r2, 0(r3)
Nate Begeman	7514620	2006-05-08 20:54:02 +0000	[diff] [blame]	504	slwi r4, r2, 1
				505	or r4, r4, r2
				506	rlwimi r2, r4, 0, 0, 0
Nate Begeman	4667f2c	2006-05-08 17:38:32 +0000	[diff] [blame]	507	stw r2, 0(r3)
Chris Lattner	55c6325	2006-05-05 05:36:15 +0000	[diff] [blame]	508	blr
				509
Nate Begeman	7514620	2006-05-08 20:54:02 +0000	[diff] [blame]	510	We could collapse a bunch of those ORs and ANDs and generate the following
				511	equivalent code:
Chris Lattner	55c6325	2006-05-05 05:36:15 +0000	[diff] [blame]	512
Nate Begeman	4667f2c	2006-05-08 17:38:32 +0000	[diff] [blame]	513	_foo:
				514	lwz r2, 0(r3)
Nate Begeman	d8624ed	2006-05-08 19:09:24 +0000	[diff] [blame]	515	rlwinm r4, r2, 1, 0, 0
Nate Begeman	4667f2c	2006-05-08 17:38:32 +0000	[diff] [blame]	516	or r2, r2, r4
				517	stw r2, 0(r3)
				518	blr
Chris Lattner	1eeedae	2006-07-14 04:07:29 +0000	[diff] [blame]	519
				520	===-------------------------------------------------------------------------===
				521
Chris Lattner	f0613e1	2006-09-14 20:56:30 +0000	[diff] [blame]	522	We compile:
				523
				524	unsigned test6(unsigned x) {
				525	return ((x & 0x00FF0000) >> 16) \| ((x & 0x000000FF) << 16);
				526	}
				527
				528	into:
				529
				530	_test6:
				531	lis r2, 255
				532	rlwinm r3, r3, 16, 0, 31
				533	ori r2, r2, 255
				534	and r3, r3, r2
				535	blr
				536
				537	GCC gets it down to:
				538
				539	_test6:
				540	rlwinm r0,r3,16,8,15
				541	rlwinm r3,r3,16,24,31
				542	or r3,r3,r0
				543	blr
				544
Chris Lattner	afd7a08	2007-01-18 07:34:57 +0000	[diff] [blame]	545
				546	===-------------------------------------------------------------------------===
				547
				548	Consider a function like this:
				549
				550	float foo(float X) { return X + 1234.4123f; }
				551
				552	The FP constant ends up in the constant pool, so we need to get the LR register.
				553	This ends up producing code like this:
				554
				555	_foo:
				556	.LBB_foo_0: ; entry
				557	mflr r11
				558	*** stw r11, 8(r1)
				559	bl "L00000$pb"
				560	"L00000$pb":
				561	mflr r2
				562	addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
				563	lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
				564	fadds f1, f1, f0
				565	*** lwz r11, 8(r1)
				566	mtlr r11
				567	blr
				568
				569	This is functional, but there is no reason to spill the LR register all the way
				570	to the stack (the two marked instrs): spilling it to a GPR is quite enough.
				571
				572	Implementing this will require some codegen improvements. Nate writes:
				573
				574	"So basically what we need to support the "no stack frame save and restore" is a
				575	generalization of the LR optimization to "callee-save regs".
				576
				577	Currently, we have LR marked as a callee-save reg. The register allocator sees
				578	that it's callee save, and spills it directly to the stack.
				579
				580	Ideally, something like this would happen:
				581
				582	LR would be in a separate register class from the GPRs. The class of LR would be
				583	marked "unspillable". When the register allocator came across an unspillable
				584	reg, it would ask "what is the best class to copy this into that I can spill"
				585	If it gets a class back, which it will in this case (the gprs), it grabs a free
				586	register of that class. If it is then later necessary to spill that reg, so be
				587	it.
				588
				589	===-------------------------------------------------------------------------===
Chris Lattner	95b9d6e	2007-01-31 19:49:20 +0000	[diff] [blame]	590
				591	We compile this:
				592	int test(_Bool X) {
				593	return X ? 524288 : 0;
				594	}
				595
				596	to:
				597	_test:
				598	cmplwi cr0, r3, 0
				599	lis r2, 8
				600	li r3, 0
				601	beq cr0, LBB1_2 ;entry
				602	LBB1_1: ;entry
				603	mr r3, r2
				604	LBB1_2: ;entry
				605	blr
				606
				607	instead of:
				608	_test:
				609	addic r2,r3,-1
				610	subfe r0,r2,r3
				611	slwi r3,r0,19
				612	blr
				613
				614	This sort of thing occurs a lot due to globalopt.
				615
				616	===-------------------------------------------------------------------------===
Chris Lattner	8abcfe1	2007-02-09 17:38:01 +0000	[diff] [blame]	617
				618	We currently compile 32-bit bswap:
				619
				620	declare i32 @llvm.bswap.i32(i32 %A)
				621	define i32 @test(i32 %A) {
				622	%B = call i32 @llvm.bswap.i32(i32 %A)
				623	ret i32 %B
				624	}
				625
				626	to:
				627
				628	_test:
				629	rlwinm r2, r3, 24, 16, 23
				630	slwi r4, r3, 24
				631	rlwimi r2, r3, 8, 24, 31
				632	rlwimi r4, r3, 8, 8, 15
				633	rlwimi r4, r2, 0, 16, 31
				634	mr r3, r4
				635	blr
				636
				637	it would be more efficient to produce:
				638
				639	_foo: mr r0,r3
				640	rlwinm r3,r3,8,0xffffffff
				641	rlwimi r3,r0,24,0,7
				642	rlwimi r3,r0,24,16,23
				643	blr
				644
				645	===-------------------------------------------------------------------------===
Chris Lattner	013e051	2007-03-25 04:46:28 +0000	[diff] [blame]	646
				647	test/CodeGen/PowerPC/2007-03-24-cntlzd.ll compiles to:
				648
				649	__ZNK4llvm5APInt17countLeadingZerosEv:
				650	ld r2, 0(r3)
				651	cntlzd r2, r2
				652	or r2, r2, r2 <<-- silly.
				653	addi r3, r2, -64
				654	blr
				655
				656	The dead or is a 'truncate' from 64- to 32-bits.
				657
				658	===-------------------------------------------------------------------------===
Chris Lattner	fcb1e61	2007-03-31 07:06:25 +0000	[diff] [blame]	659
				660	We generate horrible ppc code for this:
				661
				662	#define N 2000000
				663	double a[N],c[N];
				664	void simpleloop() {
				665	int j;
				666	for (j=0; j<N; j++)
				667	c[j] = a[j];
				668	}
				669
				670	LBB1_1: ;bb
				671	lfdx f0, r3, r4
				672	addi r5, r5, 1 ;; Extra IV for the exit value compare.
				673	stfdx f0, r2, r4
				674	addi r4, r4, 8
				675
				676	xoris r6, r5, 30 ;; This is due to a large immediate.
				677	cmplwi cr0, r6, 33920
				678	bne cr0, LBB1_1
				679
Chris Lattner	bf8ae84	2007-09-10 21:43:18 +0000	[diff] [blame^]	680	//===---------------------------------------------------------------------===//
				681
				682	This:
				683	#include <algorithm>
				684	inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				685	{ return std::make_pair(a + b, a + b < a); }
				686	bool no_overflow(unsigned a, unsigned b)
				687	{ return !full_add(a, b).second; }
				688
				689	Should compile to:
				690
				691	__Z11no_overflowjj:
				692	add r4,r3,r4
				693	subfc r3,r3,r4
				694	li r3,0
				695	adde r3,r3,r3
				696	blr
				697
				698	(or better) not:
				699
				700	__Z11no_overflowjj:
				701	add r2, r4, r3
				702	cmplw cr7, r2, r3
				703	mfcr r2
				704	rlwinm r2, r2, 29, 31, 31
				705	xori r3, r2, 1
				706	blr
				707
				708	//===---------------------------------------------------------------------===//
Chris Lattner	fcb1e61	2007-03-31 07:06:25 +0000	[diff] [blame]	709