Blame - lib/Target/PowerPC/README.txt - fp2-dev/platform/external/llvm

blob: 42d701ef01ba9210bf053dcbe6032eef90d30df1 [file] [log] [blame]

Nate Begeman	b64af91	2004-08-10 20:42:36 +0000	[diff] [blame]	1	TODO:
Nate Begeman	ef9531e	2005-04-11 20:48:57 +0000	[diff] [blame]	2	* gpr0 allocation
Nate Begeman	4a0de07	2004-10-26 04:10:53 +0000	[diff] [blame]	3	* implement do-loop -> bdnz transform
Nate Begeman	ca068e8	2004-08-14 22:16:36 +0000	[diff] [blame]	4	* implement powerpc-64 for darwin
Nate Begeman	50fb3c4	2005-12-24 01:00:15 +0000	[diff] [blame]	5
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	6	===-------------------------------------------------------------------------===
Nate Begeman	50fb3c4	2005-12-24 01:00:15 +0000	[diff] [blame]	7
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	8	Use the stfiwx instruction for:
Chris Lattner	b65975a	2005-07-26 19:07:51 +0000	[diff] [blame]	9
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	10	void foo(float a, int b) { b = a; }
				11
				12	===-------------------------------------------------------------------------===
				13
Nate Begeman	5a01481	2005-08-14 01:17:16 +0000	[diff] [blame]	14	unsigned short foo(float a) { return a; }
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	15	should be:
Nate Begeman	5a01481	2005-08-14 01:17:16 +0000	[diff] [blame]	16	_foo:
				17	fctiwz f0,f1
				18	stfd f0,-8(r1)
				19	lhz r3,-2(r1)
				20	blr
				21	not:
				22	_foo:
				23	fctiwz f0, f1
				24	stfd f0, -8(r1)
				25	lwz r2, -4(r1)
				26	rlwinm r3, r2, 0, 16, 31
				27	blr
				28
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	29	===-------------------------------------------------------------------------===
Chris Lattner	6281ae4	2005-08-05 19:18:32 +0000	[diff] [blame]	30
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	31	Support 'update' load/store instructions. These are cracked on the G5, but are
				32	still a codesize win.
				33
				34	===-------------------------------------------------------------------------===
				35
				36	Should hint to the branch select pass that it doesn't need to print the second
				37	unconditional branch, so we don't end up with things like:
Misha Brukman	4ce5ce2	2004-07-27 18:43:04 +0000	[diff] [blame]	38	b .LBBl42__2E_expand_function_8_674 ; loopentry.24
				39	b .LBBl42__2E_expand_function_8_42 ; NewDefault
				40	b .LBBl42__2E_expand_function_8_42 ; NewDefault
Chris Lattner	424dcbd	2005-08-23 06:27:59 +0000	[diff] [blame]	41
Chris Lattner	a3c4454	2005-08-24 18:15:24 +0000	[diff] [blame]	42	===-------------------------------------------------------------------------===
				43
Chris Lattner	424dcbd	2005-08-23 06:27:59 +0000	[diff] [blame]	44	* Codegen this:
				45
				46	void test2(int X) {
				47	if (X == 0x12345678) bar();
				48	}
				49
				50	as:
				51
				52	xoris r0,r3,0x1234
				53	cmpwi cr0,r0,0x5678
				54	beq cr0,L6
				55
				56	not:
				57
				58	lis r2, 4660
				59	ori r2, r2, 22136
				60	cmpw cr0, r3, r2
				61	bne .LBB_test2_2
				62
Chris Lattner	a3c4454	2005-08-24 18:15:24 +0000	[diff] [blame]	63	===-------------------------------------------------------------------------===
				64
				65	Lump the constant pool for each function into ONE pic object, and reference
				66	pieces of it as offsets from the start. For functions like this (contrived
				67	to have lots of constants obviously):
				68
				69	double X(double Y) { return (Y1.23 + 4.512)2.34 + 14.38; }
				70
				71	We generate:
				72
				73	_X:
				74	lis r2, ha16(.CPI_X_0)
				75	lfd f0, lo16(.CPI_X_0)(r2)
				76	lis r2, ha16(.CPI_X_1)
				77	lfd f2, lo16(.CPI_X_1)(r2)
				78	fmadd f0, f1, f0, f2
				79	lis r2, ha16(.CPI_X_2)
				80	lfd f1, lo16(.CPI_X_2)(r2)
				81	lis r2, ha16(.CPI_X_3)
				82	lfd f2, lo16(.CPI_X_3)(r2)
				83	fmadd f1, f0, f1, f2
				84	blr
				85
				86	It would be better to materialize .CPI_X into a register, then use immediates
				87	off of the register to avoid the lis's. This is even more important in PIC
				88	mode.
				89
Chris Lattner	39b248b	2006-02-02 23:50:22 +0000	[diff] [blame]	90	Note that this (and the static variable version) is discussed here for GCC:
				91	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				92
Chris Lattner	a3c4454	2005-08-24 18:15:24 +0000	[diff] [blame]	93	===-------------------------------------------------------------------------===
Nate Begeman	92cce90	2005-09-06 15:30:48 +0000	[diff] [blame]	94
Chris Lattner	33c1dab	2006-02-03 06:22:11 +0000	[diff] [blame^]	95	PIC Code Gen IPO optimization:
				96
				97	Squish small scalar globals together into a single global struct, allowing the
				98	address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
				99	of the GOT on targets with one).
				100
				101	Note that this is discussed here for GCC:
				102	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				103
				104	===-------------------------------------------------------------------------===
				105
Nate Begeman	92cce90	2005-09-06 15:30:48 +0000	[diff] [blame]	106	Implement Newton-Rhapson method for improving estimate instructions to the
				107	correct accuracy, and implementing divide as multiply by reciprocal when it has
				108	more than one use. Itanium will want this too.
Nate Begeman	21e463b	2005-10-16 05:39:50 +0000	[diff] [blame]	109
				110	===-------------------------------------------------------------------------===
				111
Nate Begeman	5cd61ce	2005-10-25 23:50:02 +0000	[diff] [blame]	112	#define ARRAY_LENGTH 16
				113
				114	union bitfield {
				115	struct {
				116	#ifndef __ppc__
				117	unsigned int field0 : 6;
				118	unsigned int field1 : 6;
				119	unsigned int field2 : 6;
				120	unsigned int field3 : 6;
				121	unsigned int field4 : 3;
				122	unsigned int field5 : 4;
				123	unsigned int field6 : 1;
				124	#else
				125	unsigned int field6 : 1;
				126	unsigned int field5 : 4;
				127	unsigned int field4 : 3;
				128	unsigned int field3 : 6;
				129	unsigned int field2 : 6;
				130	unsigned int field1 : 6;
				131	unsigned int field0 : 6;
				132	#endif
				133	} bitfields, bits;
				134	unsigned int u32All;
				135	signed int i32All;
				136	float f32All;
				137	};
				138
				139
				140	typedef struct program_t {
				141	union bitfield array[ARRAY_LENGTH];
				142	int size;
				143	int loaded;
				144	} program;
				145
				146
				147	void AdjustBitfields(program* prog, unsigned int fmt1)
				148	{
				149	unsigned int shift = 0;
				150	unsigned int texCount = 0;
				151	unsigned int i;
				152
				153	for (i = 0; i < 8; i++)
				154	{
				155	prog->array[i].bitfields.field0 = texCount;
				156	prog->array[i].bitfields.field1 = texCount + 1;
				157	prog->array[i].bitfields.field2 = texCount + 2;
				158	prog->array[i].bitfields.field3 = texCount + 3;
				159
				160	texCount += (fmt1 >> shift) & 0x7;
				161	shift += 3;
				162	}
				163	}
				164
				165	In the loop above, the bitfield adds get generated as
				166	(add (shl bitfield, C1), (shl C2, C1)) where C2 is 1, 2 or 3.
				167
				168	Since the input to the (or and, and) is an (add) rather than a (shl), the shift
				169	doesn't get folded into the rlwimi instruction. We should ideally see through
				170	things like this, rather than forcing llvm to generate the equivalent
				171
				172	(shl (add bitfield, C2), C1) with some kind of mask.
Chris Lattner	0195910	2005-10-28 00:20:45 +0000	[diff] [blame]	173
				174	===-------------------------------------------------------------------------===
				175
Chris Lattner	ae4664a	2005-11-05 08:57:56 +0000	[diff] [blame]	176	Compile this:
				177
				178	int %f1(int %a, int %b) {
				179	%tmp.1 = and int %a, 15 ; <int> [#uses=1]
				180	%tmp.3 = and int %b, 240 ; <int> [#uses=1]
				181	%tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
				182	ret int %tmp.4
				183	}
				184
				185	without a copy. We make this currently:
				186
				187	_f1:
				188	rlwinm r2, r4, 0, 24, 27
				189	rlwimi r2, r3, 0, 28, 31
				190	or r3, r2, r2
				191	blr
				192
				193	The two-addr pass or RA needs to learn when it is profitable to commute an
				194	instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
				195	currently only commutes to avoid inserting a copy BEFORE the two addr instr.
				196
Chris Lattner	62c08dd	2005-12-08 07:13:28 +0000	[diff] [blame]	197	===-------------------------------------------------------------------------===
				198
Nate Begeman	eb20ed6	2006-01-28 01:22:10 +0000	[diff] [blame]	199	176.gcc contains a bunch of code like this (this occurs dozens of times):
				200
				201	int %test(uint %mode.0.i.0) {
				202	%tmp.79 = cast uint %mode.0.i.0 to sbyte ; <sbyte> [#uses=1]
				203	%tmp.80 = cast sbyte %tmp.79 to int ; <int> [#uses=1]
				204	%tmp.81 = shl int %tmp.80, ubyte 16 ; <int> [#uses=1]
				205	%tmp.82 = and int %tmp.81, 16711680
				206	ret int %tmp.82
				207	}
				208
				209	which we compile to:
				210
				211	_test:
				212	extsb r2, r3
				213	rlwinm r3, r2, 16, 8, 15
				214	blr
				215
				216	The extsb is obviously dead. This can be handled by a future thing like
				217	MaskedValueIsZero that checks to see if bits are ever demanded (in this case,
				218	the sign bits are never used, so we can fold the sext_inreg to nothing).
				219
				220	I'm seeing code like this:
				221
				222	srwi r3, r3, 16
				223	extsb r3, r3
				224	rlwimi r4, r3, 16, 8, 15
				225
				226	in which the extsb is preventing the srwi from being nuked.
				227
				228	===-------------------------------------------------------------------------===
				229
				230	Another example that occurs is:
				231
				232	uint %test(int %specbits.6.1) {
				233	%tmp.2540 = shr int %specbits.6.1, ubyte 11 ; <int> [#uses=1]
				234	%tmp.2541 = cast int %tmp.2540 to uint ; <uint> [#uses=1]
				235	%tmp.2542 = shl uint %tmp.2541, ubyte 13 ; <uint> [#uses=1]
				236	%tmp.2543 = and uint %tmp.2542, 8192 ; <uint> [#uses=1]
				237	ret uint %tmp.2543
				238	}
				239
				240	which we codegen as:
				241
				242	l1_test:
				243	srawi r2, r3, 11
				244	rlwinm r3, r2, 13, 18, 18
				245	blr
				246
				247	the srawi can be nuked by turning the SAR into a logical SHR (the sext bits are
				248	dead), which I think can then be folded into the rlwinm.
				249
				250	===-------------------------------------------------------------------------===
				251
Chris Lattner	62c08dd	2005-12-08 07:13:28 +0000	[diff] [blame]	252	Compile offsets from allocas:
				253
				254	int *%test() {
				255	%X = alloca { int, int }
				256	%Y = getelementptr {int,int}* %X, int 0, uint 1
				257	ret int* %Y
				258	}
				259
				260	into a single add, not two:
				261
				262	_test:
				263	addi r2, r1, -8
				264	addi r3, r2, 4
				265	blr
				266
				267	--> important for C++.
				268
Chris Lattner	39706e6	2005-12-22 17:19:28 +0000	[diff] [blame]	269	===-------------------------------------------------------------------------===
				270
				271	int test3(int a, int b) { return (a < 0) ? a : 0; }
				272
				273	should be branch free code. LLVM is turning it into < 1 because of the RHS.
				274
				275	===-------------------------------------------------------------------------===
				276
Chris Lattner	39706e6	2005-12-22 17:19:28 +0000	[diff] [blame]	277	No loads or stores of the constants should be needed:
				278
				279	struct foo { double X, Y; };
				280	void xxx(struct foo F);
				281	void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
				282
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	283	===-------------------------------------------------------------------------===
				284
Chris Lattner	98fbc2f	2006-01-16 17:58:54 +0000	[diff] [blame]	285	Darwin Stub LICM optimization:
				286
				287	Loops like this:
				288
				289	for (...) bar();
				290
				291	Have to go through an indirect stub if bar is external or linkonce. It would
				292	be better to compile it as:
				293
				294	fp = &bar;
				295	for (...) fp();
				296
				297	which only computes the address of bar once (instead of each time through the
				298	stub). This is Darwin specific and would have to be done in the code generator.
				299	Probably not a win on x86.
				300
				301	===-------------------------------------------------------------------------===
				302
				303	PowerPC i1/setcc stuff (depends on subreg stuff):
				304
				305	Check out the PPC code we get for 'compare' in this testcase:
				306	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
				307
				308	oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
				309	invert, invert, or), we then have to compare it against zero instead of
				310	using the value already in a CR!
				311
				312	that should be something like
				313	cmpw cr7, r8, r5
				314	cmpw cr0, r7, r3
				315	crnand cr0, cr0, cr7
				316	bne cr0, LBB_compare_4
				317
				318	instead of
				319	cmpw cr7, r8, r5
				320	cmpw cr0, r7, r3
				321	mfcr r7, 1
				322	mcrf cr7, cr0
				323	mfcr r8, 1
				324	rlwinm r7, r7, 30, 31, 31
				325	rlwinm r8, r8, 30, 31, 31
				326	xori r7, r7, 1
				327	xori r8, r8, 1
				328	addi r2, r2, 1
				329	or r7, r8, r7
				330	cmpwi cr0, r7, 0
				331	bne cr0, LBB_compare_4 ; loopexit
				332
				333	===-------------------------------------------------------------------------===
				334
				335	Simple IPO for argument passing, change:
				336	void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
				337
				338	the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
				339	of arguments get assigned to r3 through r10. That is, if you have a function
				340	foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
				341	argument bytes for r4 and r5. The trick then would be to shuffle the argument
				342	order for functions we can internalize so that the maximum number of
				343	integers/pointers get passed in regs before you see any of the fp arguments.
				344
				345	Instead of implementing this, it would actually probably be easier to just
				346	implement a PPC fastcc, where we could do whatever we wanted to the CC,
				347	including having this work sanely.
				348
				349	===-------------------------------------------------------------------------===
				350
				351	Fix Darwin FP-In-Integer Registers ABI
				352
				353	Darwin passes doubles in structures in integer registers, which is very very
				354	bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
				355	that percolates these things out of functions.
				356
				357	Check out how horrible this is:
				358	http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
				359
				360	This is an extension of "interprocedural CC unmunging" that can't be done with
				361	just fastcc.
				362
				363	===-------------------------------------------------------------------------===
				364
Chris Lattner	3cda14f	2006-01-19 02:09:38 +0000	[diff] [blame]	365	Generate lwbrx and other byteswapping load/store instructions when reasonable.
				366
Chris Lattner	9690979	2006-01-28 05:40:47 +0000	[diff] [blame]	367	===-------------------------------------------------------------------------===
				368
				369	Implement TargetConstantVec, and set up PPC to custom lower ConstantVec into
				370	TargetConstantVec's if it's one of the many forms that are algorithmically
				371	computable using the spiffy altivec instructions.
				372
Chris Lattner	56b6964	2006-01-31 02:55:28 +0000	[diff] [blame]	373	===-------------------------------------------------------------------------===
				374
				375	Compile this:
				376
				377	double %test(double %X) {
				378	%Y = cast double %X to long
				379	%Z = cast long %Y to double
				380	ret double %Z
				381	}
				382
				383	to this:
				384
				385	_test:
				386	fctidz f0, f1
				387	stfd f0, -8(r1)
				388	lwz r2, -4(r1)
				389	lwz r3, -8(r1)
				390	stw r2, -12(r1)
				391	stw r3, -16(r1)
				392	lfd f0, -16(r1)
				393	fcfid f1, f0
				394	blr
				395
				396	without the lwz/stw's.
				397
Chris Lattner	83e64ba	2006-01-31 07:16:34 +0000	[diff] [blame]	398	===-------------------------------------------------------------------------===
				399
				400	Compile this:
				401
				402	int foo(int a) {
				403	int b = (a < 8);
				404	if (b) {
				405	return b * 3; // ignore the fact that this is always 3.
				406	} else {
				407	return 2;
				408	}
				409	}
				410
				411	into something not this:
				412
				413	_foo:
				414	1) cmpwi cr7, r3, 8
				415	mfcr r2, 1
				416	rlwinm r2, r2, 29, 31, 31
				417	1) cmpwi cr0, r3, 7
				418	bgt cr0, LBB1_2 ; UnifiedReturnBlock
				419	LBB1_1: ; then
				420	rlwinm r2, r2, 0, 31, 31
				421	mulli r3, r2, 3
				422	blr
				423	LBB1_2: ; UnifiedReturnBlock
				424	li r3, 2
				425	blr
				426
				427	In particular, the two compares (marked 1) could be shared by reversing one.
				428	This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
				429	same operands (but backwards) exists. In this case, this wouldn't save us
				430	anything though, because the compares still wouldn't be shared.
Chris Lattner	0ddc180	2006-02-01 00:28:12 +0000	[diff] [blame]	431
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	432	===-------------------------------------------------------------------------===
				433
				434	The legalizer should lower this:
				435
				436	bool %test(ulong %x) {
				437	%tmp = setlt ulong %x, 4294967296
				438	ret bool %tmp
				439	}
				440
				441	into "if x.high == 0", not:
				442
				443	_test:
				444	addi r2, r3, -1
				445	cntlzw r2, r2
				446	cntlzw r3, r3
				447	srwi r2, r2, 5
Nate Begeman	93c740b	2006-02-02 07:27:56 +0000	[diff] [blame]	448	srwi r4, r3, 5
				449	li r3, 0
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	450	cmpwi cr0, r2, 0
				451	bne cr0, LBB1_2 ;
				452	LBB1_1:
Nate Begeman	93c740b	2006-02-02 07:27:56 +0000	[diff] [blame]	453	or r3, r4, r4
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	454	LBB1_2:
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	455	blr
				456
				457	noticed in 2005-05-11-Popcount-ffs-fls.c.
Chris Lattner	275b884	2006-02-02 07:37:11 +0000	[diff] [blame]	458
				459
				460	===-------------------------------------------------------------------------===
				461
				462	We should custom expand setcc instead of pretending that we have it. That
				463	would allow us to expose the access of the crbit after the mfcr, allowing
				464	that access to be trivially folded into other ops. A simple example:
				465
				466	int foo(int a, int b) { return (a < b) << 4; }
				467
				468	compiles into:
				469
				470	_foo:
				471	cmpw cr7, r3, r4
				472	mfcr r2, 1
				473	rlwinm r2, r2, 29, 31, 31
				474	slwi r3, r2, 4
				475	blr
				476
Chris Lattner	d463f7f	2006-02-03 01:49:49 +0000	[diff] [blame]	477	===-------------------------------------------------------------------------===
				478
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	479	Fold add and sub with constant into non-extern, non-weak addresses so this:
				480
				481	static int a;
				482	void bar(int b) { a = b; }
				483	void foo(unsigned char *c) {
				484	*c = a;
				485	}
				486
				487	So that
				488
				489	_foo:
				490	lis r2, ha16(_a)
				491	la r2, lo16(_a)(r2)
				492	lbz r2, 3(r2)
				493	stb r2, 0(r3)
				494	blr
				495
				496	Becomes
				497
				498	_foo:
				499	lis r2, ha16(_a+3)
				500	lbz r2, lo16(_a+3)(r2)
				501	stb r2, 0(r3)
				502	blr