Blame - lib/Target/PowerPC/README.txt - fp2-dev/platform/external/llvm

blob: f90073f2cace08bf309419cd11930a1a7f2a9056 [file] [log] [blame]

Nate Begeman	b64af91	2004-08-10 20:42:36 +0000	[diff] [blame]	1	TODO:
Nate Begeman	ef9531e	2005-04-11 20:48:57 +0000	[diff] [blame]	2	* gpr0 allocation
Nate Begeman	4a0de07	2004-10-26 04:10:53 +0000	[diff] [blame]	3	* implement do-loop -> bdnz transform
Nate Begeman	ca068e8	2004-08-14 22:16:36 +0000	[diff] [blame]	4	* implement powerpc-64 for darwin
Nate Begeman	50fb3c4	2005-12-24 01:00:15 +0000	[diff] [blame]	5
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	6	===-------------------------------------------------------------------------===
Nate Begeman	50fb3c4	2005-12-24 01:00:15 +0000	[diff] [blame]	7
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	8	Use the stfiwx instruction for:
Chris Lattner	b65975a	2005-07-26 19:07:51 +0000	[diff] [blame]	9
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	10	void foo(float a, int b) { b = a; }
				11
				12	===-------------------------------------------------------------------------===
				13
Nate Begeman	5a01481	2005-08-14 01:17:16 +0000	[diff] [blame]	14	unsigned short foo(float a) { return a; }
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	15	should be:
Nate Begeman	5a01481	2005-08-14 01:17:16 +0000	[diff] [blame]	16	_foo:
				17	fctiwz f0,f1
				18	stfd f0,-8(r1)
				19	lhz r3,-2(r1)
				20	blr
				21	not:
				22	_foo:
				23	fctiwz f0, f1
				24	stfd f0, -8(r1)
				25	lwz r2, -4(r1)
				26	rlwinm r3, r2, 0, 16, 31
				27	blr
				28
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	29	===-------------------------------------------------------------------------===
Chris Lattner	6281ae4	2005-08-05 19:18:32 +0000	[diff] [blame]	30
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	31	Support 'update' load/store instructions. These are cracked on the G5, but are
				32	still a codesize win.
				33
				34	===-------------------------------------------------------------------------===
				35
				36	Should hint to the branch select pass that it doesn't need to print the second
				37	unconditional branch, so we don't end up with things like:
Misha Brukman	4ce5ce2	2004-07-27 18:43:04 +0000	[diff] [blame]	38	b .LBBl42__2E_expand_function_8_674 ; loopentry.24
				39	b .LBBl42__2E_expand_function_8_42 ; NewDefault
				40	b .LBBl42__2E_expand_function_8_42 ; NewDefault
Chris Lattner	424dcbd	2005-08-23 06:27:59 +0000	[diff] [blame]	41
Chris Lattner	1541bc3	2006-02-03 22:06:45 +0000	[diff] [blame^]	42	This occurs in SPASS.
				43
Chris Lattner	a3c4454	2005-08-24 18:15:24 +0000	[diff] [blame]	44	===-------------------------------------------------------------------------===
				45
Chris Lattner	424dcbd	2005-08-23 06:27:59 +0000	[diff] [blame]	46	* Codegen this:
				47
				48	void test2(int X) {
				49	if (X == 0x12345678) bar();
				50	}
				51
				52	as:
				53
				54	xoris r0,r3,0x1234
				55	cmpwi cr0,r0,0x5678
				56	beq cr0,L6
				57
				58	not:
				59
				60	lis r2, 4660
				61	ori r2, r2, 22136
				62	cmpw cr0, r3, r2
				63	bne .LBB_test2_2
				64
Chris Lattner	a3c4454	2005-08-24 18:15:24 +0000	[diff] [blame]	65	===-------------------------------------------------------------------------===
				66
				67	Lump the constant pool for each function into ONE pic object, and reference
				68	pieces of it as offsets from the start. For functions like this (contrived
				69	to have lots of constants obviously):
				70
				71	double X(double Y) { return (Y1.23 + 4.512)2.34 + 14.38; }
				72
				73	We generate:
				74
				75	_X:
				76	lis r2, ha16(.CPI_X_0)
				77	lfd f0, lo16(.CPI_X_0)(r2)
				78	lis r2, ha16(.CPI_X_1)
				79	lfd f2, lo16(.CPI_X_1)(r2)
				80	fmadd f0, f1, f0, f2
				81	lis r2, ha16(.CPI_X_2)
				82	lfd f1, lo16(.CPI_X_2)(r2)
				83	lis r2, ha16(.CPI_X_3)
				84	lfd f2, lo16(.CPI_X_3)(r2)
				85	fmadd f1, f0, f1, f2
				86	blr
				87
				88	It would be better to materialize .CPI_X into a register, then use immediates
				89	off of the register to avoid the lis's. This is even more important in PIC
				90	mode.
				91
Chris Lattner	39b248b	2006-02-02 23:50:22 +0000	[diff] [blame]	92	Note that this (and the static variable version) is discussed here for GCC:
				93	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				94
Chris Lattner	a3c4454	2005-08-24 18:15:24 +0000	[diff] [blame]	95	===-------------------------------------------------------------------------===
Nate Begeman	92cce90	2005-09-06 15:30:48 +0000	[diff] [blame]	96
Chris Lattner	33c1dab	2006-02-03 06:22:11 +0000	[diff] [blame]	97	PIC Code Gen IPO optimization:
				98
				99	Squish small scalar globals together into a single global struct, allowing the
				100	address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
				101	of the GOT on targets with one).
				102
				103	Note that this is discussed here for GCC:
				104	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				105
				106	===-------------------------------------------------------------------------===
				107
Nate Begeman	92cce90	2005-09-06 15:30:48 +0000	[diff] [blame]	108	Implement Newton-Rhapson method for improving estimate instructions to the
				109	correct accuracy, and implementing divide as multiply by reciprocal when it has
				110	more than one use. Itanium will want this too.
Nate Begeman	21e463b	2005-10-16 05:39:50 +0000	[diff] [blame]	111
				112	===-------------------------------------------------------------------------===
				113
Nate Begeman	5cd61ce	2005-10-25 23:50:02 +0000	[diff] [blame]	114	#define ARRAY_LENGTH 16
				115
				116	union bitfield {
				117	struct {
				118	#ifndef __ppc__
				119	unsigned int field0 : 6;
				120	unsigned int field1 : 6;
				121	unsigned int field2 : 6;
				122	unsigned int field3 : 6;
				123	unsigned int field4 : 3;
				124	unsigned int field5 : 4;
				125	unsigned int field6 : 1;
				126	#else
				127	unsigned int field6 : 1;
				128	unsigned int field5 : 4;
				129	unsigned int field4 : 3;
				130	unsigned int field3 : 6;
				131	unsigned int field2 : 6;
				132	unsigned int field1 : 6;
				133	unsigned int field0 : 6;
				134	#endif
				135	} bitfields, bits;
				136	unsigned int u32All;
				137	signed int i32All;
				138	float f32All;
				139	};
				140
				141
				142	typedef struct program_t {
				143	union bitfield array[ARRAY_LENGTH];
				144	int size;
				145	int loaded;
				146	} program;
				147
				148
				149	void AdjustBitfields(program* prog, unsigned int fmt1)
				150	{
				151	unsigned int shift = 0;
				152	unsigned int texCount = 0;
				153	unsigned int i;
				154
				155	for (i = 0; i < 8; i++)
				156	{
				157	prog->array[i].bitfields.field0 = texCount;
				158	prog->array[i].bitfields.field1 = texCount + 1;
				159	prog->array[i].bitfields.field2 = texCount + 2;
				160	prog->array[i].bitfields.field3 = texCount + 3;
				161
				162	texCount += (fmt1 >> shift) & 0x7;
				163	shift += 3;
				164	}
				165	}
				166
				167	In the loop above, the bitfield adds get generated as
				168	(add (shl bitfield, C1), (shl C2, C1)) where C2 is 1, 2 or 3.
				169
				170	Since the input to the (or and, and) is an (add) rather than a (shl), the shift
				171	doesn't get folded into the rlwimi instruction. We should ideally see through
				172	things like this, rather than forcing llvm to generate the equivalent
				173
				174	(shl (add bitfield, C2), C1) with some kind of mask.
Chris Lattner	0195910	2005-10-28 00:20:45 +0000	[diff] [blame]	175
				176	===-------------------------------------------------------------------------===
				177
Chris Lattner	ae4664a	2005-11-05 08:57:56 +0000	[diff] [blame]	178	Compile this:
				179
				180	int %f1(int %a, int %b) {
				181	%tmp.1 = and int %a, 15 ; <int> [#uses=1]
				182	%tmp.3 = and int %b, 240 ; <int> [#uses=1]
				183	%tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
				184	ret int %tmp.4
				185	}
				186
				187	without a copy. We make this currently:
				188
				189	_f1:
				190	rlwinm r2, r4, 0, 24, 27
				191	rlwimi r2, r3, 0, 28, 31
				192	or r3, r2, r2
				193	blr
				194
				195	The two-addr pass or RA needs to learn when it is profitable to commute an
				196	instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
				197	currently only commutes to avoid inserting a copy BEFORE the two addr instr.
				198
Chris Lattner	62c08dd	2005-12-08 07:13:28 +0000	[diff] [blame]	199	===-------------------------------------------------------------------------===
				200
Nate Begeman	eb20ed6	2006-01-28 01:22:10 +0000	[diff] [blame]	201	176.gcc contains a bunch of code like this (this occurs dozens of times):
				202
				203	int %test(uint %mode.0.i.0) {
				204	%tmp.79 = cast uint %mode.0.i.0 to sbyte ; <sbyte> [#uses=1]
				205	%tmp.80 = cast sbyte %tmp.79 to int ; <int> [#uses=1]
				206	%tmp.81 = shl int %tmp.80, ubyte 16 ; <int> [#uses=1]
				207	%tmp.82 = and int %tmp.81, 16711680
				208	ret int %tmp.82
				209	}
				210
				211	which we compile to:
				212
				213	_test:
				214	extsb r2, r3
				215	rlwinm r3, r2, 16, 8, 15
				216	blr
				217
				218	The extsb is obviously dead. This can be handled by a future thing like
				219	MaskedValueIsZero that checks to see if bits are ever demanded (in this case,
				220	the sign bits are never used, so we can fold the sext_inreg to nothing).
				221
				222	I'm seeing code like this:
				223
				224	srwi r3, r3, 16
				225	extsb r3, r3
				226	rlwimi r4, r3, 16, 8, 15
				227
				228	in which the extsb is preventing the srwi from being nuked.
				229
				230	===-------------------------------------------------------------------------===
				231
				232	Another example that occurs is:
				233
				234	uint %test(int %specbits.6.1) {
				235	%tmp.2540 = shr int %specbits.6.1, ubyte 11 ; <int> [#uses=1]
				236	%tmp.2541 = cast int %tmp.2540 to uint ; <uint> [#uses=1]
				237	%tmp.2542 = shl uint %tmp.2541, ubyte 13 ; <uint> [#uses=1]
				238	%tmp.2543 = and uint %tmp.2542, 8192 ; <uint> [#uses=1]
				239	ret uint %tmp.2543
				240	}
				241
				242	which we codegen as:
				243
				244	l1_test:
				245	srawi r2, r3, 11
				246	rlwinm r3, r2, 13, 18, 18
				247	blr
				248
				249	the srawi can be nuked by turning the SAR into a logical SHR (the sext bits are
				250	dead), which I think can then be folded into the rlwinm.
				251
				252	===-------------------------------------------------------------------------===
				253
Chris Lattner	62c08dd	2005-12-08 07:13:28 +0000	[diff] [blame]	254	Compile offsets from allocas:
				255
				256	int *%test() {
				257	%X = alloca { int, int }
				258	%Y = getelementptr {int,int}* %X, int 0, uint 1
				259	ret int* %Y
				260	}
				261
				262	into a single add, not two:
				263
				264	_test:
				265	addi r2, r1, -8
				266	addi r3, r2, 4
				267	blr
				268
				269	--> important for C++.
				270
Chris Lattner	39706e6	2005-12-22 17:19:28 +0000	[diff] [blame]	271	===-------------------------------------------------------------------------===
				272
				273	int test3(int a, int b) { return (a < 0) ? a : 0; }
				274
				275	should be branch free code. LLVM is turning it into < 1 because of the RHS.
				276
				277	===-------------------------------------------------------------------------===
				278
Chris Lattner	39706e6	2005-12-22 17:19:28 +0000	[diff] [blame]	279	No loads or stores of the constants should be needed:
				280
				281	struct foo { double X, Y; };
				282	void xxx(struct foo F);
				283	void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
				284
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	285	===-------------------------------------------------------------------------===
				286
Chris Lattner	98fbc2f	2006-01-16 17:58:54 +0000	[diff] [blame]	287	Darwin Stub LICM optimization:
				288
				289	Loops like this:
				290
				291	for (...) bar();
				292
				293	Have to go through an indirect stub if bar is external or linkonce. It would
				294	be better to compile it as:
				295
				296	fp = &bar;
				297	for (...) fp();
				298
				299	which only computes the address of bar once (instead of each time through the
				300	stub). This is Darwin specific and would have to be done in the code generator.
				301	Probably not a win on x86.
				302
				303	===-------------------------------------------------------------------------===
				304
				305	PowerPC i1/setcc stuff (depends on subreg stuff):
				306
				307	Check out the PPC code we get for 'compare' in this testcase:
				308	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
				309
				310	oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
				311	invert, invert, or), we then have to compare it against zero instead of
				312	using the value already in a CR!
				313
				314	that should be something like
				315	cmpw cr7, r8, r5
				316	cmpw cr0, r7, r3
				317	crnand cr0, cr0, cr7
				318	bne cr0, LBB_compare_4
				319
				320	instead of
				321	cmpw cr7, r8, r5
				322	cmpw cr0, r7, r3
				323	mfcr r7, 1
				324	mcrf cr7, cr0
				325	mfcr r8, 1
				326	rlwinm r7, r7, 30, 31, 31
				327	rlwinm r8, r8, 30, 31, 31
				328	xori r7, r7, 1
				329	xori r8, r8, 1
				330	addi r2, r2, 1
				331	or r7, r8, r7
				332	cmpwi cr0, r7, 0
				333	bne cr0, LBB_compare_4 ; loopexit
				334
				335	===-------------------------------------------------------------------------===
				336
				337	Simple IPO for argument passing, change:
				338	void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
				339
				340	the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
				341	of arguments get assigned to r3 through r10. That is, if you have a function
				342	foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
				343	argument bytes for r4 and r5. The trick then would be to shuffle the argument
				344	order for functions we can internalize so that the maximum number of
				345	integers/pointers get passed in regs before you see any of the fp arguments.
				346
				347	Instead of implementing this, it would actually probably be easier to just
				348	implement a PPC fastcc, where we could do whatever we wanted to the CC,
				349	including having this work sanely.
				350
				351	===-------------------------------------------------------------------------===
				352
				353	Fix Darwin FP-In-Integer Registers ABI
				354
				355	Darwin passes doubles in structures in integer registers, which is very very
				356	bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
				357	that percolates these things out of functions.
				358
				359	Check out how horrible this is:
				360	http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
				361
				362	This is an extension of "interprocedural CC unmunging" that can't be done with
				363	just fastcc.
				364
				365	===-------------------------------------------------------------------------===
				366
Chris Lattner	3cda14f	2006-01-19 02:09:38 +0000	[diff] [blame]	367	Generate lwbrx and other byteswapping load/store instructions when reasonable.
				368
Chris Lattner	9690979	2006-01-28 05:40:47 +0000	[diff] [blame]	369	===-------------------------------------------------------------------------===
				370
				371	Implement TargetConstantVec, and set up PPC to custom lower ConstantVec into
				372	TargetConstantVec's if it's one of the many forms that are algorithmically
				373	computable using the spiffy altivec instructions.
				374
Chris Lattner	56b6964	2006-01-31 02:55:28 +0000	[diff] [blame]	375	===-------------------------------------------------------------------------===
				376
				377	Compile this:
				378
				379	double %test(double %X) {
				380	%Y = cast double %X to long
				381	%Z = cast long %Y to double
				382	ret double %Z
				383	}
				384
				385	to this:
				386
				387	_test:
				388	fctidz f0, f1
				389	stfd f0, -8(r1)
				390	lwz r2, -4(r1)
				391	lwz r3, -8(r1)
				392	stw r2, -12(r1)
				393	stw r3, -16(r1)
				394	lfd f0, -16(r1)
				395	fcfid f1, f0
				396	blr
				397
				398	without the lwz/stw's.
				399
Chris Lattner	83e64ba	2006-01-31 07:16:34 +0000	[diff] [blame]	400	===-------------------------------------------------------------------------===
				401
				402	Compile this:
				403
				404	int foo(int a) {
				405	int b = (a < 8);
				406	if (b) {
				407	return b * 3; // ignore the fact that this is always 3.
				408	} else {
				409	return 2;
				410	}
				411	}
				412
				413	into something not this:
				414
				415	_foo:
				416	1) cmpwi cr7, r3, 8
				417	mfcr r2, 1
				418	rlwinm r2, r2, 29, 31, 31
				419	1) cmpwi cr0, r3, 7
				420	bgt cr0, LBB1_2 ; UnifiedReturnBlock
				421	LBB1_1: ; then
				422	rlwinm r2, r2, 0, 31, 31
				423	mulli r3, r2, 3
				424	blr
				425	LBB1_2: ; UnifiedReturnBlock
				426	li r3, 2
				427	blr
				428
				429	In particular, the two compares (marked 1) could be shared by reversing one.
				430	This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
				431	same operands (but backwards) exists. In this case, this wouldn't save us
				432	anything though, because the compares still wouldn't be shared.
Chris Lattner	0ddc180	2006-02-01 00:28:12 +0000	[diff] [blame]	433
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	434	===-------------------------------------------------------------------------===
				435
				436	The legalizer should lower this:
				437
				438	bool %test(ulong %x) {
				439	%tmp = setlt ulong %x, 4294967296
				440	ret bool %tmp
				441	}
				442
				443	into "if x.high == 0", not:
				444
				445	_test:
				446	addi r2, r3, -1
				447	cntlzw r2, r2
				448	cntlzw r3, r3
				449	srwi r2, r2, 5
Nate Begeman	93c740b	2006-02-02 07:27:56 +0000	[diff] [blame]	450	srwi r4, r3, 5
				451	li r3, 0
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	452	cmpwi cr0, r2, 0
				453	bne cr0, LBB1_2 ;
				454	LBB1_1:
Nate Begeman	93c740b	2006-02-02 07:27:56 +0000	[diff] [blame]	455	or r3, r4, r4
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	456	LBB1_2:
Chris Lattner	5a7efc9	2006-02-01 17:54:23 +0000	[diff] [blame]	457	blr
				458
				459	noticed in 2005-05-11-Popcount-ffs-fls.c.
Chris Lattner	275b884	2006-02-02 07:37:11 +0000	[diff] [blame]	460
				461
				462	===-------------------------------------------------------------------------===
				463
				464	We should custom expand setcc instead of pretending that we have it. That
				465	would allow us to expose the access of the crbit after the mfcr, allowing
				466	that access to be trivially folded into other ops. A simple example:
				467
				468	int foo(int a, int b) { return (a < b) << 4; }
				469
				470	compiles into:
				471
				472	_foo:
				473	cmpw cr7, r3, r4
				474	mfcr r2, 1
				475	rlwinm r2, r2, 29, 31, 31
				476	slwi r3, r2, 4
				477	blr
				478
Chris Lattner	d463f7f	2006-02-03 01:49:49 +0000	[diff] [blame]	479	===-------------------------------------------------------------------------===
				480
Nate Begeman	a63fee8	2006-02-03 05:17:06 +0000	[diff] [blame]	481	Fold add and sub with constant into non-extern, non-weak addresses so this:
				482
				483	static int a;
				484	void bar(int b) { a = b; }
				485	void foo(unsigned char *c) {
				486	*c = a;
				487	}
				488
				489	So that
				490
				491	_foo:
				492	lis r2, ha16(_a)
				493	la r2, lo16(_a)(r2)
				494	lbz r2, 3(r2)
				495	stb r2, 0(r3)
				496	blr
				497
				498	Becomes
				499
				500	_foo:
				501	lis r2, ha16(_a+3)
				502	lbz r2, lo16(_a+3)(r2)
				503	stb r2, 0(r3)
				504	blr