Blame - llvm/lib/Target/PowerPC/README.txt - toolchain/llvm-project

blob: d51776f6232c07577a6093d393ff89bc39d885ec [file] [log] [blame]

Nate Begeman	63be70d	2004-08-10 20:42:36 +0000	[diff] [blame]	1	TODO:
Nate Begeman	08698cf	2005-04-11 20:48:57 +0000	[diff] [blame]	2	* gpr0 allocation
Nate Begeman	4c6e1d6	2004-10-26 04:10:53 +0000	[diff] [blame]	3	* implement do-loop -> bdnz transform
Nate Begeman	412602d	2004-08-14 22:16:36 +0000	[diff] [blame]	4	* implement powerpc-64 for darwin
Nate Begeman	9aea6e4	2005-12-24 01:00:15 +0000	[diff] [blame]	5
Nate Begeman	fc567d8	2006-02-03 05:17:06 +0000	[diff] [blame]	6	===-------------------------------------------------------------------------===
Nate Begeman	9aea6e4	2005-12-24 01:00:15 +0000	[diff] [blame]	7
Nate Begeman	fc567d8	2006-02-03 05:17:06 +0000	[diff] [blame]	8	Use the stfiwx instruction for:
Chris Lattner	1defb7f	2005-07-26 19:07:51 +0000	[diff] [blame]	9
Nate Begeman	fc567d8	2006-02-03 05:17:06 +0000	[diff] [blame]	10	void foo(float a, int b) { b = a; }
				11
				12	===-------------------------------------------------------------------------===
				13
Nate Begeman	83f6b98	2005-08-14 01:17:16 +0000	[diff] [blame]	14	unsigned short foo(float a) { return a; }
Nate Begeman	fc567d8	2006-02-03 05:17:06 +0000	[diff] [blame]	15	should be:
Nate Begeman	83f6b98	2005-08-14 01:17:16 +0000	[diff] [blame]	16	_foo:
				17	fctiwz f0,f1
				18	stfd f0,-8(r1)
				19	lhz r3,-2(r1)
				20	blr
				21	not:
				22	_foo:
				23	fctiwz f0, f1
				24	stfd f0, -8(r1)
				25	lwz r2, -4(r1)
				26	rlwinm r3, r2, 0, 16, 31
				27	blr
				28
Nate Begeman	fc567d8	2006-02-03 05:17:06 +0000	[diff] [blame]	29	===-------------------------------------------------------------------------===
Chris Lattner	11fc319	2005-08-05 19:18:32 +0000	[diff] [blame]	30
Nate Begeman	fc567d8	2006-02-03 05:17:06 +0000	[diff] [blame]	31	Support 'update' load/store instructions. These are cracked on the G5, but are
				32	still a codesize win.
				33
				34	===-------------------------------------------------------------------------===
				35
				36	Should hint to the branch select pass that it doesn't need to print the second
				37	unconditional branch, so we don't end up with things like:
Misha Brukman	2ffb787	2004-07-27 18:43:04 +0000	[diff] [blame]	38	b .LBBl42__2E_expand_function_8_674 ; loopentry.24
				39	b .LBBl42__2E_expand_function_8_42 ; NewDefault
				40	b .LBBl42__2E_expand_function_8_42 ; NewDefault
Chris Lattner	5e3953d	2005-08-23 06:27:59 +0000	[diff] [blame]	41
Chris Lattner	81e66ab	2006-02-03 22:06:45 +0000	[diff] [blame]	42	This occurs in SPASS.
				43
Chris Lattner	1e98a33	2005-08-24 18:15:24 +0000	[diff] [blame]	44	===-------------------------------------------------------------------------===
				45
Chris Lattner	5e3953d	2005-08-23 06:27:59 +0000	[diff] [blame]	46	* Codegen this:
				47
				48	void test2(int X) {
				49	if (X == 0x12345678) bar();
				50	}
				51
				52	as:
				53
				54	xoris r0,r3,0x1234
				55	cmpwi cr0,r0,0x5678
				56	beq cr0,L6
				57
				58	not:
				59
				60	lis r2, 4660
				61	ori r2, r2, 22136
				62	cmpw cr0, r3, r2
				63	bne .LBB_test2_2
				64
Chris Lattner	1e98a33	2005-08-24 18:15:24 +0000	[diff] [blame]	65	===-------------------------------------------------------------------------===
				66
				67	Lump the constant pool for each function into ONE pic object, and reference
				68	pieces of it as offsets from the start. For functions like this (contrived
				69	to have lots of constants obviously):
				70
				71	double X(double Y) { return (Y1.23 + 4.512)2.34 + 14.38; }
				72
				73	We generate:
				74
				75	_X:
				76	lis r2, ha16(.CPI_X_0)
				77	lfd f0, lo16(.CPI_X_0)(r2)
				78	lis r2, ha16(.CPI_X_1)
				79	lfd f2, lo16(.CPI_X_1)(r2)
				80	fmadd f0, f1, f0, f2
				81	lis r2, ha16(.CPI_X_2)
				82	lfd f1, lo16(.CPI_X_2)(r2)
				83	lis r2, ha16(.CPI_X_3)
				84	lfd f2, lo16(.CPI_X_3)(r2)
				85	fmadd f1, f0, f1, f2
				86	blr
				87
				88	It would be better to materialize .CPI_X into a register, then use immediates
				89	off of the register to avoid the lis's. This is even more important in PIC
				90	mode.
				91
Chris Lattner	9b178ce	2006-02-02 23:50:22 +0000	[diff] [blame]	92	Note that this (and the static variable version) is discussed here for GCC:
				93	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				94
Chris Lattner	1e98a33	2005-08-24 18:15:24 +0000	[diff] [blame]	95	===-------------------------------------------------------------------------===
Nate Begeman	e9e2c6d	2005-09-06 15:30:48 +0000	[diff] [blame]	96
Chris Lattner	a23b04a	2006-02-03 06:22:11 +0000	[diff] [blame]	97	PIC Code Gen IPO optimization:
				98
				99	Squish small scalar globals together into a single global struct, allowing the
				100	address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
				101	of the GOT on targets with one).
				102
				103	Note that this is discussed here for GCC:
				104	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				105
				106	===-------------------------------------------------------------------------===
				107
Nate Begeman	e9e2c6d	2005-09-06 15:30:48 +0000	[diff] [blame]	108	Implement Newton-Rhapson method for improving estimate instructions to the
				109	correct accuracy, and implementing divide as multiply by reciprocal when it has
				110	more than one use. Itanium will want this too.
Nate Begeman	6cca84e	2005-10-16 05:39:50 +0000	[diff] [blame]	111
				112	===-------------------------------------------------------------------------===
				113
Nate Begeman	ff17965	2005-10-25 23:50:02 +0000	[diff] [blame]	114	#define ARRAY_LENGTH 16
				115
				116	union bitfield {
				117	struct {
				118	#ifndef __ppc__
				119	unsigned int field0 : 6;
				120	unsigned int field1 : 6;
				121	unsigned int field2 : 6;
				122	unsigned int field3 : 6;
				123	unsigned int field4 : 3;
				124	unsigned int field5 : 4;
				125	unsigned int field6 : 1;
				126	#else
				127	unsigned int field6 : 1;
				128	unsigned int field5 : 4;
				129	unsigned int field4 : 3;
				130	unsigned int field3 : 6;
				131	unsigned int field2 : 6;
				132	unsigned int field1 : 6;
				133	unsigned int field0 : 6;
				134	#endif
				135	} bitfields, bits;
				136	unsigned int u32All;
				137	signed int i32All;
				138	float f32All;
				139	};
				140
				141
				142	typedef struct program_t {
				143	union bitfield array[ARRAY_LENGTH];
				144	int size;
				145	int loaded;
				146	} program;
				147
				148
				149	void AdjustBitfields(program* prog, unsigned int fmt1)
				150	{
				151	unsigned int shift = 0;
				152	unsigned int texCount = 0;
				153	unsigned int i;
				154
				155	for (i = 0; i < 8; i++)
				156	{
				157	prog->array[i].bitfields.field0 = texCount;
				158	prog->array[i].bitfields.field1 = texCount + 1;
				159	prog->array[i].bitfields.field2 = texCount + 2;
				160	prog->array[i].bitfields.field3 = texCount + 3;
				161
				162	texCount += (fmt1 >> shift) & 0x7;
				163	shift += 3;
				164	}
				165	}
				166
				167	In the loop above, the bitfield adds get generated as
				168	(add (shl bitfield, C1), (shl C2, C1)) where C2 is 1, 2 or 3.
				169
				170	Since the input to the (or and, and) is an (add) rather than a (shl), the shift
				171	doesn't get folded into the rlwimi instruction. We should ideally see through
				172	things like this, rather than forcing llvm to generate the equivalent
				173
				174	(shl (add bitfield, C2), C1) with some kind of mask.
Chris Lattner	a0dfc67	2005-10-28 00:20:45 +0000	[diff] [blame]	175
				176	===-------------------------------------------------------------------------===
				177
Chris Lattner	75fe59c	2005-11-05 08:57:56 +0000	[diff] [blame]	178	Compile this:
				179
				180	int %f1(int %a, int %b) {
				181	%tmp.1 = and int %a, 15 ; <int> [#uses=1]
				182	%tmp.3 = and int %b, 240 ; <int> [#uses=1]
				183	%tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
				184	ret int %tmp.4
				185	}
				186
				187	without a copy. We make this currently:
				188
				189	_f1:
				190	rlwinm r2, r4, 0, 24, 27
				191	rlwimi r2, r3, 0, 28, 31
				192	or r3, r2, r2
				193	blr
				194
				195	The two-addr pass or RA needs to learn when it is profitable to commute an
				196	instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
				197	currently only commutes to avoid inserting a copy BEFORE the two addr instr.
				198
Chris Lattner	29e6c3d	2005-12-08 07:13:28 +0000	[diff] [blame]	199	===-------------------------------------------------------------------------===
				200
				201	Compile offsets from allocas:
				202
				203	int *%test() {
				204	%X = alloca { int, int }
				205	%Y = getelementptr {int,int}* %X, int 0, uint 1
				206	ret int* %Y
				207	}
				208
				209	into a single add, not two:
				210
				211	_test:
				212	addi r2, r1, -8
				213	addi r3, r2, 4
				214	blr
				215
				216	--> important for C++.
				217
Chris Lattner	ffe3542	2005-12-22 17:19:28 +0000	[diff] [blame]	218	===-------------------------------------------------------------------------===
				219
				220	int test3(int a, int b) { return (a < 0) ? a : 0; }
				221
				222	should be branch free code. LLVM is turning it into < 1 because of the RHS.
				223
				224	===-------------------------------------------------------------------------===
				225
Chris Lattner	ffe3542	2005-12-22 17:19:28 +0000	[diff] [blame]	226	No loads or stores of the constants should be needed:
				227
				228	struct foo { double X, Y; };
				229	void xxx(struct foo F);
				230	void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
				231
Chris Lattner	b2eacf4	2006-01-16 17:53:00 +0000	[diff] [blame]	232	===-------------------------------------------------------------------------===
				233
Chris Lattner	7c76290	2006-01-16 17:58:54 +0000	[diff] [blame]	234	Darwin Stub LICM optimization:
				235
				236	Loops like this:
				237
				238	for (...) bar();
				239
				240	Have to go through an indirect stub if bar is external or linkonce. It would
				241	be better to compile it as:
				242
				243	fp = &bar;
				244	for (...) fp();
				245
				246	which only computes the address of bar once (instead of each time through the
				247	stub). This is Darwin specific and would have to be done in the code generator.
				248	Probably not a win on x86.
				249
				250	===-------------------------------------------------------------------------===
				251
				252	PowerPC i1/setcc stuff (depends on subreg stuff):
				253
				254	Check out the PPC code we get for 'compare' in this testcase:
				255	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
				256
				257	oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
				258	invert, invert, or), we then have to compare it against zero instead of
				259	using the value already in a CR!
				260
				261	that should be something like
				262	cmpw cr7, r8, r5
				263	cmpw cr0, r7, r3
				264	crnand cr0, cr0, cr7
				265	bne cr0, LBB_compare_4
				266
				267	instead of
				268	cmpw cr7, r8, r5
				269	cmpw cr0, r7, r3
				270	mfcr r7, 1
				271	mcrf cr7, cr0
				272	mfcr r8, 1
				273	rlwinm r7, r7, 30, 31, 31
				274	rlwinm r8, r8, 30, 31, 31
				275	xori r7, r7, 1
				276	xori r8, r8, 1
				277	addi r2, r2, 1
				278	or r7, r8, r7
				279	cmpwi cr0, r7, 0
				280	bne cr0, LBB_compare_4 ; loopexit
				281
				282	===-------------------------------------------------------------------------===
				283
				284	Simple IPO for argument passing, change:
				285	void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
				286
				287	the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
				288	of arguments get assigned to r3 through r10. That is, if you have a function
				289	foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
				290	argument bytes for r4 and r5. The trick then would be to shuffle the argument
				291	order for functions we can internalize so that the maximum number of
				292	integers/pointers get passed in regs before you see any of the fp arguments.
				293
				294	Instead of implementing this, it would actually probably be easier to just
				295	implement a PPC fastcc, where we could do whatever we wanted to the CC,
				296	including having this work sanely.
				297
				298	===-------------------------------------------------------------------------===
				299
				300	Fix Darwin FP-In-Integer Registers ABI
				301
				302	Darwin passes doubles in structures in integer registers, which is very very
				303	bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
				304	that percolates these things out of functions.
				305
				306	Check out how horrible this is:
				307	http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
				308
				309	This is an extension of "interprocedural CC unmunging" that can't be done with
				310	just fastcc.
				311
				312	===-------------------------------------------------------------------------===
				313
Chris Lattner	c3c2703	2006-01-19 02:09:38 +0000	[diff] [blame]	314	Generate lwbrx and other byteswapping load/store instructions when reasonable.
				315
Chris Lattner	0c7b466	2006-01-28 05:40:47 +0000	[diff] [blame]	316	===-------------------------------------------------------------------------===
				317
				318	Implement TargetConstantVec, and set up PPC to custom lower ConstantVec into
				319	TargetConstantVec's if it's one of the many forms that are algorithmically
				320	computable using the spiffy altivec instructions.
				321
Chris Lattner	a9bfca8	2006-01-31 02:55:28 +0000	[diff] [blame]	322	===-------------------------------------------------------------------------===
				323
				324	Compile this:
				325
				326	double %test(double %X) {
				327	%Y = cast double %X to long
				328	%Z = cast long %Y to double
				329	ret double %Z
				330	}
				331
				332	to this:
				333
				334	_test:
				335	fctidz f0, f1
				336	stfd f0, -8(r1)
				337	lwz r2, -4(r1)
				338	lwz r3, -8(r1)
				339	stw r2, -12(r1)
				340	stw r3, -16(r1)
				341	lfd f0, -16(r1)
				342	fcfid f1, f0
				343	blr
				344
				345	without the lwz/stw's.
				346
Chris Lattner	b0fe138	2006-01-31 07:16:34 +0000	[diff] [blame]	347	===-------------------------------------------------------------------------===
				348
				349	Compile this:
				350
				351	int foo(int a) {
				352	int b = (a < 8);
				353	if (b) {
				354	return b * 3; // ignore the fact that this is always 3.
				355	} else {
				356	return 2;
				357	}
				358	}
				359
				360	into something not this:
				361
				362	_foo:
				363	1) cmpwi cr7, r3, 8
				364	mfcr r2, 1
				365	rlwinm r2, r2, 29, 31, 31
				366	1) cmpwi cr0, r3, 7
				367	bgt cr0, LBB1_2 ; UnifiedReturnBlock
				368	LBB1_1: ; then
				369	rlwinm r2, r2, 0, 31, 31
				370	mulli r3, r2, 3
				371	blr
				372	LBB1_2: ; UnifiedReturnBlock
				373	li r3, 2
				374	blr
				375
				376	In particular, the two compares (marked 1) could be shared by reversing one.
				377	This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
				378	same operands (but backwards) exists. In this case, this wouldn't save us
				379	anything though, because the compares still wouldn't be shared.
Chris Lattner	a052747	2006-02-01 00:28:12 +0000	[diff] [blame]	380
Chris Lattner	a983bea	2006-02-01 17:54:23 +0000	[diff] [blame]	381	===-------------------------------------------------------------------------===
				382
				383	The legalizer should lower this:
				384
				385	bool %test(ulong %x) {
				386	%tmp = setlt ulong %x, 4294967296
				387	ret bool %tmp
				388	}
				389
				390	into "if x.high == 0", not:
				391
				392	_test:
				393	addi r2, r3, -1
				394	cntlzw r2, r2
				395	cntlzw r3, r3
				396	srwi r2, r2, 5
Nate Begeman	cd01852	2006-02-02 07:27:56 +0000	[diff] [blame]	397	srwi r4, r3, 5
				398	li r3, 0
Chris Lattner	a983bea	2006-02-01 17:54:23 +0000	[diff] [blame]	399	cmpwi cr0, r2, 0
				400	bne cr0, LBB1_2 ;
				401	LBB1_1:
Nate Begeman	cd01852	2006-02-02 07:27:56 +0000	[diff] [blame]	402	or r3, r4, r4
Chris Lattner	a983bea	2006-02-01 17:54:23 +0000	[diff] [blame]	403	LBB1_2:
Chris Lattner	a983bea	2006-02-01 17:54:23 +0000	[diff] [blame]	404	blr
				405
				406	noticed in 2005-05-11-Popcount-ffs-fls.c.
Chris Lattner	9dd7df7	2006-02-02 07:37:11 +0000	[diff] [blame]	407
				408
				409	===-------------------------------------------------------------------------===
				410
				411	We should custom expand setcc instead of pretending that we have it. That
				412	would allow us to expose the access of the crbit after the mfcr, allowing
				413	that access to be trivially folded into other ops. A simple example:
				414
				415	int foo(int a, int b) { return (a < b) << 4; }
				416
				417	compiles into:
				418
				419	_foo:
				420	cmpw cr7, r3, r4
				421	mfcr r2, 1
				422	rlwinm r2, r2, 29, 31, 31
				423	slwi r3, r2, 4
				424	blr
				425
Chris Lattner	f0a2d66	2006-02-03 01:49:49 +0000	[diff] [blame]	426	===-------------------------------------------------------------------------===
				427
Nate Begeman	fc567d8	2006-02-03 05:17:06 +0000	[diff] [blame]	428	Fold add and sub with constant into non-extern, non-weak addresses so this:
				429
				430	static int a;
				431	void bar(int b) { a = b; }
				432	void foo(unsigned char *c) {
				433	*c = a;
				434	}
				435
				436	So that
				437
				438	_foo:
				439	lis r2, ha16(_a)
				440	la r2, lo16(_a)(r2)
				441	lbz r2, 3(r2)
				442	stb r2, 0(r3)
				443	blr
				444
				445	Becomes
				446
				447	_foo:
				448	lis r2, ha16(_a+3)
				449	lbz r2, lo16(_a+3)(r2)
				450	stb r2, 0(r3)
				451	blr
Chris Lattner	c0e48c6	2006-02-05 05:27:35 +0000	[diff] [blame^]	452
				453	===-------------------------------------------------------------------------===
				454
				455	We generate really bad code for this:
				456
				457	int f(signed char *a, _Bool b, _Bool c) {
				458	signed char t = 0;
				459	if (b) t = *a;
				460	if (c) *a = t;
				461	}
				462