Blame - llvm/lib/Target/PowerPC/README.txt - toolchain/llvm-project

blob: f1d4ca7b7fab5311758426664526470335da4460 [file] [log] [blame]

Chris Lattner	22ec3e7	2006-03-27 07:04:16 +0000	[diff] [blame]	1	//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
				2
Nate Begeman	63be70d	2004-08-10 20:42:36 +0000	[diff] [blame]	3	TODO:
Nate Begeman	3090b0f	2008-02-11 04:16:09 +0000	[diff] [blame]	4	* lmw/stmw pass a la arm load store optimizer for prolog/epilog
Nate Begeman	9aea6e4	2005-12-24 01:00:15 +0000	[diff] [blame]	5
Nate Begeman	fc567d8	2006-02-03 05:17:06 +0000	[diff] [blame]	6	===-------------------------------------------------------------------------===
Nate Begeman	9aea6e4	2005-12-24 01:00:15 +0000	[diff] [blame]	7
Chris Lattner	2e040eb	2010-09-19 00:34:58 +0000	[diff] [blame]	8	This code:
				9
				10	unsigned add32carry(unsigned sum, unsigned x) {
				11	unsigned z = sum + x;
				12	if (sum + x < x)
				13	z++;
				14	return z;
				15	}
				16
				17	Should compile to something like:
				18
				19	addc r3,r3,r4
				20	addze r3,r3
				21
				22	instead we get:
				23
				24	add r3, r4, r3
				25	cmplw cr7, r3, r4
				26	mfcr r4 ; 1
				27	rlwinm r4, r4, 29, 31, 31
				28	add r3, r3, r4
				29
				30	Ick.
Chris Lattner	9e4e45a	2010-01-07 17:53:10 +0000	[diff] [blame]	31
				32	===-------------------------------------------------------------------------===
				33
Chris Lattner	be7033b	2006-11-07 18:30:21 +0000	[diff] [blame]	34	We compile the hottest inner loop of viterbi to:
				35
				36	li r6, 0
				37	b LBB1_84 ;bb432.i
				38	LBB1_83: ;bb420.i
				39	lbzx r8, r5, r7
				40	addi r6, r7, 1
				41	stbx r8, r4, r7
				42	LBB1_84: ;bb432.i
				43	mr r7, r6
				44	cmplwi cr0, r7, 143
				45	bne cr0, LBB1_83 ;bb420.i
				46
				47	The CBE manages to produce:
				48
				49	li r0, 143
				50	mtctr r0
				51	loop:
				52	lbzx r2, r2, r11
				53	stbx r0, r2, r9
				54	addi r2, r2, 1
				55	bdz later
				56	b loop
				57
				58	This could be much better (bdnz instead of bdz) but it still beats us. If we
				59	produced this with bdnz, the loop would be a single dispatch group.
				60
				61	===-------------------------------------------------------------------------===
				62
Chris Lattner	1e98a33	2005-08-24 18:15:24 +0000	[diff] [blame]	63	Lump the constant pool for each function into ONE pic object, and reference
				64	pieces of it as offsets from the start. For functions like this (contrived
				65	to have lots of constants obviously):
				66
				67	double X(double Y) { return (Y1.23 + 4.512)2.34 + 14.38; }
				68
				69	We generate:
				70
				71	_X:
				72	lis r2, ha16(.CPI_X_0)
				73	lfd f0, lo16(.CPI_X_0)(r2)
				74	lis r2, ha16(.CPI_X_1)
				75	lfd f2, lo16(.CPI_X_1)(r2)
				76	fmadd f0, f1, f0, f2
				77	lis r2, ha16(.CPI_X_2)
				78	lfd f1, lo16(.CPI_X_2)(r2)
				79	lis r2, ha16(.CPI_X_3)
				80	lfd f2, lo16(.CPI_X_3)(r2)
				81	fmadd f1, f0, f1, f2
				82	blr
				83
				84	It would be better to materialize .CPI_X into a register, then use immediates
				85	off of the register to avoid the lis's. This is even more important in PIC
				86	mode.
				87
Chris Lattner	9b178ce	2006-02-02 23:50:22 +0000	[diff] [blame]	88	Note that this (and the static variable version) is discussed here for GCC:
				89	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				90
Chris Lattner	92c6a65	2007-08-23 15:16:03 +0000	[diff] [blame]	91	Here's another example (the sgn function):
				92	double testf(double a) {
				93	return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
				94	}
				95
				96	it produces a BB like this:
				97	LBB1_1: ; cond_true
				98	lis r2, ha16(LCPI1_0)
				99	lfs f0, lo16(LCPI1_0)(r2)
				100	lis r2, ha16(LCPI1_1)
				101	lis r3, ha16(LCPI1_2)
				102	lfs f2, lo16(LCPI1_2)(r3)
				103	lfs f3, lo16(LCPI1_1)(r2)
				104	fsub f0, f0, f1
				105	fsel f1, f0, f2, f3
				106	blr
				107
Chris Lattner	1e98a33	2005-08-24 18:15:24 +0000	[diff] [blame]	108	===-------------------------------------------------------------------------===
Nate Begeman	e9e2c6d	2005-09-06 15:30:48 +0000	[diff] [blame]	109
Chris Lattner	a23b04a	2006-02-03 06:22:11 +0000	[diff] [blame]	110	PIC Code Gen IPO optimization:
				111
				112	Squish small scalar globals together into a single global struct, allowing the
				113	address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
				114	of the GOT on targets with one).
				115
				116	Note that this is discussed here for GCC:
				117	http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
				118
				119	===-------------------------------------------------------------------------===
				120
Dale Johannesen	4e6044c	2009-07-01 23:36:02 +0000	[diff] [blame]	121	Darwin Stub removal:
				122
				123	We still generate calls to foo$stub, and stubs, on Darwin. This is not
Chris Lattner	4ec83ea	2009-07-02 01:24:34 +0000	[diff] [blame]	124	necessary when building with the Leopard (10.5) or later linker, as stubs are
				125	generated by ld when necessary. Parameterizing this based on the deployment
				126	target (-mmacosx-version-min) is probably enough. x86-32 does this right, see
				127	its logic.
Dale Johannesen	4e6044c	2009-07-01 23:36:02 +0000	[diff] [blame]	128
				129	===-------------------------------------------------------------------------===
				130
Chris Lattner	7c76290	2006-01-16 17:58:54 +0000	[diff] [blame]	131	Darwin Stub LICM optimization:
				132
				133	Loops like this:
				134
				135	for (...) bar();
				136
				137	Have to go through an indirect stub if bar is external or linkonce. It would
				138	be better to compile it as:
				139
				140	fp = &bar;
				141	for (...) fp();
				142
				143	which only computes the address of bar once (instead of each time through the
				144	stub). This is Darwin specific and would have to be done in the code generator.
				145	Probably not a win on x86.
				146
				147	===-------------------------------------------------------------------------===
				148
Chris Lattner	7c76290	2006-01-16 17:58:54 +0000	[diff] [blame]	149	Simple IPO for argument passing, change:
				150	void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
				151
				152	the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
				153	of arguments get assigned to r3 through r10. That is, if you have a function
				154	foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
				155	argument bytes for r4 and r5. The trick then would be to shuffle the argument
				156	order for functions we can internalize so that the maximum number of
				157	integers/pointers get passed in regs before you see any of the fp arguments.
				158
				159	Instead of implementing this, it would actually probably be easier to just
				160	implement a PPC fastcc, where we could do whatever we wanted to the CC,
				161	including having this work sanely.
				162
				163	===-------------------------------------------------------------------------===
				164
				165	Fix Darwin FP-In-Integer Registers ABI
				166
				167	Darwin passes doubles in structures in integer registers, which is very very
Wesley Peck	527da1b	2010-11-23 03:31:01 +0000	[diff] [blame]	168	bad. Add something like a BITCAST to LLVM, then do an i-p transformation that
				169	percolates these things out of functions.
Chris Lattner	7c76290	2006-01-16 17:58:54 +0000	[diff] [blame]	170
				171	Check out how horrible this is:
				172	http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
				173
				174	This is an extension of "interprocedural CC unmunging" that can't be done with
				175	just fastcc.
				176
				177	===-------------------------------------------------------------------------===
				178
Nate Begeman	fc567d8	2006-02-03 05:17:06 +0000	[diff] [blame]	179	Fold add and sub with constant into non-extern, non-weak addresses so this:
				180
				181	static int a;
				182	void bar(int b) { a = b; }
				183	void foo(unsigned char *c) {
				184	*c = a;
				185	}
				186
				187	So that
				188
				189	_foo:
				190	lis r2, ha16(_a)
				191	la r2, lo16(_a)(r2)
				192	lbz r2, 3(r2)
				193	stb r2, 0(r3)
				194	blr
				195
				196	Becomes
				197
				198	_foo:
				199	lis r2, ha16(_a+3)
				200	lbz r2, lo16(_a+3)(r2)
				201	stb r2, 0(r3)
				202	blr
Chris Lattner	c0e48c6	2006-02-05 05:27:35 +0000	[diff] [blame]	203
				204	===-------------------------------------------------------------------------===
				205
Chris Lattner	a8dd636	2006-03-08 00:25:47 +0000	[diff] [blame]	206	We should compile these two functions to the same thing:
				207
				208	#include <stdlib.h>
				209	void f(int a, int b, int *P) {
				210	*P = (a-b)>=0?(a-b):(b-a);
				211	}
				212	void g(int a, int b, int *P) {
				213	*P = abs(a-b);
				214	}
				215
				216	Further, they should compile to something better than:
				217
				218	_g:
				219	subf r2, r4, r3
				220	subfic r3, r2, 0
				221	cmpwi cr0, r2, -1
				222	bgt cr0, LBB2_2 ; entry
				223	LBB2_1: ; entry
				224	mr r2, r3
				225	LBB2_2: ; entry
				226	stw r2, 0(r5)
				227	blr
				228
				229	GCC produces:
				230
				231	_g:
				232	subf r4,r4,r3
				233	srawi r2,r4,31
				234	xor r0,r2,r4
				235	subf r0,r2,r0
				236	stw r0,0(r5)
				237	blr
				238
				239	... which is much nicer.
				240
				241	This theoretically may help improve twolf slightly (used in dimbox.c:142?).
				242
				243	===-------------------------------------------------------------------------===
				244
Chris Lattner	e0359b4	2010-01-24 02:27:03 +0000	[diff] [blame]	245	PR5945: This:
				246	define i32 @clamp0g(i32 %a) {
				247	entry:
				248	%cmp = icmp slt i32 %a, 0
				249	%sel = select i1 %cmp, i32 0, i32 %a
				250	ret i32 %sel
				251	}
				252
				253	Is compile to this with the PowerPC (32-bit) backend:
				254
				255	_clamp0g:
				256	cmpwi cr0, r3, 0
				257	li r2, 0
				258	blt cr0, LBB1_2
				259	; BB#1: ; %entry
				260	mr r2, r3
				261	LBB1_2: ; %entry
				262	mr r3, r2
				263	blr
				264
				265	This could be reduced to the much simpler:
				266
				267	_clamp0g:
				268	srawi r2, r3, 31
				269	andc r3, r3, r2
				270	blr
				271
				272	===-------------------------------------------------------------------------===
				273
Nate Begeman	32e73f9	2006-03-16 18:50:44 +0000	[diff] [blame]	274	int foo(int N, int *W, int TK, int X) {
				275	int t, i;
				276
				277	for (t = 0; t < N; ++t)
				278	for (i = 0; i < 4; ++i)
				279	W[t / X][i][t % X] = TK[i][t];
				280
				281	return 5;
				282	}
				283
Chris Lattner	325bb46	2006-03-16 22:25:55 +0000	[diff] [blame]	284	We generate relatively atrocious code for this loop compared to gcc.
				285
Chris Lattner	f194834	2006-03-21 00:47:09 +0000	[diff] [blame]	286	We could also strength reduce the rem and the div:
				287	http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
				288
Chris Lattner	ea64687	2006-03-19 05:33:30 +0000	[diff] [blame]	289	===-------------------------------------------------------------------------===
Chris Lattner	325bb46	2006-03-16 22:25:55 +0000	[diff] [blame]	290
Chris Lattner	9f9b611	2006-03-24 20:04:27 +0000	[diff] [blame]	291	We generate ugly code for this:
				292
				293	void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
				294	unsigned code = 0;
				295	if(dx < -dw) code \|= 1;
				296	if(dx > dw) code \|= 2;
				297	if(dy < -dw) code \|= 4;
				298	if(dy > dw) code \|= 8;
				299	if(dz < -dw) code \|= 16;
				300	if(dz > dw) code \|= 32;
				301	*ret = code;
				302	}
				303
Chris Lattner	5d70a7c	2006-03-25 06:47:10 +0000	[diff] [blame]	304	===-------------------------------------------------------------------------===
				305
Nate Begeman	17f2500	2007-01-29 21:21:22 +0000	[diff] [blame]	306	%struct.B = type { i8, [3 x i8] }
Nate Begeman	ce6646c	2006-05-08 20:54:02 +0000	[diff] [blame]	307
Nate Begeman	17f2500	2007-01-29 21:21:22 +0000	[diff] [blame]	308	define void @bar(%struct.B* %b) {
Nate Begeman	ce6646c	2006-05-08 20:54:02 +0000	[diff] [blame]	309	entry:
Nate Begeman	17f2500	2007-01-29 21:21:22 +0000	[diff] [blame]	310	%tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
				311	%tmp = load i32* %tmp ; <uint> [#uses=1]
				312	%tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
				313	%tmp4 = load i32* %tmp3 ; <uint> [#uses=1]
				314	%tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2]
				315	%tmp9 = load i32* %tmp8 ; <uint> [#uses=1]
				316	%tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1]
				317	%tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
				318	%tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1]
				319	%tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1]
				320	%tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1]
				321	%tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1]
				322	store i32 %tmp13, i32* %tmp8
Chris Lattner	304bbf3	2006-05-05 05:36:15 +0000	[diff] [blame]	323	ret void
				324	}
				325
				326	We emit:
				327
				328	_foo:
				329	lwz r2, 0(r3)
Nate Begeman	ce6646c	2006-05-08 20:54:02 +0000	[diff] [blame]	330	slwi r4, r2, 1
				331	or r4, r4, r2
				332	rlwimi r2, r4, 0, 0, 0
Nate Begeman	9b6d4c2	2006-05-08 17:38:32 +0000	[diff] [blame]	333	stw r2, 0(r3)
Chris Lattner	304bbf3	2006-05-05 05:36:15 +0000	[diff] [blame]	334	blr
				335
Nate Begeman	ce6646c	2006-05-08 20:54:02 +0000	[diff] [blame]	336	We could collapse a bunch of those ORs and ANDs and generate the following
				337	equivalent code:
Chris Lattner	304bbf3	2006-05-05 05:36:15 +0000	[diff] [blame]	338
Nate Begeman	9b6d4c2	2006-05-08 17:38:32 +0000	[diff] [blame]	339	_foo:
				340	lwz r2, 0(r3)
Nate Begeman	0eb8f2e	2006-05-08 19:09:24 +0000	[diff] [blame]	341	rlwinm r4, r2, 1, 0, 0
Nate Begeman	9b6d4c2	2006-05-08 17:38:32 +0000	[diff] [blame]	342	or r2, r2, r4
				343	stw r2, 0(r3)
				344	blr
Chris Lattner	950dffa	2006-07-14 04:07:29 +0000	[diff] [blame]	345
				346	===-------------------------------------------------------------------------===
				347
Chris Lattner	889d934	2007-01-18 07:34:57 +0000	[diff] [blame]	348	Consider a function like this:
				349
				350	float foo(float X) { return X + 1234.4123f; }
				351
				352	The FP constant ends up in the constant pool, so we need to get the LR register.
				353	This ends up producing code like this:
				354
				355	_foo:
				356	.LBB_foo_0: ; entry
				357	mflr r11
				358	*** stw r11, 8(r1)
				359	bl "L00000$pb"
				360	"L00000$pb":
				361	mflr r2
				362	addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
				363	lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
				364	fadds f1, f1, f0
				365	*** lwz r11, 8(r1)
				366	mtlr r11
				367	blr
				368
				369	This is functional, but there is no reason to spill the LR register all the way
				370	to the stack (the two marked instrs): spilling it to a GPR is quite enough.
				371
				372	Implementing this will require some codegen improvements. Nate writes:
				373
				374	"So basically what we need to support the "no stack frame save and restore" is a
				375	generalization of the LR optimization to "callee-save regs".
				376
				377	Currently, we have LR marked as a callee-save reg. The register allocator sees
				378	that it's callee save, and spills it directly to the stack.
				379
				380	Ideally, something like this would happen:
				381
				382	LR would be in a separate register class from the GPRs. The class of LR would be
				383	marked "unspillable". When the register allocator came across an unspillable
				384	reg, it would ask "what is the best class to copy this into that I can spill"
				385	If it gets a class back, which it will in this case (the gprs), it grabs a free
				386	register of that class. If it is then later necessary to spill that reg, so be
				387	it.
				388
				389	===-------------------------------------------------------------------------===
Chris Lattner	37ebf93	2007-01-31 19:49:20 +0000	[diff] [blame]	390
				391	We compile this:
				392	int test(_Bool X) {
				393	return X ? 524288 : 0;
				394	}
				395
				396	to:
				397	_test:
				398	cmplwi cr0, r3, 0
				399	lis r2, 8
				400	li r3, 0
				401	beq cr0, LBB1_2 ;entry
				402	LBB1_1: ;entry
				403	mr r3, r2
				404	LBB1_2: ;entry
				405	blr
				406
				407	instead of:
				408	_test:
				409	addic r2,r3,-1
				410	subfe r0,r2,r3
				411	slwi r3,r0,19
				412	blr
				413
				414	This sort of thing occurs a lot due to globalopt.
				415
				416	===-------------------------------------------------------------------------===
Chris Lattner	c9088b4	2007-02-09 17:38:01 +0000	[diff] [blame]	417
Chris Lattner	97331ae	2010-01-23 18:42:37 +0000	[diff] [blame]	418	We compile:
				419
				420	define i32 @bar(i32 %x) nounwind readnone ssp {
				421	entry:
				422	%0 = icmp eq i32 %x, 0 ; <i1> [#uses=1]
Chris Lattner	1b35bbe	2010-01-24 00:09:49 +0000	[diff] [blame]	423	%neg = sext i1 %0 to i32 ; <i32> [#uses=1]
Chris Lattner	97331ae	2010-01-23 18:42:37 +0000	[diff] [blame]	424	ret i32 %neg
				425	}
				426
				427	to:
				428
				429	_bar:
Chris Lattner	1b35bbe	2010-01-24 00:09:49 +0000	[diff] [blame]	430	cntlzw r2, r3
				431	slwi r2, r2, 26
				432	srawi r3, r2, 31
Chris Lattner	97331ae	2010-01-23 18:42:37 +0000	[diff] [blame]	433	blr
				434
Chris Lattner	1b35bbe	2010-01-24 00:09:49 +0000	[diff] [blame]	435	it would be better to produce:
Chris Lattner	97331ae	2010-01-23 18:42:37 +0000	[diff] [blame]	436
				437	_bar:
				438	addic r3,r3,-1
				439	subfe r3,r3,r3
				440	blr
				441
				442	===-------------------------------------------------------------------------===
				443
Chris Lattner	075b4db	2007-03-31 07:06:25 +0000	[diff] [blame]	444	We generate horrible ppc code for this:
				445
				446	#define N 2000000
				447	double a[N],c[N];
				448	void simpleloop() {
				449	int j;
				450	for (j=0; j<N; j++)
				451	c[j] = a[j];
				452	}
				453
				454	LBB1_1: ;bb
				455	lfdx f0, r3, r4
				456	addi r5, r5, 1 ;; Extra IV for the exit value compare.
				457	stfdx f0, r2, r4
				458	addi r4, r4, 8
				459
				460	xoris r6, r5, 30 ;; This is due to a large immediate.
				461	cmplwi cr0, r6, 33920
				462	bne cr0, LBB1_1
				463
Chris Lattner	6777b72	2007-09-10 21:43:18 +0000	[diff] [blame]	464	//===---------------------------------------------------------------------===//
				465
				466	This:
				467	#include <algorithm>
				468	inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				469	{ return std::make_pair(a + b, a + b < a); }
				470	bool no_overflow(unsigned a, unsigned b)
				471	{ return !full_add(a, b).second; }
				472
				473	Should compile to:
				474
				475	__Z11no_overflowjj:
				476	add r4,r3,r4
				477	subfc r3,r3,r4
				478	li r3,0
				479	adde r3,r3,r3
				480	blr
				481
				482	(or better) not:
				483
				484	__Z11no_overflowjj:
				485	add r2, r4, r3
				486	cmplw cr7, r2, r3
				487	mfcr r2
				488	rlwinm r2, r2, 29, 31, 31
				489	xori r3, r2, 1
				490	blr
				491
				492	//===---------------------------------------------------------------------===//
Chris Lattner	075b4db	2007-03-31 07:06:25 +0000	[diff] [blame]	493
Chris Lattner	89f36e6	2008-01-08 06:46:30 +0000	[diff] [blame]	494	We compile some FP comparisons into an mfcr with two rlwinms and an or. For
				495	example:
				496	#include <math.h>
				497	int test(double x, double y) { return islessequal(x, y);}
				498	int test2(double x, double y) { return islessgreater(x, y);}
				499	int test3(double x, double y) { return !islessequal(x, y);}
				500
				501	Compiles into (all three are similar, but the bits differ):
				502
				503	_test:
				504	fcmpu cr7, f1, f2
				505	mfcr r2
				506	rlwinm r3, r2, 29, 31, 31
				507	rlwinm r2, r2, 31, 31, 31
				508	or r3, r2, r3
				509	blr
				510
				511	GCC compiles this into:
				512
				513	_test:
				514	fcmpu cr7,f1,f2
				515	cror 30,28,30
				516	mfcr r3
				517	rlwinm r3,r3,31,1
				518	blr
				519
				520	which is more efficient and can use mfocr. See PR642 for some more context.
				521
				522	//===---------------------------------------------------------------------===//
Chris Lattner	6b0a189	2008-03-02 19:27:34 +0000	[diff] [blame]	523
				524	void foo(float *data, float d) {
				525	long i;
				526	for (i = 0; i < 8000; i++)
				527	data[i] = d;
				528	}
				529	void foo2(float *data, float d) {
				530	long i;
				531	data--;
				532	for (i = 0; i < 8000; i++) {
				533	data[1] = d;
				534	data++;
				535	}
				536	}
				537
				538	These compile to:
				539
				540	_foo:
				541	li r2, 0
				542	LBB1_1: ; bb
				543	addi r4, r2, 4
				544	stfsx f1, r3, r2
				545	cmplwi cr0, r4, 32000
				546	mr r2, r4
				547	bne cr0, LBB1_1 ; bb
				548	blr
				549	_foo2:
				550	li r2, 0
				551	LBB2_1: ; bb
				552	addi r4, r2, 4
				553	stfsx f1, r3, r2
				554	cmplwi cr0, r4, 32000
				555	mr r2, r4
				556	bne cr0, LBB2_1 ; bb
				557	blr
				558
				559	The 'mr' could be eliminated to folding the add into the cmp better.
				560
				561	//===---------------------------------------------------------------------===//
Dale Johannesen	aae3a4f	2008-11-17 18:56:34 +0000	[diff] [blame]	562	Codegen for the following (low-probability) case deteriorated considerably
				563	when the correctness fixes for unordered comparisons went in (PR 642, 58871).
				564	It should be possible to recover the code quality described in the comments.
				565
				566	; RUN: llvm-as < %s \| llc -march=ppc32 \| grep or \| count 3
				567	; This should produce one 'or' or 'cror' instruction per function.
				568
				569	; RUN: llvm-as < %s \| llc -march=ppc32 \| grep mfcr \| count 3
				570	; PR2964
				571
				572	define i32 @test(double %x, double %y) nounwind {
				573	entry:
				574	%tmp3 = fcmp ole double %x, %y ; <i1> [#uses=1]
				575	%tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
				576	ret i32 %tmp345
				577	}
				578
				579	define i32 @test2(double %x, double %y) nounwind {
				580	entry:
				581	%tmp3 = fcmp one double %x, %y ; <i1> [#uses=1]
				582	%tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
				583	ret i32 %tmp345
				584	}
				585
				586	define i32 @test3(double %x, double %y) nounwind {
				587	entry:
				588	%tmp3 = fcmp ugt double %x, %y ; <i1> [#uses=1]
				589	%tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
				590	ret i32 %tmp34
				591	}
Ehsan Amiri	631ed04	2016-03-18 04:02:25 +0000	[diff] [blame]	592
				593	//===---------------------------------------------------------------------===//
				594	for the following code:
				595
				596	void foo (float __restrict__ a, int __restrict__ b, int n) {
				597	a[n] = b[n] * 2.321;
				598	}
				599
				600	we load b[n] to GPR, then move it VSX register and convert it float. We should
				601	use vsx scalar integer load instructions to avoid direct moves
				602
Dale Johannesen	aae3a4f	2008-11-17 18:56:34 +0000	[diff] [blame]	603	//===----------------------------------------------------------------------===//
				604	; RUN: llvm-as < %s \| llc -march=ppc32 \| not grep fneg
				605
				606	; This could generate FSEL with appropriate flags (FSEL is not IEEE-safe, and
				607	; should not be generated except with -enable-finite-only-fp-math or the like).
				608	; With the correctness fixes for PR642 (58871) LowerSELECT_CC would need to
				609	; recognize a more elaborate tree than a simple SETxx.
				610
				611	define double @test_FNEG_sel(double %A, double %B, double %C) {
Dan Gohman	6f34abd	2010-03-02 01:11:08 +0000	[diff] [blame]	612	%D = fsub double -0.000000e+00, %A ; <double> [#uses=1]
Dale Johannesen	aae3a4f	2008-11-17 18:56:34 +0000	[diff] [blame]	613	%Cond = fcmp ugt double %D, -0.000000e+00 ; <i1> [#uses=1]
				614	%E = select i1 %Cond, double %B, double %C ; <double> [#uses=1]
				615	ret double %E
				616	}
				617
Dale Johannesen	626b79d	2010-02-12 23:16:24 +0000	[diff] [blame]	618	//===----------------------------------------------------------------------===//
				619	The save/restore sequence for CR in prolog/epilog is terrible:
				620	- Each CR subreg is saved individually, rather than doing one save as a unit.
				621	- On Darwin, the save is done after the decrement of SP, which means the offset
				622	from SP of the save slot can be too big for a store instruction, which means we
				623	need an additional register (currently hacked in 96015+96020; the solution there
				624	is correct, but poor).
				625	- On SVR4 the same thing can happen, and I don't think saving before the SP
				626	decrement is safe on that target, as there is no red zone. This is currently
				627	broken AFAIK, although it's not a target I can exercise.
				628	The following demonstrates the problem:
				629	extern void bar(char *p);
				630	void foo() {
				631	char x[100000];
				632	bar(x);
				633	__asm__("" ::: "cr2");
				634	}
Kit Barton	116e18b	2015-03-11 17:43:43 +0000	[diff] [blame]	635
Nemanja Ivanovic	c090479	2015-04-09 23:54:37 +0000	[diff] [blame]	636	//===-------------------------------------------------------------------------===
				637	Naming convention for instruction formats is very haphazard.
				638	We have agreed on a naming scheme as follows:
				639
				640	<INST_form>{_<OP_type><OP_len>}+
				641
				642	Where:
				643	INST_form is the instruction format (X-form, etc.)
				644	OP_type is the operand type - one of OPC (opcode), RD (register destination),
				645	RS (register source),
				646	RDp (destination register pair),
				647	RSp (source register pair), IM (immediate),
				648	XO (extended opcode)
				649	OP_len is the length of the operand in bits
				650
				651	VSX register operands would be of length 6 (split across two fields),
				652	condition register fields of length 3.
				653	We would not need denote reserved fields in names of instruction formats.
				654
Kit Barton	116e18b	2015-03-11 17:43:43 +0000	[diff] [blame]	655	//===----------------------------------------------------------------------===//
				656
				657	Instruction fusion was introduced in ISA 2.06 and more opportunities added in
				658	ISA 2.07. LLVM needs to add infrastructure to recognize fusion opportunities
				659	and force instruction pairs to be scheduled together.
				660