Blame - lib/Target/X86/README.txt - fp2-dev/platform/external/llvm

blob: c8e359fced33490039f391d0f0990fcad3998256 [file] [log] [blame]

Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the X86 backend.
				3	//===---------------------------------------------------------------------===//
				4
				5	Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
				6	Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
				7	X86, & make the dag combiner produce it when needed. This will eliminate one
				8	imul from the code generated for:
				9
				10	long long test(long long X, long long Y) { return X*Y; }
				11
				12	by using the EAX result from the mul. We should add a similar node for
				13	DIVREM.
				14
Chris Lattner	865874c	2005-12-02 00:11:20 +0000	[diff] [blame]	15	another case is:
				16
				17	long long test(int X, int Y) { return (long long)X*Y; }
				18
				19	... which should only be one imul instruction.
				20
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	21	//===---------------------------------------------------------------------===//
				22
				23	This should be one DIV/IDIV instruction, not a libcall:
				24
				25	unsigned test(unsigned long long X, unsigned Y) {
				26	return X/Y;
				27	}
				28
				29	This can be done trivially with a custom legalizer. What about overflow
				30	though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
				31
				32	//===---------------------------------------------------------------------===//
				33
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	34	Some targets (e.g. athlons) prefer freep to fstp ST(0):
				35	http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html
				36
				37	//===---------------------------------------------------------------------===//
				38
Evan Cheng	a3195e8	2006-01-12 22:54:21 +0000	[diff] [blame]	39	This should use fiadd on chips where it is profitable:
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	40	double foo(double P, int I) { return P+I; }
				41
				42	//===---------------------------------------------------------------------===//
				43
				44	The FP stackifier needs to be global. Also, it should handle simple permutates
				45	to reduce number of shuffle instructions, e.g. turning:
				46
				47	fld P -> fld Q
				48	fld Q fld P
				49	fxch
				50
				51	or:
				52
				53	fxch -> fucomi
				54	fucomi jl X
				55	jg X
				56
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	57	Ideas:
				58	http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html
				59
				60
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	61	//===---------------------------------------------------------------------===//
				62
				63	Improvements to the multiply -> shift/add algorithm:
				64	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
				65
				66	//===---------------------------------------------------------------------===//
				67
				68	Improve code like this (occurs fairly frequently, e.g. in LLVM):
				69	long long foo(int x) { return 1LL << x; }
				70
				71	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
				72	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
				73	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
				74
				75	Another useful one would be ~0ULL >> X and ~0ULL << X.
				76
Chris Lattner	ffff617	2005-10-23 21:44:59 +0000	[diff] [blame]	77	//===---------------------------------------------------------------------===//
				78
				79	Should support emission of the bswap instruction, probably by adding a new
				80	DAG node for byte swapping. Also useful on PPC which has byte-swapping loads.
				81
Chris Lattner	1e4ed93	2005-11-28 04:52:39 +0000	[diff] [blame]	82	//===---------------------------------------------------------------------===//
				83
				84	Compile this:
				85	_Bool f(_Bool a) { return a!=1; }
				86
				87	into:
				88	movzbl %dil, %eax
				89	xorl $1, %eax
				90	ret
Evan Cheng	8dee8cc	2005-12-17 01:25:19 +0000	[diff] [blame]	91
				92	//===---------------------------------------------------------------------===//
				93
				94	Some isel ideas:
				95
				96	1. Dynamic programming based approach when compile time if not an
				97	issue.
				98	2. Code duplication (addressing mode) during isel.
				99	3. Other ideas from "Register-Sensitive Selection, Duplication, and
				100	Sequencing of Instructions".
				101
				102	//===---------------------------------------------------------------------===//
				103
				104	Should we promote i16 to i32 to avoid partial register update stalls?
Evan Cheng	98abbfb	2005-12-17 06:54:43 +0000	[diff] [blame]	105
				106	//===---------------------------------------------------------------------===//
				107
				108	Leave any_extend as pseudo instruction and hint to register
				109	allocator. Delay codegen until post register allocation.
Evan Cheng	a3195e8	2006-01-12 22:54:21 +0000	[diff] [blame]	110
				111	//===---------------------------------------------------------------------===//
				112
				113	Add a target specific hook to DAG combiner to handle SINT_TO_FP and
				114	FP_TO_SINT when the source operand is already in memory.
				115
				116	//===---------------------------------------------------------------------===//
				117
				118	Check if load folding would add a cycle in the dag.
Evan Cheng	e08c270	2006-01-13 01:20:42 +0000	[diff] [blame]	119
				120	//===---------------------------------------------------------------------===//
				121
				122	Model X86 EFLAGS as a real register to avoid redudant cmp / test. e.g.
				123
				124	cmpl $1, %eax
				125	setg %al
				126	testb %al, %al # unnecessary
				127	jne .BB7
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	128
				129	//===---------------------------------------------------------------------===//
				130
				131	Count leading zeros and count trailing zeros:
				132
				133	int clz(int X) { return __builtin_clz(X); }
				134	int ctz(int X) { return __builtin_ctz(X); }
				135
				136	$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
				137	clz:
				138	bsr %eax, DWORD PTR [%esp+4]
				139	xor %eax, 31
				140	ret
				141	ctz:
				142	bsf %eax, DWORD PTR [%esp+4]
				143	ret
				144
				145	however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
				146	aren't.
				147
				148	//===---------------------------------------------------------------------===//
				149
				150	Use push/pop instructions in prolog/epilog sequences instead of stores off
				151	ESP (certain code size win, perf win on some [which?] processors).
				152
				153	//===---------------------------------------------------------------------===//
				154
				155	Only use inc/neg/not instructions on processors where they are faster than
				156	add/sub/xor. They are slower on the P4 due to only updating some processor
				157	flags.
				158
				159	//===---------------------------------------------------------------------===//
				160
				161	Open code rint,floor,ceil,trunc:
				162	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html
				163	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html
				164
				165	//===---------------------------------------------------------------------===//
				166
				167	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				168
Evan Cheng	e826a01	2006-01-27 22:11:01 +0000	[diff] [blame]	169	//===---------------------------------------------------------------------===//
				170
Reid Spencer	2ce5b26	2006-01-29 06:48:25 +0000	[diff] [blame]	171	For all targets, not just X86:
				172	When llvm.memcpy, llvm.memset, or llvm.memmove are lowered, they should be
				173	optimized to a few store instructions if the source is constant and the length
				174	is smallish (< 8). This will greatly help some tests like Shootout/strcat.c
				175
				176	//===---------------------------------------------------------------------===//
				177
Evan Cheng	e826a01	2006-01-27 22:11:01 +0000	[diff] [blame]	178	Solve this DAG isel folding deficiency:
				179
				180	int X, Y;
				181
				182	void fn1(void)
				183	{
				184	X = X \| (Y << 3);
				185	}
				186
				187	compiles to
				188
				189	fn1:
				190	movl Y, %eax
				191	shll $3, %eax
				192	orl X, %eax
				193	movl %eax, X
				194	ret
				195
				196	The problem is the store's chain operand is not the load X but rather
Evan Cheng	d41e9e5	2006-01-27 22:54:32 +0000	[diff] [blame]	197	a TokenFactor of the load X and load Y, which prevents the folding.
				198
				199	There are two ways to fix this:
				200
				201	1. The dag combiner can start using alias analysis to realize that y/x
				202	don't alias, making the store to X not dependent on the load from Y.
				203	2. The generated isel could be made smarter in the case it can't
				204	disambiguate the pointers.
				205
				206	Number 1 is the preferred solution.
Chris Lattner	b638cd8	2006-01-29 09:08:15 +0000	[diff] [blame]	207
				208	//===---------------------------------------------------------------------===//
				209
				210	The instruction selector sometimes misses folding a load into a compare. The
				211	pattern is written as (cmp reg, (load p)). Because the compare isn't
				212	commutative, it is not matched with the load on both sides. The dag combiner
				213	should be made smart enough to cannonicalize the load into the RHS of a compare
				214	when it can invert the result of the compare for free.
				215
Chris Lattner	6a28456	2006-01-29 09:14:47 +0000	[diff] [blame]	216	//===---------------------------------------------------------------------===//
				217
Chris Lattner	5164a31	2006-01-29 09:42:20 +0000	[diff] [blame]	218	LSR should be turned on for the X86 backend and tuned to take advantage of its
				219	addressing modes.
				220
Chris Lattner	c7097af	2006-01-29 09:46:06 +0000	[diff] [blame]	221	//===---------------------------------------------------------------------===//
				222
				223	When compiled with unsafemath enabled, "main" should enable SSE DAZ mode and
				224	other fast SSE modes.
Chris Lattner	bdde465	2006-01-31 00:20:38 +0000	[diff] [blame]	225
				226	//===---------------------------------------------------------------------===//
				227
Chris Lattner	594086d	2006-01-31 00:45:37 +0000	[diff] [blame]	228	Think about doing i64 math in SSE regs.
				229
Chris Lattner	8e38ae6	2006-01-31 02:10:06 +0000	[diff] [blame]	230	//===---------------------------------------------------------------------===//
				231
				232	The DAG Isel doesn't fold the loads into the adds in this testcase. The
				233	pattern selector does. This is because the chain value of the load gets
				234	selected first, and the loads aren't checking to see if they are only used by
				235	and add.
				236
				237	.ll:
				238
				239	int %test(int* %x, int* %y, int* %z) {
				240	%X = load int* %x
				241	%Y = load int* %y
				242	%Z = load int* %z
				243	%a = add int %X, %Y
				244	%b = add int %a, %Z
				245	ret int %b
				246	}
				247
				248	dag isel:
				249
				250	_test:
				251	movl 4(%esp), %eax
				252	movl (%eax), %eax
				253	movl 8(%esp), %ecx
				254	movl (%ecx), %ecx
				255	addl %ecx, %eax
				256	movl 12(%esp), %ecx
				257	movl (%ecx), %ecx
				258	addl %ecx, %eax
				259	ret
				260
				261	pattern isel:
				262
				263	_test:
				264	movl 12(%esp), %ecx
				265	movl 4(%esp), %edx
				266	movl 8(%esp), %eax
				267	movl (%eax), %eax
				268	addl (%edx), %eax
				269	addl (%ecx), %eax
				270	ret
				271
				272	This is bad for register pressure, though the dag isel is producing a
				273	better schedule. :)
Chris Lattner	3e1d5e5	2006-02-01 01:44:25 +0000	[diff] [blame]	274
				275	//===---------------------------------------------------------------------===//
				276
				277	This testcase should have no SSE instructions in it, and only one load from
				278	a constant pool:
				279
				280	double %test3(bool %B) {
				281	%C = select bool %B, double 123.412, double 523.01123123
				282	ret double %C
				283	}
				284
				285	Currently, the select is being lowered, which prevents the dag combiner from
				286	turning 'select (load CPI1), (load CPI2)' -> 'load (select CPI1, CPI2)'
				287
				288	The pattern isel got this one right.
				289
Chris Lattner	1f7c630	2006-02-01 06:40:32 +0000	[diff] [blame]	290	//===---------------------------------------------------------------------===//
				291
Chris Lattner	3e2b94a	2006-02-01 21:44:48 +0000	[diff] [blame]	292	We need to lower switch statements to tablejumps when appropriate instead of
				293	always into binary branch trees.
Chris Lattner	4d7db40	2006-02-01 23:38:08 +0000	[diff] [blame]	294
				295	//===---------------------------------------------------------------------===//
				296
				297	SSE doesn't have [mem] op= reg instructions. If we have an SSE instruction
				298	like this:
				299
				300	X += y
				301
				302	and the register allocator decides to spill X, it is cheaper to emit this as:
				303
				304	Y += [xslot]
				305	store Y -> [xslot]
				306
				307	than as:
				308
				309	tmp = [xslot]
				310	tmp += y
				311	store tmp -> [xslot]
				312
				313	..and this uses one fewer register (so this should be done at load folding
				314	time, not at spiller time). Note however that this can only be done
				315	if Y is dead. Here's a testcase:
				316
				317	%.str_3 = external global [15 x sbyte] ; <[15 x sbyte]*> [#uses=0]
				318	implementation ; Functions:
				319	declare void %printf(int, ...)
				320	void %main() {
				321	build_tree.exit:
				322	br label %no_exit.i7
				323	no_exit.i7: ; preds = %no_exit.i7, %build_tree.exit
				324	%tmp.0.1.0.i9 = phi double [ 0.000000e+00, %build_tree.exit ], [ %tmp.34.i18, %no_exit.i7 ] ; <double> [#uses=1]
				325	%tmp.0.0.0.i10 = phi double [ 0.000000e+00, %build_tree.exit ], [ %tmp.28.i16, %no_exit.i7 ] ; <double> [#uses=1]
				326	%tmp.28.i16 = add double %tmp.0.0.0.i10, 0.000000e+00
				327	%tmp.34.i18 = add double %tmp.0.1.0.i9, 0.000000e+00
				328	br bool false, label %Compute_Tree.exit23, label %no_exit.i7
				329	Compute_Tree.exit23: ; preds = %no_exit.i7
				330	tail call void (int, ...)* %printf( int 0 )
				331	store double %tmp.34.i18, double* null
				332	ret void
				333	}
				334
				335	We currently emit:
				336
				337	.BBmain_1:
				338	xorpd %XMM1, %XMM1
				339	addsd %XMM0, %XMM1
				340	*** movsd %XMM2, QWORD PTR [%ESP + 8]
				341	*** addsd %XMM2, %XMM1
				342	*** movsd QWORD PTR [%ESP + 8], %XMM2
				343	jmp .BBmain_1 # no_exit.i7
				344
				345	This is a bugpoint reduced testcase, which is why the testcase doesn't make
				346	much sense (e.g. its an infinite loop). :)
				347
Evan Cheng	8b6e4e6	2006-02-02 02:40:17 +0000	[diff] [blame]	348	//===---------------------------------------------------------------------===//
				349
				350	None of the FPStack instructions are handled in
				351	X86RegisterInfo::foldMemoryOperand, which prevents the spiller from
				352	folding spill code into the instructions.
Chris Lattner	9acddcd	2006-02-02 19:16:34 +0000	[diff] [blame^]	353
				354	//===---------------------------------------------------------------------===//
				355
				356	In many cases, LLVM generates code like this:
				357
				358	_test:
				359	movl 8(%esp), %eax
				360	cmpl %eax, 4(%esp)
				361	setl %al
				362	movzbl %al, %eax
				363	ret
				364
				365	on some processors (which ones?), it is more efficient to do this:
				366
				367	_test:
				368	movl 8(%esp), %ebx
				369	xor %eax, %eax
				370	cmpl %ebx, 4(%esp)
				371	setl %al
				372	ret
				373
				374	Doing this correctly is tricky though, as the xor clobbers the flags.
				375
				376