Blame - lib/Target/X86/README.txt - fp2-dev/platform/external/llvm

blob: 40d9dbbf20d5cf81c35b21e206b26fa583e1c7ae [file] [log] [blame]

Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the X86 backend.
				3	//===---------------------------------------------------------------------===//
				4
				5	Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
				6	Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
				7	X86, & make the dag combiner produce it when needed. This will eliminate one
				8	imul from the code generated for:
				9
				10	long long test(long long X, long long Y) { return X*Y; }
				11
				12	by using the EAX result from the mul. We should add a similar node for
				13	DIVREM.
				14
Chris Lattner	865874c	2005-12-02 00:11:20 +0000	[diff] [blame]	15	another case is:
				16
				17	long long test(int X, int Y) { return (long long)X*Y; }
				18
				19	... which should only be one imul instruction.
				20
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	21	//===---------------------------------------------------------------------===//
				22
				23	This should be one DIV/IDIV instruction, not a libcall:
				24
				25	unsigned test(unsigned long long X, unsigned Y) {
				26	return X/Y;
				27	}
				28
				29	This can be done trivially with a custom legalizer. What about overflow
				30	though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
				31
				32	//===---------------------------------------------------------------------===//
				33
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	34	Some targets (e.g. athlons) prefer freep to fstp ST(0):
				35	http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html
				36
				37	//===---------------------------------------------------------------------===//
				38
Evan Cheng	a3195e8	2006-01-12 22:54:21 +0000	[diff] [blame]	39	This should use fiadd on chips where it is profitable:
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	40	double foo(double P, int I) { return P+I; }
				41
				42	//===---------------------------------------------------------------------===//
				43
				44	The FP stackifier needs to be global. Also, it should handle simple permutates
				45	to reduce number of shuffle instructions, e.g. turning:
				46
				47	fld P -> fld Q
				48	fld Q fld P
				49	fxch
				50
				51	or:
				52
				53	fxch -> fucomi
				54	fucomi jl X
				55	jg X
				56
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	57	Ideas:
				58	http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html
				59
				60
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	61	//===---------------------------------------------------------------------===//
				62
				63	Improvements to the multiply -> shift/add algorithm:
				64	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
				65
				66	//===---------------------------------------------------------------------===//
				67
				68	Improve code like this (occurs fairly frequently, e.g. in LLVM):
				69	long long foo(int x) { return 1LL << x; }
				70
				71	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
				72	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
				73	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
				74
				75	Another useful one would be ~0ULL >> X and ~0ULL << X.
				76
Chris Lattner	ffff617	2005-10-23 21:44:59 +0000	[diff] [blame]	77	//===---------------------------------------------------------------------===//
				78
				79	Should support emission of the bswap instruction, probably by adding a new
				80	DAG node for byte swapping. Also useful on PPC which has byte-swapping loads.
				81
Chris Lattner	1e4ed93	2005-11-28 04:52:39 +0000	[diff] [blame]	82	//===---------------------------------------------------------------------===//
				83
				84	Compile this:
				85	_Bool f(_Bool a) { return a!=1; }
				86
				87	into:
				88	movzbl %dil, %eax
				89	xorl $1, %eax
				90	ret
Evan Cheng	8dee8cc	2005-12-17 01:25:19 +0000	[diff] [blame]	91
				92	//===---------------------------------------------------------------------===//
				93
				94	Some isel ideas:
				95
				96	1. Dynamic programming based approach when compile time if not an
				97	issue.
				98	2. Code duplication (addressing mode) during isel.
				99	3. Other ideas from "Register-Sensitive Selection, Duplication, and
				100	Sequencing of Instructions".
				101
				102	//===---------------------------------------------------------------------===//
				103
				104	Should we promote i16 to i32 to avoid partial register update stalls?
Evan Cheng	98abbfb	2005-12-17 06:54:43 +0000	[diff] [blame]	105
				106	//===---------------------------------------------------------------------===//
				107
				108	Leave any_extend as pseudo instruction and hint to register
				109	allocator. Delay codegen until post register allocation.
Evan Cheng	a3195e8	2006-01-12 22:54:21 +0000	[diff] [blame]	110
				111	//===---------------------------------------------------------------------===//
				112
				113	Add a target specific hook to DAG combiner to handle SINT_TO_FP and
				114	FP_TO_SINT when the source operand is already in memory.
				115
				116	//===---------------------------------------------------------------------===//
				117
				118	Check if load folding would add a cycle in the dag.
Evan Cheng	e08c270	2006-01-13 01:20:42 +0000	[diff] [blame]	119
				120	//===---------------------------------------------------------------------===//
				121
				122	Model X86 EFLAGS as a real register to avoid redudant cmp / test. e.g.
				123
				124	cmpl $1, %eax
				125	setg %al
				126	testb %al, %al # unnecessary
				127	jne .BB7
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	128
				129	//===---------------------------------------------------------------------===//
				130
				131	Count leading zeros and count trailing zeros:
				132
				133	int clz(int X) { return __builtin_clz(X); }
				134	int ctz(int X) { return __builtin_ctz(X); }
				135
				136	$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
				137	clz:
				138	bsr %eax, DWORD PTR [%esp+4]
				139	xor %eax, 31
				140	ret
				141	ctz:
				142	bsf %eax, DWORD PTR [%esp+4]
				143	ret
				144
				145	however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
				146	aren't.
				147
				148	//===---------------------------------------------------------------------===//
				149
				150	Use push/pop instructions in prolog/epilog sequences instead of stores off
				151	ESP (certain code size win, perf win on some [which?] processors).
				152
				153	//===---------------------------------------------------------------------===//
				154
				155	Only use inc/neg/not instructions on processors where they are faster than
				156	add/sub/xor. They are slower on the P4 due to only updating some processor
				157	flags.
				158
				159	//===---------------------------------------------------------------------===//
				160
				161	Open code rint,floor,ceil,trunc:
				162	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html
				163	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html
				164
				165	//===---------------------------------------------------------------------===//
				166
				167	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				168
Evan Cheng	e826a01	2006-01-27 22:11:01 +0000	[diff] [blame]	169	//===---------------------------------------------------------------------===//
				170
Reid Spencer	2ce5b26	2006-01-29 06:48:25 +0000	[diff] [blame]	171	For all targets, not just X86:
				172	When llvm.memcpy, llvm.memset, or llvm.memmove are lowered, they should be
				173	optimized to a few store instructions if the source is constant and the length
				174	is smallish (< 8). This will greatly help some tests like Shootout/strcat.c
				175
				176	//===---------------------------------------------------------------------===//
				177
Evan Cheng	e826a01	2006-01-27 22:11:01 +0000	[diff] [blame]	178	Solve this DAG isel folding deficiency:
				179
				180	int X, Y;
				181
				182	void fn1(void)
				183	{
				184	X = X \| (Y << 3);
				185	}
				186
				187	compiles to
				188
				189	fn1:
				190	movl Y, %eax
				191	shll $3, %eax
				192	orl X, %eax
				193	movl %eax, X
				194	ret
				195
				196	The problem is the store's chain operand is not the load X but rather
Evan Cheng	d41e9e5	2006-01-27 22:54:32 +0000	[diff] [blame]	197	a TokenFactor of the load X and load Y, which prevents the folding.
				198
				199	There are two ways to fix this:
				200
				201	1. The dag combiner can start using alias analysis to realize that y/x
				202	don't alias, making the store to X not dependent on the load from Y.
				203	2. The generated isel could be made smarter in the case it can't
				204	disambiguate the pointers.
				205
				206	Number 1 is the preferred solution.
Chris Lattner	b638cd8	2006-01-29 09:08:15 +0000	[diff] [blame]	207
				208	//===---------------------------------------------------------------------===//
				209
				210	The instruction selector sometimes misses folding a load into a compare. The
				211	pattern is written as (cmp reg, (load p)). Because the compare isn't
				212	commutative, it is not matched with the load on both sides. The dag combiner
				213	should be made smart enough to cannonicalize the load into the RHS of a compare
				214	when it can invert the result of the compare for free.
				215
Chris Lattner	6a28456	2006-01-29 09:14:47 +0000	[diff] [blame]	216	//===---------------------------------------------------------------------===//
				217
				218	The code generated for 'abs' is truly aweful:
				219
				220	float %foo(float %tmp.38) {
				221	%tmp.39 = setgt float %tmp.38, 0.000000e+00
				222	%tmp.45 = sub float -0.000000e+00, %tmp.38
				223	%mem_tmp.0.0 = select bool %tmp.39, float %tmp.38, float %tmp.45
				224	ret float %mem_tmp.0.0
				225	}
				226
				227	_foo:
				228	subl $4, %esp
				229	movss LCPI1_0, %xmm0
				230	movss 8(%esp), %xmm1
				231	subss %xmm1, %xmm0
				232	xorps %xmm2, %xmm2
				233	ucomiss %xmm2, %xmm1
				234	setp %al
				235	seta %cl
				236	orb %cl, %al
				237	testb %al, %al
				238	jne LBB_foo_2 #
				239	LBB_foo_1: #
				240	movss %xmm0, %xmm1
				241	LBB_foo_2: #
				242	movss %xmm1, (%esp)
				243	flds (%esp)
				244	addl $4, %esp
				245	ret
				246
				247	This should be a high-priority to fix. With the fp-stack, this is a single
				248	instruction. With SSE it could be far better than this. Why is the sequence
				249	above using 'setp'? It shouldn't care about nan's.
Chris Lattner	b638cd8	2006-01-29 09:08:15 +0000	[diff] [blame]	250
Chris Lattner	5164a31	2006-01-29 09:42:20 +0000	[diff] [blame]	251	//===---------------------------------------------------------------------===//
				252
				253	Is there a better way to implement Y = -X (fneg) than the literal code:
				254
				255	float %test(float %X) {
				256	%Y = sub float -0.0, %X
				257	ret float %Y
				258	}
				259
				260	movss LCPI1_0, %xmm0 ;; load -0.0
				261	subss 8(%esp), %xmm0 ;; subtract
				262
				263	//===---------------------------------------------------------------------===//
				264
				265	None of the SSE instructions are handled in X86RegisterInfo::foldMemoryOperand,
				266	which prevents the spiller from folding spill code into the instructions.
				267
				268	This leads to code like this:
				269
				270	mov %eax, 8(%esp)
				271	cvtsi2sd %eax, %xmm0
				272	instead of:
				273	cvtsi2sd 8(%esp), %xmm0
				274
				275	//===---------------------------------------------------------------------===//
				276
				277	This instruction selector selects 'int X = 0' as 'mov Reg, 0' not 'xor Reg,Reg'
				278	This is bigger and slower.
				279
				280	//===---------------------------------------------------------------------===//
				281
				282	LSR should be turned on for the X86 backend and tuned to take advantage of its
				283	addressing modes.
				284
Chris Lattner	c7097af	2006-01-29 09:46:06 +0000	[diff] [blame]	285	//===---------------------------------------------------------------------===//
				286
				287	When compiled with unsafemath enabled, "main" should enable SSE DAZ mode and
				288	other fast SSE modes.
Chris Lattner	bdde465	2006-01-31 00:20:38 +0000	[diff] [blame]	289
				290	//===---------------------------------------------------------------------===//
				291
				292	cd Regression/CodeGen/X86
				293	llvm-as < setuge.ll \| llc -march=x86 -mcpu=yonah -enable-x86-sse
				294
				295	_cmp:
				296	subl $4, %esp
				297	1) leal 20(%esp), %eax
				298	movss 12(%esp), %xmm0
				299	1) leal 16(%esp), %ecx
				300	ucomiss 8(%esp), %xmm0
				301	cmovb %ecx, %eax
				302	2) movss (%eax), %xmm0
				303	2) movss %xmm0, (%esp)
				304	flds (%esp)
				305	addl $4, %esp
				306	ret
				307
				308
				309	1) These LEA's should be adds. This is tricky because they are FrameIndex's
				310	before prolog-epilog rewriting.
				311	2) We shouldn't load into XMM regs only to store it back.
				312
Chris Lattner	594086d	2006-01-31 00:45:37 +0000	[diff] [blame]	313	//===---------------------------------------------------------------------===//
				314
				315	Think about doing i64 math in SSE regs.
				316
Chris Lattner	8e38ae6	2006-01-31 02:10:06 +0000	[diff] [blame^]	317	//===---------------------------------------------------------------------===//
				318
				319	The DAG Isel doesn't fold the loads into the adds in this testcase. The
				320	pattern selector does. This is because the chain value of the load gets
				321	selected first, and the loads aren't checking to see if they are only used by
				322	and add.
				323
				324	.ll:
				325
				326	int %test(int* %x, int* %y, int* %z) {
				327	%X = load int* %x
				328	%Y = load int* %y
				329	%Z = load int* %z
				330	%a = add int %X, %Y
				331	%b = add int %a, %Z
				332	ret int %b
				333	}
				334
				335	dag isel:
				336
				337	_test:
				338	movl 4(%esp), %eax
				339	movl (%eax), %eax
				340	movl 8(%esp), %ecx
				341	movl (%ecx), %ecx
				342	addl %ecx, %eax
				343	movl 12(%esp), %ecx
				344	movl (%ecx), %ecx
				345	addl %ecx, %eax
				346	ret
				347
				348	pattern isel:
				349
				350	_test:
				351	movl 12(%esp), %ecx
				352	movl 4(%esp), %edx
				353	movl 8(%esp), %eax
				354	movl (%eax), %eax
				355	addl (%edx), %eax
				356	addl (%ecx), %eax
				357	ret
				358
				359	This is bad for register pressure, though the dag isel is producing a
				360	better schedule. :)
				361
				362