Blame - lib/Target/X86/README.txt - fp2-dev/platform/external/llvm

blob: 5789754439aaa3020c525132740fd2fde6ece37a [file] [log] [blame]

Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the X86 backend.
				3	//===---------------------------------------------------------------------===//
				4
				5	Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
				6	Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
				7	X86, & make the dag combiner produce it when needed. This will eliminate one
				8	imul from the code generated for:
				9
				10	long long test(long long X, long long Y) { return X*Y; }
				11
				12	by using the EAX result from the mul. We should add a similar node for
				13	DIVREM.
				14
Chris Lattner	865874c	2005-12-02 00:11:20 +0000	[diff] [blame]	15	another case is:
				16
				17	long long test(int X, int Y) { return (long long)X*Y; }
				18
				19	... which should only be one imul instruction.
				20
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	21	//===---------------------------------------------------------------------===//
				22
				23	This should be one DIV/IDIV instruction, not a libcall:
				24
				25	unsigned test(unsigned long long X, unsigned Y) {
				26	return X/Y;
				27	}
				28
				29	This can be done trivially with a custom legalizer. What about overflow
				30	though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
				31
				32	//===---------------------------------------------------------------------===//
				33
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	34	Improvements to the multiply -> shift/add algorithm:
				35	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
				36
				37	//===---------------------------------------------------------------------===//
				38
				39	Improve code like this (occurs fairly frequently, e.g. in LLVM):
				40	long long foo(int x) { return 1LL << x; }
				41
				42	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
				43	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
				44	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
				45
				46	Another useful one would be ~0ULL >> X and ~0ULL << X.
				47
Chris Lattner	ffff617	2005-10-23 21:44:59 +0000	[diff] [blame]	48	//===---------------------------------------------------------------------===//
				49
Chris Lattner	1e4ed93	2005-11-28 04:52:39 +0000	[diff] [blame]	50	Compile this:
				51	_Bool f(_Bool a) { return a!=1; }
				52
				53	into:
				54	movzbl %dil, %eax
				55	xorl $1, %eax
				56	ret
Evan Cheng	8dee8cc	2005-12-17 01:25:19 +0000	[diff] [blame]	57
				58	//===---------------------------------------------------------------------===//
				59
				60	Some isel ideas:
				61
				62	1. Dynamic programming based approach when compile time if not an
				63	issue.
				64	2. Code duplication (addressing mode) during isel.
				65	3. Other ideas from "Register-Sensitive Selection, Duplication, and
				66	Sequencing of Instructions".
Chris Lattner	cb29890	2006-02-08 07:12:07 +0000	[diff] [blame]	67	4. Scheduling for reduced register pressure. E.g. "Minimum Register
				68	Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
				69	and other related papers.
				70	http://citeseer.ist.psu.edu/govindarajan01minimum.html
Evan Cheng	8dee8cc	2005-12-17 01:25:19 +0000	[diff] [blame]	71
				72	//===---------------------------------------------------------------------===//
				73
				74	Should we promote i16 to i32 to avoid partial register update stalls?
Evan Cheng	98abbfb	2005-12-17 06:54:43 +0000	[diff] [blame]	75
				76	//===---------------------------------------------------------------------===//
				77
				78	Leave any_extend as pseudo instruction and hint to register
				79	allocator. Delay codegen until post register allocation.
Evan Cheng	a3195e8	2006-01-12 22:54:21 +0000	[diff] [blame]	80
				81	//===---------------------------------------------------------------------===//
				82
Evan Cheng	e08c270	2006-01-13 01:20:42 +0000	[diff] [blame]	83	Model X86 EFLAGS as a real register to avoid redudant cmp / test. e.g.
				84
				85	cmpl $1, %eax
				86	setg %al
				87	testb %al, %al # unnecessary
				88	jne .BB7
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	89
				90	//===---------------------------------------------------------------------===//
				91
				92	Count leading zeros and count trailing zeros:
				93
				94	int clz(int X) { return __builtin_clz(X); }
				95	int ctz(int X) { return __builtin_ctz(X); }
				96
				97	$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
				98	clz:
				99	bsr %eax, DWORD PTR [%esp+4]
				100	xor %eax, 31
				101	ret
				102	ctz:
				103	bsf %eax, DWORD PTR [%esp+4]
				104	ret
				105
				106	however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
				107	aren't.
				108
				109	//===---------------------------------------------------------------------===//
				110
				111	Use push/pop instructions in prolog/epilog sequences instead of stores off
				112	ESP (certain code size win, perf win on some [which?] processors).
Evan Cheng	53f280a	2006-02-25 10:04:07 +0000	[diff] [blame]	113	Also, it appears icc use push for parameter passing. Need to investigate.
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	114
				115	//===---------------------------------------------------------------------===//
				116
				117	Only use inc/neg/not instructions on processors where they are faster than
				118	add/sub/xor. They are slower on the P4 due to only updating some processor
				119	flags.
				120
				121	//===---------------------------------------------------------------------===//
				122
Chris Lattner	b638cd8	2006-01-29 09:08:15 +0000	[diff] [blame]	123	The instruction selector sometimes misses folding a load into a compare. The
				124	pattern is written as (cmp reg, (load p)). Because the compare isn't
				125	commutative, it is not matched with the load on both sides. The dag combiner
				126	should be made smart enough to cannonicalize the load into the RHS of a compare
				127	when it can invert the result of the compare for free.
				128
Evan Cheng	fc7c17a	2006-04-13 05:09:45 +0000	[diff] [blame]	129	How about intrinsics? An example is:
				130	res = _mm_mulhi_epu16(A, _mm_mul_epu32(B, C));
				131
				132	compiles to
				133	pmuludq (%eax), %xmm0
				134	movl 8(%esp), %eax
				135	movdqa (%eax), %xmm1
				136	pmulhuw %xmm0, %xmm1
				137
				138	The transformation probably requires a X86 specific pass or a DAG combiner
				139	target specific hook.
				140
Chris Lattner	6a28456	2006-01-29 09:14:47 +0000	[diff] [blame]	141	//===---------------------------------------------------------------------===//
				142
Chris Lattner	8e38ae6	2006-01-31 02:10:06 +0000	[diff] [blame]	143	The DAG Isel doesn't fold the loads into the adds in this testcase. The
				144	pattern selector does. This is because the chain value of the load gets
				145	selected first, and the loads aren't checking to see if they are only used by
				146	and add.
				147
				148	.ll:
				149
				150	int %test(int* %x, int* %y, int* %z) {
				151	%X = load int* %x
				152	%Y = load int* %y
				153	%Z = load int* %z
				154	%a = add int %X, %Y
				155	%b = add int %a, %Z
				156	ret int %b
				157	}
				158
				159	dag isel:
				160
				161	_test:
				162	movl 4(%esp), %eax
				163	movl (%eax), %eax
				164	movl 8(%esp), %ecx
				165	movl (%ecx), %ecx
				166	addl %ecx, %eax
				167	movl 12(%esp), %ecx
				168	movl (%ecx), %ecx
				169	addl %ecx, %eax
				170	ret
				171
				172	pattern isel:
				173
				174	_test:
				175	movl 12(%esp), %ecx
				176	movl 4(%esp), %edx
				177	movl 8(%esp), %eax
				178	movl (%eax), %eax
				179	addl (%edx), %eax
				180	addl (%ecx), %eax
				181	ret
				182
				183	This is bad for register pressure, though the dag isel is producing a
				184	better schedule. :)
Chris Lattner	3e1d5e5	2006-02-01 01:44:25 +0000	[diff] [blame]	185
				186	//===---------------------------------------------------------------------===//
				187
Chris Lattner	9acddcd	2006-02-02 19:16:34 +0000	[diff] [blame]	188	In many cases, LLVM generates code like this:
				189
				190	_test:
				191	movl 8(%esp), %eax
				192	cmpl %eax, 4(%esp)
				193	setl %al
				194	movzbl %al, %eax
				195	ret
				196
				197	on some processors (which ones?), it is more efficient to do this:
				198
				199	_test:
				200	movl 8(%esp), %ebx
				201	xor %eax, %eax
				202	cmpl %ebx, 4(%esp)
				203	setl %al
				204	ret
				205
				206	Doing this correctly is tricky though, as the xor clobbers the flags.
				207
Chris Lattner	d395d09	2006-02-02 19:43:28 +0000	[diff] [blame]	208	//===---------------------------------------------------------------------===//
				209
				210	We should generate 'test' instead of 'cmp' in various cases, e.g.:
				211
				212	bool %test(int %X) {
				213	%Y = shl int %X, ubyte 1
				214	%C = seteq int %Y, 0
				215	ret bool %C
				216	}
				217	bool %test(int %X) {
				218	%Y = and int %X, 8
				219	%C = seteq int %Y, 0
				220	ret bool %C
				221	}
				222
				223	This may just be a matter of using 'test' to write bigger patterns for X86cmp.
				224
Chris Lattner	3f705e6	2006-05-18 17:38:16 +0000	[diff] [blame]	225	An important case is comparison against zero:
				226
				227	if (X == 0) ...
				228
				229	instead of:
				230
				231	cmpl $0, %eax
				232	je LBB4_2 #cond_next
				233
				234	use:
				235	test %eax, %eax
				236	jz LBB4_2
				237
				238	which is smaller.
				239
Chris Lattner	d395d09	2006-02-02 19:43:28 +0000	[diff] [blame]	240	//===---------------------------------------------------------------------===//
				241
Chris Lattner	8f77b73	2006-02-08 06:52:06 +0000	[diff] [blame]	242	We should generate bts/btr/etc instructions on targets where they are cheap or
				243	when codesize is important. e.g., for:
				244
				245	void setbit(int *target, int bit) {
				246	*target \|= (1 << bit);
				247	}
				248	void clearbit(int *target, int bit) {
				249	*target &= ~(1 << bit);
				250	}
				251
Chris Lattner	dba382b	2006-02-08 17:47:22 +0000	[diff] [blame]	252	//===---------------------------------------------------------------------===//
				253
Evan Cheng	952b7d6	2006-02-14 08:25:32 +0000	[diff] [blame]	254	Instead of the following for memset char*, 1, 10:
				255
				256	movl $16843009, 4(%edx)
				257	movl $16843009, (%edx)
				258	movw $257, 8(%edx)
				259
				260	It might be better to generate
				261
				262	movl $16843009, %eax
				263	movl %eax, 4(%edx)
				264	movl %eax, (%edx)
				265	movw al, 8(%edx)
				266
				267	when we can spare a register. It reduces code size.
Evan Cheng	7634ac4	2006-02-17 00:04:28 +0000	[diff] [blame]	268
				269	//===---------------------------------------------------------------------===//
				270
Chris Lattner	a648df2	2006-02-17 04:20:13 +0000	[diff] [blame]	271	Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
				272	get this:
				273
				274	int %test1(int %X) {
				275	%Y = div int %X, 8
				276	ret int %Y
				277	}
				278
				279	_test1:
				280	movl 4(%esp), %eax
				281	movl %eax, %ecx
				282	sarl $31, %ecx
				283	shrl $29, %ecx
				284	addl %ecx, %eax
				285	sarl $3, %eax
				286	ret
				287
				288	GCC knows several different ways to codegen it, one of which is this:
				289
				290	_test1:
				291	movl 4(%esp), %eax
				292	cmpl $-1, %eax
				293	leal 7(%eax), %ecx
				294	cmovle %ecx, %eax
				295	sarl $3, %eax
				296	ret
				297
				298	which is probably slower, but it's interesting at least :)
				299
Evan Cheng	755ee8f	2006-02-20 19:58:27 +0000	[diff] [blame]	300	//===---------------------------------------------------------------------===//
				301
Evan Cheng	7ab5404	2006-03-21 07:18:26 +0000	[diff] [blame]	302	Should generate min/max for stuff like:
				303
				304	void minf(float a, float b, float *X) {
				305	*X = a <= b ? a : b;
				306	}
				307
Evan Cheng	3032410	2006-02-23 02:50:21 +0000	[diff] [blame]	308	Make use of floating point min / max instructions. Perhaps introduce ISD::FMIN
				309	and ISD::FMAX node types?
				310
				311	//===---------------------------------------------------------------------===//
				312
Chris Lattner	205065a	2006-02-23 05:17:43 +0000	[diff] [blame]	313	The first BB of this code:
				314
				315	declare bool %foo()
				316	int %bar() {
				317	%V = call bool %foo()
				318	br bool %V, label %T, label %F
				319	T:
				320	ret int 1
				321	F:
				322	call bool %foo()
				323	ret int 12
				324	}
				325
				326	compiles to:
				327
				328	_bar:
				329	subl $12, %esp
				330	call L_foo$stub
				331	xorb $1, %al
				332	testb %al, %al
				333	jne LBB_bar_2 # F
				334
				335	It would be better to emit "cmp %al, 1" than a xor and test.
				336
Evan Cheng	53f280a	2006-02-25 10:04:07 +0000	[diff] [blame]	337	//===---------------------------------------------------------------------===//
Chris Lattner	205065a	2006-02-23 05:17:43 +0000	[diff] [blame]	338
Evan Cheng	53f280a	2006-02-25 10:04:07 +0000	[diff] [blame]	339	Enable X86InstrInfo::convertToThreeAddress().
Evan Cheng	aafc141	2006-02-28 23:38:49 +0000	[diff] [blame]	340
				341	//===---------------------------------------------------------------------===//
				342
				343	Investigate whether it is better to codegen the following
				344
				345	%tmp.1 = mul int %x, 9
				346	to
				347
				348	movl 4(%esp), %eax
				349	leal (%eax,%eax,8), %eax
				350
				351	as opposed to what llc is currently generating:
				352
				353	imull $9, 4(%esp), %eax
				354
				355	Currently the load folding imull has a higher complexity than the LEA32 pattern.
Evan Cheng	f42f516	2006-03-04 07:49:50 +0000	[diff] [blame]	356
				357	//===---------------------------------------------------------------------===//
				358
Nate Begeman	c02e5a8	2006-03-26 19:19:27 +0000	[diff] [blame]	359	We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
				360	We should leave these as libcalls for everything over a much lower threshold,
				361	since libc is hand tuned for medium and large mem ops (avoiding RFO for large
				362	stores, TLB preheating, etc)
				363
				364	//===---------------------------------------------------------------------===//
				365
Chris Lattner	181b9c6	2006-03-09 01:39:46 +0000	[diff] [blame]	366	Optimize this into something reasonable:
				367	x * copysign(1.0, y) * copysign(1.0, z)
				368
				369	//===---------------------------------------------------------------------===//
				370
				371	Optimize copysign(x, *y) to use an integer load from y.
				372
				373	//===---------------------------------------------------------------------===//
				374
Evan Cheng	2771d21	2006-03-16 22:44:22 +0000	[diff] [blame]	375	%X = weak global int 0
				376
				377	void %foo(int %N) {
				378	%N = cast int %N to uint
				379	%tmp.24 = setgt int %N, 0
				380	br bool %tmp.24, label %no_exit, label %return
				381
				382	no_exit:
				383	%indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
				384	%i.0.0 = cast uint %indvar to int
				385	volatile store int %i.0.0, int* %X
				386	%indvar.next = add uint %indvar, 1
				387	%exitcond = seteq uint %indvar.next, %N
				388	br bool %exitcond, label %return, label %no_exit
				389
				390	return:
				391	ret void
				392	}
				393
				394	compiles into:
				395
				396	.text
				397	.align 4
				398	.globl _foo
				399	_foo:
				400	movl 4(%esp), %eax
				401	cmpl $1, %eax
				402	jl LBB_foo_4 # return
				403	LBB_foo_1: # no_exit.preheader
				404	xorl %ecx, %ecx
				405	LBB_foo_2: # no_exit
				406	movl L_X$non_lazy_ptr, %edx
				407	movl %ecx, (%edx)
				408	incl %ecx
				409	cmpl %eax, %ecx
				410	jne LBB_foo_2 # no_exit
				411	LBB_foo_3: # return.loopexit
				412	LBB_foo_4: # return
				413	ret
				414
				415	We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
				416	remateralization is implemented. This can be accomplished with 1) a target
				417	dependent LICM pass or 2) makeing SelectDAG represent the whole function.
				418
				419	//===---------------------------------------------------------------------===//
Evan Cheng	0def9c3	2006-03-19 06:08:11 +0000	[diff] [blame]	420
				421	The following tests perform worse with LSR:
				422
				423	lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
Chris Lattner	8bcf926	2006-03-19 22:27:41 +0000	[diff] [blame]	424
				425	//===---------------------------------------------------------------------===//
				426
Chris Lattner	9d5da1d	2006-03-24 07:12:19 +0000	[diff] [blame]	427	Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
Evan Cheng	50a6d8c	2006-03-21 07:12:57 +0000	[diff] [blame]	428	FR64 to VR128.
Evan Cheng	b20aace	2006-03-24 02:57:03 +0000	[diff] [blame]	429
				430	//===---------------------------------------------------------------------===//
				431
				432	mov $reg, 48(%esp)
				433	...
				434	leal 48(%esp), %eax
				435	mov %eax, (%esp)
				436	call _foo
				437
				438	Obviously it would have been better for the first mov (or any op) to store
				439	directly %esp[0] if there are no other uses.
Evan Cheng	4c4a2e2	2006-03-28 02:49:12 +0000	[diff] [blame]	440
				441	//===---------------------------------------------------------------------===//
				442
Evan Cheng	8af5ef9	2006-04-05 23:46:04 +0000	[diff] [blame]	443	Adding to the list of cmp / test poor codegen issues:
				444
				445	int test(__m128 A, __m128 B) {
				446	if (_mm_comige_ss(A, B))
				447	return 3;
				448	else
				449	return 4;
				450	}
				451
				452	_test:
				453	movl 8(%esp), %eax
				454	movaps (%eax), %xmm0
				455	movl 4(%esp), %eax
				456	movaps (%eax), %xmm1
				457	comiss %xmm0, %xmm1
				458	setae %al
				459	movzbl %al, %ecx
				460	movl $3, %eax
				461	movl $4, %edx
				462	cmpl $0, %ecx
				463	cmove %edx, %eax
				464	ret
				465
				466	Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
				467	are a number of issues. 1) We are introducing a setcc between the result of the
				468	intrisic call and select. 2) The intrinsic is expected to produce a i32 value
				469	so a any extend (which becomes a zero extend) is added.
				470
				471	We probably need some kind of target DAG combine hook to fix this.
Evan Cheng	573cb7c	2006-04-06 23:21:24 +0000	[diff] [blame]	472
				473	//===---------------------------------------------------------------------===//
				474
Chris Lattner	57a6c13	2006-04-23 19:47:09 +0000	[diff] [blame]	475	We generate significantly worse code for this than GCC:
				476	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
				477	http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
				478
				479	There is also one case we do worse on PPC.
				480
				481	//===---------------------------------------------------------------------===//
Evan Cheng	d7ec518	2006-04-24 23:30:10 +0000	[diff] [blame]	482
Chris Lattner	8ef67ac	2006-05-08 21:24:21 +0000	[diff] [blame]	483	If shorter, we should use things like:
				484	movzwl %ax, %eax
				485	instead of:
				486	andl $65535, %EAX
				487
				488	The former can also be used when the two-addressy nature of the 'and' would
				489	require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
Chris Lattner	217fde5	2006-04-27 21:40:57 +0000	[diff] [blame]	490
Chris Lattner	9fd868a	2006-05-08 21:39:45 +0000	[diff] [blame]	491	//===---------------------------------------------------------------------===//
				492
				493	This code generates ugly code, probably due to costs being off or something:
				494
				495	void %test(float* %P, <4 x float>* %P2 ) {
				496	%xFloat0.688 = load float* %P
				497	%loadVector37.712 = load <4 x float>* %P2
				498	%inFloat3.713 = insertelement <4 x float> %loadVector37.712, float 0.000000e+00, uint 3
				499	store <4 x float> %inFloat3.713, <4 x float>* %P2
				500	ret void
				501	}
				502
				503	Generates:
				504
				505	_test:
				506	pxor %xmm0, %xmm0
				507	movd %xmm0, %eax ;; EAX = 0!
				508	movl 8(%esp), %ecx
				509	movaps (%ecx), %xmm0
				510	pinsrw $6, %eax, %xmm0
				511	shrl $16, %eax ;; EAX = 0 again!
				512	pinsrw $7, %eax, %xmm0
				513	movaps %xmm0, (%ecx)
				514	ret
				515
				516	It would be better to generate:
				517
				518	_test:
				519	movl 8(%esp), %ecx
				520	movaps (%ecx), %xmm0
				521	xor %eax, %eax
				522	pinsrw $6, %eax, %xmm0
				523	pinsrw $7, %eax, %xmm0
				524	movaps %xmm0, (%ecx)
				525	ret
				526
				527	or use pxor (to make a zero vector) and shuffle (to insert it).
Evan Cheng	4dfa85d	2006-05-17 19:05:31 +0000	[diff] [blame]	528
				529	//===---------------------------------------------------------------------===//
				530
				531	Bad codegen:
				532
				533	char foo(int x) { return x; }
				534
				535	_foo:
				536	movl 4(%esp), %eax
				537	shll $24, %eax
				538	sarl $24, %eax
				539	ret
Evan Cheng	c8999f2	2006-05-17 21:20:51 +0000	[diff] [blame]	540
Evan Cheng	c21051f	2006-06-04 09:08:00 +0000	[diff] [blame]	541	SIGN_EXTEND_INREG can be implemented as (sext (trunc)) to take advantage of
				542	sub-registers.
				543
Evan Cheng	c8999f2	2006-05-17 21:20:51 +0000	[diff] [blame]	544	//===---------------------------------------------------------------------===//
Chris Lattner	778ae71	2006-05-19 20:55:31 +0000	[diff] [blame]	545
				546	Consider this:
				547
				548	typedef struct pair { float A, B; } pair;
				549	void pairtest(pair P, float *FP) {
				550	*FP = P.A+P.B;
				551	}
				552
				553	We currently generate this code with llvmgcc4:
				554
				555	_pairtest:
				556	subl $12, %esp
				557	movl 20(%esp), %eax
				558	movl %eax, 4(%esp)
				559	movl 16(%esp), %eax
				560	movl %eax, (%esp)
				561	movss (%esp), %xmm0
				562	addss 4(%esp), %xmm0
				563	movl 24(%esp), %eax
				564	movss %xmm0, (%eax)
				565	addl $12, %esp
				566	ret
				567
				568	we should be able to generate:
				569	_pairtest:
				570	movss 4(%esp), %xmm0
				571	movl 12(%esp), %eax
				572	addss 8(%esp), %xmm0
				573	movss %xmm0, (%eax)
				574	ret
				575
				576	The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
				577	integer chunks. It does this so that structs like {short,short} are passed in
				578	a single 32-bit integer stack slot. We should handle the safe cases above much
				579	nicer, while still handling the hard cases.
				580
				581	//===---------------------------------------------------------------------===//
				582
Evan Cheng	0d23bc4	2006-05-20 07:44:53 +0000	[diff] [blame]	583	Some ideas for instruction selection code simplification: 1. A pre-pass to
				584	determine which chain producing node can or cannot be folded. The generated
				585	isel code would then use the information. 2. The same pre-pass can force
				586	ordering of TokenFactor operands to allow load / store folding. 3. During isel,
				587	instead of recursively going up the chain operand chain, mark the chain operand
				588	as available and put it in some work list. Select other nodes in the normal
				589	manner. The chain operands are selected after all other nodes are selected. Uses
				590	of chain nodes are modified after instruction selection is completed.
				591
Evan Cheng	435bcd7	2006-05-22 05:54:49 +0000	[diff] [blame]	592	//===---------------------------------------------------------------------===//
Evan Cheng	0d23bc4	2006-05-20 07:44:53 +0000	[diff] [blame]	593
Evan Cheng	435bcd7	2006-05-22 05:54:49 +0000	[diff] [blame]	594	Another instruction selector deficiency:
				595
				596	void %bar() {
				597	%tmp = load int (int)** %foo
				598	%tmp = tail call int %tmp( int 3 )
				599	ret void
				600	}
				601
				602	_bar:
				603	subl $12, %esp
				604	movl L_foo$non_lazy_ptr, %eax
				605	movl (%eax), %eax
				606	call *%eax
				607	addl $12, %esp
				608	ret
				609
				610	The current isel scheme will not allow the load to be folded in the call since
				611	the load's chain result is read by the callseq_start.
Evan Cheng	8c65fa5	2006-05-30 06:23:50 +0000	[diff] [blame]	612
				613	//===---------------------------------------------------------------------===//
				614
				615	Don't forget to find a way to squash noop truncates in the JIT environment.
				616
				617	//===---------------------------------------------------------------------===//
				618
				619	Implement anyext in the same manner as truncate that would allow them to be
				620	eliminated.
				621
				622	//===---------------------------------------------------------------------===//
				623
				624	How about implementing truncate / anyext as a property of machine instruction
				625	operand? i.e. Print as 32-bit super-class register / 16-bit sub-class register.
				626	Do this for the cases where a truncate / anyext is guaranteed to be eliminated.
				627	For IA32 that is truncate from 32 to 16 and anyext from 16 to 32.
Evan Cheng	5a622f2	2006-05-30 07:37:37 +0000	[diff] [blame]	628
				629	//===---------------------------------------------------------------------===//
				630
				631	For this:
				632
				633	int test(int a)
				634	{
				635	return a * 3;
				636	}
				637
				638	We currently emits
				639	imull $3, 4(%esp), %eax
				640
				641	Perhaps this is what we really should generate is? Is imull three or four
				642	cycles? Note: ICC generates this:
				643	movl 4(%esp), %eax
				644	leal (%eax,%eax,2), %eax
				645
				646	The current instruction priority is based on pattern complexity. The former is
				647	more "complex" because it folds a load so the latter will not be emitted.
				648
				649	Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
				650	should always try to match LEA first since the LEA matching code does some
				651	estimate to determine whether the match is profitable.
				652
				653	However, if we care more about code size, then imull is better. It's two bytes
				654	shorter than movl + leal.
Evan Cheng	c21051f	2006-06-04 09:08:00 +0000	[diff] [blame]	655
				656	//===---------------------------------------------------------------------===//
				657
				658	Implement CTTZ, CTLZ with bsf and bsr.
				659
				660	//===---------------------------------------------------------------------===//
				661
				662	It appears gcc place string data with linkonce linkage in
				663	.section __TEXT,__const_coal,coalesced instead of
				664	.section __DATA,__const_coal,coalesced.
				665	Take a look at darwin.h, there are other Darwin assembler directives that we
				666	do not make use of.
				667
				668	//===---------------------------------------------------------------------===//
				669
				670	We should handle __attribute__ ((__visibility__ ("hidden"))).
Chris Lattner	8e173de	2006-06-15 21:33:31 +0000	[diff] [blame]	671
				672	//===---------------------------------------------------------------------===//
				673
				674	Consider:
				675	int foo(int *a, int t) {
				676	int x;
				677	for (x=0; x<40; ++x)
				678	t = t + a[x] + x;
				679	return t;
				680	}
				681
				682	We generate:
				683	LBB1_1: #cond_true
				684	movl %ecx, %esi
				685	movl (%edx,%eax,4), %edi
				686	movl %esi, %ecx
				687	addl %edi, %ecx
				688	addl %eax, %ecx
				689	incl %eax
				690	cmpl $40, %eax
				691	jne LBB1_1 #cond_true
				692
				693	GCC generates:
				694
				695	L2:
				696	addl (%ecx,%edx,4), %eax
				697	addl %edx, %eax
				698	addl $1, %edx
				699	cmpl $40, %edx
				700	jne L2
				701
				702	Smells like a register coallescing/reassociation issue.
				703
				704	//===---------------------------------------------------------------------===//
Evan Cheng	357edf8	2006-06-17 00:45:49 +0000	[diff] [blame]	705
				706	Use cpuid to auto-detect CPU features such as SSE, SSE2, and SSE3.