Blame - lib/Target/X86/README.txt - fp2-dev/platform/external/llvm

blob: 5ffb2f81cd3e32238dd43b19df2a4adba1fe30a2 [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the X86 backend.
				3	//===---------------------------------------------------------------------===//
				4
				5	Missing features:
				6	- Support for SSE4: http://www.intel.com/software/penryn
				7	http://softwarecommunity.intel.com/isn/Downloads/Intel%20SSE4%20Programming%20Reference.pdf
				8	- support for 3DNow!
				9	- weird abis?
				10
				11	//===---------------------------------------------------------------------===//
				12
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	13	CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
				14	backend knows how to three-addressify this shift, but it appears the register
				15	allocator isn't even asking it to do so in this case. We should investigate
				16	why this isn't happening, it could have significant impact on other important
				17	cases for X86 as well.
				18
				19	//===---------------------------------------------------------------------===//
				20
				21	This should be one DIV/IDIV instruction, not a libcall:
				22
				23	unsigned test(unsigned long long X, unsigned Y) {
				24	return X/Y;
				25	}
				26
				27	This can be done trivially with a custom legalizer. What about overflow
				28	though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
				29
				30	//===---------------------------------------------------------------------===//
				31
				32	Improvements to the multiply -> shift/add algorithm:
				33	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
				34
				35	//===---------------------------------------------------------------------===//
				36
				37	Improve code like this (occurs fairly frequently, e.g. in LLVM):
				38	long long foo(int x) { return 1LL << x; }
				39
				40	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
				41	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
				42	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
				43
				44	Another useful one would be ~0ULL >> X and ~0ULL << X.
				45
				46	One better solution for 1LL << x is:
				47	xorl %eax, %eax
				48	xorl %edx, %edx
				49	testb $32, %cl
				50	sete %al
				51	setne %dl
				52	sall %cl, %eax
				53	sall %cl, %edx
				54
				55	But that requires good 8-bit subreg support.
				56
				57	64-bit shifts (in general) expand to really bad code. Instead of using
				58	cmovs, we should expand to a conditional branch like GCC produces.
				59
				60	//===---------------------------------------------------------------------===//
				61
				62	Compile this:
				63	_Bool f(_Bool a) { return a!=1; }
				64
				65	into:
				66	movzbl %dil, %eax
				67	xorl $1, %eax
				68	ret
				69
				70	//===---------------------------------------------------------------------===//
				71
				72	Some isel ideas:
				73
				74	1. Dynamic programming based approach when compile time if not an
				75	issue.
				76	2. Code duplication (addressing mode) during isel.
				77	3. Other ideas from "Register-Sensitive Selection, Duplication, and
				78	Sequencing of Instructions".
				79	4. Scheduling for reduced register pressure. E.g. "Minimum Register
				80	Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
				81	and other related papers.
				82	http://citeseer.ist.psu.edu/govindarajan01minimum.html
				83
				84	//===---------------------------------------------------------------------===//
				85
				86	Should we promote i16 to i32 to avoid partial register update stalls?
				87
				88	//===---------------------------------------------------------------------===//
				89
				90	Leave any_extend as pseudo instruction and hint to register
				91	allocator. Delay codegen until post register allocation.
Evan Cheng	fdbb667	2007-10-12 18:22:55 +0000	[diff] [blame]	92	Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
				93	the coalescer how to deal with it though.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	94
				95	//===---------------------------------------------------------------------===//
				96
				97	Count leading zeros and count trailing zeros:
				98
				99	int clz(int X) { return __builtin_clz(X); }
				100	int ctz(int X) { return __builtin_ctz(X); }
				101
				102	$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
				103	clz:
				104	bsr %eax, DWORD PTR [%esp+4]
				105	xor %eax, 31
				106	ret
				107	ctz:
				108	bsf %eax, DWORD PTR [%esp+4]
				109	ret
				110
				111	however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
				112	aren't.
				113
				114	Another example (use predsimplify to eliminate a select):
				115
				116	int foo (unsigned long j) {
				117	if (j)
				118	return __builtin_ffs (j) - 1;
				119	else
				120	return 0;
				121	}
				122
				123	//===---------------------------------------------------------------------===//
				124
				125	It appears icc use push for parameter passing. Need to investigate.
				126
				127	//===---------------------------------------------------------------------===//
				128
				129	Only use inc/neg/not instructions on processors where they are faster than
				130	add/sub/xor. They are slower on the P4 due to only updating some processor
				131	flags.
				132
				133	//===---------------------------------------------------------------------===//
				134
				135	The instruction selector sometimes misses folding a load into a compare. The
				136	pattern is written as (cmp reg, (load p)). Because the compare isn't
				137	commutative, it is not matched with the load on both sides. The dag combiner
				138	should be made smart enough to cannonicalize the load into the RHS of a compare
				139	when it can invert the result of the compare for free.
				140
				141	//===---------------------------------------------------------------------===//
				142
				143	How about intrinsics? An example is:
				144	res = _mm_mulhi_epu16(A, _mm_mul_epu32(B, C));
				145
				146	compiles to
				147	pmuludq (%eax), %xmm0
				148	movl 8(%esp), %eax
				149	movdqa (%eax), %xmm1
				150	pmulhuw %xmm0, %xmm1
				151
				152	The transformation probably requires a X86 specific pass or a DAG combiner
				153	target specific hook.
				154
				155	//===---------------------------------------------------------------------===//
				156
				157	In many cases, LLVM generates code like this:
				158
				159	_test:
				160	movl 8(%esp), %eax
				161	cmpl %eax, 4(%esp)
				162	setl %al
				163	movzbl %al, %eax
				164	ret
				165
				166	on some processors (which ones?), it is more efficient to do this:
				167
				168	_test:
				169	movl 8(%esp), %ebx
				170	xor %eax, %eax
				171	cmpl %ebx, 4(%esp)
				172	setl %al
				173	ret
				174
				175	Doing this correctly is tricky though, as the xor clobbers the flags.
				176
				177	//===---------------------------------------------------------------------===//
				178
				179	We should generate bts/btr/etc instructions on targets where they are cheap or
				180	when codesize is important. e.g., for:
				181
				182	void setbit(int *target, int bit) {
				183	*target \|= (1 << bit);
				184	}
				185	void clearbit(int *target, int bit) {
				186	*target &= ~(1 << bit);
				187	}
				188
				189	//===---------------------------------------------------------------------===//
				190
				191	Instead of the following for memset char*, 1, 10:
				192
				193	movl $16843009, 4(%edx)
				194	movl $16843009, (%edx)
				195	movw $257, 8(%edx)
				196
				197	It might be better to generate
				198
				199	movl $16843009, %eax
				200	movl %eax, 4(%edx)
				201	movl %eax, (%edx)
				202	movw al, 8(%edx)
				203
				204	when we can spare a register. It reduces code size.
				205
				206	//===---------------------------------------------------------------------===//
				207
				208	Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
				209	get this:
				210
				211	int %test1(int %X) {
				212	%Y = div int %X, 8
				213	ret int %Y
				214	}
				215
				216	_test1:
				217	movl 4(%esp), %eax
				218	movl %eax, %ecx
				219	sarl $31, %ecx
				220	shrl $29, %ecx
				221	addl %ecx, %eax
				222	sarl $3, %eax
				223	ret
				224
				225	GCC knows several different ways to codegen it, one of which is this:
				226
				227	_test1:
				228	movl 4(%esp), %eax
				229	cmpl $-1, %eax
				230	leal 7(%eax), %ecx
				231	cmovle %ecx, %eax
				232	sarl $3, %eax
				233	ret
				234
				235	which is probably slower, but it's interesting at least :)
				236
				237	//===---------------------------------------------------------------------===//
				238
				239	The first BB of this code:
				240
				241	declare bool %foo()
				242	int %bar() {
				243	%V = call bool %foo()
				244	br bool %V, label %T, label %F
				245	T:
				246	ret int 1
				247	F:
				248	call bool %foo()
				249	ret int 12
				250	}
				251
				252	compiles to:
				253
				254	_bar:
				255	subl $12, %esp
				256	call L_foo$stub
				257	xorb $1, %al
				258	testb %al, %al
				259	jne LBB_bar_2 # F
				260
				261	It would be better to emit "cmp %al, 1" than a xor and test.
				262
				263	//===---------------------------------------------------------------------===//
				264
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	265	We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
				266	We should leave these as libcalls for everything over a much lower threshold,
				267	since libc is hand tuned for medium and large mem ops (avoiding RFO for large
				268	stores, TLB preheating, etc)
				269
				270	//===---------------------------------------------------------------------===//
				271
				272	Optimize this into something reasonable:
				273	x * copysign(1.0, y) * copysign(1.0, z)
				274
				275	//===---------------------------------------------------------------------===//
				276
				277	Optimize copysign(x, *y) to use an integer load from y.
				278
				279	//===---------------------------------------------------------------------===//
				280
				281	%X = weak global int 0
				282
				283	void %foo(int %N) {
				284	%N = cast int %N to uint
				285	%tmp.24 = setgt int %N, 0
				286	br bool %tmp.24, label %no_exit, label %return
				287
				288	no_exit:
				289	%indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
				290	%i.0.0 = cast uint %indvar to int
				291	volatile store int %i.0.0, int* %X
				292	%indvar.next = add uint %indvar, 1
				293	%exitcond = seteq uint %indvar.next, %N
				294	br bool %exitcond, label %return, label %no_exit
				295
				296	return:
				297	ret void
				298	}
				299
				300	compiles into:
				301
				302	.text
				303	.align 4
				304	.globl _foo
				305	_foo:
				306	movl 4(%esp), %eax
				307	cmpl $1, %eax
				308	jl LBB_foo_4 # return
				309	LBB_foo_1: # no_exit.preheader
				310	xorl %ecx, %ecx
				311	LBB_foo_2: # no_exit
				312	movl L_X$non_lazy_ptr, %edx
				313	movl %ecx, (%edx)
				314	incl %ecx
				315	cmpl %eax, %ecx
				316	jne LBB_foo_2 # no_exit
				317	LBB_foo_3: # return.loopexit
				318	LBB_foo_4: # return
				319	ret
				320
				321	We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
				322	remateralization is implemented. This can be accomplished with 1) a target
				323	dependent LICM pass or 2) makeing SelectDAG represent the whole function.
				324
				325	//===---------------------------------------------------------------------===//
				326
				327	The following tests perform worse with LSR:
				328
				329	lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
				330
				331	//===---------------------------------------------------------------------===//
				332
				333	We are generating far worse code than gcc:
				334
				335	volatile short X, Y;
				336
				337	void foo(int N) {
				338	int i;
				339	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				340	}
				341
Evan Cheng	27a820a	2007-10-26 01:56:11 +0000	[diff] [blame]	342	LBB1_1: # entry.bb_crit_edge
				343	xorl %ecx, %ecx
				344	xorw %dx, %dx
				345	LBB1_2: # bb
				346	movl L_X$non_lazy_ptr, %esi
				347	movw %cx, (%esi)
				348	movl L_Y$non_lazy_ptr, %esi
				349	movw %dx, (%esi)
				350	addw $4, %dx
				351	incl %ecx
				352	cmpl %eax, %ecx
				353	jne LBB1_2 # bb
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	354
				355	vs.
				356
				357	xorl %edx, %edx
				358	movl L_X$non_lazy_ptr-"L00000000001$pb"(%ebx), %esi
				359	movl L_Y$non_lazy_ptr-"L00000000001$pb"(%ebx), %ecx
				360	L4:
				361	movw %dx, (%esi)
				362	leal 0(,%edx,4), %eax
				363	movw %ax, (%ecx)
				364	addl $1, %edx
				365	cmpl %edx, %edi
				366	jne L4
				367
Evan Cheng	27a820a	2007-10-26 01:56:11 +0000	[diff] [blame]	368	This is due to the lack of post regalloc LICM.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	369
				370	//===---------------------------------------------------------------------===//
				371
				372	Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
				373	FR64 to VR128.
				374
				375	//===---------------------------------------------------------------------===//
				376
				377	mov $reg, 48(%esp)
				378	...
				379	leal 48(%esp), %eax
				380	mov %eax, (%esp)
				381	call _foo
				382
				383	Obviously it would have been better for the first mov (or any op) to store
				384	directly %esp[0] if there are no other uses.
				385
				386	//===---------------------------------------------------------------------===//
				387
				388	Adding to the list of cmp / test poor codegen issues:
				389
				390	int test(__m128 A, __m128 B) {
				391	if (_mm_comige_ss(A, B))
				392	return 3;
				393	else
				394	return 4;
				395	}
				396
				397	_test:
				398	movl 8(%esp), %eax
				399	movaps (%eax), %xmm0
				400	movl 4(%esp), %eax
				401	movaps (%eax), %xmm1
				402	comiss %xmm0, %xmm1
				403	setae %al
				404	movzbl %al, %ecx
				405	movl $3, %eax
				406	movl $4, %edx
				407	cmpl $0, %ecx
				408	cmove %edx, %eax
				409	ret
				410
				411	Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
				412	are a number of issues. 1) We are introducing a setcc between the result of the
				413	intrisic call and select. 2) The intrinsic is expected to produce a i32 value
				414	so a any extend (which becomes a zero extend) is added.
				415
				416	We probably need some kind of target DAG combine hook to fix this.
				417
				418	//===---------------------------------------------------------------------===//
				419
				420	We generate significantly worse code for this than GCC:
				421	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
				422	http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
				423
				424	There is also one case we do worse on PPC.
				425
				426	//===---------------------------------------------------------------------===//
				427
				428	If shorter, we should use things like:
				429	movzwl %ax, %eax
				430	instead of:
				431	andl $65535, %EAX
				432
				433	The former can also be used when the two-addressy nature of the 'and' would
				434	require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
				435
				436	//===---------------------------------------------------------------------===//
				437
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	438	Consider this:
				439
				440	typedef struct pair { float A, B; } pair;
				441	void pairtest(pair P, float *FP) {
				442	*FP = P.A+P.B;
				443	}
				444
				445	We currently generate this code with llvmgcc4:
				446
				447	_pairtest:
				448	movl 8(%esp), %eax
				449	movl 4(%esp), %ecx
				450	movd %eax, %xmm0
				451	movd %ecx, %xmm1
				452	addss %xmm0, %xmm1
				453	movl 12(%esp), %eax
				454	movss %xmm1, (%eax)
				455	ret
				456
				457	we should be able to generate:
				458	_pairtest:
				459	movss 4(%esp), %xmm0
				460	movl 12(%esp), %eax
				461	addss 8(%esp), %xmm0
				462	movss %xmm0, (%eax)
				463	ret
				464
				465	The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
				466	integer chunks. It does this so that structs like {short,short} are passed in
				467	a single 32-bit integer stack slot. We should handle the safe cases above much
				468	nicer, while still handling the hard cases.
				469
				470	While true in general, in this specific case we could do better by promoting
				471	load int + bitcast to float -> load fload. This basically needs alignment info,
				472	the code is already implemented (but disabled) in dag combine).
				473
				474	//===---------------------------------------------------------------------===//
				475
				476	Another instruction selector deficiency:
				477
				478	void %bar() {
				479	%tmp = load int (int)** %foo
				480	%tmp = tail call int %tmp( int 3 )
				481	ret void
				482	}
				483
				484	_bar:
				485	subl $12, %esp
				486	movl L_foo$non_lazy_ptr, %eax
				487	movl (%eax), %eax
				488	call *%eax
				489	addl $12, %esp
				490	ret
				491
				492	The current isel scheme will not allow the load to be folded in the call since
				493	the load's chain result is read by the callseq_start.
				494
				495	//===---------------------------------------------------------------------===//
				496
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	497	For this:
				498
				499	int test(int a)
				500	{
				501	return a * 3;
				502	}
				503
				504	We currently emits
				505	imull $3, 4(%esp), %eax
				506
				507	Perhaps this is what we really should generate is? Is imull three or four
				508	cycles? Note: ICC generates this:
				509	movl 4(%esp), %eax
				510	leal (%eax,%eax,2), %eax
				511
				512	The current instruction priority is based on pattern complexity. The former is
				513	more "complex" because it folds a load so the latter will not be emitted.
				514
				515	Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
				516	should always try to match LEA first since the LEA matching code does some
				517	estimate to determine whether the match is profitable.
				518
				519	However, if we care more about code size, then imull is better. It's two bytes
				520	shorter than movl + leal.
				521
				522	//===---------------------------------------------------------------------===//
				523
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	524	Implement CTTZ, CTLZ with bsf and bsr. GCC produces:
				525
				526	int ctz_(unsigned X) { return __builtin_ctz(X); }
				527	int clz_(unsigned X) { return __builtin_clz(X); }
				528	int ffs_(unsigned X) { return __builtin_ffs(X); }
				529
				530	_ctz_:
				531	bsfl 4(%esp), %eax
				532	ret
				533	_clz_:
				534	bsrl 4(%esp), %eax
				535	xorl $31, %eax
				536	ret
				537	_ffs_:
				538	movl $-1, %edx
				539	bsfl 4(%esp), %eax
				540	cmove %edx, %eax
				541	addl $1, %eax
				542	ret
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	543
				544	//===---------------------------------------------------------------------===//
				545
				546	It appears gcc place string data with linkonce linkage in
				547	.section __TEXT,__const_coal,coalesced instead of
				548	.section __DATA,__const_coal,coalesced.
				549	Take a look at darwin.h, there are other Darwin assembler directives that we
				550	do not make use of.
				551
				552	//===---------------------------------------------------------------------===//
				553
				554	int %foo(int* %a, int %t) {
				555	entry:
				556	br label %cond_true
				557
				558	cond_true: ; preds = %cond_true, %entry
				559	%x.0.0 = phi int [ 0, %entry ], [ %tmp9, %cond_true ]
				560	%t_addr.0.0 = phi int [ %t, %entry ], [ %tmp7, %cond_true ]
				561	%tmp2 = getelementptr int* %a, int %x.0.0
				562	%tmp3 = load int* %tmp2 ; <int> [#uses=1]
				563	%tmp5 = add int %t_addr.0.0, %x.0.0 ; <int> [#uses=1]
				564	%tmp7 = add int %tmp5, %tmp3 ; <int> [#uses=2]
				565	%tmp9 = add int %x.0.0, 1 ; <int> [#uses=2]
				566	%tmp = setgt int %tmp9, 39 ; <bool> [#uses=1]
				567	br bool %tmp, label %bb12, label %cond_true
				568
				569	bb12: ; preds = %cond_true
				570	ret int %tmp7
				571	}
				572
				573	is pessimized by -loop-reduce and -indvars
				574
				575	//===---------------------------------------------------------------------===//
				576
				577	u32 to float conversion improvement:
				578
				579	float uint32_2_float( unsigned u ) {
				580	float fl = (int) (u & 0xffff);
				581	float fh = (int) (u >> 16);
				582	fh *= 0x1.0p16f;
				583	return fh + fl;
				584	}
				585
				586	00000000 subl $0x04,%esp
				587	00000003 movl 0x08(%esp,1),%eax
				588	00000007 movl %eax,%ecx
				589	00000009 shrl $0x10,%ecx
				590	0000000c cvtsi2ss %ecx,%xmm0
				591	00000010 andl $0x0000ffff,%eax
				592	00000015 cvtsi2ss %eax,%xmm1
				593	00000019 mulss 0x00000078,%xmm0
				594	00000021 addss %xmm1,%xmm0
				595	00000025 movss %xmm0,(%esp,1)
				596	0000002a flds (%esp,1)
				597	0000002d addl $0x04,%esp
				598	00000030 ret
				599
				600	//===---------------------------------------------------------------------===//
				601
				602	When using fastcc abi, align stack slot of argument of type double on 8 byte
				603	boundary to improve performance.
				604
				605	//===---------------------------------------------------------------------===//
				606
				607	Codegen:
				608
				609	int f(int a, int b) {
				610	if (a == 4 \|\| a == 6)
				611	b++;
				612	return b;
				613	}
				614
				615
				616	as:
				617
				618	or eax, 2
				619	cmp eax, 6
				620	jz label
				621
				622	//===---------------------------------------------------------------------===//
				623
				624	GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
				625	simplifications for integer "x cmp y ? a : b". For example, instead of:
				626
				627	int G;
				628	void f(int X, int Y) {
				629	G = X < 0 ? 14 : 13;
				630	}
				631
				632	compiling to:
				633
				634	_f:
				635	movl $14, %eax
				636	movl $13, %ecx
				637	movl 4(%esp), %edx
				638	testl %edx, %edx
				639	cmovl %eax, %ecx
				640	movl %ecx, _G
				641	ret
				642
				643	it could be:
				644	_f:
				645	movl 4(%esp), %eax
				646	sarl $31, %eax
				647	notl %eax
				648	addl $14, %eax
				649	movl %eax, _G
				650	ret
				651
				652	etc.
				653
Chris Lattner	e7037c2	2007-11-02 17:04:20 +0000	[diff] [blame]	654	Another is:
				655	int usesbb(unsigned int a, unsigned int b) {
				656	return (a < b ? -1 : 0);
				657	}
				658	to:
				659	_usesbb:
				660	movl 8(%esp), %eax
				661	cmpl %eax, 4(%esp)
				662	sbbl %eax, %eax
				663	ret
				664
				665	instead of:
				666	_usesbb:
				667	xorl %eax, %eax
				668	movl 8(%esp), %ecx
				669	cmpl %ecx, 4(%esp)
				670	movl $4294967295, %ecx
				671	cmovb %ecx, %eax
				672	ret
				673
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	674	//===---------------------------------------------------------------------===//
				675
				676	Currently we don't have elimination of redundant stack manipulations. Consider
				677	the code:
				678
				679	int %main() {
				680	entry:
				681	call fastcc void %test1( )
				682	call fastcc void %test2( sbyte* cast (void ()* %test1 to sbyte*) )
				683	ret int 0
				684	}
				685
				686	declare fastcc void %test1()
				687
				688	declare fastcc void %test2(sbyte*)
				689
				690
				691	This currently compiles to:
				692
				693	subl $16, %esp
				694	call _test5
				695	addl $12, %esp
				696	subl $16, %esp
				697	movl $_test5, (%esp)
				698	call _test6
				699	addl $12, %esp
				700
				701	The add\sub pair is really unneeded here.
				702
				703	//===---------------------------------------------------------------------===//
				704
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	705	Consider the expansion of:
				706
				707	uint %test3(uint %X) {
				708	%tmp1 = rem uint %X, 255
				709	ret uint %tmp1
				710	}
				711
				712	Currently it compiles to:
				713
				714	...
				715	movl $2155905153, %ecx
				716	movl 8(%esp), %esi
				717	movl %esi, %eax
				718	mull %ecx
				719	...
				720
				721	This could be "reassociated" into:
				722
				723	movl $2155905153, %eax
				724	movl 8(%esp), %ecx
				725	mull %ecx
				726
				727	to avoid the copy. In fact, the existing two-address stuff would do this
				728	except that mul isn't a commutative 2-addr instruction. I guess this has
				729	to be done at isel time based on the #uses to mul?
				730
				731	//===---------------------------------------------------------------------===//
				732
				733	Make sure the instruction which starts a loop does not cross a cacheline
				734	boundary. This requires knowning the exact length of each machine instruction.
				735	That is somewhat complicated, but doable. Example 256.bzip2:
				736
				737	In the new trace, the hot loop has an instruction which crosses a cacheline
				738	boundary. In addition to potential cache misses, this can't help decoding as I
				739	imagine there has to be some kind of complicated decoder reset and realignment
				740	to grab the bytes from the next cacheline.
				741
				742	532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
				743	942 942 0x3d03 movl %dh, (1809(%esp, %esi)
				744	937 937 0x3d0a incl %esi
				745	3 3 0x3d0b cmpb %bl, %dl
				746	27 27 0x3d0d jnz 0x000062db <main+11707>
				747
				748	//===---------------------------------------------------------------------===//
				749
				750	In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
				751
				752	//===---------------------------------------------------------------------===//
				753
				754	This could be a single 16-bit load.
				755
				756	int f(char *p) {
				757	if ((p[0] == 1) & (p[1] == 2)) return 1;
				758	return 0;
				759	}
				760
				761	//===---------------------------------------------------------------------===//
				762
				763	We should inline lrintf and probably other libc functions.
				764
				765	//===---------------------------------------------------------------------===//
				766
				767	Start using the flags more. For example, compile:
				768
				769	int add_zf(int *x, int y, int a, int b) {
				770	if ((*x += y) == 0)
				771	return a;
				772	else
				773	return b;
				774	}
				775
				776	to:
				777	addl %esi, (%rdi)
				778	movl %edx, %eax
				779	cmovne %ecx, %eax
				780	ret
				781	instead of:
				782
				783	_add_zf:
				784	addl (%rdi), %esi
				785	movl %esi, (%rdi)
				786	testl %esi, %esi
				787	cmove %edx, %ecx
				788	movl %ecx, %eax
				789	ret
				790
				791	and:
				792
				793	int add_zf(int *x, int y, int a, int b) {
				794	if ((*x + y) < 0)
				795	return a;
				796	else
				797	return b;
				798	}
				799
				800	to:
				801
				802	add_zf:
				803	addl (%rdi), %esi
				804	movl %edx, %eax
				805	cmovns %ecx, %eax
				806	ret
				807
				808	instead of:
				809
				810	_add_zf:
				811	addl (%rdi), %esi
				812	testl %esi, %esi
				813	cmovs %edx, %ecx
				814	movl %ecx, %eax
				815	ret
				816
				817	//===---------------------------------------------------------------------===//
				818
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	819	These two functions have identical effects:
				820
				821	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
				822	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				823
				824	We currently compile them to:
				825
				826	_f:
				827	movl 4(%esp), %eax
				828	movl %eax, %ecx
				829	incl %ecx
				830	movl 8(%esp), %edx
				831	cmpl %edx, %ecx
				832	jne LBB1_2 #UnifiedReturnBlock
				833	LBB1_1: #cond_true
				834	addl $2, %eax
				835	ret
				836	LBB1_2: #UnifiedReturnBlock
				837	movl %ecx, %eax
				838	ret
				839	_f2:
				840	movl 4(%esp), %eax
				841	movl %eax, %ecx
				842	incl %ecx
				843	cmpl 8(%esp), %ecx
				844	sete %cl
				845	movzbl %cl, %ecx
				846	leal 1(%ecx,%eax), %eax
				847	ret
				848
				849	both of which are inferior to GCC's:
				850
				851	_f:
				852	movl 4(%esp), %edx
				853	leal 1(%edx), %eax
				854	addl $2, %edx
				855	cmpl 8(%esp), %eax
				856	cmove %edx, %eax
				857	ret
				858	_f2:
				859	movl 4(%esp), %eax
				860	addl $1, %eax
				861	xorl %edx, %edx
				862	cmpl 8(%esp), %eax
				863	sete %dl
				864	addl %edx, %eax
				865	ret
				866
				867	//===---------------------------------------------------------------------===//
				868
				869	This code:
				870
				871	void test(int X) {
				872	if (X) abort();
				873	}
				874
				875	is currently compiled to:
				876
				877	_test:
				878	subl $12, %esp
				879	cmpl $0, 16(%esp)
				880	jne LBB1_1
				881	addl $12, %esp
				882	ret
				883	LBB1_1:
				884	call L_abort$stub
				885
				886	It would be better to produce:
				887
				888	_test:
				889	subl $12, %esp
				890	cmpl $0, 16(%esp)
				891	jne L_abort$stub
				892	addl $12, %esp
				893	ret
				894
				895	This can be applied to any no-return function call that takes no arguments etc.
				896	Alternatively, the stack save/restore logic could be shrink-wrapped, producing
				897	something like this:
				898
				899	_test:
				900	cmpl $0, 4(%esp)
				901	jne LBB1_1
				902	ret
				903	LBB1_1:
				904	subl $12, %esp
				905	call L_abort$stub
				906
				907	Both are useful in different situations. Finally, it could be shrink-wrapped
				908	and tail called, like this:
				909
				910	_test:
				911	cmpl $0, 4(%esp)
				912	jne LBB1_1
				913	ret
				914	LBB1_1:
				915	pop %eax # realign stack.
				916	call L_abort$stub
				917
				918	Though this probably isn't worth it.
				919
				920	//===---------------------------------------------------------------------===//
				921
				922	We need to teach the codegen to convert two-address INC instructions to LEA
Chris Lattner	0d64ec3	2007-08-11 18:16:46 +0000	[diff] [blame]	923	when the flags are dead (likewise dec). For example, on X86-64, compile:
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	924
				925	int foo(int A, int B) {
				926	return A+1;
				927	}
				928
				929	to:
				930
				931	_foo:
				932	leal 1(%edi), %eax
				933	ret
				934
				935	instead of:
				936
				937	_foo:
				938	incl %edi
				939	movl %edi, %eax
				940	ret
				941
				942	Another example is:
				943
				944	;; X's live range extends beyond the shift, so the register allocator
				945	;; cannot coalesce it with Y. Because of this, a copy needs to be
				946	;; emitted before the shift to save the register value before it is
				947	;; clobbered. However, this copy is not needed if the register
				948	;; allocator turns the shift into an LEA. This also occurs for ADD.
				949
				950	; Check that the shift gets turned into an LEA.
				951	; RUN: llvm-upgrade < %s \| llvm-as \| llc -march=x86 -x86-asm-syntax=intel \| \
				952	; RUN: not grep {mov E.X, E.X}
				953
				954	%G = external global int
				955
				956	int %test1(int %X, int %Y) {
				957	%Z = add int %X, %Y
				958	volatile store int %Y, int* %G
				959	volatile store int %Z, int* %G
				960	ret int %X
				961	}
				962
				963	int %test2(int %X) {
				964	%Z = add int %X, 1 ;; inc
				965	volatile store int %Z, int* %G
				966	ret int %X
				967	}
				968
				969	//===---------------------------------------------------------------------===//
				970
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	971	Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
				972	a neg instead of a sub instruction. Consider:
				973
				974	int test(char X) { return 7-X; }
				975
				976	we currently produce:
				977	_test:
				978	movl $7, %eax
				979	movsbl 4(%esp), %ecx
				980	subl %ecx, %eax
				981	ret
				982
				983	We would use one fewer register if codegen'd as:
				984
				985	movsbl 4(%esp), %eax
				986	neg %eax
				987	add $7, %eax
				988	ret
				989
				990	Note that this isn't beneficial if the load can be folded into the sub. In
				991	this case, we want a sub:
				992
				993	int test(int X) { return 7-X; }
				994	_test:
				995	movl $7, %eax
				996	subl 4(%esp), %eax
				997	ret
				998
				999	//===---------------------------------------------------------------------===//
				1000
				1001	For code like:
				1002	phi (undef, x)
				1003
				1004	We get an implicit def on the undef side. If the phi is spilled, we then get:
				1005	implicitdef xmm1
				1006	store xmm1 -> stack
				1007
				1008	It should be possible to teach the x86 backend to "fold" the store into the
				1009	implicitdef, which just deletes the implicit def.
				1010
				1011	These instructions should go away:
				1012	#IMPLICIT_DEF %xmm1
				1013	movaps %xmm1, 192(%esp)
				1014	movaps %xmm1, 224(%esp)
				1015	movaps %xmm1, 176(%esp)
Chris Lattner	a3c76a4	2007-08-03 00:17:42 +0000	[diff] [blame]	1016
				1017	//===---------------------------------------------------------------------===//
				1018
				1019	This is a "commutable two-address" register coallescing deficiency:
				1020
				1021	define <4 x float> @test1(<4 x float> %V) {
				1022	entry:
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	1023	%tmp8 = shufflevector <4 x float> %V, <4 x float> undef,
				1024	<4 x i32> < i32 3, i32 2, i32 1, i32 0 >
				1025	%add = add <4 x float> %tmp8, %V
Chris Lattner	a3c76a4	2007-08-03 00:17:42 +0000	[diff] [blame]	1026	ret <4 x float> %add
				1027	}
				1028
				1029	this codegens to:
				1030
				1031	_test1:
				1032	pshufd $27, %xmm0, %xmm1
				1033	addps %xmm0, %xmm1
				1034	movaps %xmm1, %xmm0
				1035	ret
				1036
				1037	instead of:
				1038
				1039	_test1:
				1040	pshufd $27, %xmm0, %xmm1
				1041	addps %xmm1, %xmm0
				1042	ret
				1043
Chris Lattner	32f6587	2007-08-20 02:14:33 +0000	[diff] [blame]	1044	//===---------------------------------------------------------------------===//
				1045
				1046	Leaf functions that require one 4-byte spill slot have a prolog like this:
				1047
				1048	_foo:
				1049	pushl %esi
				1050	subl $4, %esp
				1051	...
				1052	and an epilog like this:
				1053	addl $4, %esp
				1054	popl %esi
				1055	ret
				1056
				1057	It would be smaller, and potentially faster, to push eax on entry and to
				1058	pop into a dummy register instead of using addl/subl of esp. Just don't pop
				1059	into any return registers :)
				1060
				1061	//===---------------------------------------------------------------------===//
Chris Lattner	44b03cb	2007-08-23 15:22:07 +0000	[diff] [blame]	1062
				1063	The X86 backend should fold (branch (or (setcc, setcc))) into multiple
				1064	branches. We generate really poor code for:
				1065
				1066	double testf(double a) {
				1067	return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
				1068	}
				1069
				1070	For example, the entry BB is:
				1071
				1072	_testf:
				1073	subl $20, %esp
				1074	pxor %xmm0, %xmm0
				1075	movsd 24(%esp), %xmm1
				1076	ucomisd %xmm0, %xmm1
				1077	setnp %al
				1078	sete %cl
				1079	testb %cl, %al
				1080	jne LBB1_5 # UnifiedReturnBlock
				1081	LBB1_1: # cond_true
				1082
				1083
				1084	it would be better to replace the last four instructions with:
				1085
				1086	jp LBB1_1
				1087	je LBB1_5
				1088	LBB1_1:
				1089
				1090	We also codegen the inner ?: into a diamond:
				1091
				1092	cvtss2sd LCPI1_0(%rip), %xmm2
				1093	cvtss2sd LCPI1_1(%rip), %xmm3
				1094	ucomisd %xmm1, %xmm0
				1095	ja LBB1_3 # cond_true
				1096	LBB1_2: # cond_true
				1097	movapd %xmm3, %xmm2
				1098	LBB1_3: # cond_true
				1099	movapd %xmm2, %xmm0
				1100	ret
				1101
				1102	We should sink the load into xmm3 into the LBB1_2 block. This should
				1103	be pretty easy, and will nuke all the copies.
				1104
				1105	//===---------------------------------------------------------------------===//
Chris Lattner	4084d49	2007-09-10 21:43:18 +0000	[diff] [blame]	1106
				1107	This:
				1108	#include <algorithm>
				1109	inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				1110	{ return std::make_pair(a + b, a + b < a); }
				1111	bool no_overflow(unsigned a, unsigned b)
				1112	{ return !full_add(a, b).second; }
				1113
				1114	Should compile to:
				1115
				1116
				1117	_Z11no_overflowjj:
				1118	addl %edi, %esi
				1119	setae %al
				1120	ret
				1121
				1122	on x86-64, not:
				1123
				1124	__Z11no_overflowjj:
				1125	addl %edi, %esi
				1126	cmpl %edi, %esi
				1127	setae %al
				1128	movzbl %al, %eax
				1129	ret
				1130
				1131
				1132	//===---------------------------------------------------------------------===//
Evan Cheng	35127a6	2007-09-10 22:16:37 +0000	[diff] [blame]	1133
				1134	Re-materialize MOV32r0 etc. with xor instead of changing them to moves if the
				1135	condition register is dead. xor reg reg is shorter than mov reg, #0.
Chris Lattner	a487bf7	2007-09-26 06:29:31 +0000	[diff] [blame]	1136
				1137	//===---------------------------------------------------------------------===//
				1138
				1139	We aren't matching RMW instructions aggressively
				1140	enough. Here's a reduced testcase (more in PR1160):
				1141
				1142	define void @test(i32* %huge_ptr, i32* %target_ptr) {
				1143	%A = load i32* %huge_ptr ; <i32> [#uses=1]
				1144	%B = load i32* %target_ptr ; <i32> [#uses=1]
				1145	%C = or i32 %A, %B ; <i32> [#uses=1]
				1146	store i32 %C, i32* %target_ptr
				1147	ret void
				1148	}
				1149
				1150	$ llvm-as < t.ll \| llc -march=x86-64
				1151
				1152	_test:
				1153	movl (%rdi), %eax
				1154	orl (%rsi), %eax
				1155	movl %eax, (%rsi)
				1156	ret
				1157
				1158	That should be something like:
				1159
				1160	_test:
				1161	movl (%rdi), %eax
				1162	orl %eax, (%rsi)
				1163	ret
				1164
				1165	//===---------------------------------------------------------------------===//
				1166
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	1167	The following code:
				1168
Bill Wendling	c2036e3	2007-10-02 20:54:32 +0000	[diff] [blame]	1169	bb114.preheader: ; preds = %cond_next94
				1170	%tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
				1171	%tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
				1172	%tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
				1173	%tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
				1174	%tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
				1175	%tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
				1176	%tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
				1177	%tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
				1178	%tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
				1179	%tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
				1180	%tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
				1181	br label %bb114
				1182
				1183	produces:
				1184
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	1185	LBB3_5: # bb114.preheader
				1186	movswl -68(%ebp), %eax
				1187	movl $32, %ecx
				1188	movl %ecx, -80(%ebp)
				1189	subl %eax, -80(%ebp)
				1190	movswl -52(%ebp), %eax
				1191	movl %ecx, -84(%ebp)
				1192	subl %eax, -84(%ebp)
				1193	movswl -70(%ebp), %eax
				1194	movl %ecx, -88(%ebp)
				1195	subl %eax, -88(%ebp)
				1196	movswl -50(%ebp), %eax
				1197	subl %eax, %ecx
				1198	movl %ecx, -76(%ebp)
				1199	movswl -42(%ebp), %eax
				1200	movl %eax, -92(%ebp)
				1201	movswl -66(%ebp), %eax
				1202	movl %eax, -96(%ebp)
				1203	movw $0, -98(%ebp)
				1204
Chris Lattner	792bae5	2007-10-03 03:40:24 +0000	[diff] [blame]	1205	This appears to be bad because the RA is not folding the store to the stack
				1206	slot into the movl. The above instructions could be:
				1207	movl $32, -80(%ebp)
				1208	...
				1209	movl $32, -84(%ebp)
				1210	...
				1211	This seems like a cross between remat and spill folding.
				1212
Bill Wendling	c2036e3	2007-10-02 20:54:32 +0000	[diff] [blame]	1213	This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	1214	change, so we could simply subtract %eax from %ecx first and then use %ecx (or
				1215	vice-versa).
				1216
				1217	//===---------------------------------------------------------------------===//
				1218
Bill Wendling	c524bae	2007-10-02 21:43:06 +0000	[diff] [blame]	1219	For this code:
				1220
				1221	cond_next603: ; preds = %bb493, %cond_true336, %cond_next599
				1222	%v.21050.1 = phi i32 [ %v.21050.0, %cond_next599 ], [ %tmp344, %cond_true336 ], [ %v.2, %bb493 ] ; <i32> [#uses=1]
				1223	%maxz.21051.1 = phi i32 [ %maxz.21051.0, %cond_next599 ], [ 0, %cond_true336 ], [ %maxz.2, %bb493 ] ; <i32> [#uses=2]
				1224	%cnt.01055.1 = phi i32 [ %cnt.01055.0, %cond_next599 ], [ 0, %cond_true336 ], [ %cnt.0, %bb493 ] ; <i32> [#uses=2]
				1225	%byteptr.9 = phi i8* [ %byteptr.12, %cond_next599 ], [ %byteptr.0, %cond_true336 ], [ %byteptr.10, %bb493 ] ; <i8*> [#uses=9]
				1226	%bitptr.6 = phi i32 [ %tmp5571104.1, %cond_next599 ], [ %tmp4921049, %cond_true336 ], [ %bitptr.7, %bb493 ] ; <i32> [#uses=4]
				1227	%source.5 = phi i32 [ %tmp602, %cond_next599 ], [ %source.0, %cond_true336 ], [ %source.6, %bb493 ] ; <i32> [#uses=7]
				1228	%tmp606 = getelementptr %struct.const_tables* @tables, i32 0, i32 0, i32 %cnt.01055.1 ; <i8*> [#uses=1]
				1229	%tmp607 = load i8* %tmp606, align 1 ; <i8> [#uses=1]
				1230
				1231	We produce this:
				1232
				1233	LBB4_70: # cond_next603
				1234	movl -20(%ebp), %esi
				1235	movl L_tables$non_lazy_ptr-"L4$pb"(%esi), %esi
				1236
				1237	However, ICC caches this information before the loop and produces this:
				1238
				1239	movl 88(%esp), %eax #481.12
				1240
				1241	//===---------------------------------------------------------------------===//
Bill Wendling	54c4f83	2007-10-02 21:49:31 +0000	[diff] [blame]	1242
				1243	This code:
				1244
				1245	%tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
				1246	br i1 %tmp659, label %cond_true662, label %cond_next715
				1247
				1248	produces this:
				1249
				1250	testw %cx, %cx
				1251	movswl %cx, %esi
				1252	jns LBB4_109 # cond_next715
				1253
				1254	Shark tells us that using %cx in the testw instruction is sub-optimal. It
				1255	suggests using the 32-bit register (which is what ICC uses).
				1256
				1257	//===---------------------------------------------------------------------===//
Chris Lattner	802c62a	2007-10-03 17:10:03 +0000	[diff] [blame]	1258
				1259	rdar://5506677 - We compile this:
				1260
				1261	define i32 @foo(double %x) {
				1262	%x14 = bitcast double %x to i64 ; <i64> [#uses=1]
				1263	%tmp713 = trunc i64 %x14 to i32 ; <i32> [#uses=1]
				1264	%tmp8 = and i32 %tmp713, 2147483647 ; <i32> [#uses=1]
				1265	ret i32 %tmp8
				1266	}
				1267
				1268	to:
				1269
				1270	_foo:
				1271	subl $12, %esp
				1272	fldl 16(%esp)
				1273	fstpl (%esp)
				1274	movl $2147483647, %eax
				1275	andl (%esp), %eax
				1276	addl $12, %esp
				1277	#FP_REG_KILL
				1278	ret
				1279
				1280	It would be much better to eliminate the fldl/fstpl by folding the bitcast
				1281	into the load SDNode. That would give us:
				1282
				1283	_foo:
				1284	movl $2147483647, %eax
				1285	andl 4(%esp), %eax
				1286	ret
				1287
				1288	//===---------------------------------------------------------------------===//
				1289
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	1290	We compile this:
				1291
				1292	void compare (long long foo) {
				1293	if (foo < 4294967297LL)
				1294	abort();
				1295	}
				1296
				1297	to:
				1298
				1299	_compare:
				1300	subl $12, %esp
				1301	cmpl $0, 16(%esp)
				1302	setne %al
				1303	movzbw %al, %ax
				1304	cmpl $1, 20(%esp)
				1305	setg %cl
				1306	movzbw %cl, %cx
				1307	cmove %ax, %cx
				1308	movw %cx, %ax
				1309	testb $1, %al
				1310	je LBB1_2 # cond_true
				1311
				1312	(also really horrible code on ppc). This is due to the expand code for 64-bit
				1313	compares. GCC produces multiple branches, which is much nicer:
				1314
				1315	_compare:
				1316	pushl %ebp
				1317	movl %esp, %ebp
				1318	subl $8, %esp
				1319	movl 8(%ebp), %eax
				1320	movl 12(%ebp), %edx
				1321	subl $1, %edx
				1322	jg L5
				1323	L7:
				1324	jl L4
				1325	cmpl $0, %eax
				1326	jbe L4
				1327	L5:
				1328
				1329	//===---------------------------------------------------------------------===//
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1330
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1331	Tail call optimization improvements: Tail call optimization currently
				1332	pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighofer	449b01a	2008-01-11 16:49:42 +0000	[diff] [blame^]	1333	non-tail call optimized calls) that source from the callers arguments
				1334	or that source from a virtual register (also possibly sourcing from
				1335	callers arguments).
				1336	This is done to prevent overwriting of parameters (see example
				1337	below) that might be used later.
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1338
				1339	example:
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1340
				1341	int callee(int32, int64);
				1342	int caller(int32 arg1, int32 arg2) {
				1343	int64 local = arg2 * 2;
				1344	return callee(arg2, (int64)local);
				1345	}
				1346
				1347	[arg1] [!arg2 no longer valid since we moved local onto it]
				1348	[arg2] -> [(int64)
				1349	[RETADDR] local ]
				1350
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1351	Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1352	arg2 of the caller.
				1353
				1354	Possible optimizations:
				1355
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1356
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1357	- Analyse the actual parameters of the callee to see which would
				1358	overwrite a caller parameter which is used by the callee and only
				1359	push them onto the top of the stack.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1360
				1361	int callee (int32 arg1, int32 arg2);
				1362	int caller (int32 arg1, int32 arg2) {
				1363	return callee(arg1,arg2);
				1364	}
				1365
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1366	Here we don't need to write any variables to the top of the stack
				1367	since they don't overwrite each other.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1368
				1369	int callee (int32 arg1, int32 arg2);
				1370	int caller (int32 arg1, int32 arg2) {
				1371	return callee(arg2,arg1);
				1372	}
				1373
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1374	Here we need to push the arguments because they overwrite each
				1375	other.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1376
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1377	//===---------------------------------------------------------------------===//
Evan Cheng	7f1ad6a	2007-10-28 04:01:09 +0000	[diff] [blame]	1378
				1379	main ()
				1380	{
				1381	int i = 0;
				1382	unsigned long int z = 0;
				1383
				1384	do {
				1385	z -= 0x00004000;
				1386	i++;
				1387	if (i > 0x00040000)
				1388	abort ();
				1389	} while (z > 0);
				1390	exit (0);
				1391	}
				1392
				1393	gcc compiles this to:
				1394
				1395	_main:
				1396	subl $28, %esp
				1397	xorl %eax, %eax
				1398	jmp L2
				1399	L3:
				1400	cmpl $262144, %eax
				1401	je L10
				1402	L2:
				1403	addl $1, %eax
				1404	cmpl $262145, %eax
				1405	jne L3
				1406	call L_abort$stub
				1407	L10:
				1408	movl $0, (%esp)
				1409	call L_exit$stub
				1410
				1411	llvm:
				1412
				1413	_main:
				1414	subl $12, %esp
				1415	movl $1, %eax
				1416	movl $16384, %ecx
				1417	LBB1_1: # bb
				1418	cmpl $262145, %eax
				1419	jge LBB1_4 # cond_true
				1420	LBB1_2: # cond_next
				1421	incl %eax
				1422	addl $4294950912, %ecx
				1423	cmpl $16384, %ecx
				1424	jne LBB1_1 # bb
				1425	LBB1_3: # bb11
				1426	xorl %eax, %eax
				1427	addl $12, %esp
				1428	ret
				1429	LBB1_4: # cond_true
				1430	call L_abort$stub
				1431
				1432	1. LSR should rewrite the first cmp with induction variable %ecx.
				1433	2. DAG combiner should fold
				1434	leal 1(%eax), %edx
				1435	cmpl $262145, %edx
				1436	=>
				1437	cmpl $262144, %eax
				1438
				1439	//===---------------------------------------------------------------------===//
Chris Lattner	358670b	2007-11-24 06:13:33 +0000	[diff] [blame]	1440
				1441	define i64 @test(double %X) {
				1442	%Y = fptosi double %X to i64
				1443	ret i64 %Y
				1444	}
				1445
				1446	compiles to:
				1447
				1448	_test:
				1449	subl $20, %esp
				1450	movsd 24(%esp), %xmm0
				1451	movsd %xmm0, 8(%esp)
				1452	fldl 8(%esp)
				1453	fisttpll (%esp)
				1454	movl 4(%esp), %edx
				1455	movl (%esp), %eax
				1456	addl $20, %esp
				1457	#FP_REG_KILL
				1458	ret
				1459
				1460	This should just fldl directly from the input stack slot.
Chris Lattner	10d54d1	2007-12-05 22:58:19 +0000	[diff] [blame]	1461
				1462	//===---------------------------------------------------------------------===//
				1463
				1464	This code:
				1465	int foo (int x) { return (x & 65535) \| 255; }
				1466
				1467	Should compile into:
				1468
				1469	_foo:
				1470	movzwl 4(%esp), %eax
				1471	orb $-1, %al ;; 'orl 255' is also fine :)
				1472	ret
				1473
				1474	instead of:
				1475	_foo:
				1476	movl $255, %eax
				1477	orl 4(%esp), %eax
				1478	andl $65535, %eax
				1479	ret
				1480
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1481	//===---------------------------------------------------------------------===//
				1482
				1483	We're missing an obvious fold of a load into imul:
				1484
				1485	int test(long a, long b) { return a * b; }
				1486
				1487	LLVM produces:
				1488	_test:
				1489	movl 4(%esp), %ecx
				1490	movl 8(%esp), %eax
				1491	imull %ecx, %eax
				1492	ret
				1493
				1494	vs:
				1495	_test:
				1496	movl 8(%esp), %eax
				1497	imull 4(%esp), %eax
				1498	ret
				1499
				1500	//===---------------------------------------------------------------------===//
				1501
Chris Lattner	2b55ebd	2007-12-24 19:27:46 +0000	[diff] [blame]	1502	We can fold a store into "zeroing a reg". Instead of:
				1503
				1504	xorl %eax, %eax
				1505	movl %eax, 124(%esp)
				1506
				1507	we should get:
				1508
				1509	movl $0, 124(%esp)
				1510
				1511	if the flags of the xor are dead.
				1512
				1513	//===---------------------------------------------------------------------===//
Chris Lattner	6440095	2007-12-28 21:50:40 +0000	[diff] [blame]	1514
				1515	This testcase misses a read/modify/write opportunity (from PR1425):
				1516
				1517	void vertical_decompose97iH1(int b0, int b1, int *b2, int width){
				1518	int i;
				1519	for(i=0; i<width; i++)
				1520	b1[i] += (1*(b0[i] + b2[i])+0)>>0;
				1521	}
				1522
				1523	We compile it down to:
				1524
				1525	LBB1_2: # bb
				1526	movl (%esi,%edi,4), %ebx
				1527	addl (%ecx,%edi,4), %ebx
				1528	addl (%edx,%edi,4), %ebx
				1529	movl %ebx, (%ecx,%edi,4)
				1530	incl %edi
				1531	cmpl %eax, %edi
				1532	jne LBB1_2 # bb
				1533
				1534	the inner loop should add to the memory location (%ecx,%edi,4), saving
				1535	a mov. Something like:
				1536
				1537	movl (%esi,%edi,4), %ebx
				1538	addl (%edx,%edi,4), %ebx
				1539	addl %ebx, (%ecx,%edi,4)
				1540
Chris Lattner	bde7310	2007-12-29 05:51:58 +0000	[diff] [blame]	1541	Here is another interesting example:
				1542
				1543	void vertical_compose97iH1(int b0, int b1, int *b2, int width){
				1544	int i;
				1545	for(i=0; i<width; i++)
				1546	b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
				1547	}
				1548
				1549	We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
				1550
				1551	LBB9_2: # bb
				1552	movl (%ecx,%edi,4), %ebx
				1553	subl (%esi,%edi,4), %ebx
				1554	subl (%edx,%edi,4), %ebx
				1555	movl %ebx, (%ecx,%edi,4)
				1556	incl %edi
				1557	cmpl %eax, %edi
				1558	jne LBB9_2 # bb
				1559
				1560	Additionally, LSR should rewrite the exit condition of these loops to use
Chris Lattner	6440095	2007-12-28 21:50:40 +0000	[diff] [blame]	1561	a stride-4 IV, would would allow all the scales in the loop to go away.
				1562	This would result in smaller code and more efficient microops.
				1563
				1564	//===---------------------------------------------------------------------===//
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1565
				1566	In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
				1567	or and instruction, for example:
				1568
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1569	xorpd LCPI1_0, %xmm2
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1570
				1571	However, if xmm2 gets spilled, we end up with really ugly code like this:
				1572
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1573	movsd (%esp), %xmm0
				1574	xorpd LCPI1_0, %xmm0
				1575	movsd %xmm0, (%esp)
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1576
				1577	Since we 'know' that this is a 'neg', we can actually "fold" the spill into
				1578	the neg/abs instruction, turning it into an integer operation, like this:
				1579
				1580	xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
				1581
				1582	you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1583	stall. Here is a contrived testcase:
				1584
				1585	double a, b, c;
				1586	void test(double *P) {
				1587	double X = *P;
				1588	a = X;
				1589	bar();
				1590	X = -X;
				1591	b = X;
				1592	bar();
				1593	c = X;
				1594	}
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1595
				1596	//===---------------------------------------------------------------------===//