Blame - lib/Target/X86/README.txt - fp2-dev/platform/external/llvm

blob: e0704f66c34cc60781f19894b15c1832ec8971b8 [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the X86 backend.
				3	//===---------------------------------------------------------------------===//
				4
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	5
				6	//===---------------------------------------------------------------------===//
				7
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	8	CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
				9	backend knows how to three-addressify this shift, but it appears the register
				10	allocator isn't even asking it to do so in this case. We should investigate
				11	why this isn't happening, it could have significant impact on other important
				12	cases for X86 as well.
				13
				14	//===---------------------------------------------------------------------===//
				15
				16	This should be one DIV/IDIV instruction, not a libcall:
				17
				18	unsigned test(unsigned long long X, unsigned Y) {
				19	return X/Y;
				20	}
				21
				22	This can be done trivially with a custom legalizer. What about overflow
				23	though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
				24
				25	//===---------------------------------------------------------------------===//
				26
				27	Improvements to the multiply -> shift/add algorithm:
				28	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
				29
				30	//===---------------------------------------------------------------------===//
				31
				32	Improve code like this (occurs fairly frequently, e.g. in LLVM):
				33	long long foo(int x) { return 1LL << x; }
				34
				35	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
				36	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
				37	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
				38
				39	Another useful one would be ~0ULL >> X and ~0ULL << X.
				40
				41	One better solution for 1LL << x is:
				42	xorl %eax, %eax
				43	xorl %edx, %edx
				44	testb $32, %cl
				45	sete %al
				46	setne %dl
				47	sall %cl, %eax
				48	sall %cl, %edx
				49
				50	But that requires good 8-bit subreg support.
				51
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	52	Also, this might be better. It's an extra shift, but it's one instruction
				53	shorter, and doesn't stress 8-bit subreg support.
				54	(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
				55	but without the unnecessary and.)
				56	movl %ecx, %eax
				57	shrl $5, %eax
				58	movl %eax, %edx
				59	xorl $1, %edx
				60	sall %cl, %eax
				61	sall %cl. %edx
				62
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	63	64-bit shifts (in general) expand to really bad code. Instead of using
				64	cmovs, we should expand to a conditional branch like GCC produces.
				65
				66	//===---------------------------------------------------------------------===//
				67
				68	Compile this:
				69	_Bool f(_Bool a) { return a!=1; }
				70
				71	into:
				72	movzbl %dil, %eax
				73	xorl $1, %eax
				74	ret
				75
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	76	(Although note that this isn't a legal way to express the code that llvm-gcc
				77	currently generates for that function.)
				78
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	79	//===---------------------------------------------------------------------===//
				80
				81	Some isel ideas:
				82
				83	1. Dynamic programming based approach when compile time if not an
				84	issue.
				85	2. Code duplication (addressing mode) during isel.
				86	3. Other ideas from "Register-Sensitive Selection, Duplication, and
				87	Sequencing of Instructions".
				88	4. Scheduling for reduced register pressure. E.g. "Minimum Register
				89	Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
				90	and other related papers.
				91	http://citeseer.ist.psu.edu/govindarajan01minimum.html
				92
				93	//===---------------------------------------------------------------------===//
				94
				95	Should we promote i16 to i32 to avoid partial register update stalls?
				96
				97	//===---------------------------------------------------------------------===//
				98
				99	Leave any_extend as pseudo instruction and hint to register
				100	allocator. Delay codegen until post register allocation.
Evan Cheng	fdbb667	2007-10-12 18:22:55 +0000	[diff] [blame]	101	Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
				102	the coalescer how to deal with it though.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	103
				104	//===---------------------------------------------------------------------===//
				105
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	106	It appears icc use push for parameter passing. Need to investigate.
				107
				108	//===---------------------------------------------------------------------===//
				109
				110	Only use inc/neg/not instructions on processors where they are faster than
				111	add/sub/xor. They are slower on the P4 due to only updating some processor
				112	flags.
				113
				114	//===---------------------------------------------------------------------===//
				115
				116	The instruction selector sometimes misses folding a load into a compare. The
				117	pattern is written as (cmp reg, (load p)). Because the compare isn't
				118	commutative, it is not matched with the load on both sides. The dag combiner
				119	should be made smart enough to cannonicalize the load into the RHS of a compare
				120	when it can invert the result of the compare for free.
				121
				122	//===---------------------------------------------------------------------===//
				123
				124	How about intrinsics? An example is:
				125	res = _mm_mulhi_epu16(A, _mm_mul_epu32(B, C));
				126
				127	compiles to
				128	pmuludq (%eax), %xmm0
				129	movl 8(%esp), %eax
				130	movdqa (%eax), %xmm1
				131	pmulhuw %xmm0, %xmm1
				132
				133	The transformation probably requires a X86 specific pass or a DAG combiner
				134	target specific hook.
				135
				136	//===---------------------------------------------------------------------===//
				137
				138	In many cases, LLVM generates code like this:
				139
				140	_test:
				141	movl 8(%esp), %eax
				142	cmpl %eax, 4(%esp)
				143	setl %al
				144	movzbl %al, %eax
				145	ret
				146
				147	on some processors (which ones?), it is more efficient to do this:
				148
				149	_test:
				150	movl 8(%esp), %ebx
				151	xor %eax, %eax
				152	cmpl %ebx, 4(%esp)
				153	setl %al
				154	ret
				155
				156	Doing this correctly is tricky though, as the xor clobbers the flags.
				157
				158	//===---------------------------------------------------------------------===//
				159
				160	We should generate bts/btr/etc instructions on targets where they are cheap or
				161	when codesize is important. e.g., for:
				162
				163	void setbit(int *target, int bit) {
				164	*target \|= (1 << bit);
				165	}
				166	void clearbit(int *target, int bit) {
				167	*target &= ~(1 << bit);
				168	}
				169
				170	//===---------------------------------------------------------------------===//
				171
				172	Instead of the following for memset char*, 1, 10:
				173
				174	movl $16843009, 4(%edx)
				175	movl $16843009, (%edx)
				176	movw $257, 8(%edx)
				177
				178	It might be better to generate
				179
				180	movl $16843009, %eax
				181	movl %eax, 4(%edx)
				182	movl %eax, (%edx)
				183	movw al, 8(%edx)
				184
				185	when we can spare a register. It reduces code size.
				186
				187	//===---------------------------------------------------------------------===//
				188
				189	Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
				190	get this:
				191
Eli Friedman	1aa1f2c	2008-02-28 00:21:43 +0000	[diff] [blame]	192	define i32 @test1(i32 %X) {
				193	%Y = sdiv i32 %X, 8
				194	ret i32 %Y
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	195	}
				196
				197	_test1:
				198	movl 4(%esp), %eax
				199	movl %eax, %ecx
				200	sarl $31, %ecx
				201	shrl $29, %ecx
				202	addl %ecx, %eax
				203	sarl $3, %eax
				204	ret
				205
				206	GCC knows several different ways to codegen it, one of which is this:
				207
				208	_test1:
				209	movl 4(%esp), %eax
				210	cmpl $-1, %eax
				211	leal 7(%eax), %ecx
				212	cmovle %ecx, %eax
				213	sarl $3, %eax
				214	ret
				215
				216	which is probably slower, but it's interesting at least :)
				217
				218	//===---------------------------------------------------------------------===//
				219
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	220	We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
				221	We should leave these as libcalls for everything over a much lower threshold,
				222	since libc is hand tuned for medium and large mem ops (avoiding RFO for large
				223	stores, TLB preheating, etc)
				224
				225	//===---------------------------------------------------------------------===//
				226
				227	Optimize this into something reasonable:
				228	x * copysign(1.0, y) * copysign(1.0, z)
				229
				230	//===---------------------------------------------------------------------===//
				231
				232	Optimize copysign(x, *y) to use an integer load from y.
				233
				234	//===---------------------------------------------------------------------===//
				235
				236	%X = weak global int 0
				237
				238	void %foo(int %N) {
				239	%N = cast int %N to uint
				240	%tmp.24 = setgt int %N, 0
				241	br bool %tmp.24, label %no_exit, label %return
				242
				243	no_exit:
				244	%indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
				245	%i.0.0 = cast uint %indvar to int
				246	volatile store int %i.0.0, int* %X
				247	%indvar.next = add uint %indvar, 1
				248	%exitcond = seteq uint %indvar.next, %N
				249	br bool %exitcond, label %return, label %no_exit
				250
				251	return:
				252	ret void
				253	}
				254
				255	compiles into:
				256
				257	.text
				258	.align 4
				259	.globl _foo
				260	_foo:
				261	movl 4(%esp), %eax
				262	cmpl $1, %eax
				263	jl LBB_foo_4 # return
				264	LBB_foo_1: # no_exit.preheader
				265	xorl %ecx, %ecx
				266	LBB_foo_2: # no_exit
				267	movl L_X$non_lazy_ptr, %edx
				268	movl %ecx, (%edx)
				269	incl %ecx
				270	cmpl %eax, %ecx
				271	jne LBB_foo_2 # no_exit
				272	LBB_foo_3: # return.loopexit
				273	LBB_foo_4: # return
				274	ret
				275
				276	We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
				277	remateralization is implemented. This can be accomplished with 1) a target
				278	dependent LICM pass or 2) makeing SelectDAG represent the whole function.
				279
				280	//===---------------------------------------------------------------------===//
				281
				282	The following tests perform worse with LSR:
				283
				284	lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
				285
				286	//===---------------------------------------------------------------------===//
				287
				288	We are generating far worse code than gcc:
				289
				290	volatile short X, Y;
				291
				292	void foo(int N) {
				293	int i;
				294	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				295	}
				296
Evan Cheng	27a820a	2007-10-26 01:56:11 +0000	[diff] [blame]	297	LBB1_1: # entry.bb_crit_edge
				298	xorl %ecx, %ecx
				299	xorw %dx, %dx
				300	LBB1_2: # bb
				301	movl L_X$non_lazy_ptr, %esi
				302	movw %cx, (%esi)
				303	movl L_Y$non_lazy_ptr, %esi
				304	movw %dx, (%esi)
				305	addw $4, %dx
				306	incl %ecx
				307	cmpl %eax, %ecx
				308	jne LBB1_2 # bb
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	309
				310	vs.
				311
				312	xorl %edx, %edx
				313	movl L_X$non_lazy_ptr-"L00000000001$pb"(%ebx), %esi
				314	movl L_Y$non_lazy_ptr-"L00000000001$pb"(%ebx), %ecx
				315	L4:
				316	movw %dx, (%esi)
				317	leal 0(,%edx,4), %eax
				318	movw %ax, (%ecx)
				319	addl $1, %edx
				320	cmpl %edx, %edi
				321	jne L4
				322
Evan Cheng	27a820a	2007-10-26 01:56:11 +0000	[diff] [blame]	323	This is due to the lack of post regalloc LICM.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	324
				325	//===---------------------------------------------------------------------===//
				326
				327	Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
				328	FR64 to VR128.
				329
				330	//===---------------------------------------------------------------------===//
				331
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	332	Adding to the list of cmp / test poor codegen issues:
				333
				334	int test(__m128 A, __m128 B) {
				335	if (_mm_comige_ss(A, B))
				336	return 3;
				337	else
				338	return 4;
				339	}
				340
				341	_test:
				342	movl 8(%esp), %eax
				343	movaps (%eax), %xmm0
				344	movl 4(%esp), %eax
				345	movaps (%eax), %xmm1
				346	comiss %xmm0, %xmm1
				347	setae %al
				348	movzbl %al, %ecx
				349	movl $3, %eax
				350	movl $4, %edx
				351	cmpl $0, %ecx
				352	cmove %edx, %eax
				353	ret
				354
				355	Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
				356	are a number of issues. 1) We are introducing a setcc between the result of the
				357	intrisic call and select. 2) The intrinsic is expected to produce a i32 value
				358	so a any extend (which becomes a zero extend) is added.
				359
				360	We probably need some kind of target DAG combine hook to fix this.
				361
				362	//===---------------------------------------------------------------------===//
				363
				364	We generate significantly worse code for this than GCC:
				365	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
				366	http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
				367
				368	There is also one case we do worse on PPC.
				369
				370	//===---------------------------------------------------------------------===//
				371
				372	If shorter, we should use things like:
				373	movzwl %ax, %eax
				374	instead of:
				375	andl $65535, %EAX
				376
				377	The former can also be used when the two-addressy nature of the 'and' would
				378	require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
				379
				380	//===---------------------------------------------------------------------===//
				381
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	382	For this:
				383
				384	int test(int a)
				385	{
				386	return a * 3;
				387	}
				388
				389	We currently emits
				390	imull $3, 4(%esp), %eax
				391
				392	Perhaps this is what we really should generate is? Is imull three or four
				393	cycles? Note: ICC generates this:
				394	movl 4(%esp), %eax
				395	leal (%eax,%eax,2), %eax
				396
				397	The current instruction priority is based on pattern complexity. The former is
				398	more "complex" because it folds a load so the latter will not be emitted.
				399
				400	Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
				401	should always try to match LEA first since the LEA matching code does some
				402	estimate to determine whether the match is profitable.
				403
				404	However, if we care more about code size, then imull is better. It's two bytes
				405	shorter than movl + leal.
				406
				407	//===---------------------------------------------------------------------===//
				408
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	409	__builtin_ffs codegen is messy.
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	410
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	411	int ffs_(unsigned X) { return __builtin_ffs(X); }
				412
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	413	llvm produces:
				414	ffs_:
				415	movl 4(%esp), %ecx
				416	bsfl %ecx, %eax
				417	movl $32, %edx
				418	cmove %edx, %eax
				419	incl %eax
				420	xorl %edx, %edx
				421	testl %ecx, %ecx
				422	cmove %edx, %eax
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	423	ret
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	424
				425	vs gcc:
				426
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	427	_ffs_:
				428	movl $-1, %edx
				429	bsfl 4(%esp), %eax
				430	cmove %edx, %eax
				431	addl $1, %eax
				432	ret
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	433
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	434	Another example of __builtin_ffs (use predsimplify to eliminate a select):
				435
				436	int foo (unsigned long j) {
				437	if (j)
				438	return __builtin_ffs (j) - 1;
				439	else
				440	return 0;
				441	}
				442
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	443	//===---------------------------------------------------------------------===//
				444
				445	It appears gcc place string data with linkonce linkage in
				446	.section __TEXT,__const_coal,coalesced instead of
				447	.section __DATA,__const_coal,coalesced.
				448	Take a look at darwin.h, there are other Darwin assembler directives that we
				449	do not make use of.
				450
				451	//===---------------------------------------------------------------------===//
				452
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	453	define i32 @foo(i32* %a, i32 %t) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	454	entry:
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	455	br label %cond_true
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	456
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	457	cond_true: ; preds = %cond_true, %entry
				458	%x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ] ; <i32> [#uses=3]
				459	%t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ] ; <i32> [#uses=1]
				460	%tmp2 = getelementptr i32* %a, i32 %x.0.0 ; <i32*> [#uses=1]
				461	%tmp3 = load i32* %tmp2 ; <i32> [#uses=1]
				462	%tmp5 = add i32 %t_addr.0.0, %x.0.0 ; <i32> [#uses=1]
				463	%tmp7 = add i32 %tmp5, %tmp3 ; <i32> [#uses=2]
				464	%tmp9 = add i32 %x.0.0, 1 ; <i32> [#uses=2]
				465	%tmp = icmp sgt i32 %tmp9, 39 ; <i1> [#uses=1]
				466	br i1 %tmp, label %bb12, label %cond_true
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	467
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	468	bb12: ; preds = %cond_true
				469	ret i32 %tmp7
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	470	}
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	471	is pessimized by -loop-reduce and -indvars
				472
				473	//===---------------------------------------------------------------------===//
				474
				475	u32 to float conversion improvement:
				476
				477	float uint32_2_float( unsigned u ) {
				478	float fl = (int) (u & 0xffff);
				479	float fh = (int) (u >> 16);
				480	fh *= 0x1.0p16f;
				481	return fh + fl;
				482	}
				483
				484	00000000 subl $0x04,%esp
				485	00000003 movl 0x08(%esp,1),%eax
				486	00000007 movl %eax,%ecx
				487	00000009 shrl $0x10,%ecx
				488	0000000c cvtsi2ss %ecx,%xmm0
				489	00000010 andl $0x0000ffff,%eax
				490	00000015 cvtsi2ss %eax,%xmm1
				491	00000019 mulss 0x00000078,%xmm0
				492	00000021 addss %xmm1,%xmm0
				493	00000025 movss %xmm0,(%esp,1)
				494	0000002a flds (%esp,1)
				495	0000002d addl $0x04,%esp
				496	00000030 ret
				497
				498	//===---------------------------------------------------------------------===//
				499
				500	When using fastcc abi, align stack slot of argument of type double on 8 byte
				501	boundary to improve performance.
				502
				503	//===---------------------------------------------------------------------===//
				504
				505	Codegen:
				506
				507	int f(int a, int b) {
				508	if (a == 4 \|\| a == 6)
				509	b++;
				510	return b;
				511	}
				512
				513
				514	as:
				515
				516	or eax, 2
				517	cmp eax, 6
				518	jz label
				519
				520	//===---------------------------------------------------------------------===//
				521
				522	GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
				523	simplifications for integer "x cmp y ? a : b". For example, instead of:
				524
				525	int G;
				526	void f(int X, int Y) {
				527	G = X < 0 ? 14 : 13;
				528	}
				529
				530	compiling to:
				531
				532	_f:
				533	movl $14, %eax
				534	movl $13, %ecx
				535	movl 4(%esp), %edx
				536	testl %edx, %edx
				537	cmovl %eax, %ecx
				538	movl %ecx, _G
				539	ret
				540
				541	it could be:
				542	_f:
				543	movl 4(%esp), %eax
				544	sarl $31, %eax
				545	notl %eax
				546	addl $14, %eax
				547	movl %eax, _G
				548	ret
				549
				550	etc.
				551
Chris Lattner	e7037c2	2007-11-02 17:04:20 +0000	[diff] [blame]	552	Another is:
				553	int usesbb(unsigned int a, unsigned int b) {
				554	return (a < b ? -1 : 0);
				555	}
				556	to:
				557	_usesbb:
				558	movl 8(%esp), %eax
				559	cmpl %eax, 4(%esp)
				560	sbbl %eax, %eax
				561	ret
				562
				563	instead of:
				564	_usesbb:
				565	xorl %eax, %eax
				566	movl 8(%esp), %ecx
				567	cmpl %ecx, 4(%esp)
				568	movl $4294967295, %ecx
				569	cmovb %ecx, %eax
				570	ret
				571
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	572	//===---------------------------------------------------------------------===//
				573
				574	Currently we don't have elimination of redundant stack manipulations. Consider
				575	the code:
				576
				577	int %main() {
				578	entry:
				579	call fastcc void %test1( )
				580	call fastcc void %test2( sbyte* cast (void ()* %test1 to sbyte*) )
				581	ret int 0
				582	}
				583
				584	declare fastcc void %test1()
				585
				586	declare fastcc void %test2(sbyte*)
				587
				588
				589	This currently compiles to:
				590
				591	subl $16, %esp
				592	call _test5
				593	addl $12, %esp
				594	subl $16, %esp
				595	movl $_test5, (%esp)
				596	call _test6
				597	addl $12, %esp
				598
				599	The add\sub pair is really unneeded here.
				600
				601	//===---------------------------------------------------------------------===//
				602
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	603	Consider the expansion of:
				604
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	605	define i32 @test3(i32 %X) {
				606	%tmp1 = urem i32 %X, 255
				607	ret i32 %tmp1
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	608	}
				609
				610	Currently it compiles to:
				611
				612	...
				613	movl $2155905153, %ecx
				614	movl 8(%esp), %esi
				615	movl %esi, %eax
				616	mull %ecx
				617	...
				618
				619	This could be "reassociated" into:
				620
				621	movl $2155905153, %eax
				622	movl 8(%esp), %ecx
				623	mull %ecx
				624
				625	to avoid the copy. In fact, the existing two-address stuff would do this
				626	except that mul isn't a commutative 2-addr instruction. I guess this has
				627	to be done at isel time based on the #uses to mul?
				628
				629	//===---------------------------------------------------------------------===//
				630
				631	Make sure the instruction which starts a loop does not cross a cacheline
				632	boundary. This requires knowning the exact length of each machine instruction.
				633	That is somewhat complicated, but doable. Example 256.bzip2:
				634
				635	In the new trace, the hot loop has an instruction which crosses a cacheline
				636	boundary. In addition to potential cache misses, this can't help decoding as I
				637	imagine there has to be some kind of complicated decoder reset and realignment
				638	to grab the bytes from the next cacheline.
				639
				640	532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
				641	942 942 0x3d03 movl %dh, (1809(%esp, %esi)
				642	937 937 0x3d0a incl %esi
				643	3 3 0x3d0b cmpb %bl, %dl
				644	27 27 0x3d0d jnz 0x000062db <main+11707>
				645
				646	//===---------------------------------------------------------------------===//
				647
				648	In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
				649
				650	//===---------------------------------------------------------------------===//
				651
				652	This could be a single 16-bit load.
				653
				654	int f(char *p) {
				655	if ((p[0] == 1) & (p[1] == 2)) return 1;
				656	return 0;
				657	}
				658
				659	//===---------------------------------------------------------------------===//
				660
				661	We should inline lrintf and probably other libc functions.
				662
				663	//===---------------------------------------------------------------------===//
				664
				665	Start using the flags more. For example, compile:
				666
				667	int add_zf(int *x, int y, int a, int b) {
				668	if ((*x += y) == 0)
				669	return a;
				670	else
				671	return b;
				672	}
				673
				674	to:
				675	addl %esi, (%rdi)
				676	movl %edx, %eax
				677	cmovne %ecx, %eax
				678	ret
				679	instead of:
				680
				681	_add_zf:
				682	addl (%rdi), %esi
				683	movl %esi, (%rdi)
				684	testl %esi, %esi
				685	cmove %edx, %ecx
				686	movl %ecx, %eax
				687	ret
				688
				689	and:
				690
				691	int add_zf(int *x, int y, int a, int b) {
				692	if ((*x + y) < 0)
				693	return a;
				694	else
				695	return b;
				696	}
				697
				698	to:
				699
				700	add_zf:
				701	addl (%rdi), %esi
				702	movl %edx, %eax
				703	cmovns %ecx, %eax
				704	ret
				705
				706	instead of:
				707
				708	_add_zf:
				709	addl (%rdi), %esi
				710	testl %esi, %esi
				711	cmovs %edx, %ecx
				712	movl %ecx, %eax
				713	ret
				714
				715	//===---------------------------------------------------------------------===//
				716
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	717	These two functions have identical effects:
				718
				719	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
				720	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				721
				722	We currently compile them to:
				723
				724	_f:
				725	movl 4(%esp), %eax
				726	movl %eax, %ecx
				727	incl %ecx
				728	movl 8(%esp), %edx
				729	cmpl %edx, %ecx
				730	jne LBB1_2 #UnifiedReturnBlock
				731	LBB1_1: #cond_true
				732	addl $2, %eax
				733	ret
				734	LBB1_2: #UnifiedReturnBlock
				735	movl %ecx, %eax
				736	ret
				737	_f2:
				738	movl 4(%esp), %eax
				739	movl %eax, %ecx
				740	incl %ecx
				741	cmpl 8(%esp), %ecx
				742	sete %cl
				743	movzbl %cl, %ecx
				744	leal 1(%ecx,%eax), %eax
				745	ret
				746
				747	both of which are inferior to GCC's:
				748
				749	_f:
				750	movl 4(%esp), %edx
				751	leal 1(%edx), %eax
				752	addl $2, %edx
				753	cmpl 8(%esp), %eax
				754	cmove %edx, %eax
				755	ret
				756	_f2:
				757	movl 4(%esp), %eax
				758	addl $1, %eax
				759	xorl %edx, %edx
				760	cmpl 8(%esp), %eax
				761	sete %dl
				762	addl %edx, %eax
				763	ret
				764
				765	//===---------------------------------------------------------------------===//
				766
				767	This code:
				768
				769	void test(int X) {
				770	if (X) abort();
				771	}
				772
				773	is currently compiled to:
				774
				775	_test:
				776	subl $12, %esp
				777	cmpl $0, 16(%esp)
				778	jne LBB1_1
				779	addl $12, %esp
				780	ret
				781	LBB1_1:
				782	call L_abort$stub
				783
				784	It would be better to produce:
				785
				786	_test:
				787	subl $12, %esp
				788	cmpl $0, 16(%esp)
				789	jne L_abort$stub
				790	addl $12, %esp
				791	ret
				792
				793	This can be applied to any no-return function call that takes no arguments etc.
				794	Alternatively, the stack save/restore logic could be shrink-wrapped, producing
				795	something like this:
				796
				797	_test:
				798	cmpl $0, 4(%esp)
				799	jne LBB1_1
				800	ret
				801	LBB1_1:
				802	subl $12, %esp
				803	call L_abort$stub
				804
				805	Both are useful in different situations. Finally, it could be shrink-wrapped
				806	and tail called, like this:
				807
				808	_test:
				809	cmpl $0, 4(%esp)
				810	jne LBB1_1
				811	ret
				812	LBB1_1:
				813	pop %eax # realign stack.
				814	call L_abort$stub
				815
				816	Though this probably isn't worth it.
				817
				818	//===---------------------------------------------------------------------===//
				819
				820	We need to teach the codegen to convert two-address INC instructions to LEA
Chris Lattner	0d64ec3	2007-08-11 18:16:46 +0000	[diff] [blame]	821	when the flags are dead (likewise dec). For example, on X86-64, compile:
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	822
				823	int foo(int A, int B) {
				824	return A+1;
				825	}
				826
				827	to:
				828
				829	_foo:
				830	leal 1(%edi), %eax
				831	ret
				832
				833	instead of:
				834
				835	_foo:
				836	incl %edi
				837	movl %edi, %eax
				838	ret
				839
				840	Another example is:
				841
				842	;; X's live range extends beyond the shift, so the register allocator
				843	;; cannot coalesce it with Y. Because of this, a copy needs to be
				844	;; emitted before the shift to save the register value before it is
				845	;; clobbered. However, this copy is not needed if the register
				846	;; allocator turns the shift into an LEA. This also occurs for ADD.
				847
				848	; Check that the shift gets turned into an LEA.
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	849	; RUN: llvm-as < %s \| llc -march=x86 -x86-asm-syntax=intel \| \
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	850	; RUN: not grep {mov E.X, E.X}
				851
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	852	@G = external global i32 ; <i32*> [#uses=3]
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	853
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	854	define i32 @test1(i32 %X, i32 %Y) {
				855	%Z = add i32 %X, %Y ; <i32> [#uses=1]
				856	volatile store i32 %Y, i32* @G
				857	volatile store i32 %Z, i32* @G
				858	ret i32 %X
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	859	}
				860
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	861	define i32 @test2(i32 %X) {
				862	%Z = add i32 %X, 1 ; <i32> [#uses=1]
				863	volatile store i32 %Z, i32* @G
				864	ret i32 %X
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	865	}
				866
				867	//===---------------------------------------------------------------------===//
				868
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	869	Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
				870	a neg instead of a sub instruction. Consider:
				871
				872	int test(char X) { return 7-X; }
				873
				874	we currently produce:
				875	_test:
				876	movl $7, %eax
				877	movsbl 4(%esp), %ecx
				878	subl %ecx, %eax
				879	ret
				880
				881	We would use one fewer register if codegen'd as:
				882
				883	movsbl 4(%esp), %eax
				884	neg %eax
				885	add $7, %eax
				886	ret
				887
				888	Note that this isn't beneficial if the load can be folded into the sub. In
				889	this case, we want a sub:
				890
				891	int test(int X) { return 7-X; }
				892	_test:
				893	movl $7, %eax
				894	subl 4(%esp), %eax
				895	ret
				896
				897	//===---------------------------------------------------------------------===//
				898
Chris Lattner	32f6587	2007-08-20 02:14:33 +0000	[diff] [blame]	899	Leaf functions that require one 4-byte spill slot have a prolog like this:
				900
				901	_foo:
				902	pushl %esi
				903	subl $4, %esp
				904	...
				905	and an epilog like this:
				906	addl $4, %esp
				907	popl %esi
				908	ret
				909
				910	It would be smaller, and potentially faster, to push eax on entry and to
				911	pop into a dummy register instead of using addl/subl of esp. Just don't pop
				912	into any return registers :)
				913
				914	//===---------------------------------------------------------------------===//
Chris Lattner	44b03cb	2007-08-23 15:22:07 +0000	[diff] [blame]	915
				916	The X86 backend should fold (branch (or (setcc, setcc))) into multiple
				917	branches. We generate really poor code for:
				918
				919	double testf(double a) {
				920	return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
				921	}
				922
				923	For example, the entry BB is:
				924
				925	_testf:
				926	subl $20, %esp
				927	pxor %xmm0, %xmm0
				928	movsd 24(%esp), %xmm1
				929	ucomisd %xmm0, %xmm1
				930	setnp %al
				931	sete %cl
				932	testb %cl, %al
				933	jne LBB1_5 # UnifiedReturnBlock
				934	LBB1_1: # cond_true
				935
				936
				937	it would be better to replace the last four instructions with:
				938
				939	jp LBB1_1
				940	je LBB1_5
				941	LBB1_1:
				942
				943	We also codegen the inner ?: into a diamond:
				944
				945	cvtss2sd LCPI1_0(%rip), %xmm2
				946	cvtss2sd LCPI1_1(%rip), %xmm3
				947	ucomisd %xmm1, %xmm0
				948	ja LBB1_3 # cond_true
				949	LBB1_2: # cond_true
				950	movapd %xmm3, %xmm2
				951	LBB1_3: # cond_true
				952	movapd %xmm2, %xmm0
				953	ret
				954
				955	We should sink the load into xmm3 into the LBB1_2 block. This should
				956	be pretty easy, and will nuke all the copies.
				957
				958	//===---------------------------------------------------------------------===//
Chris Lattner	4084d49	2007-09-10 21:43:18 +0000	[diff] [blame]	959
				960	This:
				961	#include <algorithm>
				962	inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				963	{ return std::make_pair(a + b, a + b < a); }
				964	bool no_overflow(unsigned a, unsigned b)
				965	{ return !full_add(a, b).second; }
				966
				967	Should compile to:
				968
				969
				970	_Z11no_overflowjj:
				971	addl %edi, %esi
				972	setae %al
				973	ret
				974
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	975	FIXME: That code looks wrong; bool return is normally defined as zext.
				976
Chris Lattner	4084d49	2007-09-10 21:43:18 +0000	[diff] [blame]	977	on x86-64, not:
				978
				979	__Z11no_overflowjj:
				980	addl %edi, %esi
				981	cmpl %edi, %esi
				982	setae %al
				983	movzbl %al, %eax
				984	ret
				985
				986
				987	//===---------------------------------------------------------------------===//
Evan Cheng	35127a6	2007-09-10 22:16:37 +0000	[diff] [blame]	988
				989	Re-materialize MOV32r0 etc. with xor instead of changing them to moves if the
				990	condition register is dead. xor reg reg is shorter than mov reg, #0.
Chris Lattner	a487bf7	2007-09-26 06:29:31 +0000	[diff] [blame]	991
				992	//===---------------------------------------------------------------------===//
				993
				994	We aren't matching RMW instructions aggressively
				995	enough. Here's a reduced testcase (more in PR1160):
				996
				997	define void @test(i32* %huge_ptr, i32* %target_ptr) {
				998	%A = load i32* %huge_ptr ; <i32> [#uses=1]
				999	%B = load i32* %target_ptr ; <i32> [#uses=1]
				1000	%C = or i32 %A, %B ; <i32> [#uses=1]
				1001	store i32 %C, i32* %target_ptr
				1002	ret void
				1003	}
				1004
				1005	$ llvm-as < t.ll \| llc -march=x86-64
				1006
				1007	_test:
				1008	movl (%rdi), %eax
				1009	orl (%rsi), %eax
				1010	movl %eax, (%rsi)
				1011	ret
				1012
				1013	That should be something like:
				1014
				1015	_test:
				1016	movl (%rdi), %eax
				1017	orl %eax, (%rsi)
				1018	ret
				1019
				1020	//===---------------------------------------------------------------------===//
				1021
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	1022	The following code:
				1023
Bill Wendling	c2036e3	2007-10-02 20:54:32 +0000	[diff] [blame]	1024	bb114.preheader: ; preds = %cond_next94
				1025	%tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
				1026	%tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
				1027	%tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
				1028	%tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
				1029	%tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
				1030	%tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
				1031	%tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
				1032	%tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
				1033	%tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
				1034	%tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
				1035	%tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
				1036	br label %bb114
				1037
				1038	produces:
				1039
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	1040	LBB3_5: # bb114.preheader
				1041	movswl -68(%ebp), %eax
				1042	movl $32, %ecx
				1043	movl %ecx, -80(%ebp)
				1044	subl %eax, -80(%ebp)
				1045	movswl -52(%ebp), %eax
				1046	movl %ecx, -84(%ebp)
				1047	subl %eax, -84(%ebp)
				1048	movswl -70(%ebp), %eax
				1049	movl %ecx, -88(%ebp)
				1050	subl %eax, -88(%ebp)
				1051	movswl -50(%ebp), %eax
				1052	subl %eax, %ecx
				1053	movl %ecx, -76(%ebp)
				1054	movswl -42(%ebp), %eax
				1055	movl %eax, -92(%ebp)
				1056	movswl -66(%ebp), %eax
				1057	movl %eax, -96(%ebp)
				1058	movw $0, -98(%ebp)
				1059
Chris Lattner	792bae5	2007-10-03 03:40:24 +0000	[diff] [blame]	1060	This appears to be bad because the RA is not folding the store to the stack
				1061	slot into the movl. The above instructions could be:
				1062	movl $32, -80(%ebp)
				1063	...
				1064	movl $32, -84(%ebp)
				1065	...
				1066	This seems like a cross between remat and spill folding.
				1067
Bill Wendling	c2036e3	2007-10-02 20:54:32 +0000	[diff] [blame]	1068	This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	1069	change, so we could simply subtract %eax from %ecx first and then use %ecx (or
				1070	vice-versa).
				1071
				1072	//===---------------------------------------------------------------------===//
				1073
Bill Wendling	54c4f83	2007-10-02 21:49:31 +0000	[diff] [blame]	1074	This code:
				1075
				1076	%tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
				1077	br i1 %tmp659, label %cond_true662, label %cond_next715
				1078
				1079	produces this:
				1080
				1081	testw %cx, %cx
				1082	movswl %cx, %esi
				1083	jns LBB4_109 # cond_next715
				1084
				1085	Shark tells us that using %cx in the testw instruction is sub-optimal. It
				1086	suggests using the 32-bit register (which is what ICC uses).
				1087
				1088	//===---------------------------------------------------------------------===//
Chris Lattner	802c62a	2007-10-03 17:10:03 +0000	[diff] [blame]	1089
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	1090	We compile this:
				1091
				1092	void compare (long long foo) {
				1093	if (foo < 4294967297LL)
				1094	abort();
				1095	}
				1096
				1097	to:
				1098
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1099	compare:
				1100	subl $4, %esp
				1101	cmpl $0, 8(%esp)
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	1102	setne %al
				1103	movzbw %al, %ax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1104	cmpl $1, 12(%esp)
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	1105	setg %cl
				1106	movzbw %cl, %cx
				1107	cmove %ax, %cx
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1108	testb $1, %cl
				1109	jne .LBB1_2 # UnifiedReturnBlock
				1110	.LBB1_1: # ifthen
				1111	call abort
				1112	.LBB1_2: # UnifiedReturnBlock
				1113	addl $4, %esp
				1114	ret
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	1115
				1116	(also really horrible code on ppc). This is due to the expand code for 64-bit
				1117	compares. GCC produces multiple branches, which is much nicer:
				1118
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1119	compare:
				1120	subl $12, %esp
				1121	movl 20(%esp), %edx
				1122	movl 16(%esp), %eax
				1123	decl %edx
				1124	jle .L7
				1125	.L5:
				1126	addl $12, %esp
				1127	ret
				1128	.p2align 4,,7
				1129	.L7:
				1130	jl .L4
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	1131	cmpl $0, %eax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1132	.p2align 4,,8
				1133	ja .L5
				1134	.L4:
				1135	.p2align 4,,9
				1136	call abort
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	1137
				1138	//===---------------------------------------------------------------------===//
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1139
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1140	Tail call optimization improvements: Tail call optimization currently
				1141	pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighofer	449b01a	2008-01-11 16:49:42 +0000	[diff] [blame]	1142	non-tail call optimized calls) that source from the callers arguments
				1143	or that source from a virtual register (also possibly sourcing from
				1144	callers arguments).
				1145	This is done to prevent overwriting of parameters (see example
				1146	below) that might be used later.
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1147
				1148	example:
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1149
				1150	int callee(int32, int64);
				1151	int caller(int32 arg1, int32 arg2) {
				1152	int64 local = arg2 * 2;
				1153	return callee(arg2, (int64)local);
				1154	}
				1155
				1156	[arg1] [!arg2 no longer valid since we moved local onto it]
				1157	[arg2] -> [(int64)
				1158	[RETADDR] local ]
				1159
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1160	Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1161	arg2 of the caller.
				1162
				1163	Possible optimizations:
				1164
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1165
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1166	- Analyse the actual parameters of the callee to see which would
				1167	overwrite a caller parameter which is used by the callee and only
				1168	push them onto the top of the stack.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1169
				1170	int callee (int32 arg1, int32 arg2);
				1171	int caller (int32 arg1, int32 arg2) {
				1172	return callee(arg1,arg2);
				1173	}
				1174
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1175	Here we don't need to write any variables to the top of the stack
				1176	since they don't overwrite each other.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1177
				1178	int callee (int32 arg1, int32 arg2);
				1179	int caller (int32 arg1, int32 arg2) {
				1180	return callee(arg2,arg1);
				1181	}
				1182
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1183	Here we need to push the arguments because they overwrite each
				1184	other.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1185
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1186	//===---------------------------------------------------------------------===//
Evan Cheng	7f1ad6a	2007-10-28 04:01:09 +0000	[diff] [blame]	1187
				1188	main ()
				1189	{
				1190	int i = 0;
				1191	unsigned long int z = 0;
				1192
				1193	do {
				1194	z -= 0x00004000;
				1195	i++;
				1196	if (i > 0x00040000)
				1197	abort ();
				1198	} while (z > 0);
				1199	exit (0);
				1200	}
				1201
				1202	gcc compiles this to:
				1203
				1204	_main:
				1205	subl $28, %esp
				1206	xorl %eax, %eax
				1207	jmp L2
				1208	L3:
				1209	cmpl $262144, %eax
				1210	je L10
				1211	L2:
				1212	addl $1, %eax
				1213	cmpl $262145, %eax
				1214	jne L3
				1215	call L_abort$stub
				1216	L10:
				1217	movl $0, (%esp)
				1218	call L_exit$stub
				1219
				1220	llvm:
				1221
				1222	_main:
				1223	subl $12, %esp
				1224	movl $1, %eax
				1225	movl $16384, %ecx
				1226	LBB1_1: # bb
				1227	cmpl $262145, %eax
				1228	jge LBB1_4 # cond_true
				1229	LBB1_2: # cond_next
				1230	incl %eax
				1231	addl $4294950912, %ecx
				1232	cmpl $16384, %ecx
				1233	jne LBB1_1 # bb
				1234	LBB1_3: # bb11
				1235	xorl %eax, %eax
				1236	addl $12, %esp
				1237	ret
				1238	LBB1_4: # cond_true
				1239	call L_abort$stub
				1240
				1241	1. LSR should rewrite the first cmp with induction variable %ecx.
				1242	2. DAG combiner should fold
				1243	leal 1(%eax), %edx
				1244	cmpl $262145, %edx
				1245	=>
				1246	cmpl $262144, %eax
				1247
				1248	//===---------------------------------------------------------------------===//
Chris Lattner	358670b	2007-11-24 06:13:33 +0000	[diff] [blame]	1249
				1250	define i64 @test(double %X) {
				1251	%Y = fptosi double %X to i64
				1252	ret i64 %Y
				1253	}
				1254
				1255	compiles to:
				1256
				1257	_test:
				1258	subl $20, %esp
				1259	movsd 24(%esp), %xmm0
				1260	movsd %xmm0, 8(%esp)
				1261	fldl 8(%esp)
				1262	fisttpll (%esp)
				1263	movl 4(%esp), %edx
				1264	movl (%esp), %eax
				1265	addl $20, %esp
				1266	#FP_REG_KILL
				1267	ret
				1268
				1269	This should just fldl directly from the input stack slot.
Chris Lattner	10d54d1	2007-12-05 22:58:19 +0000	[diff] [blame]	1270
				1271	//===---------------------------------------------------------------------===//
				1272
				1273	This code:
				1274	int foo (int x) { return (x & 65535) \| 255; }
				1275
				1276	Should compile into:
				1277
				1278	_foo:
				1279	movzwl 4(%esp), %eax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1280	orl $255, %eax
Chris Lattner	10d54d1	2007-12-05 22:58:19 +0000	[diff] [blame]	1281	ret
				1282
				1283	instead of:
				1284	_foo:
				1285	movl $255, %eax
				1286	orl 4(%esp), %eax
				1287	andl $65535, %eax
				1288	ret
				1289
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1290	//===---------------------------------------------------------------------===//
				1291
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1292	We're codegen'ing multiply of long longs inefficiently:
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1293
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1294	unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
				1295	return arg1 * arg2;
				1296	}
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1297
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1298	We compile to (fomit-frame-pointer):
				1299
				1300	_LLM:
				1301	pushl %esi
				1302	movl 8(%esp), %ecx
				1303	movl 16(%esp), %esi
				1304	movl %esi, %eax
				1305	mull %ecx
				1306	imull 12(%esp), %esi
				1307	addl %edx, %esi
				1308	imull 20(%esp), %ecx
				1309	movl %esi, %edx
				1310	addl %ecx, %edx
				1311	popl %esi
				1312	ret
				1313
				1314	This looks like a scheduling deficiency and lack of remat of the load from
				1315	the argument area. ICC apparently produces:
				1316
				1317	movl 8(%esp), %ecx
				1318	imull 12(%esp), %ecx
				1319	movl 16(%esp), %eax
				1320	imull 4(%esp), %eax
				1321	addl %eax, %ecx
				1322	movl 4(%esp), %eax
				1323	mull 12(%esp)
				1324	addl %ecx, %edx
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1325	ret
				1326
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1327	Note that it remat'd loads from 4(esp) and 12(esp). See this GCC PR:
				1328	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1329
				1330	//===---------------------------------------------------------------------===//
				1331
Chris Lattner	2b55ebd	2007-12-24 19:27:46 +0000	[diff] [blame]	1332	We can fold a store into "zeroing a reg". Instead of:
				1333
				1334	xorl %eax, %eax
				1335	movl %eax, 124(%esp)
				1336
				1337	we should get:
				1338
				1339	movl $0, 124(%esp)
				1340
				1341	if the flags of the xor are dead.
				1342
Chris Lattner	459ff99	2008-01-11 18:00:13 +0000	[diff] [blame]	1343	Likewise, we isel "x<<1" into "add reg,reg". If reg is spilled, this should
				1344	be folded into: shl [mem], 1
				1345
Chris Lattner	2b55ebd	2007-12-24 19:27:46 +0000	[diff] [blame]	1346	//===---------------------------------------------------------------------===//
Chris Lattner	6440095	2007-12-28 21:50:40 +0000	[diff] [blame]	1347
				1348	This testcase misses a read/modify/write opportunity (from PR1425):
				1349
				1350	void vertical_decompose97iH1(int b0, int b1, int *b2, int width){
				1351	int i;
				1352	for(i=0; i<width; i++)
				1353	b1[i] += (1*(b0[i] + b2[i])+0)>>0;
				1354	}
				1355
				1356	We compile it down to:
				1357
				1358	LBB1_2: # bb
				1359	movl (%esi,%edi,4), %ebx
				1360	addl (%ecx,%edi,4), %ebx
				1361	addl (%edx,%edi,4), %ebx
				1362	movl %ebx, (%ecx,%edi,4)
				1363	incl %edi
				1364	cmpl %eax, %edi
				1365	jne LBB1_2 # bb
				1366
				1367	the inner loop should add to the memory location (%ecx,%edi,4), saving
				1368	a mov. Something like:
				1369
				1370	movl (%esi,%edi,4), %ebx
				1371	addl (%edx,%edi,4), %ebx
				1372	addl %ebx, (%ecx,%edi,4)
				1373
Chris Lattner	bde7310	2007-12-29 05:51:58 +0000	[diff] [blame]	1374	Here is another interesting example:
				1375
				1376	void vertical_compose97iH1(int b0, int b1, int *b2, int width){
				1377	int i;
				1378	for(i=0; i<width; i++)
				1379	b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
				1380	}
				1381
				1382	We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
				1383
				1384	LBB9_2: # bb
				1385	movl (%ecx,%edi,4), %ebx
				1386	subl (%esi,%edi,4), %ebx
				1387	subl (%edx,%edi,4), %ebx
				1388	movl %ebx, (%ecx,%edi,4)
				1389	incl %edi
				1390	cmpl %eax, %edi
				1391	jne LBB9_2 # bb
				1392
				1393	Additionally, LSR should rewrite the exit condition of these loops to use
Chris Lattner	6440095	2007-12-28 21:50:40 +0000	[diff] [blame]	1394	a stride-4 IV, would would allow all the scales in the loop to go away.
				1395	This would result in smaller code and more efficient microops.
				1396
				1397	//===---------------------------------------------------------------------===//
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1398
				1399	In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
				1400	or and instruction, for example:
				1401
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1402	xorpd LCPI1_0, %xmm2
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1403
				1404	However, if xmm2 gets spilled, we end up with really ugly code like this:
				1405
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1406	movsd (%esp), %xmm0
				1407	xorpd LCPI1_0, %xmm0
				1408	movsd %xmm0, (%esp)
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1409
				1410	Since we 'know' that this is a 'neg', we can actually "fold" the spill into
				1411	the neg/abs instruction, turning it into an integer operation, like this:
				1412
				1413	xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
				1414
				1415	you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1416	stall. Here is a contrived testcase:
				1417
				1418	double a, b, c;
				1419	void test(double *P) {
				1420	double X = *P;
				1421	a = X;
				1422	bar();
				1423	X = -X;
				1424	b = X;
				1425	bar();
				1426	c = X;
				1427	}
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1428
				1429	//===---------------------------------------------------------------------===//
Andrew Lenharth	785610d	2008-02-16 01:24:58 +0000	[diff] [blame]	1430
				1431	handling llvm.memory.barrier on pre SSE2 cpus
				1432
				1433	should generate:
				1434	lock ; mov %esp, %esp
				1435
				1436	//===---------------------------------------------------------------------===//
Chris Lattner	7644ff3	2008-02-17 19:43:57 +0000	[diff] [blame]	1437
				1438	The generated code on x86 for checking for signed overflow on a multiply the
				1439	obvious way is much longer than it needs to be.
				1440
				1441	int x(int a, int b) {
				1442	long long prod = (long long)a*b;
				1443	return prod > 0x7FFFFFFF \|\| prod < (-0x7FFFFFFF-1);
				1444	}
				1445
				1446	See PR2053 for more details.
				1447
				1448	//===---------------------------------------------------------------------===//
Chris Lattner	83f2236	2008-02-18 18:30:13 +0000	[diff] [blame]	1449
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1450	We should investigate using cdq/ctld (effect: edx = sar eax, 31)
				1451	more aggressively; it should cost the same as a move+shift on any modern
				1452	processor, but it's a lot shorter. Downside is that it puts more
				1453	pressure on register allocation because it has fixed operands.
				1454
				1455	Example:
				1456	int abs(int x) {return x < 0 ? -x : x;}
				1457
				1458	gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
				1459	abs:
				1460	movl 4(%esp), %eax
				1461	cltd
				1462	xorl %edx, %eax
				1463	subl %edx, %eax
				1464	ret
				1465
				1466	//===---------------------------------------------------------------------===//
				1467
				1468	Consider:
Chris Lattner	83f2236	2008-02-18 18:30:13 +0000	[diff] [blame]	1469	int test(unsigned long a, unsigned long b) { return -(a < b); }
				1470
				1471	We currently compile this to:
				1472
				1473	define i32 @test(i32 %a, i32 %b) nounwind {
				1474	%tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
				1475	%tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
				1476	%tmp5 = sub i32 0, %tmp34 ; <i32> [#uses=1]
				1477	ret i32 %tmp5
				1478	}
				1479
				1480	and
				1481
				1482	_test:
				1483	movl 8(%esp), %eax
				1484	cmpl %eax, 4(%esp)
				1485	setb %al
				1486	movzbl %al, %eax
				1487	negl %eax
				1488	ret
				1489
				1490	Several deficiencies here. First, we should instcombine zext+neg into sext:
				1491
				1492	define i32 @test2(i32 %a, i32 %b) nounwind {
				1493	%tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
				1494	%tmp34 = sext i1 %tmp3 to i32 ; <i32> [#uses=1]
				1495	ret i32 %tmp34
				1496	}
				1497
				1498	However, before we can do that, we have to fix the bad codegen that we get for
				1499	sext from bool:
				1500
				1501	_test2:
				1502	movl 8(%esp), %eax
				1503	cmpl %eax, 4(%esp)
				1504	setb %al
				1505	movzbl %al, %eax
				1506	shll $31, %eax
				1507	sarl $31, %eax
				1508	ret
				1509
				1510	This code should be at least as good as the code above. Once this is fixed, we
				1511	can optimize this specific case even more to:
				1512
				1513	movl 8(%esp), %eax
				1514	xorl %ecx, %ecx
				1515	cmpl %eax, 4(%esp)
				1516	sbbl %ecx, %ecx
				1517
				1518	//===---------------------------------------------------------------------===//
Eli Friedman	1aa1f2c	2008-02-28 00:21:43 +0000	[diff] [blame]	1519
				1520	Take the following code (from
				1521	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
				1522
				1523	extern unsigned char first_one[65536];
				1524	int FirstOnet(unsigned long long arg1)
				1525	{
				1526	if (arg1 >> 48)
				1527	return (first_one[arg1 >> 48]);
				1528	return 0;
				1529	}
				1530
				1531
				1532	The following code is currently generated:
				1533	FirstOnet:
				1534	movl 8(%esp), %eax
				1535	cmpl $65536, %eax
				1536	movl 4(%esp), %ecx
				1537	jb .LBB1_2 # UnifiedReturnBlock
				1538	.LBB1_1: # ifthen
				1539	shrl $16, %eax
				1540	movzbl first_one(%eax), %eax
				1541	ret
				1542	.LBB1_2: # UnifiedReturnBlock
				1543	xorl %eax, %eax
				1544	ret
				1545
				1546	There are a few possible improvements here:
				1547	1. We should be able to eliminate the dead load into %ecx
				1548	2. We could change the "movl 8(%esp), %eax" into
				1549	"movzwl 10(%esp), %eax"; this lets us change the cmpl
				1550	into a testl, which is shorter, and eliminate the shift.
				1551
				1552	We could also in theory eliminate the branch by using a conditional
				1553	for the address of the load, but that seems unlikely to be worthwhile
				1554	in general.
				1555
				1556	//===---------------------------------------------------------------------===//
				1557
Chris Lattner	44a98ac	2008-02-28 04:52:59 +0000	[diff] [blame]	1558	We compile this function:
				1559
				1560	define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext %d) nounwind {
				1561	entry:
				1562	%tmp2 = icmp eq i8 %d, 0 ; <i1> [#uses=1]
				1563	br i1 %tmp2, label %bb7, label %bb
				1564
				1565	bb: ; preds = %entry
				1566	%tmp6 = add i32 %b, %a ; <i32> [#uses=1]
				1567	ret i32 %tmp6
				1568
				1569	bb7: ; preds = %entry
				1570	%tmp10 = sub i32 %a, %c ; <i32> [#uses=1]
				1571	ret i32 %tmp10
				1572	}
				1573
				1574	to:
				1575
				1576	_foo:
				1577	cmpb $0, 16(%esp)
				1578	movl 12(%esp), %ecx
				1579	movl 8(%esp), %eax
				1580	movl 4(%esp), %edx
				1581	je LBB1_2 # bb7
				1582	LBB1_1: # bb
				1583	addl %edx, %eax
				1584	ret
				1585	LBB1_2: # bb7
				1586	movl %edx, %eax
				1587	subl %ecx, %eax
				1588	ret
				1589
Gabor Greif	0266159	2008-03-06 10:51:21 +0000	[diff] [blame]	1590	The coalescer could coalesce "edx" with "eax" to avoid the movl in LBB1_2
Chris Lattner	44a98ac	2008-02-28 04:52:59 +0000	[diff] [blame]	1591	if it commuted the addl in LBB1_1.
				1592
				1593	//===---------------------------------------------------------------------===//
Evan Cheng	921dcba	2008-03-28 07:07:06 +0000	[diff] [blame]	1594
				1595	See rdar://4653682.
				1596
				1597	From flops:
				1598
				1599	LBB1_15: # bb310
				1600	cvtss2sd LCPI1_0, %xmm1
				1601	addsd %xmm1, %xmm0
				1602	movsd 176(%esp), %xmm2
				1603	mulsd %xmm0, %xmm2
				1604	movapd %xmm2, %xmm3
				1605	mulsd %xmm3, %xmm3
				1606	movapd %xmm3, %xmm4
				1607	mulsd LCPI1_23, %xmm4
				1608	addsd LCPI1_24, %xmm4
				1609	mulsd %xmm3, %xmm4
				1610	addsd LCPI1_25, %xmm4
				1611	mulsd %xmm3, %xmm4
				1612	addsd LCPI1_26, %xmm4
				1613	mulsd %xmm3, %xmm4
				1614	addsd LCPI1_27, %xmm4
				1615	mulsd %xmm3, %xmm4
				1616	addsd LCPI1_28, %xmm4
				1617	mulsd %xmm3, %xmm4
				1618	addsd %xmm1, %xmm4
				1619	mulsd %xmm2, %xmm4
				1620	movsd 152(%esp), %xmm1
				1621	addsd %xmm4, %xmm1
				1622	movsd %xmm1, 152(%esp)
				1623	incl %eax
				1624	cmpl %eax, %esi
				1625	jge LBB1_15 # bb310
				1626	LBB1_16: # bb358.loopexit
				1627	movsd 152(%esp), %xmm0
				1628	addsd %xmm0, %xmm0
				1629	addsd LCPI1_22, %xmm0
				1630	movsd %xmm0, 152(%esp)
				1631
				1632	Rather than spilling the result of the last addsd in the loop, we should have
				1633	insert a copy to split the interval (one for the duration of the loop, one
				1634	extending to the fall through). The register pressure in the loop isn't high
				1635	enough to warrant the spill.
				1636
				1637	Also check why xmm7 is not used at all in the function.
Chris Lattner	16e5c78	2008-04-21 04:46:30 +0000	[diff] [blame]	1638
				1639	//===---------------------------------------------------------------------===//
				1640
				1641	Legalize loses track of the fact that bools are always zero extended when in
				1642	memory. This causes us to compile abort_gzip (from 164.gzip) from:
				1643
				1644	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
				1645	target triple = "i386-apple-darwin8"
				1646	@in_exit.4870.b = internal global i1 false ; <i1*> [#uses=2]
				1647	define fastcc void @abort_gzip() noreturn nounwind {
				1648	entry:
				1649	%tmp.b.i = load i1* @in_exit.4870.b ; <i1> [#uses=1]
				1650	br i1 %tmp.b.i, label %bb.i, label %bb4.i
				1651	bb.i: ; preds = %entry
				1652	tail call void @exit( i32 1 ) noreturn nounwind
				1653	unreachable
				1654	bb4.i: ; preds = %entry
				1655	store i1 true, i1* @in_exit.4870.b
				1656	tail call void @exit( i32 1 ) noreturn nounwind
				1657	unreachable
				1658	}
				1659	declare void @exit(i32) noreturn nounwind
				1660
				1661	into:
				1662
				1663	_abort_gzip:
				1664	subl $12, %esp
				1665	movb _in_exit.4870.b, %al
				1666	notb %al
				1667	testb $1, %al
				1668	jne LBB1_2 ## bb4.i
				1669	LBB1_1: ## bb.i
				1670	...
				1671
				1672	//===---------------------------------------------------------------------===//
Chris Lattner	7cb1d33	2008-05-05 23:19:45 +0000	[diff] [blame]	1673
				1674	We compile:
				1675
				1676	int test(int x, int y) {
				1677	return x-y-1;
				1678	}
				1679
				1680	into (-m64):
				1681
				1682	_test:
				1683	decl %edi
				1684	movl %edi, %eax
				1685	subl %esi, %eax
				1686	ret
				1687
				1688	it would be better to codegen as: x+~y (notl+addl)