Blame - lib/Target/X86/README.txt - fp2-dev/platform/external/llvm

blob: 3796aac57cb53055995a931e5784ecdcb58f0890 [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the X86 backend.
				3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	88de51e	2009-05-25 16:34:44 +0000	[diff] [blame]	5	We should add support for the "movbe" instruction, which does a byte-swapping
				6	copy (3-addr bswap + memory support?) This is available on Atom processors.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	7
				8	//===---------------------------------------------------------------------===//
				9
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	10	CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
				11	backend knows how to three-addressify this shift, but it appears the register
				12	allocator isn't even asking it to do so in this case. We should investigate
				13	why this isn't happening, it could have significant impact on other important
				14	cases for X86 as well.
				15
				16	//===---------------------------------------------------------------------===//
				17
				18	This should be one DIV/IDIV instruction, not a libcall:
				19
				20	unsigned test(unsigned long long X, unsigned Y) {
				21	return X/Y;
				22	}
				23
				24	This can be done trivially with a custom legalizer. What about overflow
				25	though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
				26
				27	//===---------------------------------------------------------------------===//
				28
				29	Improvements to the multiply -> shift/add algorithm:
				30	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
				31
				32	//===---------------------------------------------------------------------===//
				33
				34	Improve code like this (occurs fairly frequently, e.g. in LLVM):
				35	long long foo(int x) { return 1LL << x; }
				36
				37	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
				38	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
				39	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
				40
				41	Another useful one would be ~0ULL >> X and ~0ULL << X.
				42
				43	One better solution for 1LL << x is:
				44	xorl %eax, %eax
				45	xorl %edx, %edx
				46	testb $32, %cl
				47	sete %al
				48	setne %dl
				49	sall %cl, %eax
				50	sall %cl, %edx
				51
				52	But that requires good 8-bit subreg support.
				53
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	54	Also, this might be better. It's an extra shift, but it's one instruction
				55	shorter, and doesn't stress 8-bit subreg support.
				56	(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
				57	but without the unnecessary and.)
				58	movl %ecx, %eax
				59	shrl $5, %eax
				60	movl %eax, %edx
				61	xorl $1, %edx
				62	sall %cl, %eax
				63	sall %cl. %edx
				64
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	65	64-bit shifts (in general) expand to really bad code. Instead of using
				66	cmovs, we should expand to a conditional branch like GCC produces.
				67
				68	//===---------------------------------------------------------------------===//
				69
				70	Compile this:
				71	_Bool f(_Bool a) { return a!=1; }
				72
				73	into:
				74	movzbl %dil, %eax
				75	xorl $1, %eax
				76	ret
				77
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	78	(Although note that this isn't a legal way to express the code that llvm-gcc
				79	currently generates for that function.)
				80
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	81	//===---------------------------------------------------------------------===//
				82
				83	Some isel ideas:
				84
				85	1. Dynamic programming based approach when compile time if not an
				86	issue.
				87	2. Code duplication (addressing mode) during isel.
				88	3. Other ideas from "Register-Sensitive Selection, Duplication, and
				89	Sequencing of Instructions".
				90	4. Scheduling for reduced register pressure. E.g. "Minimum Register
				91	Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
				92	and other related papers.
				93	http://citeseer.ist.psu.edu/govindarajan01minimum.html
				94
				95	//===---------------------------------------------------------------------===//
				96
				97	Should we promote i16 to i32 to avoid partial register update stalls?
				98
				99	//===---------------------------------------------------------------------===//
				100
				101	Leave any_extend as pseudo instruction and hint to register
				102	allocator. Delay codegen until post register allocation.
Evan Cheng	fdbb667	2007-10-12 18:22:55 +0000	[diff] [blame]	103	Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
				104	the coalescer how to deal with it though.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	105
				106	//===---------------------------------------------------------------------===//
				107
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	108	It appears icc use push for parameter passing. Need to investigate.
				109
				110	//===---------------------------------------------------------------------===//
				111
				112	Only use inc/neg/not instructions on processors where they are faster than
				113	add/sub/xor. They are slower on the P4 due to only updating some processor
				114	flags.
				115
				116	//===---------------------------------------------------------------------===//
				117
				118	The instruction selector sometimes misses folding a load into a compare. The
				119	pattern is written as (cmp reg, (load p)). Because the compare isn't
				120	commutative, it is not matched with the load on both sides. The dag combiner
				121	should be made smart enough to cannonicalize the load into the RHS of a compare
				122	when it can invert the result of the compare for free.
				123
				124	//===---------------------------------------------------------------------===//
				125
				126	How about intrinsics? An example is:
				127	res = _mm_mulhi_epu16(A, _mm_mul_epu32(B, C));
				128
				129	compiles to
				130	pmuludq (%eax), %xmm0
				131	movl 8(%esp), %eax
				132	movdqa (%eax), %xmm1
				133	pmulhuw %xmm0, %xmm1
				134
				135	The transformation probably requires a X86 specific pass or a DAG combiner
				136	target specific hook.
				137
				138	//===---------------------------------------------------------------------===//
				139
				140	In many cases, LLVM generates code like this:
				141
				142	_test:
				143	movl 8(%esp), %eax
				144	cmpl %eax, 4(%esp)
				145	setl %al
				146	movzbl %al, %eax
				147	ret
				148
				149	on some processors (which ones?), it is more efficient to do this:
				150
				151	_test:
				152	movl 8(%esp), %ebx
				153	xor %eax, %eax
				154	cmpl %ebx, 4(%esp)
				155	setl %al
				156	ret
				157
				158	Doing this correctly is tricky though, as the xor clobbers the flags.
				159
				160	//===---------------------------------------------------------------------===//
				161
				162	We should generate bts/btr/etc instructions on targets where they are cheap or
				163	when codesize is important. e.g., for:
				164
				165	void setbit(int *target, int bit) {
				166	*target \|= (1 << bit);
				167	}
				168	void clearbit(int *target, int bit) {
				169	*target &= ~(1 << bit);
				170	}
				171
				172	//===---------------------------------------------------------------------===//
				173
				174	Instead of the following for memset char*, 1, 10:
				175
				176	movl $16843009, 4(%edx)
				177	movl $16843009, (%edx)
				178	movw $257, 8(%edx)
				179
				180	It might be better to generate
				181
				182	movl $16843009, %eax
				183	movl %eax, 4(%edx)
				184	movl %eax, (%edx)
				185	movw al, 8(%edx)
				186
				187	when we can spare a register. It reduces code size.
				188
				189	//===---------------------------------------------------------------------===//
				190
				191	Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
				192	get this:
				193
Eli Friedman	1aa1f2c	2008-02-28 00:21:43 +0000	[diff] [blame]	194	define i32 @test1(i32 %X) {
				195	%Y = sdiv i32 %X, 8
				196	ret i32 %Y
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	197	}
				198
				199	_test1:
				200	movl 4(%esp), %eax
				201	movl %eax, %ecx
				202	sarl $31, %ecx
				203	shrl $29, %ecx
				204	addl %ecx, %eax
				205	sarl $3, %eax
				206	ret
				207
				208	GCC knows several different ways to codegen it, one of which is this:
				209
				210	_test1:
				211	movl 4(%esp), %eax
				212	cmpl $-1, %eax
				213	leal 7(%eax), %ecx
				214	cmovle %ecx, %eax
				215	sarl $3, %eax
				216	ret
				217
				218	which is probably slower, but it's interesting at least :)
				219
				220	//===---------------------------------------------------------------------===//
				221
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	222	We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
				223	We should leave these as libcalls for everything over a much lower threshold,
				224	since libc is hand tuned for medium and large mem ops (avoiding RFO for large
				225	stores, TLB preheating, etc)
				226
				227	//===---------------------------------------------------------------------===//
				228
				229	Optimize this into something reasonable:
				230	x * copysign(1.0, y) * copysign(1.0, z)
				231
				232	//===---------------------------------------------------------------------===//
				233
				234	Optimize copysign(x, *y) to use an integer load from y.
				235
				236	//===---------------------------------------------------------------------===//
				237
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	238	The following tests perform worse with LSR:
				239
				240	lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
				241
				242	//===---------------------------------------------------------------------===//
				243
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	244	Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
				245	FR64 to VR128.
				246
				247	//===---------------------------------------------------------------------===//
				248
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	249	Adding to the list of cmp / test poor codegen issues:
				250
				251	int test(__m128 A, __m128 B) {
				252	if (_mm_comige_ss(A, B))
				253	return 3;
				254	else
				255	return 4;
				256	}
				257
				258	_test:
				259	movl 8(%esp), %eax
				260	movaps (%eax), %xmm0
				261	movl 4(%esp), %eax
				262	movaps (%eax), %xmm1
				263	comiss %xmm0, %xmm1
				264	setae %al
				265	movzbl %al, %ecx
				266	movl $3, %eax
				267	movl $4, %edx
				268	cmpl $0, %ecx
				269	cmove %edx, %eax
				270	ret
				271
				272	Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
				273	are a number of issues. 1) We are introducing a setcc between the result of the
				274	intrisic call and select. 2) The intrinsic is expected to produce a i32 value
				275	so a any extend (which becomes a zero extend) is added.
				276
				277	We probably need some kind of target DAG combine hook to fix this.
				278
				279	//===---------------------------------------------------------------------===//
				280
				281	We generate significantly worse code for this than GCC:
				282	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
				283	http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
				284
				285	There is also one case we do worse on PPC.
				286
				287	//===---------------------------------------------------------------------===//
				288
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	289	For this:
				290
				291	int test(int a)
				292	{
				293	return a * 3;
				294	}
				295
				296	We currently emits
				297	imull $3, 4(%esp), %eax
				298
				299	Perhaps this is what we really should generate is? Is imull three or four
				300	cycles? Note: ICC generates this:
				301	movl 4(%esp), %eax
				302	leal (%eax,%eax,2), %eax
				303
				304	The current instruction priority is based on pattern complexity. The former is
				305	more "complex" because it folds a load so the latter will not be emitted.
				306
				307	Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
				308	should always try to match LEA first since the LEA matching code does some
				309	estimate to determine whether the match is profitable.
				310
				311	However, if we care more about code size, then imull is better. It's two bytes
				312	shorter than movl + leal.
				313
Eli Friedman	9ab1db0	2008-11-30 07:52:27 +0000	[diff] [blame]	314	On a Pentium M, both variants have the same characteristics with regard
				315	to throughput; however, the multiplication has a latency of four cycles, as
				316	opposed to two cycles for the movl+lea variant.
				317
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	318	//===---------------------------------------------------------------------===//
				319
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	320	__builtin_ffs codegen is messy.
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	321
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	322	int ffs_(unsigned X) { return __builtin_ffs(X); }
				323
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	324	llvm produces:
				325	ffs_:
				326	movl 4(%esp), %ecx
				327	bsfl %ecx, %eax
				328	movl $32, %edx
				329	cmove %edx, %eax
				330	incl %eax
				331	xorl %edx, %edx
				332	testl %ecx, %ecx
				333	cmove %edx, %eax
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	334	ret
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	335
				336	vs gcc:
				337
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	338	_ffs_:
				339	movl $-1, %edx
				340	bsfl 4(%esp), %eax
				341	cmove %edx, %eax
				342	addl $1, %eax
				343	ret
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	344
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	345	Another example of __builtin_ffs (use predsimplify to eliminate a select):
				346
				347	int foo (unsigned long j) {
				348	if (j)
				349	return __builtin_ffs (j) - 1;
				350	else
				351	return 0;
				352	}
				353
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	354	//===---------------------------------------------------------------------===//
				355
				356	It appears gcc place string data with linkonce linkage in
				357	.section __TEXT,__const_coal,coalesced instead of
				358	.section __DATA,__const_coal,coalesced.
				359	Take a look at darwin.h, there are other Darwin assembler directives that we
				360	do not make use of.
				361
				362	//===---------------------------------------------------------------------===//
				363
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	364	define i32 @foo(i32* %a, i32 %t) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	365	entry:
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	366	br label %cond_true
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	367
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	368	cond_true: ; preds = %cond_true, %entry
				369	%x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ] ; <i32> [#uses=3]
				370	%t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ] ; <i32> [#uses=1]
				371	%tmp2 = getelementptr i32* %a, i32 %x.0.0 ; <i32*> [#uses=1]
				372	%tmp3 = load i32* %tmp2 ; <i32> [#uses=1]
				373	%tmp5 = add i32 %t_addr.0.0, %x.0.0 ; <i32> [#uses=1]
				374	%tmp7 = add i32 %tmp5, %tmp3 ; <i32> [#uses=2]
				375	%tmp9 = add i32 %x.0.0, 1 ; <i32> [#uses=2]
				376	%tmp = icmp sgt i32 %tmp9, 39 ; <i1> [#uses=1]
				377	br i1 %tmp, label %bb12, label %cond_true
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	378
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	379	bb12: ; preds = %cond_true
				380	ret i32 %tmp7
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	381	}
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	382	is pessimized by -loop-reduce and -indvars
				383
				384	//===---------------------------------------------------------------------===//
				385
				386	u32 to float conversion improvement:
				387
				388	float uint32_2_float( unsigned u ) {
				389	float fl = (int) (u & 0xffff);
				390	float fh = (int) (u >> 16);
				391	fh *= 0x1.0p16f;
				392	return fh + fl;
				393	}
				394
				395	00000000 subl $0x04,%esp
				396	00000003 movl 0x08(%esp,1),%eax
				397	00000007 movl %eax,%ecx
				398	00000009 shrl $0x10,%ecx
				399	0000000c cvtsi2ss %ecx,%xmm0
				400	00000010 andl $0x0000ffff,%eax
				401	00000015 cvtsi2ss %eax,%xmm1
				402	00000019 mulss 0x00000078,%xmm0
				403	00000021 addss %xmm1,%xmm0
				404	00000025 movss %xmm0,(%esp,1)
				405	0000002a flds (%esp,1)
				406	0000002d addl $0x04,%esp
				407	00000030 ret
				408
				409	//===---------------------------------------------------------------------===//
				410
				411	When using fastcc abi, align stack slot of argument of type double on 8 byte
				412	boundary to improve performance.
				413
				414	//===---------------------------------------------------------------------===//
				415
				416	Codegen:
				417
				418	int f(int a, int b) {
				419	if (a == 4 \|\| a == 6)
				420	b++;
				421	return b;
				422	}
				423
				424
				425	as:
				426
				427	or eax, 2
				428	cmp eax, 6
				429	jz label
				430
				431	//===---------------------------------------------------------------------===//
				432
				433	GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
				434	simplifications for integer "x cmp y ? a : b". For example, instead of:
				435
				436	int G;
				437	void f(int X, int Y) {
				438	G = X < 0 ? 14 : 13;
				439	}
				440
				441	compiling to:
				442
				443	_f:
				444	movl $14, %eax
				445	movl $13, %ecx
				446	movl 4(%esp), %edx
				447	testl %edx, %edx
				448	cmovl %eax, %ecx
				449	movl %ecx, _G
				450	ret
				451
				452	it could be:
				453	_f:
				454	movl 4(%esp), %eax
				455	sarl $31, %eax
				456	notl %eax
				457	addl $14, %eax
				458	movl %eax, _G
				459	ret
				460
				461	etc.
				462
Chris Lattner	e7037c2	2007-11-02 17:04:20 +0000	[diff] [blame]	463	Another is:
				464	int usesbb(unsigned int a, unsigned int b) {
				465	return (a < b ? -1 : 0);
				466	}
				467	to:
				468	_usesbb:
				469	movl 8(%esp), %eax
				470	cmpl %eax, 4(%esp)
				471	sbbl %eax, %eax
				472	ret
				473
				474	instead of:
				475	_usesbb:
				476	xorl %eax, %eax
				477	movl 8(%esp), %ecx
				478	cmpl %ecx, 4(%esp)
				479	movl $4294967295, %ecx
				480	cmovb %ecx, %eax
				481	ret
				482
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	483	//===---------------------------------------------------------------------===//
				484
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	485	Consider the expansion of:
				486
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	487	define i32 @test3(i32 %X) {
				488	%tmp1 = urem i32 %X, 255
				489	ret i32 %tmp1
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	490	}
				491
				492	Currently it compiles to:
				493
				494	...
				495	movl $2155905153, %ecx
				496	movl 8(%esp), %esi
				497	movl %esi, %eax
				498	mull %ecx
				499	...
				500
				501	This could be "reassociated" into:
				502
				503	movl $2155905153, %eax
				504	movl 8(%esp), %ecx
				505	mull %ecx
				506
				507	to avoid the copy. In fact, the existing two-address stuff would do this
				508	except that mul isn't a commutative 2-addr instruction. I guess this has
				509	to be done at isel time based on the #uses to mul?
				510
				511	//===---------------------------------------------------------------------===//
				512
				513	Make sure the instruction which starts a loop does not cross a cacheline
				514	boundary. This requires knowning the exact length of each machine instruction.
				515	That is somewhat complicated, but doable. Example 256.bzip2:
				516
				517	In the new trace, the hot loop has an instruction which crosses a cacheline
				518	boundary. In addition to potential cache misses, this can't help decoding as I
				519	imagine there has to be some kind of complicated decoder reset and realignment
				520	to grab the bytes from the next cacheline.
				521
				522	532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
Eli Friedman	9ab1db0	2008-11-30 07:52:27 +0000	[diff] [blame]	523	942 942 0x3d03 movl %dh, (1809(%esp, %esi)
				524	937 937 0x3d0a incl %esi
				525	3 3 0x3d0b cmpb %bl, %dl
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	526	27 27 0x3d0d jnz 0x000062db <main+11707>
				527
				528	//===---------------------------------------------------------------------===//
				529
				530	In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
				531
				532	//===---------------------------------------------------------------------===//
				533
				534	This could be a single 16-bit load.
				535
				536	int f(char *p) {
				537	if ((p[0] == 1) & (p[1] == 2)) return 1;
				538	return 0;
				539	}
				540
				541	//===---------------------------------------------------------------------===//
				542
				543	We should inline lrintf and probably other libc functions.
				544
				545	//===---------------------------------------------------------------------===//
				546
				547	Start using the flags more. For example, compile:
				548
				549	int add_zf(int *x, int y, int a, int b) {
				550	if ((*x += y) == 0)
				551	return a;
				552	else
				553	return b;
				554	}
				555
				556	to:
				557	addl %esi, (%rdi)
				558	movl %edx, %eax
				559	cmovne %ecx, %eax
				560	ret
				561	instead of:
				562
				563	_add_zf:
				564	addl (%rdi), %esi
				565	movl %esi, (%rdi)
				566	testl %esi, %esi
				567	cmove %edx, %ecx
				568	movl %ecx, %eax
				569	ret
				570
				571	and:
				572
				573	int add_zf(int *x, int y, int a, int b) {
				574	if ((*x + y) < 0)
				575	return a;
				576	else
				577	return b;
				578	}
				579
				580	to:
				581
				582	add_zf:
				583	addl (%rdi), %esi
				584	movl %edx, %eax
				585	cmovns %ecx, %eax
				586	ret
				587
				588	instead of:
				589
				590	_add_zf:
				591	addl (%rdi), %esi
				592	testl %esi, %esi
				593	cmovs %edx, %ecx
				594	movl %ecx, %eax
				595	ret
				596
				597	//===---------------------------------------------------------------------===//
				598
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	599	These two functions have identical effects:
				600
				601	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
				602	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				603
				604	We currently compile them to:
				605
				606	_f:
				607	movl 4(%esp), %eax
				608	movl %eax, %ecx
				609	incl %ecx
				610	movl 8(%esp), %edx
				611	cmpl %edx, %ecx
				612	jne LBB1_2 #UnifiedReturnBlock
				613	LBB1_1: #cond_true
				614	addl $2, %eax
				615	ret
				616	LBB1_2: #UnifiedReturnBlock
				617	movl %ecx, %eax
				618	ret
				619	_f2:
				620	movl 4(%esp), %eax
				621	movl %eax, %ecx
				622	incl %ecx
				623	cmpl 8(%esp), %ecx
				624	sete %cl
				625	movzbl %cl, %ecx
				626	leal 1(%ecx,%eax), %eax
				627	ret
				628
				629	both of which are inferior to GCC's:
				630
				631	_f:
				632	movl 4(%esp), %edx
				633	leal 1(%edx), %eax
				634	addl $2, %edx
				635	cmpl 8(%esp), %eax
				636	cmove %edx, %eax
				637	ret
				638	_f2:
				639	movl 4(%esp), %eax
				640	addl $1, %eax
				641	xorl %edx, %edx
				642	cmpl 8(%esp), %eax
				643	sete %dl
				644	addl %edx, %eax
				645	ret
				646
				647	//===---------------------------------------------------------------------===//
				648
				649	This code:
				650
				651	void test(int X) {
				652	if (X) abort();
				653	}
				654
				655	is currently compiled to:
				656
				657	_test:
				658	subl $12, %esp
				659	cmpl $0, 16(%esp)
				660	jne LBB1_1
				661	addl $12, %esp
				662	ret
				663	LBB1_1:
				664	call L_abort$stub
				665
				666	It would be better to produce:
				667
				668	_test:
				669	subl $12, %esp
				670	cmpl $0, 16(%esp)
				671	jne L_abort$stub
				672	addl $12, %esp
				673	ret
				674
				675	This can be applied to any no-return function call that takes no arguments etc.
				676	Alternatively, the stack save/restore logic could be shrink-wrapped, producing
				677	something like this:
				678
				679	_test:
				680	cmpl $0, 4(%esp)
				681	jne LBB1_1
				682	ret
				683	LBB1_1:
				684	subl $12, %esp
				685	call L_abort$stub
				686
				687	Both are useful in different situations. Finally, it could be shrink-wrapped
				688	and tail called, like this:
				689
				690	_test:
				691	cmpl $0, 4(%esp)
				692	jne LBB1_1
				693	ret
				694	LBB1_1:
				695	pop %eax # realign stack.
				696	call L_abort$stub
				697
				698	Though this probably isn't worth it.
				699
				700	//===---------------------------------------------------------------------===//
				701
				702	We need to teach the codegen to convert two-address INC instructions to LEA
Chris Lattner	0d64ec3	2007-08-11 18:16:46 +0000	[diff] [blame]	703	when the flags are dead (likewise dec). For example, on X86-64, compile:
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	704
				705	int foo(int A, int B) {
				706	return A+1;
				707	}
				708
				709	to:
				710
				711	_foo:
				712	leal 1(%edi), %eax
				713	ret
				714
				715	instead of:
				716
				717	_foo:
				718	incl %edi
				719	movl %edi, %eax
				720	ret
				721
				722	Another example is:
				723
				724	;; X's live range extends beyond the shift, so the register allocator
				725	;; cannot coalesce it with Y. Because of this, a copy needs to be
				726	;; emitted before the shift to save the register value before it is
				727	;; clobbered. However, this copy is not needed if the register
				728	;; allocator turns the shift into an LEA. This also occurs for ADD.
				729
				730	; Check that the shift gets turned into an LEA.
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	731	; RUN: llvm-as < %s \| llc -march=x86 -x86-asm-syntax=intel \| \
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	732	; RUN: not grep {mov E.X, E.X}
				733
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	734	@G = external global i32 ; <i32*> [#uses=3]
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	735
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	736	define i32 @test1(i32 %X, i32 %Y) {
				737	%Z = add i32 %X, %Y ; <i32> [#uses=1]
				738	volatile store i32 %Y, i32* @G
				739	volatile store i32 %Z, i32* @G
				740	ret i32 %X
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	741	}
				742
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	743	define i32 @test2(i32 %X) {
				744	%Z = add i32 %X, 1 ; <i32> [#uses=1]
				745	volatile store i32 %Z, i32* @G
				746	ret i32 %X
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	747	}
				748
				749	//===---------------------------------------------------------------------===//
				750
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	751	Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
				752	a neg instead of a sub instruction. Consider:
				753
				754	int test(char X) { return 7-X; }
				755
				756	we currently produce:
				757	_test:
				758	movl $7, %eax
				759	movsbl 4(%esp), %ecx
				760	subl %ecx, %eax
				761	ret
				762
				763	We would use one fewer register if codegen'd as:
				764
				765	movsbl 4(%esp), %eax
				766	neg %eax
				767	add $7, %eax
				768	ret
				769
				770	Note that this isn't beneficial if the load can be folded into the sub. In
				771	this case, we want a sub:
				772
				773	int test(int X) { return 7-X; }
				774	_test:
				775	movl $7, %eax
				776	subl 4(%esp), %eax
				777	ret
				778
				779	//===---------------------------------------------------------------------===//
				780
Chris Lattner	32f6587	2007-08-20 02:14:33 +0000	[diff] [blame]	781	Leaf functions that require one 4-byte spill slot have a prolog like this:
				782
				783	_foo:
				784	pushl %esi
				785	subl $4, %esp
				786	...
				787	and an epilog like this:
				788	addl $4, %esp
				789	popl %esi
				790	ret
				791
				792	It would be smaller, and potentially faster, to push eax on entry and to
				793	pop into a dummy register instead of using addl/subl of esp. Just don't pop
				794	into any return registers :)
				795
				796	//===---------------------------------------------------------------------===//
Chris Lattner	44b03cb	2007-08-23 15:22:07 +0000	[diff] [blame]	797
				798	The X86 backend should fold (branch (or (setcc, setcc))) into multiple
				799	branches. We generate really poor code for:
				800
				801	double testf(double a) {
				802	return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
				803	}
				804
				805	For example, the entry BB is:
				806
				807	_testf:
				808	subl $20, %esp
				809	pxor %xmm0, %xmm0
				810	movsd 24(%esp), %xmm1
				811	ucomisd %xmm0, %xmm1
				812	setnp %al
				813	sete %cl
				814	testb %cl, %al
				815	jne LBB1_5 # UnifiedReturnBlock
				816	LBB1_1: # cond_true
				817
				818
				819	it would be better to replace the last four instructions with:
				820
				821	jp LBB1_1
				822	je LBB1_5
				823	LBB1_1:
				824
				825	We also codegen the inner ?: into a diamond:
				826
				827	cvtss2sd LCPI1_0(%rip), %xmm2
				828	cvtss2sd LCPI1_1(%rip), %xmm3
				829	ucomisd %xmm1, %xmm0
				830	ja LBB1_3 # cond_true
				831	LBB1_2: # cond_true
				832	movapd %xmm3, %xmm2
				833	LBB1_3: # cond_true
				834	movapd %xmm2, %xmm0
				835	ret
				836
				837	We should sink the load into xmm3 into the LBB1_2 block. This should
				838	be pretty easy, and will nuke all the copies.
				839
				840	//===---------------------------------------------------------------------===//
Chris Lattner	4084d49	2007-09-10 21:43:18 +0000	[diff] [blame]	841
				842	This:
				843	#include <algorithm>
				844	inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				845	{ return std::make_pair(a + b, a + b < a); }
				846	bool no_overflow(unsigned a, unsigned b)
				847	{ return !full_add(a, b).second; }
				848
				849	Should compile to:
				850
				851
				852	_Z11no_overflowjj:
				853	addl %edi, %esi
				854	setae %al
				855	ret
				856
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	857	FIXME: That code looks wrong; bool return is normally defined as zext.
				858
Chris Lattner	4084d49	2007-09-10 21:43:18 +0000	[diff] [blame]	859	on x86-64, not:
				860
				861	__Z11no_overflowjj:
				862	addl %edi, %esi
				863	cmpl %edi, %esi
				864	setae %al
				865	movzbl %al, %eax
				866	ret
				867
				868
				869	//===---------------------------------------------------------------------===//
Evan Cheng	35127a6	2007-09-10 22:16:37 +0000	[diff] [blame]	870
				871	Re-materialize MOV32r0 etc. with xor instead of changing them to moves if the
				872	condition register is dead. xor reg reg is shorter than mov reg, #0.
Chris Lattner	a487bf7	2007-09-26 06:29:31 +0000	[diff] [blame]	873
				874	//===---------------------------------------------------------------------===//
				875
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	876	The following code:
				877
Bill Wendling	c2036e3	2007-10-02 20:54:32 +0000	[diff] [blame]	878	bb114.preheader: ; preds = %cond_next94
				879	%tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
				880	%tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
				881	%tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
				882	%tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
				883	%tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
				884	%tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
				885	%tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
				886	%tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
				887	%tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
				888	%tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
				889	%tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
				890	br label %bb114
				891
				892	produces:
				893
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	894	LBB3_5: # bb114.preheader
				895	movswl -68(%ebp), %eax
				896	movl $32, %ecx
				897	movl %ecx, -80(%ebp)
				898	subl %eax, -80(%ebp)
				899	movswl -52(%ebp), %eax
				900	movl %ecx, -84(%ebp)
				901	subl %eax, -84(%ebp)
				902	movswl -70(%ebp), %eax
				903	movl %ecx, -88(%ebp)
				904	subl %eax, -88(%ebp)
				905	movswl -50(%ebp), %eax
				906	subl %eax, %ecx
				907	movl %ecx, -76(%ebp)
				908	movswl -42(%ebp), %eax
				909	movl %eax, -92(%ebp)
				910	movswl -66(%ebp), %eax
				911	movl %eax, -96(%ebp)
				912	movw $0, -98(%ebp)
				913
Chris Lattner	792bae5	2007-10-03 03:40:24 +0000	[diff] [blame]	914	This appears to be bad because the RA is not folding the store to the stack
				915	slot into the movl. The above instructions could be:
				916	movl $32, -80(%ebp)
				917	...
				918	movl $32, -84(%ebp)
				919	...
				920	This seems like a cross between remat and spill folding.
				921
Bill Wendling	c2036e3	2007-10-02 20:54:32 +0000	[diff] [blame]	922	This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	923	change, so we could simply subtract %eax from %ecx first and then use %ecx (or
				924	vice-versa).
				925
				926	//===---------------------------------------------------------------------===//
				927
Bill Wendling	54c4f83	2007-10-02 21:49:31 +0000	[diff] [blame]	928	This code:
				929
				930	%tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
				931	br i1 %tmp659, label %cond_true662, label %cond_next715
				932
				933	produces this:
				934
				935	testw %cx, %cx
				936	movswl %cx, %esi
				937	jns LBB4_109 # cond_next715
				938
				939	Shark tells us that using %cx in the testw instruction is sub-optimal. It
				940	suggests using the 32-bit register (which is what ICC uses).
				941
				942	//===---------------------------------------------------------------------===//
Chris Lattner	802c62a	2007-10-03 17:10:03 +0000	[diff] [blame]	943
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	944	We compile this:
				945
				946	void compare (long long foo) {
				947	if (foo < 4294967297LL)
				948	abort();
				949	}
				950
				951	to:
				952
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	953	compare:
				954	subl $4, %esp
				955	cmpl $0, 8(%esp)
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	956	setne %al
				957	movzbw %al, %ax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	958	cmpl $1, 12(%esp)
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	959	setg %cl
				960	movzbw %cl, %cx
				961	cmove %ax, %cx
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	962	testb $1, %cl
				963	jne .LBB1_2 # UnifiedReturnBlock
				964	.LBB1_1: # ifthen
				965	call abort
				966	.LBB1_2: # UnifiedReturnBlock
				967	addl $4, %esp
				968	ret
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	969
				970	(also really horrible code on ppc). This is due to the expand code for 64-bit
				971	compares. GCC produces multiple branches, which is much nicer:
				972
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	973	compare:
				974	subl $12, %esp
				975	movl 20(%esp), %edx
				976	movl 16(%esp), %eax
				977	decl %edx
				978	jle .L7
				979	.L5:
				980	addl $12, %esp
				981	ret
				982	.p2align 4,,7
				983	.L7:
				984	jl .L4
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	985	cmpl $0, %eax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	986	.p2align 4,,8
				987	ja .L5
				988	.L4:
				989	.p2align 4,,9
				990	call abort
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	991
				992	//===---------------------------------------------------------------------===//
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	993
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	994	Tail call optimization improvements: Tail call optimization currently
				995	pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighofer	449b01a	2008-01-11 16:49:42 +0000	[diff] [blame]	996	non-tail call optimized calls) that source from the callers arguments
				997	or that source from a virtual register (also possibly sourcing from
				998	callers arguments).
				999	This is done to prevent overwriting of parameters (see example
				1000	below) that might be used later.
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1001
				1002	example:
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1003
				1004	int callee(int32, int64);
				1005	int caller(int32 arg1, int32 arg2) {
				1006	int64 local = arg2 * 2;
				1007	return callee(arg2, (int64)local);
				1008	}
				1009
				1010	[arg1] [!arg2 no longer valid since we moved local onto it]
				1011	[arg2] -> [(int64)
				1012	[RETADDR] local ]
				1013
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1014	Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1015	arg2 of the caller.
				1016
				1017	Possible optimizations:
				1018
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1019
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1020	- Analyse the actual parameters of the callee to see which would
				1021	overwrite a caller parameter which is used by the callee and only
				1022	push them onto the top of the stack.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1023
				1024	int callee (int32 arg1, int32 arg2);
				1025	int caller (int32 arg1, int32 arg2) {
				1026	return callee(arg1,arg2);
				1027	}
				1028
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1029	Here we don't need to write any variables to the top of the stack
				1030	since they don't overwrite each other.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1031
				1032	int callee (int32 arg1, int32 arg2);
				1033	int caller (int32 arg1, int32 arg2) {
				1034	return callee(arg2,arg1);
				1035	}
				1036
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	1037	Here we need to push the arguments because they overwrite each
				1038	other.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1039
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	1040	//===---------------------------------------------------------------------===//
Evan Cheng	7f1ad6a	2007-10-28 04:01:09 +0000	[diff] [blame]	1041
				1042	main ()
				1043	{
				1044	int i = 0;
				1045	unsigned long int z = 0;
				1046
				1047	do {
				1048	z -= 0x00004000;
				1049	i++;
				1050	if (i > 0x00040000)
				1051	abort ();
				1052	} while (z > 0);
				1053	exit (0);
				1054	}
				1055
				1056	gcc compiles this to:
				1057
				1058	_main:
				1059	subl $28, %esp
				1060	xorl %eax, %eax
				1061	jmp L2
				1062	L3:
				1063	cmpl $262144, %eax
				1064	je L10
				1065	L2:
				1066	addl $1, %eax
				1067	cmpl $262145, %eax
				1068	jne L3
				1069	call L_abort$stub
				1070	L10:
				1071	movl $0, (%esp)
				1072	call L_exit$stub
				1073
				1074	llvm:
				1075
				1076	_main:
				1077	subl $12, %esp
				1078	movl $1, %eax
				1079	movl $16384, %ecx
				1080	LBB1_1: # bb
				1081	cmpl $262145, %eax
				1082	jge LBB1_4 # cond_true
				1083	LBB1_2: # cond_next
				1084	incl %eax
				1085	addl $4294950912, %ecx
				1086	cmpl $16384, %ecx
				1087	jne LBB1_1 # bb
				1088	LBB1_3: # bb11
				1089	xorl %eax, %eax
				1090	addl $12, %esp
				1091	ret
				1092	LBB1_4: # cond_true
				1093	call L_abort$stub
				1094
				1095	1. LSR should rewrite the first cmp with induction variable %ecx.
				1096	2. DAG combiner should fold
				1097	leal 1(%eax), %edx
				1098	cmpl $262145, %edx
				1099	=>
				1100	cmpl $262144, %eax
				1101
				1102	//===---------------------------------------------------------------------===//
Chris Lattner	358670b	2007-11-24 06:13:33 +0000	[diff] [blame]	1103
				1104	define i64 @test(double %X) {
				1105	%Y = fptosi double %X to i64
				1106	ret i64 %Y
				1107	}
				1108
				1109	compiles to:
				1110
				1111	_test:
				1112	subl $20, %esp
				1113	movsd 24(%esp), %xmm0
				1114	movsd %xmm0, 8(%esp)
				1115	fldl 8(%esp)
				1116	fisttpll (%esp)
				1117	movl 4(%esp), %edx
				1118	movl (%esp), %eax
				1119	addl $20, %esp
				1120	#FP_REG_KILL
				1121	ret
				1122
				1123	This should just fldl directly from the input stack slot.
Chris Lattner	10d54d1	2007-12-05 22:58:19 +0000	[diff] [blame]	1124
				1125	//===---------------------------------------------------------------------===//
				1126
				1127	This code:
				1128	int foo (int x) { return (x & 65535) \| 255; }
				1129
				1130	Should compile into:
				1131
				1132	_foo:
				1133	movzwl 4(%esp), %eax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1134	orl $255, %eax
Chris Lattner	10d54d1	2007-12-05 22:58:19 +0000	[diff] [blame]	1135	ret
				1136
				1137	instead of:
				1138	_foo:
				1139	movl $255, %eax
				1140	orl 4(%esp), %eax
				1141	andl $65535, %eax
				1142	ret
				1143
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1144	//===---------------------------------------------------------------------===//
				1145
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1146	We're codegen'ing multiply of long longs inefficiently:
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1147
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1148	unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
				1149	return arg1 * arg2;
				1150	}
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1151
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1152	We compile to (fomit-frame-pointer):
				1153
				1154	_LLM:
				1155	pushl %esi
				1156	movl 8(%esp), %ecx
				1157	movl 16(%esp), %esi
				1158	movl %esi, %eax
				1159	mull %ecx
				1160	imull 12(%esp), %esi
				1161	addl %edx, %esi
				1162	imull 20(%esp), %ecx
				1163	movl %esi, %edx
				1164	addl %ecx, %edx
				1165	popl %esi
				1166	ret
				1167
				1168	This looks like a scheduling deficiency and lack of remat of the load from
				1169	the argument area. ICC apparently produces:
				1170
				1171	movl 8(%esp), %ecx
				1172	imull 12(%esp), %ecx
				1173	movl 16(%esp), %eax
				1174	imull 4(%esp), %eax
				1175	addl %eax, %ecx
				1176	movl 4(%esp), %eax
				1177	mull 12(%esp)
				1178	addl %ecx, %edx
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1179	ret
				1180
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1181	Note that it remat'd loads from 4(esp) and 12(esp). See this GCC PR:
				1182	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1183
				1184	//===---------------------------------------------------------------------===//
				1185
Chris Lattner	2b55ebd	2007-12-24 19:27:46 +0000	[diff] [blame]	1186	We can fold a store into "zeroing a reg". Instead of:
				1187
				1188	xorl %eax, %eax
				1189	movl %eax, 124(%esp)
				1190
				1191	we should get:
				1192
				1193	movl $0, 124(%esp)
				1194
				1195	if the flags of the xor are dead.
				1196
Chris Lattner	459ff99	2008-01-11 18:00:13 +0000	[diff] [blame]	1197	Likewise, we isel "x<<1" into "add reg,reg". If reg is spilled, this should
				1198	be folded into: shl [mem], 1
				1199
Chris Lattner	2b55ebd	2007-12-24 19:27:46 +0000	[diff] [blame]	1200	//===---------------------------------------------------------------------===//
Chris Lattner	6440095	2007-12-28 21:50:40 +0000	[diff] [blame]	1201
				1202	This testcase misses a read/modify/write opportunity (from PR1425):
				1203
				1204	void vertical_decompose97iH1(int b0, int b1, int *b2, int width){
				1205	int i;
				1206	for(i=0; i<width; i++)
				1207	b1[i] += (1*(b0[i] + b2[i])+0)>>0;
				1208	}
				1209
				1210	We compile it down to:
				1211
				1212	LBB1_2: # bb
				1213	movl (%esi,%edi,4), %ebx
				1214	addl (%ecx,%edi,4), %ebx
				1215	addl (%edx,%edi,4), %ebx
				1216	movl %ebx, (%ecx,%edi,4)
				1217	incl %edi
				1218	cmpl %eax, %edi
				1219	jne LBB1_2 # bb
				1220
				1221	the inner loop should add to the memory location (%ecx,%edi,4), saving
				1222	a mov. Something like:
				1223
				1224	movl (%esi,%edi,4), %ebx
				1225	addl (%edx,%edi,4), %ebx
				1226	addl %ebx, (%ecx,%edi,4)
				1227
Chris Lattner	bde7310	2007-12-29 05:51:58 +0000	[diff] [blame]	1228	Here is another interesting example:
				1229
				1230	void vertical_compose97iH1(int b0, int b1, int *b2, int width){
				1231	int i;
				1232	for(i=0; i<width; i++)
				1233	b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
				1234	}
				1235
				1236	We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
				1237
				1238	LBB9_2: # bb
				1239	movl (%ecx,%edi,4), %ebx
				1240	subl (%esi,%edi,4), %ebx
				1241	subl (%edx,%edi,4), %ebx
				1242	movl %ebx, (%ecx,%edi,4)
				1243	incl %edi
				1244	cmpl %eax, %edi
				1245	jne LBB9_2 # bb
				1246
				1247	Additionally, LSR should rewrite the exit condition of these loops to use
Chris Lattner	6440095	2007-12-28 21:50:40 +0000	[diff] [blame]	1248	a stride-4 IV, would would allow all the scales in the loop to go away.
				1249	This would result in smaller code and more efficient microops.
				1250
				1251	//===---------------------------------------------------------------------===//
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1252
				1253	In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
				1254	or and instruction, for example:
				1255
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1256	xorpd LCPI1_0, %xmm2
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1257
				1258	However, if xmm2 gets spilled, we end up with really ugly code like this:
				1259
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1260	movsd (%esp), %xmm0
				1261	xorpd LCPI1_0, %xmm0
				1262	movsd %xmm0, (%esp)
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1263
				1264	Since we 'know' that this is a 'neg', we can actually "fold" the spill into
				1265	the neg/abs instruction, turning it into an integer operation, like this:
				1266
				1267	xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
				1268
				1269	you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1270	stall. Here is a contrived testcase:
				1271
				1272	double a, b, c;
				1273	void test(double *P) {
				1274	double X = *P;
				1275	a = X;
				1276	bar();
				1277	X = -X;
				1278	b = X;
				1279	bar();
				1280	c = X;
				1281	}
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1282
				1283	//===---------------------------------------------------------------------===//
Andrew Lenharth	785610d	2008-02-16 01:24:58 +0000	[diff] [blame]	1284
				1285	handling llvm.memory.barrier on pre SSE2 cpus
				1286
				1287	should generate:
				1288	lock ; mov %esp, %esp
				1289
				1290	//===---------------------------------------------------------------------===//
Chris Lattner	7644ff3	2008-02-17 19:43:57 +0000	[diff] [blame]	1291
				1292	The generated code on x86 for checking for signed overflow on a multiply the
				1293	obvious way is much longer than it needs to be.
				1294
				1295	int x(int a, int b) {
				1296	long long prod = (long long)a*b;
				1297	return prod > 0x7FFFFFFF \|\| prod < (-0x7FFFFFFF-1);
				1298	}
				1299
				1300	See PR2053 for more details.
				1301
				1302	//===---------------------------------------------------------------------===//
Chris Lattner	83f2236	2008-02-18 18:30:13 +0000	[diff] [blame]	1303
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1304	We should investigate using cdq/ctld (effect: edx = sar eax, 31)
				1305	more aggressively; it should cost the same as a move+shift on any modern
				1306	processor, but it's a lot shorter. Downside is that it puts more
				1307	pressure on register allocation because it has fixed operands.
				1308
				1309	Example:
				1310	int abs(int x) {return x < 0 ? -x : x;}
				1311
				1312	gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
				1313	abs:
				1314	movl 4(%esp), %eax
				1315	cltd
				1316	xorl %edx, %eax
				1317	subl %edx, %eax
				1318	ret
				1319
				1320	//===---------------------------------------------------------------------===//
				1321
				1322	Consider:
Chris Lattner	83f2236	2008-02-18 18:30:13 +0000	[diff] [blame]	1323	int test(unsigned long a, unsigned long b) { return -(a < b); }
				1324
				1325	We currently compile this to:
				1326
				1327	define i32 @test(i32 %a, i32 %b) nounwind {
				1328	%tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
				1329	%tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
				1330	%tmp5 = sub i32 0, %tmp34 ; <i32> [#uses=1]
				1331	ret i32 %tmp5
				1332	}
				1333
				1334	and
				1335
				1336	_test:
				1337	movl 8(%esp), %eax
				1338	cmpl %eax, 4(%esp)
				1339	setb %al
				1340	movzbl %al, %eax
				1341	negl %eax
				1342	ret
				1343
				1344	Several deficiencies here. First, we should instcombine zext+neg into sext:
				1345
				1346	define i32 @test2(i32 %a, i32 %b) nounwind {
				1347	%tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
				1348	%tmp34 = sext i1 %tmp3 to i32 ; <i32> [#uses=1]
				1349	ret i32 %tmp34
				1350	}
				1351
				1352	However, before we can do that, we have to fix the bad codegen that we get for
				1353	sext from bool:
				1354
				1355	_test2:
				1356	movl 8(%esp), %eax
				1357	cmpl %eax, 4(%esp)
				1358	setb %al
				1359	movzbl %al, %eax
				1360	shll $31, %eax
				1361	sarl $31, %eax
				1362	ret
				1363
				1364	This code should be at least as good as the code above. Once this is fixed, we
				1365	can optimize this specific case even more to:
				1366
				1367	movl 8(%esp), %eax
				1368	xorl %ecx, %ecx
				1369	cmpl %eax, 4(%esp)
				1370	sbbl %ecx, %ecx
				1371
				1372	//===---------------------------------------------------------------------===//
Eli Friedman	1aa1f2c	2008-02-28 00:21:43 +0000	[diff] [blame]	1373
				1374	Take the following code (from
				1375	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
				1376
				1377	extern unsigned char first_one[65536];
				1378	int FirstOnet(unsigned long long arg1)
				1379	{
				1380	if (arg1 >> 48)
				1381	return (first_one[arg1 >> 48]);
				1382	return 0;
				1383	}
				1384
				1385
				1386	The following code is currently generated:
				1387	FirstOnet:
				1388	movl 8(%esp), %eax
				1389	cmpl $65536, %eax
				1390	movl 4(%esp), %ecx
				1391	jb .LBB1_2 # UnifiedReturnBlock
				1392	.LBB1_1: # ifthen
				1393	shrl $16, %eax
				1394	movzbl first_one(%eax), %eax
				1395	ret
				1396	.LBB1_2: # UnifiedReturnBlock
				1397	xorl %eax, %eax
				1398	ret
				1399
				1400	There are a few possible improvements here:
				1401	1. We should be able to eliminate the dead load into %ecx
				1402	2. We could change the "movl 8(%esp), %eax" into
				1403	"movzwl 10(%esp), %eax"; this lets us change the cmpl
				1404	into a testl, which is shorter, and eliminate the shift.
				1405
				1406	We could also in theory eliminate the branch by using a conditional
				1407	for the address of the load, but that seems unlikely to be worthwhile
				1408	in general.
				1409
				1410	//===---------------------------------------------------------------------===//
				1411
Chris Lattner	44a98ac	2008-02-28 04:52:59 +0000	[diff] [blame]	1412	We compile this function:
				1413
				1414	define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext %d) nounwind {
				1415	entry:
				1416	%tmp2 = icmp eq i8 %d, 0 ; <i1> [#uses=1]
				1417	br i1 %tmp2, label %bb7, label %bb
				1418
				1419	bb: ; preds = %entry
				1420	%tmp6 = add i32 %b, %a ; <i32> [#uses=1]
				1421	ret i32 %tmp6
				1422
				1423	bb7: ; preds = %entry
				1424	%tmp10 = sub i32 %a, %c ; <i32> [#uses=1]
				1425	ret i32 %tmp10
				1426	}
				1427
				1428	to:
				1429
				1430	_foo:
				1431	cmpb $0, 16(%esp)
				1432	movl 12(%esp), %ecx
				1433	movl 8(%esp), %eax
				1434	movl 4(%esp), %edx
				1435	je LBB1_2 # bb7
				1436	LBB1_1: # bb
				1437	addl %edx, %eax
				1438	ret
				1439	LBB1_2: # bb7
				1440	movl %edx, %eax
				1441	subl %ecx, %eax
				1442	ret
				1443
Gabor Greif	0266159	2008-03-06 10:51:21 +0000	[diff] [blame]	1444	The coalescer could coalesce "edx" with "eax" to avoid the movl in LBB1_2
Chris Lattner	44a98ac	2008-02-28 04:52:59 +0000	[diff] [blame]	1445	if it commuted the addl in LBB1_1.
				1446
				1447	//===---------------------------------------------------------------------===//
Evan Cheng	921dcba	2008-03-28 07:07:06 +0000	[diff] [blame]	1448
				1449	See rdar://4653682.
				1450
				1451	From flops:
				1452
				1453	LBB1_15: # bb310
				1454	cvtss2sd LCPI1_0, %xmm1
				1455	addsd %xmm1, %xmm0
				1456	movsd 176(%esp), %xmm2
				1457	mulsd %xmm0, %xmm2
				1458	movapd %xmm2, %xmm3
				1459	mulsd %xmm3, %xmm3
				1460	movapd %xmm3, %xmm4
				1461	mulsd LCPI1_23, %xmm4
				1462	addsd LCPI1_24, %xmm4
				1463	mulsd %xmm3, %xmm4
				1464	addsd LCPI1_25, %xmm4
				1465	mulsd %xmm3, %xmm4
				1466	addsd LCPI1_26, %xmm4
				1467	mulsd %xmm3, %xmm4
				1468	addsd LCPI1_27, %xmm4
				1469	mulsd %xmm3, %xmm4
				1470	addsd LCPI1_28, %xmm4
				1471	mulsd %xmm3, %xmm4
				1472	addsd %xmm1, %xmm4
				1473	mulsd %xmm2, %xmm4
				1474	movsd 152(%esp), %xmm1
				1475	addsd %xmm4, %xmm1
				1476	movsd %xmm1, 152(%esp)
				1477	incl %eax
				1478	cmpl %eax, %esi
				1479	jge LBB1_15 # bb310
				1480	LBB1_16: # bb358.loopexit
				1481	movsd 152(%esp), %xmm0
				1482	addsd %xmm0, %xmm0
				1483	addsd LCPI1_22, %xmm0
				1484	movsd %xmm0, 152(%esp)
				1485
				1486	Rather than spilling the result of the last addsd in the loop, we should have
				1487	insert a copy to split the interval (one for the duration of the loop, one
				1488	extending to the fall through). The register pressure in the loop isn't high
				1489	enough to warrant the spill.
				1490
				1491	Also check why xmm7 is not used at all in the function.
Chris Lattner	16e5c78	2008-04-21 04:46:30 +0000	[diff] [blame]	1492
				1493	//===---------------------------------------------------------------------===//
				1494
				1495	Legalize loses track of the fact that bools are always zero extended when in
				1496	memory. This causes us to compile abort_gzip (from 164.gzip) from:
				1497
				1498	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
				1499	target triple = "i386-apple-darwin8"
				1500	@in_exit.4870.b = internal global i1 false ; <i1*> [#uses=2]
				1501	define fastcc void @abort_gzip() noreturn nounwind {
				1502	entry:
				1503	%tmp.b.i = load i1* @in_exit.4870.b ; <i1> [#uses=1]
				1504	br i1 %tmp.b.i, label %bb.i, label %bb4.i
				1505	bb.i: ; preds = %entry
				1506	tail call void @exit( i32 1 ) noreturn nounwind
				1507	unreachable
				1508	bb4.i: ; preds = %entry
				1509	store i1 true, i1* @in_exit.4870.b
				1510	tail call void @exit( i32 1 ) noreturn nounwind
				1511	unreachable
				1512	}
				1513	declare void @exit(i32) noreturn nounwind
				1514
				1515	into:
				1516
				1517	_abort_gzip:
				1518	subl $12, %esp
				1519	movb _in_exit.4870.b, %al
				1520	notb %al
				1521	testb $1, %al
				1522	jne LBB1_2 ## bb4.i
				1523	LBB1_1: ## bb.i
				1524	...
				1525
				1526	//===---------------------------------------------------------------------===//
Chris Lattner	7cb1d33	2008-05-05 23:19:45 +0000	[diff] [blame]	1527
				1528	We compile:
				1529
				1530	int test(int x, int y) {
				1531	return x-y-1;
				1532	}
				1533
				1534	into (-m64):
				1535
				1536	_test:
				1537	decl %edi
				1538	movl %edi, %eax
				1539	subl %esi, %eax
				1540	ret
				1541
				1542	it would be better to codegen as: x+~y (notl+addl)
Edwin Török	fa9d5e2	2008-10-24 19:23:07 +0000	[diff] [blame]	1543
				1544	//===---------------------------------------------------------------------===//
				1545
				1546	This code:
				1547
				1548	int foo(const char *str,...)
				1549	{
				1550	__builtin_va_list a; int x;
				1551	__builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
				1552	return x;
				1553	}
				1554
				1555	gets compiled into this on x86-64:
				1556	subq $200, %rsp
				1557	movaps %xmm7, 160(%rsp)
				1558	movaps %xmm6, 144(%rsp)
				1559	movaps %xmm5, 128(%rsp)
				1560	movaps %xmm4, 112(%rsp)
				1561	movaps %xmm3, 96(%rsp)
				1562	movaps %xmm2, 80(%rsp)
				1563	movaps %xmm1, 64(%rsp)
				1564	movaps %xmm0, 48(%rsp)
				1565	movq %r9, 40(%rsp)
				1566	movq %r8, 32(%rsp)
				1567	movq %rcx, 24(%rsp)
				1568	movq %rdx, 16(%rsp)
				1569	movq %rsi, 8(%rsp)
				1570	leaq (%rsp), %rax
				1571	movq %rax, 192(%rsp)
				1572	leaq 208(%rsp), %rax
				1573	movq %rax, 184(%rsp)
				1574	movl $48, 180(%rsp)
				1575	movl $8, 176(%rsp)
				1576	movl 176(%rsp), %eax
				1577	cmpl $47, %eax
				1578	jbe .LBB1_3 # bb
				1579	.LBB1_1: # bb3
				1580	movq 184(%rsp), %rcx
				1581	leaq 8(%rcx), %rax
				1582	movq %rax, 184(%rsp)
				1583	.LBB1_2: # bb4
				1584	movl (%rcx), %eax
				1585	addq $200, %rsp
				1586	ret
				1587	.LBB1_3: # bb
				1588	movl %eax, %ecx
				1589	addl $8, %eax
				1590	addq 192(%rsp), %rcx
				1591	movl %eax, 176(%rsp)
				1592	jmp .LBB1_2 # bb4
				1593
				1594	gcc 4.3 generates:
				1595	subq $96, %rsp
				1596	.LCFI0:
				1597	leaq 104(%rsp), %rax
				1598	movq %rsi, -80(%rsp)
				1599	movl $8, -120(%rsp)
				1600	movq %rax, -112(%rsp)
				1601	leaq -88(%rsp), %rax
				1602	movq %rax, -104(%rsp)
				1603	movl $8, %eax
				1604	cmpl $48, %eax
				1605	jb .L6
				1606	movq -112(%rsp), %rdx
				1607	movl (%rdx), %eax
				1608	addq $96, %rsp
				1609	ret
				1610	.p2align 4,,10
				1611	.p2align 3
				1612	.L6:
				1613	mov %eax, %edx
				1614	addq -104(%rsp), %rdx
				1615	addl $8, %eax
				1616	movl %eax, -120(%rsp)
				1617	movl (%rdx), %eax
				1618	addq $96, %rsp
				1619	ret
				1620
				1621	and it gets compiled into this on x86:
				1622	pushl %ebp
				1623	movl %esp, %ebp
				1624	subl $4, %esp
				1625	leal 12(%ebp), %eax
				1626	movl %eax, -4(%ebp)
				1627	leal 16(%ebp), %eax
				1628	movl %eax, -4(%ebp)
				1629	movl 12(%ebp), %eax
				1630	addl $4, %esp
				1631	popl %ebp
				1632	ret
				1633
				1634	gcc 4.3 generates:
				1635	pushl %ebp
				1636	movl %esp, %ebp
				1637	movl 12(%ebp), %eax
				1638	popl %ebp
				1639	ret
Evan Cheng	bf97bec	2008-11-11 17:35:52 +0000	[diff] [blame]	1640
				1641	//===---------------------------------------------------------------------===//
				1642
				1643	Teach tblgen not to check bitconvert source type in some cases. This allows us
				1644	to consolidate the following patterns in X86InstrMMX.td:
				1645
				1646	def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1647	(iPTR 0))))),
				1648	(v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
				1649	def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1650	(iPTR 0))))),
				1651	(v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
				1652	def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1653	(iPTR 0))))),
				1654	(v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
				1655
				1656	There are other cases in various td files.
Eli Friedman	9ab1db0	2008-11-30 07:52:27 +0000	[diff] [blame]	1657
				1658	//===---------------------------------------------------------------------===//
				1659
				1660	Take something like the following on x86-32:
				1661	unsigned a(unsigned long long x, unsigned y) {return x % y;}
				1662
				1663	We currently generate a libcall, but we really shouldn't: the expansion is
				1664	shorter and likely faster than the libcall. The expected code is something
				1665	like the following:
				1666
				1667	movl 12(%ebp), %eax
				1668	movl 16(%ebp), %ecx
				1669	xorl %edx, %edx
				1670	divl %ecx
				1671	movl 8(%ebp), %eax
				1672	divl %ecx
				1673	movl %edx, %eax
				1674	ret
				1675
				1676	A similar code sequence works for division.
				1677
				1678	//===---------------------------------------------------------------------===//
Chris Lattner	bfccda6	2008-12-06 22:49:05 +0000	[diff] [blame]	1679
				1680	These should compile to the same code, but the later codegen's to useless
				1681	instructions on X86. This may be a trivial dag combine (GCC PR7061):
				1682
				1683	struct s1 { unsigned char a, b; };
				1684	unsigned long f1(struct s1 x) {
				1685	return x.a + x.b;
				1686	}
				1687	struct s2 { unsigned a: 8, b: 8; };
				1688	unsigned long f2(struct s2 x) {
				1689	return x.a + x.b;
				1690	}
				1691
				1692	//===---------------------------------------------------------------------===//
				1693
Chris Lattner	9f34dc6	2009-02-08 20:44:19 +0000	[diff] [blame]	1694	We currently compile this:
				1695
				1696	define i32 @func1(i32 %v1, i32 %v2) nounwind {
				1697	entry:
				1698	%t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)
				1699	%sum = extractvalue {i32, i1} %t, 0
				1700	%obit = extractvalue {i32, i1} %t, 1
				1701	br i1 %obit, label %overflow, label %normal
				1702	normal:
				1703	ret i32 %sum
				1704	overflow:
				1705	call void @llvm.trap()
				1706	unreachable
				1707	}
				1708	declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)
				1709	declare void @llvm.trap()
				1710
				1711	to:
				1712
				1713	_func1:
				1714	movl 4(%esp), %eax
				1715	addl 8(%esp), %eax
				1716	jo LBB1_2 ## overflow
				1717	LBB1_1: ## normal
				1718	ret
				1719	LBB1_2: ## overflow
				1720	ud2
				1721
				1722	it would be nice to produce "into" someday.
				1723
				1724	//===---------------------------------------------------------------------===//
Chris Lattner	09c650b	2009-02-17 01:16:14 +0000	[diff] [blame]	1725
				1726	This code:
				1727
				1728	void vec_mpys1(int y[], const int x[], int scaler) {
				1729	int i;
				1730	for (i = 0; i < 150; i++)
				1731	y[i] += (((long long)scaler * (long long)x[i]) >> 31);
				1732	}
				1733
				1734	Compiles to this loop with GCC 3.x:
				1735
				1736	.L5:
				1737	movl %ebx, %eax
				1738	imull (%edi,%ecx,4)
				1739	shrdl $31, %edx, %eax
				1740	addl %eax, (%esi,%ecx,4)
				1741	incl %ecx
				1742	cmpl $149, %ecx
				1743	jle .L5
				1744
				1745	llvm-gcc compiles it to the much uglier:
				1746
				1747	LBB1_1: ## bb1
				1748	movl 24(%esp), %eax
				1749	movl (%eax,%edi,4), %ebx
				1750	movl %ebx, %ebp
				1751	imull %esi, %ebp
				1752	movl %ebx, %eax
				1753	mull %ecx
				1754	addl %ebp, %edx
				1755	sarl $31, %ebx
				1756	imull %ecx, %ebx
				1757	addl %edx, %ebx
				1758	shldl $1, %eax, %ebx
				1759	movl 20(%esp), %eax
				1760	addl %ebx, (%eax,%edi,4)
				1761	incl %edi
				1762	cmpl $150, %edi
				1763	jne LBB1_1 ## bb1
				1764
				1765	//===---------------------------------------------------------------------===//
Chris Lattner	8eca7c7	2009-03-08 01:54:43 +0000	[diff] [blame]	1766
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1767	Test instructions can be eliminated by using EFLAGS values from arithmetic
Dan Gohman	8bb7db6	2009-03-10 00:26:23 +0000	[diff] [blame]	1768	instructions. This is currently not done for mul, and, or, xor, neg, shl,
				1769	sra, srl, shld, shrd, atomic ops, and others. It is also currently not done
				1770	for read-modify-write instructions. It is also current not done if the
				1771	OF or CF flags are needed.
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1772
				1773	The shift operators have the complication that when the shift count is
				1774	zero, EFLAGS is not set, so they can only subsume a test instruction if
Dan Gohman	8bb7db6	2009-03-10 00:26:23 +0000	[diff] [blame]	1775	the shift count is known to be non-zero. Also, using the EFLAGS value
				1776	from a shift is apparently very slow on some x86 implementations.
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1777
				1778	In read-modify-write instructions, the root node in the isel match is
				1779	the store, and isel has no way for the use of the EFLAGS result of the
				1780	arithmetic to be remapped to the new node.
				1781
Dan Gohman	8bb7db6	2009-03-10 00:26:23 +0000	[diff] [blame]	1782	Add and subtract instructions set OF on signed overflow and CF on unsiged
				1783	overflow, while test instructions always clear OF and CF. In order to
				1784	replace a test with an add or subtract in a situation where OF or CF is
				1785	needed, codegen must be able to prove that the operation cannot see
				1786	signed or unsigned overflow, respectively.
				1787
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1788	//===---------------------------------------------------------------------===//
				1789
Chris Lattner	fe8b559	2009-03-08 03:04:26 +0000	[diff] [blame]	1790	memcpy/memmove do not lower to SSE copies when possible. A silly example is:
				1791	define <16 x float> @foo(<16 x float> %A) nounwind {
				1792	%tmp = alloca <16 x float>, align 16
				1793	%tmp2 = alloca <16 x float>, align 16
				1794	store <16 x float> %A, <16 x float>* %tmp
				1795	%s = bitcast <16 x float>* %tmp to i8*
				1796	%s2 = bitcast <16 x float>* %tmp2 to i8*
				1797	call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)
				1798	%R = load <16 x float>* %tmp2
				1799	ret <16 x float> %R
				1800	}
				1801
				1802	declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind
				1803
				1804	which compiles to:
				1805
				1806	_foo:
				1807	subl $140, %esp
				1808	movaps %xmm3, 112(%esp)
				1809	movaps %xmm2, 96(%esp)
				1810	movaps %xmm1, 80(%esp)
				1811	movaps %xmm0, 64(%esp)
				1812	movl 60(%esp), %eax
				1813	movl %eax, 124(%esp)
				1814	movl 56(%esp), %eax
				1815	movl %eax, 120(%esp)
				1816	movl 52(%esp), %eax
				1817	<many many more 32-bit copies>
				1818	movaps (%esp), %xmm0
				1819	movaps 16(%esp), %xmm1
				1820	movaps 32(%esp), %xmm2
				1821	movaps 48(%esp), %xmm3
				1822	addl $140, %esp
				1823	ret
				1824
				1825	On Nehalem, it may even be cheaper to just use movups when unaligned than to
				1826	fall back to lower-granularity chunks.
				1827
				1828	//===---------------------------------------------------------------------===//
Chris Lattner	e00a4e0	2009-05-25 20:28:19 +0000	[diff] [blame]	1829
				1830	Implement processor-specific optimizations for parity with GCC on these
				1831	processors. GCC does two optimizations:
				1832
				1833	1. ix86_pad_returns inserts a noop before ret instructions if immediately
				1834	preceeded by a conditional branch or is the target of a jump.
				1835	2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of
				1836	code contains more than 3 branches.
				1837
				1838	The first one is done for all AMDs, Core2, and "Generic"
				1839	The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,
				1840	Core 2, and "Generic"
				1841
				1842	//===---------------------------------------------------------------------===//
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame^]	1843
				1844	Testcase:
				1845	int a(int x) { return (x & 127) > 31; }
				1846
				1847	Current output:
				1848	movl 4(%esp), %eax
				1849	andl $127, %eax
				1850	cmpl $31, %eax
				1851	seta %al
				1852	movzbl %al, %eax
				1853	ret
				1854
				1855	Ideal output:
				1856	xorl %eax, %eax
				1857	testl $96, 4(%esp)
				1858	setne %al
				1859	ret
				1860
				1861	We could do this transformation in instcombine, but it's only clearly
				1862	beneficial on platforms with a test instruction.
				1863
				1864	//===---------------------------------------------------------------------===//
				1865	Testcase:
				1866	int x(int a) { return (a&0xf0)>>4; }
				1867
				1868	Current output:
				1869	movl 4(%esp), %eax
				1870	shrl $4, %eax
				1871	andl $15, %eax
				1872	ret
				1873
				1874	Ideal output:
				1875	movzbl 4(%esp), %eax
				1876	shrl $4, %eax
				1877	ret
				1878
				1879	//===---------------------------------------------------------------------===//
				1880
				1881	Testcase:
				1882	int x(int a) { return (a & 0x80) ? 0x100 : 0; }
				1883
				1884	Current output:
				1885	testl $128, 4(%esp)
				1886	setne %al
				1887	movzbl %al, %eax
				1888	shll $8, %eax
				1889	ret
				1890
				1891	Ideal output:
				1892	movl 4(%esp), %eax
				1893	addl %eax, %eax
				1894	andl $256, %eax
				1895	ret
				1896
				1897	We generally want to fold shifted tests of a single bit into a shift+and on x86.
				1898
				1899	//===---------------------------------------------------------------------===//