Blame - lib/Target/X86/README.txt - fp2-dev/platform/external/llvm

blob: d4545a6fcfd37150820e69ae6bb268a45c2d7e4d [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the X86 backend.
				3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	88de51e	2009-05-25 16:34:44 +0000	[diff] [blame]	5	We should add support for the "movbe" instruction, which does a byte-swapping
				6	copy (3-addr bswap + memory support?) This is available on Atom processors.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	7
				8	//===---------------------------------------------------------------------===//
				9
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	10	CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
				11	backend knows how to three-addressify this shift, but it appears the register
				12	allocator isn't even asking it to do so in this case. We should investigate
				13	why this isn't happening, it could have significant impact on other important
				14	cases for X86 as well.
				15
				16	//===---------------------------------------------------------------------===//
				17
				18	This should be one DIV/IDIV instruction, not a libcall:
				19
				20	unsigned test(unsigned long long X, unsigned Y) {
				21	return X/Y;
				22	}
				23
				24	This can be done trivially with a custom legalizer. What about overflow
				25	though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
				26
				27	//===---------------------------------------------------------------------===//
				28
				29	Improvements to the multiply -> shift/add algorithm:
				30	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
				31
				32	//===---------------------------------------------------------------------===//
				33
				34	Improve code like this (occurs fairly frequently, e.g. in LLVM):
				35	long long foo(int x) { return 1LL << x; }
				36
				37	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
				38	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
				39	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
				40
				41	Another useful one would be ~0ULL >> X and ~0ULL << X.
				42
				43	One better solution for 1LL << x is:
				44	xorl %eax, %eax
				45	xorl %edx, %edx
				46	testb $32, %cl
				47	sete %al
				48	setne %dl
				49	sall %cl, %eax
				50	sall %cl, %edx
				51
				52	But that requires good 8-bit subreg support.
				53
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	54	Also, this might be better. It's an extra shift, but it's one instruction
				55	shorter, and doesn't stress 8-bit subreg support.
				56	(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
				57	but without the unnecessary and.)
				58	movl %ecx, %eax
				59	shrl $5, %eax
				60	movl %eax, %edx
				61	xorl $1, %edx
				62	sall %cl, %eax
				63	sall %cl. %edx
				64
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	65	64-bit shifts (in general) expand to really bad code. Instead of using
				66	cmovs, we should expand to a conditional branch like GCC produces.
				67
				68	//===---------------------------------------------------------------------===//
				69
				70	Compile this:
				71	_Bool f(_Bool a) { return a!=1; }
				72
				73	into:
				74	movzbl %dil, %eax
				75	xorl $1, %eax
				76	ret
				77
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	78	(Although note that this isn't a legal way to express the code that llvm-gcc
				79	currently generates for that function.)
				80
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	81	//===---------------------------------------------------------------------===//
				82
				83	Some isel ideas:
				84
				85	1. Dynamic programming based approach when compile time if not an
				86	issue.
				87	2. Code duplication (addressing mode) during isel.
				88	3. Other ideas from "Register-Sensitive Selection, Duplication, and
				89	Sequencing of Instructions".
				90	4. Scheduling for reduced register pressure. E.g. "Minimum Register
				91	Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
				92	and other related papers.
				93	http://citeseer.ist.psu.edu/govindarajan01minimum.html
				94
				95	//===---------------------------------------------------------------------===//
				96
				97	Should we promote i16 to i32 to avoid partial register update stalls?
				98
				99	//===---------------------------------------------------------------------===//
				100
				101	Leave any_extend as pseudo instruction and hint to register
				102	allocator. Delay codegen until post register allocation.
Evan Cheng	fdbb667	2007-10-12 18:22:55 +0000	[diff] [blame]	103	Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
				104	the coalescer how to deal with it though.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	105
				106	//===---------------------------------------------------------------------===//
				107
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	108	It appears icc use push for parameter passing. Need to investigate.
				109
				110	//===---------------------------------------------------------------------===//
				111
				112	Only use inc/neg/not instructions on processors where they are faster than
				113	add/sub/xor. They are slower on the P4 due to only updating some processor
				114	flags.
				115
				116	//===---------------------------------------------------------------------===//
				117
				118	The instruction selector sometimes misses folding a load into a compare. The
				119	pattern is written as (cmp reg, (load p)). Because the compare isn't
				120	commutative, it is not matched with the load on both sides. The dag combiner
				121	should be made smart enough to cannonicalize the load into the RHS of a compare
				122	when it can invert the result of the compare for free.
				123
				124	//===---------------------------------------------------------------------===//
				125
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	126	In many cases, LLVM generates code like this:
				127
				128	_test:
				129	movl 8(%esp), %eax
				130	cmpl %eax, 4(%esp)
				131	setl %al
				132	movzbl %al, %eax
				133	ret
				134
				135	on some processors (which ones?), it is more efficient to do this:
				136
				137	_test:
				138	movl 8(%esp), %ebx
				139	xor %eax, %eax
				140	cmpl %ebx, 4(%esp)
				141	setl %al
				142	ret
				143
				144	Doing this correctly is tricky though, as the xor clobbers the flags.
				145
				146	//===---------------------------------------------------------------------===//
				147
				148	We should generate bts/btr/etc instructions on targets where they are cheap or
				149	when codesize is important. e.g., for:
				150
				151	void setbit(int *target, int bit) {
				152	*target \|= (1 << bit);
				153	}
				154	void clearbit(int *target, int bit) {
				155	*target &= ~(1 << bit);
				156	}
				157
				158	//===---------------------------------------------------------------------===//
				159
				160	Instead of the following for memset char*, 1, 10:
				161
				162	movl $16843009, 4(%edx)
				163	movl $16843009, (%edx)
				164	movw $257, 8(%edx)
				165
				166	It might be better to generate
				167
				168	movl $16843009, %eax
				169	movl %eax, 4(%edx)
				170	movl %eax, (%edx)
				171	movw al, 8(%edx)
				172
				173	when we can spare a register. It reduces code size.
				174
				175	//===---------------------------------------------------------------------===//
				176
				177	Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
				178	get this:
				179
Eli Friedman	1aa1f2c	2008-02-28 00:21:43 +0000	[diff] [blame]	180	define i32 @test1(i32 %X) {
				181	%Y = sdiv i32 %X, 8
				182	ret i32 %Y
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	183	}
				184
				185	_test1:
				186	movl 4(%esp), %eax
				187	movl %eax, %ecx
				188	sarl $31, %ecx
				189	shrl $29, %ecx
				190	addl %ecx, %eax
				191	sarl $3, %eax
				192	ret
				193
				194	GCC knows several different ways to codegen it, one of which is this:
				195
				196	_test1:
				197	movl 4(%esp), %eax
				198	cmpl $-1, %eax
				199	leal 7(%eax), %ecx
				200	cmovle %ecx, %eax
				201	sarl $3, %eax
				202	ret
				203
				204	which is probably slower, but it's interesting at least :)
				205
				206	//===---------------------------------------------------------------------===//
				207
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	208	We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
				209	We should leave these as libcalls for everything over a much lower threshold,
				210	since libc is hand tuned for medium and large mem ops (avoiding RFO for large
				211	stores, TLB preheating, etc)
				212
				213	//===---------------------------------------------------------------------===//
				214
				215	Optimize this into something reasonable:
				216	x * copysign(1.0, y) * copysign(1.0, z)
				217
				218	//===---------------------------------------------------------------------===//
				219
				220	Optimize copysign(x, *y) to use an integer load from y.
				221
				222	//===---------------------------------------------------------------------===//
				223
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	224	The following tests perform worse with LSR:
				225
				226	lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
				227
				228	//===---------------------------------------------------------------------===//
				229
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	230	Adding to the list of cmp / test poor codegen issues:
				231
				232	int test(__m128 A, __m128 B) {
				233	if (_mm_comige_ss(A, B))
				234	return 3;
				235	else
				236	return 4;
				237	}
				238
				239	_test:
				240	movl 8(%esp), %eax
				241	movaps (%eax), %xmm0
				242	movl 4(%esp), %eax
				243	movaps (%eax), %xmm1
				244	comiss %xmm0, %xmm1
				245	setae %al
				246	movzbl %al, %ecx
				247	movl $3, %eax
				248	movl $4, %edx
				249	cmpl $0, %ecx
				250	cmove %edx, %eax
				251	ret
				252
				253	Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
				254	are a number of issues. 1) We are introducing a setcc between the result of the
				255	intrisic call and select. 2) The intrinsic is expected to produce a i32 value
				256	so a any extend (which becomes a zero extend) is added.
				257
				258	We probably need some kind of target DAG combine hook to fix this.
				259
				260	//===---------------------------------------------------------------------===//
				261
				262	We generate significantly worse code for this than GCC:
				263	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
				264	http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
				265
				266	There is also one case we do worse on PPC.
				267
				268	//===---------------------------------------------------------------------===//
				269
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	270	For this:
				271
				272	int test(int a)
				273	{
				274	return a * 3;
				275	}
				276
				277	We currently emits
				278	imull $3, 4(%esp), %eax
				279
				280	Perhaps this is what we really should generate is? Is imull three or four
				281	cycles? Note: ICC generates this:
				282	movl 4(%esp), %eax
				283	leal (%eax,%eax,2), %eax
				284
				285	The current instruction priority is based on pattern complexity. The former is
				286	more "complex" because it folds a load so the latter will not be emitted.
				287
				288	Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
				289	should always try to match LEA first since the LEA matching code does some
				290	estimate to determine whether the match is profitable.
				291
				292	However, if we care more about code size, then imull is better. It's two bytes
				293	shorter than movl + leal.
				294
Eli Friedman	9ab1db0	2008-11-30 07:52:27 +0000	[diff] [blame]	295	On a Pentium M, both variants have the same characteristics with regard
				296	to throughput; however, the multiplication has a latency of four cycles, as
				297	opposed to two cycles for the movl+lea variant.
				298
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	299	//===---------------------------------------------------------------------===//
				300
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	301	__builtin_ffs codegen is messy.
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	302
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	303	int ffs_(unsigned X) { return __builtin_ffs(X); }
				304
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	305	llvm produces:
				306	ffs_:
				307	movl 4(%esp), %ecx
				308	bsfl %ecx, %eax
				309	movl $32, %edx
				310	cmove %edx, %eax
				311	incl %eax
				312	xorl %edx, %edx
				313	testl %ecx, %ecx
				314	cmove %edx, %eax
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	315	ret
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	316
				317	vs gcc:
				318
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	319	_ffs_:
				320	movl $-1, %edx
				321	bsfl 4(%esp), %eax
				322	cmove %edx, %eax
				323	addl $1, %eax
				324	ret
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	325
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	326	Another example of __builtin_ffs (use predsimplify to eliminate a select):
				327
				328	int foo (unsigned long j) {
				329	if (j)
				330	return __builtin_ffs (j) - 1;
				331	else
				332	return 0;
				333	}
				334
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	335	//===---------------------------------------------------------------------===//
				336
				337	It appears gcc place string data with linkonce linkage in
				338	.section __TEXT,__const_coal,coalesced instead of
				339	.section __DATA,__const_coal,coalesced.
				340	Take a look at darwin.h, there are other Darwin assembler directives that we
				341	do not make use of.
				342
				343	//===---------------------------------------------------------------------===//
				344
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	345	define i32 @foo(i32* %a, i32 %t) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	346	entry:
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	347	br label %cond_true
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	348
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	349	cond_true: ; preds = %cond_true, %entry
				350	%x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ] ; <i32> [#uses=3]
				351	%t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ] ; <i32> [#uses=1]
				352	%tmp2 = getelementptr i32* %a, i32 %x.0.0 ; <i32*> [#uses=1]
				353	%tmp3 = load i32* %tmp2 ; <i32> [#uses=1]
				354	%tmp5 = add i32 %t_addr.0.0, %x.0.0 ; <i32> [#uses=1]
				355	%tmp7 = add i32 %tmp5, %tmp3 ; <i32> [#uses=2]
				356	%tmp9 = add i32 %x.0.0, 1 ; <i32> [#uses=2]
				357	%tmp = icmp sgt i32 %tmp9, 39 ; <i1> [#uses=1]
				358	br i1 %tmp, label %bb12, label %cond_true
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	359
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	360	bb12: ; preds = %cond_true
				361	ret i32 %tmp7
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	362	}
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	363	is pessimized by -loop-reduce and -indvars
				364
				365	//===---------------------------------------------------------------------===//
				366
				367	u32 to float conversion improvement:
				368
				369	float uint32_2_float( unsigned u ) {
				370	float fl = (int) (u & 0xffff);
				371	float fh = (int) (u >> 16);
				372	fh *= 0x1.0p16f;
				373	return fh + fl;
				374	}
				375
				376	00000000 subl $0x04,%esp
				377	00000003 movl 0x08(%esp,1),%eax
				378	00000007 movl %eax,%ecx
				379	00000009 shrl $0x10,%ecx
				380	0000000c cvtsi2ss %ecx,%xmm0
				381	00000010 andl $0x0000ffff,%eax
				382	00000015 cvtsi2ss %eax,%xmm1
				383	00000019 mulss 0x00000078,%xmm0
				384	00000021 addss %xmm1,%xmm0
				385	00000025 movss %xmm0,(%esp,1)
				386	0000002a flds (%esp,1)
				387	0000002d addl $0x04,%esp
				388	00000030 ret
				389
				390	//===---------------------------------------------------------------------===//
				391
				392	When using fastcc abi, align stack slot of argument of type double on 8 byte
				393	boundary to improve performance.
				394
				395	//===---------------------------------------------------------------------===//
				396
				397	Codegen:
				398
				399	int f(int a, int b) {
				400	if (a == 4 \|\| a == 6)
				401	b++;
				402	return b;
				403	}
				404
				405
				406	as:
				407
				408	or eax, 2
				409	cmp eax, 6
				410	jz label
				411
				412	//===---------------------------------------------------------------------===//
				413
				414	GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
				415	simplifications for integer "x cmp y ? a : b". For example, instead of:
				416
				417	int G;
				418	void f(int X, int Y) {
				419	G = X < 0 ? 14 : 13;
				420	}
				421
				422	compiling to:
				423
				424	_f:
				425	movl $14, %eax
				426	movl $13, %ecx
				427	movl 4(%esp), %edx
				428	testl %edx, %edx
				429	cmovl %eax, %ecx
				430	movl %ecx, _G
				431	ret
				432
				433	it could be:
				434	_f:
				435	movl 4(%esp), %eax
				436	sarl $31, %eax
				437	notl %eax
				438	addl $14, %eax
				439	movl %eax, _G
				440	ret
				441
				442	etc.
				443
Chris Lattner	e7037c2	2007-11-02 17:04:20 +0000	[diff] [blame]	444	Another is:
				445	int usesbb(unsigned int a, unsigned int b) {
				446	return (a < b ? -1 : 0);
				447	}
				448	to:
				449	_usesbb:
				450	movl 8(%esp), %eax
				451	cmpl %eax, 4(%esp)
				452	sbbl %eax, %eax
				453	ret
				454
				455	instead of:
				456	_usesbb:
				457	xorl %eax, %eax
				458	movl 8(%esp), %ecx
				459	cmpl %ecx, 4(%esp)
				460	movl $4294967295, %ecx
				461	cmovb %ecx, %eax
				462	ret
				463
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	464	//===---------------------------------------------------------------------===//
				465
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	466	Consider the expansion of:
				467
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	468	define i32 @test3(i32 %X) {
				469	%tmp1 = urem i32 %X, 255
				470	ret i32 %tmp1
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	471	}
				472
				473	Currently it compiles to:
				474
				475	...
				476	movl $2155905153, %ecx
				477	movl 8(%esp), %esi
				478	movl %esi, %eax
				479	mull %ecx
				480	...
				481
				482	This could be "reassociated" into:
				483
				484	movl $2155905153, %eax
				485	movl 8(%esp), %ecx
				486	mull %ecx
				487
				488	to avoid the copy. In fact, the existing two-address stuff would do this
				489	except that mul isn't a commutative 2-addr instruction. I guess this has
				490	to be done at isel time based on the #uses to mul?
				491
				492	//===---------------------------------------------------------------------===//
				493
				494	Make sure the instruction which starts a loop does not cross a cacheline
				495	boundary. This requires knowning the exact length of each machine instruction.
				496	That is somewhat complicated, but doable. Example 256.bzip2:
				497
				498	In the new trace, the hot loop has an instruction which crosses a cacheline
				499	boundary. In addition to potential cache misses, this can't help decoding as I
				500	imagine there has to be some kind of complicated decoder reset and realignment
				501	to grab the bytes from the next cacheline.
				502
				503	532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
Eli Friedman	9ab1db0	2008-11-30 07:52:27 +0000	[diff] [blame]	504	942 942 0x3d03 movl %dh, (1809(%esp, %esi)
				505	937 937 0x3d0a incl %esi
				506	3 3 0x3d0b cmpb %bl, %dl
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	507	27 27 0x3d0d jnz 0x000062db <main+11707>
				508
				509	//===---------------------------------------------------------------------===//
				510
				511	In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
				512
				513	//===---------------------------------------------------------------------===//
				514
				515	This could be a single 16-bit load.
				516
				517	int f(char *p) {
				518	if ((p[0] == 1) & (p[1] == 2)) return 1;
				519	return 0;
				520	}
				521
				522	//===---------------------------------------------------------------------===//
				523
				524	We should inline lrintf and probably other libc functions.
				525
				526	//===---------------------------------------------------------------------===//
				527
Dan Gohman	52415eb	2010-01-04 20:55:05 +0000	[diff] [blame]	528	Use the FLAGS values from arithmetic instructions more. For example, compile:
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	529
				530	int add_zf(int *x, int y, int a, int b) {
				531	if ((*x += y) == 0)
				532	return a;
				533	else
				534	return b;
				535	}
				536
				537	to:
				538	addl %esi, (%rdi)
				539	movl %edx, %eax
				540	cmovne %ecx, %eax
				541	ret
				542	instead of:
				543
				544	_add_zf:
				545	addl (%rdi), %esi
				546	movl %esi, (%rdi)
				547	testl %esi, %esi
				548	cmove %edx, %ecx
				549	movl %ecx, %eax
				550	ret
				551
Dan Gohman	52415eb	2010-01-04 20:55:05 +0000	[diff] [blame]	552	As another example, compile function f2 in test/CodeGen/X86/cmp-test.ll
				553	without a test instruction.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	554
				555	//===---------------------------------------------------------------------===//
				556
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	557	These two functions have identical effects:
				558
				559	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
				560	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				561
				562	We currently compile them to:
				563
				564	_f:
				565	movl 4(%esp), %eax
				566	movl %eax, %ecx
				567	incl %ecx
				568	movl 8(%esp), %edx
				569	cmpl %edx, %ecx
				570	jne LBB1_2 #UnifiedReturnBlock
				571	LBB1_1: #cond_true
				572	addl $2, %eax
				573	ret
				574	LBB1_2: #UnifiedReturnBlock
				575	movl %ecx, %eax
				576	ret
				577	_f2:
				578	movl 4(%esp), %eax
				579	movl %eax, %ecx
				580	incl %ecx
				581	cmpl 8(%esp), %ecx
				582	sete %cl
				583	movzbl %cl, %ecx
				584	leal 1(%ecx,%eax), %eax
				585	ret
				586
				587	both of which are inferior to GCC's:
				588
				589	_f:
				590	movl 4(%esp), %edx
				591	leal 1(%edx), %eax
				592	addl $2, %edx
				593	cmpl 8(%esp), %eax
				594	cmove %edx, %eax
				595	ret
				596	_f2:
				597	movl 4(%esp), %eax
				598	addl $1, %eax
				599	xorl %edx, %edx
				600	cmpl 8(%esp), %eax
				601	sete %dl
				602	addl %edx, %eax
				603	ret
				604
				605	//===---------------------------------------------------------------------===//
				606
				607	This code:
				608
				609	void test(int X) {
				610	if (X) abort();
				611	}
				612
				613	is currently compiled to:
				614
				615	_test:
				616	subl $12, %esp
				617	cmpl $0, 16(%esp)
				618	jne LBB1_1
				619	addl $12, %esp
				620	ret
				621	LBB1_1:
				622	call L_abort$stub
				623
				624	It would be better to produce:
				625
				626	_test:
				627	subl $12, %esp
				628	cmpl $0, 16(%esp)
				629	jne L_abort$stub
				630	addl $12, %esp
				631	ret
				632
				633	This can be applied to any no-return function call that takes no arguments etc.
				634	Alternatively, the stack save/restore logic could be shrink-wrapped, producing
				635	something like this:
				636
				637	_test:
				638	cmpl $0, 4(%esp)
				639	jne LBB1_1
				640	ret
				641	LBB1_1:
				642	subl $12, %esp
				643	call L_abort$stub
				644
				645	Both are useful in different situations. Finally, it could be shrink-wrapped
				646	and tail called, like this:
				647
				648	_test:
				649	cmpl $0, 4(%esp)
				650	jne LBB1_1
				651	ret
				652	LBB1_1:
				653	pop %eax # realign stack.
				654	call L_abort$stub
				655
				656	Though this probably isn't worth it.
				657
				658	//===---------------------------------------------------------------------===//
				659
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	660	Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
				661	a neg instead of a sub instruction. Consider:
				662
				663	int test(char X) { return 7-X; }
				664
				665	we currently produce:
				666	_test:
				667	movl $7, %eax
				668	movsbl 4(%esp), %ecx
				669	subl %ecx, %eax
				670	ret
				671
				672	We would use one fewer register if codegen'd as:
				673
				674	movsbl 4(%esp), %eax
				675	neg %eax
				676	add $7, %eax
				677	ret
				678
				679	Note that this isn't beneficial if the load can be folded into the sub. In
				680	this case, we want a sub:
				681
				682	int test(int X) { return 7-X; }
				683	_test:
				684	movl $7, %eax
				685	subl 4(%esp), %eax
				686	ret
				687
				688	//===---------------------------------------------------------------------===//
				689
Chris Lattner	32f6587	2007-08-20 02:14:33 +0000	[diff] [blame]	690	Leaf functions that require one 4-byte spill slot have a prolog like this:
				691
				692	_foo:
				693	pushl %esi
				694	subl $4, %esp
				695	...
				696	and an epilog like this:
				697	addl $4, %esp
				698	popl %esi
				699	ret
				700
				701	It would be smaller, and potentially faster, to push eax on entry and to
				702	pop into a dummy register instead of using addl/subl of esp. Just don't pop
				703	into any return registers :)
				704
				705	//===---------------------------------------------------------------------===//
Chris Lattner	44b03cb	2007-08-23 15:22:07 +0000	[diff] [blame]	706
				707	The X86 backend should fold (branch (or (setcc, setcc))) into multiple
				708	branches. We generate really poor code for:
				709
				710	double testf(double a) {
				711	return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
				712	}
				713
				714	For example, the entry BB is:
				715
				716	_testf:
				717	subl $20, %esp
				718	pxor %xmm0, %xmm0
				719	movsd 24(%esp), %xmm1
				720	ucomisd %xmm0, %xmm1
				721	setnp %al
				722	sete %cl
				723	testb %cl, %al
				724	jne LBB1_5 # UnifiedReturnBlock
				725	LBB1_1: # cond_true
				726
				727
				728	it would be better to replace the last four instructions with:
				729
				730	jp LBB1_1
				731	je LBB1_5
				732	LBB1_1:
				733
				734	We also codegen the inner ?: into a diamond:
				735
				736	cvtss2sd LCPI1_0(%rip), %xmm2
				737	cvtss2sd LCPI1_1(%rip), %xmm3
				738	ucomisd %xmm1, %xmm0
				739	ja LBB1_3 # cond_true
				740	LBB1_2: # cond_true
				741	movapd %xmm3, %xmm2
				742	LBB1_3: # cond_true
				743	movapd %xmm2, %xmm0
				744	ret
				745
				746	We should sink the load into xmm3 into the LBB1_2 block. This should
				747	be pretty easy, and will nuke all the copies.
				748
				749	//===---------------------------------------------------------------------===//
Chris Lattner	4084d49	2007-09-10 21:43:18 +0000	[diff] [blame]	750
				751	This:
				752	#include <algorithm>
				753	inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				754	{ return std::make_pair(a + b, a + b < a); }
				755	bool no_overflow(unsigned a, unsigned b)
				756	{ return !full_add(a, b).second; }
				757
				758	Should compile to:
				759
				760
				761	_Z11no_overflowjj:
				762	addl %edi, %esi
				763	setae %al
				764	ret
				765
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	766	FIXME: That code looks wrong; bool return is normally defined as zext.
				767
Chris Lattner	4084d49	2007-09-10 21:43:18 +0000	[diff] [blame]	768	on x86-64, not:
				769
				770	__Z11no_overflowjj:
				771	addl %edi, %esi
				772	cmpl %edi, %esi
				773	setae %al
				774	movzbl %al, %eax
				775	ret
				776
				777
				778	//===---------------------------------------------------------------------===//
Evan Cheng	35127a6	2007-09-10 22:16:37 +0000	[diff] [blame]	779
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	780	The following code:
				781
Bill Wendling	c2036e3	2007-10-02 20:54:32 +0000	[diff] [blame]	782	bb114.preheader: ; preds = %cond_next94
				783	%tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
				784	%tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
				785	%tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
				786	%tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
				787	%tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
				788	%tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
				789	%tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
				790	%tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
				791	%tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
				792	%tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
				793	%tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
				794	br label %bb114
				795
				796	produces:
				797
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	798	LBB3_5: # bb114.preheader
				799	movswl -68(%ebp), %eax
				800	movl $32, %ecx
				801	movl %ecx, -80(%ebp)
				802	subl %eax, -80(%ebp)
				803	movswl -52(%ebp), %eax
				804	movl %ecx, -84(%ebp)
				805	subl %eax, -84(%ebp)
				806	movswl -70(%ebp), %eax
				807	movl %ecx, -88(%ebp)
				808	subl %eax, -88(%ebp)
				809	movswl -50(%ebp), %eax
				810	subl %eax, %ecx
				811	movl %ecx, -76(%ebp)
				812	movswl -42(%ebp), %eax
				813	movl %eax, -92(%ebp)
				814	movswl -66(%ebp), %eax
				815	movl %eax, -96(%ebp)
				816	movw $0, -98(%ebp)
				817
Chris Lattner	792bae5	2007-10-03 03:40:24 +0000	[diff] [blame]	818	This appears to be bad because the RA is not folding the store to the stack
				819	slot into the movl. The above instructions could be:
				820	movl $32, -80(%ebp)
				821	...
				822	movl $32, -84(%ebp)
				823	...
				824	This seems like a cross between remat and spill folding.
				825
Bill Wendling	c2036e3	2007-10-02 20:54:32 +0000	[diff] [blame]	826	This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	827	change, so we could simply subtract %eax from %ecx first and then use %ecx (or
				828	vice-versa).
				829
				830	//===---------------------------------------------------------------------===//
				831
Bill Wendling	54c4f83	2007-10-02 21:49:31 +0000	[diff] [blame]	832	This code:
				833
				834	%tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
				835	br i1 %tmp659, label %cond_true662, label %cond_next715
				836
				837	produces this:
				838
				839	testw %cx, %cx
				840	movswl %cx, %esi
				841	jns LBB4_109 # cond_next715
				842
				843	Shark tells us that using %cx in the testw instruction is sub-optimal. It
				844	suggests using the 32-bit register (which is what ICC uses).
				845
				846	//===---------------------------------------------------------------------===//
Chris Lattner	802c62a	2007-10-03 17:10:03 +0000	[diff] [blame]	847
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	848	We compile this:
				849
				850	void compare (long long foo) {
				851	if (foo < 4294967297LL)
				852	abort();
				853	}
				854
				855	to:
				856
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	857	compare:
				858	subl $4, %esp
				859	cmpl $0, 8(%esp)
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	860	setne %al
				861	movzbw %al, %ax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	862	cmpl $1, 12(%esp)
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	863	setg %cl
				864	movzbw %cl, %cx
				865	cmove %ax, %cx
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	866	testb $1, %cl
				867	jne .LBB1_2 # UnifiedReturnBlock
				868	.LBB1_1: # ifthen
				869	call abort
				870	.LBB1_2: # UnifiedReturnBlock
				871	addl $4, %esp
				872	ret
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	873
				874	(also really horrible code on ppc). This is due to the expand code for 64-bit
				875	compares. GCC produces multiple branches, which is much nicer:
				876
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	877	compare:
				878	subl $12, %esp
				879	movl 20(%esp), %edx
				880	movl 16(%esp), %eax
				881	decl %edx
				882	jle .L7
				883	.L5:
				884	addl $12, %esp
				885	ret
				886	.p2align 4,,7
				887	.L7:
				888	jl .L4
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	889	cmpl $0, %eax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	890	.p2align 4,,8
				891	ja .L5
				892	.L4:
				893	.p2align 4,,9
				894	call abort
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	895
				896	//===---------------------------------------------------------------------===//
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	897
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	898	Tail call optimization improvements: Tail call optimization currently
				899	pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighofer	449b01a	2008-01-11 16:49:42 +0000	[diff] [blame]	900	non-tail call optimized calls) that source from the callers arguments
				901	or that source from a virtual register (also possibly sourcing from
				902	callers arguments).
				903	This is done to prevent overwriting of parameters (see example
				904	below) that might be used later.
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	905
				906	example:
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	907
				908	int callee(int32, int64);
				909	int caller(int32 arg1, int32 arg2) {
				910	int64 local = arg2 * 2;
				911	return callee(arg2, (int64)local);
				912	}
				913
				914	[arg1] [!arg2 no longer valid since we moved local onto it]
				915	[arg2] -> [(int64)
				916	[RETADDR] local ]
				917
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	918	Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	919	arg2 of the caller.
				920
				921	Possible optimizations:
				922
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	923
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	924	- Analyse the actual parameters of the callee to see which would
				925	overwrite a caller parameter which is used by the callee and only
				926	push them onto the top of the stack.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	927
				928	int callee (int32 arg1, int32 arg2);
				929	int caller (int32 arg1, int32 arg2) {
				930	return callee(arg1,arg2);
				931	}
				932
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	933	Here we don't need to write any variables to the top of the stack
				934	since they don't overwrite each other.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	935
				936	int callee (int32 arg1, int32 arg2);
				937	int caller (int32 arg1, int32 arg2) {
				938	return callee(arg2,arg1);
				939	}
				940
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	941	Here we need to push the arguments because they overwrite each
				942	other.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	943
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	944	//===---------------------------------------------------------------------===//
Evan Cheng	7f1ad6a	2007-10-28 04:01:09 +0000	[diff] [blame]	945
				946	main ()
				947	{
				948	int i = 0;
				949	unsigned long int z = 0;
				950
				951	do {
				952	z -= 0x00004000;
				953	i++;
				954	if (i > 0x00040000)
				955	abort ();
				956	} while (z > 0);
				957	exit (0);
				958	}
				959
				960	gcc compiles this to:
				961
				962	_main:
				963	subl $28, %esp
				964	xorl %eax, %eax
				965	jmp L2
				966	L3:
				967	cmpl $262144, %eax
				968	je L10
				969	L2:
				970	addl $1, %eax
				971	cmpl $262145, %eax
				972	jne L3
				973	call L_abort$stub
				974	L10:
				975	movl $0, (%esp)
				976	call L_exit$stub
				977
				978	llvm:
				979
				980	_main:
				981	subl $12, %esp
				982	movl $1, %eax
				983	movl $16384, %ecx
				984	LBB1_1: # bb
				985	cmpl $262145, %eax
				986	jge LBB1_4 # cond_true
				987	LBB1_2: # cond_next
				988	incl %eax
				989	addl $4294950912, %ecx
				990	cmpl $16384, %ecx
				991	jne LBB1_1 # bb
				992	LBB1_3: # bb11
				993	xorl %eax, %eax
				994	addl $12, %esp
				995	ret
				996	LBB1_4: # cond_true
				997	call L_abort$stub
				998
				999	1. LSR should rewrite the first cmp with induction variable %ecx.
				1000	2. DAG combiner should fold
				1001	leal 1(%eax), %edx
				1002	cmpl $262145, %edx
				1003	=>
				1004	cmpl $262144, %eax
				1005
				1006	//===---------------------------------------------------------------------===//
Chris Lattner	358670b	2007-11-24 06:13:33 +0000	[diff] [blame]	1007
				1008	define i64 @test(double %X) {
				1009	%Y = fptosi double %X to i64
				1010	ret i64 %Y
				1011	}
				1012
				1013	compiles to:
				1014
				1015	_test:
				1016	subl $20, %esp
				1017	movsd 24(%esp), %xmm0
				1018	movsd %xmm0, 8(%esp)
				1019	fldl 8(%esp)
				1020	fisttpll (%esp)
				1021	movl 4(%esp), %edx
				1022	movl (%esp), %eax
				1023	addl $20, %esp
				1024	#FP_REG_KILL
				1025	ret
				1026
				1027	This should just fldl directly from the input stack slot.
Chris Lattner	10d54d1	2007-12-05 22:58:19 +0000	[diff] [blame]	1028
				1029	//===---------------------------------------------------------------------===//
				1030
				1031	This code:
				1032	int foo (int x) { return (x & 65535) \| 255; }
				1033
				1034	Should compile into:
				1035
				1036	_foo:
				1037	movzwl 4(%esp), %eax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1038	orl $255, %eax
Chris Lattner	10d54d1	2007-12-05 22:58:19 +0000	[diff] [blame]	1039	ret
				1040
				1041	instead of:
				1042	_foo:
				1043	movl $255, %eax
				1044	orl 4(%esp), %eax
				1045	andl $65535, %eax
				1046	ret
				1047
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1048	//===---------------------------------------------------------------------===//
				1049
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1050	We're codegen'ing multiply of long longs inefficiently:
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1051
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1052	unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
				1053	return arg1 * arg2;
				1054	}
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1055
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1056	We compile to (fomit-frame-pointer):
				1057
				1058	_LLM:
				1059	pushl %esi
				1060	movl 8(%esp), %ecx
				1061	movl 16(%esp), %esi
				1062	movl %esi, %eax
				1063	mull %ecx
				1064	imull 12(%esp), %esi
				1065	addl %edx, %esi
				1066	imull 20(%esp), %ecx
				1067	movl %esi, %edx
				1068	addl %ecx, %edx
				1069	popl %esi
				1070	ret
				1071
				1072	This looks like a scheduling deficiency and lack of remat of the load from
				1073	the argument area. ICC apparently produces:
				1074
				1075	movl 8(%esp), %ecx
				1076	imull 12(%esp), %ecx
				1077	movl 16(%esp), %eax
				1078	imull 4(%esp), %eax
				1079	addl %eax, %ecx
				1080	movl 4(%esp), %eax
				1081	mull 12(%esp)
				1082	addl %ecx, %edx
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1083	ret
				1084
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1085	Note that it remat'd loads from 4(esp) and 12(esp). See this GCC PR:
				1086	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1087
				1088	//===---------------------------------------------------------------------===//
				1089
Chris Lattner	2b55ebd	2007-12-24 19:27:46 +0000	[diff] [blame]	1090	We can fold a store into "zeroing a reg". Instead of:
				1091
				1092	xorl %eax, %eax
				1093	movl %eax, 124(%esp)
				1094
				1095	we should get:
				1096
				1097	movl $0, 124(%esp)
				1098
				1099	if the flags of the xor are dead.
				1100
Chris Lattner	459ff99	2008-01-11 18:00:13 +0000	[diff] [blame]	1101	Likewise, we isel "x<<1" into "add reg,reg". If reg is spilled, this should
				1102	be folded into: shl [mem], 1
				1103
Chris Lattner	2b55ebd	2007-12-24 19:27:46 +0000	[diff] [blame]	1104	//===---------------------------------------------------------------------===//
Chris Lattner	6440095	2007-12-28 21:50:40 +0000	[diff] [blame]	1105
				1106	This testcase misses a read/modify/write opportunity (from PR1425):
				1107
				1108	void vertical_decompose97iH1(int b0, int b1, int *b2, int width){
				1109	int i;
				1110	for(i=0; i<width; i++)
				1111	b1[i] += (1*(b0[i] + b2[i])+0)>>0;
				1112	}
				1113
				1114	We compile it down to:
				1115
				1116	LBB1_2: # bb
				1117	movl (%esi,%edi,4), %ebx
				1118	addl (%ecx,%edi,4), %ebx
				1119	addl (%edx,%edi,4), %ebx
				1120	movl %ebx, (%ecx,%edi,4)
				1121	incl %edi
				1122	cmpl %eax, %edi
				1123	jne LBB1_2 # bb
				1124
				1125	the inner loop should add to the memory location (%ecx,%edi,4), saving
				1126	a mov. Something like:
				1127
				1128	movl (%esi,%edi,4), %ebx
				1129	addl (%edx,%edi,4), %ebx
				1130	addl %ebx, (%ecx,%edi,4)
				1131
Chris Lattner	bde7310	2007-12-29 05:51:58 +0000	[diff] [blame]	1132	Here is another interesting example:
				1133
				1134	void vertical_compose97iH1(int b0, int b1, int *b2, int width){
				1135	int i;
				1136	for(i=0; i<width; i++)
				1137	b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
				1138	}
				1139
				1140	We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
				1141
				1142	LBB9_2: # bb
				1143	movl (%ecx,%edi,4), %ebx
				1144	subl (%esi,%edi,4), %ebx
				1145	subl (%edx,%edi,4), %ebx
				1146	movl %ebx, (%ecx,%edi,4)
				1147	incl %edi
				1148	cmpl %eax, %edi
				1149	jne LBB9_2 # bb
				1150
				1151	Additionally, LSR should rewrite the exit condition of these loops to use
Chris Lattner	6440095	2007-12-28 21:50:40 +0000	[diff] [blame]	1152	a stride-4 IV, would would allow all the scales in the loop to go away.
				1153	This would result in smaller code and more efficient microops.
				1154
				1155	//===---------------------------------------------------------------------===//
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1156
				1157	In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
				1158	or and instruction, for example:
				1159
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1160	xorpd LCPI1_0, %xmm2
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1161
				1162	However, if xmm2 gets spilled, we end up with really ugly code like this:
				1163
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1164	movsd (%esp), %xmm0
				1165	xorpd LCPI1_0, %xmm0
				1166	movsd %xmm0, (%esp)
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1167
				1168	Since we 'know' that this is a 'neg', we can actually "fold" the spill into
				1169	the neg/abs instruction, turning it into an integer operation, like this:
				1170
				1171	xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
				1172
				1173	you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1174	stall. Here is a contrived testcase:
				1175
				1176	double a, b, c;
				1177	void test(double *P) {
				1178	double X = *P;
				1179	a = X;
				1180	bar();
				1181	X = -X;
				1182	b = X;
				1183	bar();
				1184	c = X;
				1185	}
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1186
				1187	//===---------------------------------------------------------------------===//
Andrew Lenharth	785610d	2008-02-16 01:24:58 +0000	[diff] [blame]	1188
				1189	handling llvm.memory.barrier on pre SSE2 cpus
				1190
				1191	should generate:
				1192	lock ; mov %esp, %esp
				1193
				1194	//===---------------------------------------------------------------------===//
Chris Lattner	7644ff3	2008-02-17 19:43:57 +0000	[diff] [blame]	1195
				1196	The generated code on x86 for checking for signed overflow on a multiply the
				1197	obvious way is much longer than it needs to be.
				1198
				1199	int x(int a, int b) {
				1200	long long prod = (long long)a*b;
				1201	return prod > 0x7FFFFFFF \|\| prod < (-0x7FFFFFFF-1);
				1202	}
				1203
				1204	See PR2053 for more details.
				1205
				1206	//===---------------------------------------------------------------------===//
Chris Lattner	83f2236	2008-02-18 18:30:13 +0000	[diff] [blame]	1207
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1208	We should investigate using cdq/ctld (effect: edx = sar eax, 31)
				1209	more aggressively; it should cost the same as a move+shift on any modern
				1210	processor, but it's a lot shorter. Downside is that it puts more
				1211	pressure on register allocation because it has fixed operands.
				1212
				1213	Example:
				1214	int abs(int x) {return x < 0 ? -x : x;}
				1215
				1216	gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
				1217	abs:
				1218	movl 4(%esp), %eax
				1219	cltd
				1220	xorl %edx, %eax
				1221	subl %edx, %eax
				1222	ret
				1223
				1224	//===---------------------------------------------------------------------===//
				1225
				1226	Consider:
Chris Lattner	83f2236	2008-02-18 18:30:13 +0000	[diff] [blame]	1227	int test(unsigned long a, unsigned long b) { return -(a < b); }
				1228
				1229	We currently compile this to:
				1230
				1231	define i32 @test(i32 %a, i32 %b) nounwind {
				1232	%tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
				1233	%tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
				1234	%tmp5 = sub i32 0, %tmp34 ; <i32> [#uses=1]
				1235	ret i32 %tmp5
				1236	}
				1237
				1238	and
				1239
				1240	_test:
				1241	movl 8(%esp), %eax
				1242	cmpl %eax, 4(%esp)
				1243	setb %al
				1244	movzbl %al, %eax
				1245	negl %eax
				1246	ret
				1247
				1248	Several deficiencies here. First, we should instcombine zext+neg into sext:
				1249
				1250	define i32 @test2(i32 %a, i32 %b) nounwind {
				1251	%tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
				1252	%tmp34 = sext i1 %tmp3 to i32 ; <i32> [#uses=1]
				1253	ret i32 %tmp34
				1254	}
				1255
				1256	However, before we can do that, we have to fix the bad codegen that we get for
				1257	sext from bool:
				1258
				1259	_test2:
				1260	movl 8(%esp), %eax
				1261	cmpl %eax, 4(%esp)
				1262	setb %al
				1263	movzbl %al, %eax
				1264	shll $31, %eax
				1265	sarl $31, %eax
				1266	ret
				1267
				1268	This code should be at least as good as the code above. Once this is fixed, we
				1269	can optimize this specific case even more to:
				1270
				1271	movl 8(%esp), %eax
				1272	xorl %ecx, %ecx
				1273	cmpl %eax, 4(%esp)
				1274	sbbl %ecx, %ecx
				1275
				1276	//===---------------------------------------------------------------------===//
Eli Friedman	1aa1f2c	2008-02-28 00:21:43 +0000	[diff] [blame]	1277
				1278	Take the following code (from
				1279	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
				1280
				1281	extern unsigned char first_one[65536];
				1282	int FirstOnet(unsigned long long arg1)
				1283	{
				1284	if (arg1 >> 48)
				1285	return (first_one[arg1 >> 48]);
				1286	return 0;
				1287	}
				1288
				1289
				1290	The following code is currently generated:
				1291	FirstOnet:
				1292	movl 8(%esp), %eax
				1293	cmpl $65536, %eax
				1294	movl 4(%esp), %ecx
				1295	jb .LBB1_2 # UnifiedReturnBlock
				1296	.LBB1_1: # ifthen
				1297	shrl $16, %eax
				1298	movzbl first_one(%eax), %eax
				1299	ret
				1300	.LBB1_2: # UnifiedReturnBlock
				1301	xorl %eax, %eax
				1302	ret
				1303
				1304	There are a few possible improvements here:
				1305	1. We should be able to eliminate the dead load into %ecx
				1306	2. We could change the "movl 8(%esp), %eax" into
				1307	"movzwl 10(%esp), %eax"; this lets us change the cmpl
				1308	into a testl, which is shorter, and eliminate the shift.
				1309
				1310	We could also in theory eliminate the branch by using a conditional
				1311	for the address of the load, but that seems unlikely to be worthwhile
				1312	in general.
				1313
				1314	//===---------------------------------------------------------------------===//
				1315
Chris Lattner	44a98ac	2008-02-28 04:52:59 +0000	[diff] [blame]	1316	We compile this function:
				1317
				1318	define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext %d) nounwind {
				1319	entry:
				1320	%tmp2 = icmp eq i8 %d, 0 ; <i1> [#uses=1]
				1321	br i1 %tmp2, label %bb7, label %bb
				1322
				1323	bb: ; preds = %entry
				1324	%tmp6 = add i32 %b, %a ; <i32> [#uses=1]
				1325	ret i32 %tmp6
				1326
				1327	bb7: ; preds = %entry
				1328	%tmp10 = sub i32 %a, %c ; <i32> [#uses=1]
				1329	ret i32 %tmp10
				1330	}
				1331
				1332	to:
				1333
				1334	_foo:
				1335	cmpb $0, 16(%esp)
				1336	movl 12(%esp), %ecx
				1337	movl 8(%esp), %eax
				1338	movl 4(%esp), %edx
				1339	je LBB1_2 # bb7
				1340	LBB1_1: # bb
				1341	addl %edx, %eax
				1342	ret
				1343	LBB1_2: # bb7
				1344	movl %edx, %eax
				1345	subl %ecx, %eax
				1346	ret
				1347
Gabor Greif	0266159	2008-03-06 10:51:21 +0000	[diff] [blame]	1348	The coalescer could coalesce "edx" with "eax" to avoid the movl in LBB1_2
Chris Lattner	44a98ac	2008-02-28 04:52:59 +0000	[diff] [blame]	1349	if it commuted the addl in LBB1_1.
				1350
				1351	//===---------------------------------------------------------------------===//
Evan Cheng	921dcba	2008-03-28 07:07:06 +0000	[diff] [blame]	1352
				1353	See rdar://4653682.
				1354
				1355	From flops:
				1356
				1357	LBB1_15: # bb310
				1358	cvtss2sd LCPI1_0, %xmm1
				1359	addsd %xmm1, %xmm0
				1360	movsd 176(%esp), %xmm2
				1361	mulsd %xmm0, %xmm2
				1362	movapd %xmm2, %xmm3
				1363	mulsd %xmm3, %xmm3
				1364	movapd %xmm3, %xmm4
				1365	mulsd LCPI1_23, %xmm4
				1366	addsd LCPI1_24, %xmm4
				1367	mulsd %xmm3, %xmm4
				1368	addsd LCPI1_25, %xmm4
				1369	mulsd %xmm3, %xmm4
				1370	addsd LCPI1_26, %xmm4
				1371	mulsd %xmm3, %xmm4
				1372	addsd LCPI1_27, %xmm4
				1373	mulsd %xmm3, %xmm4
				1374	addsd LCPI1_28, %xmm4
				1375	mulsd %xmm3, %xmm4
				1376	addsd %xmm1, %xmm4
				1377	mulsd %xmm2, %xmm4
				1378	movsd 152(%esp), %xmm1
				1379	addsd %xmm4, %xmm1
				1380	movsd %xmm1, 152(%esp)
				1381	incl %eax
				1382	cmpl %eax, %esi
				1383	jge LBB1_15 # bb310
				1384	LBB1_16: # bb358.loopexit
				1385	movsd 152(%esp), %xmm0
				1386	addsd %xmm0, %xmm0
				1387	addsd LCPI1_22, %xmm0
				1388	movsd %xmm0, 152(%esp)
				1389
				1390	Rather than spilling the result of the last addsd in the loop, we should have
				1391	insert a copy to split the interval (one for the duration of the loop, one
				1392	extending to the fall through). The register pressure in the loop isn't high
				1393	enough to warrant the spill.
				1394
				1395	Also check why xmm7 is not used at all in the function.
Chris Lattner	16e5c78	2008-04-21 04:46:30 +0000	[diff] [blame]	1396
				1397	//===---------------------------------------------------------------------===//
				1398
				1399	Legalize loses track of the fact that bools are always zero extended when in
				1400	memory. This causes us to compile abort_gzip (from 164.gzip) from:
				1401
				1402	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
				1403	target triple = "i386-apple-darwin8"
				1404	@in_exit.4870.b = internal global i1 false ; <i1*> [#uses=2]
				1405	define fastcc void @abort_gzip() noreturn nounwind {
				1406	entry:
				1407	%tmp.b.i = load i1* @in_exit.4870.b ; <i1> [#uses=1]
				1408	br i1 %tmp.b.i, label %bb.i, label %bb4.i
				1409	bb.i: ; preds = %entry
				1410	tail call void @exit( i32 1 ) noreturn nounwind
				1411	unreachable
				1412	bb4.i: ; preds = %entry
				1413	store i1 true, i1* @in_exit.4870.b
				1414	tail call void @exit( i32 1 ) noreturn nounwind
				1415	unreachable
				1416	}
				1417	declare void @exit(i32) noreturn nounwind
				1418
				1419	into:
				1420
				1421	_abort_gzip:
				1422	subl $12, %esp
				1423	movb _in_exit.4870.b, %al
				1424	notb %al
				1425	testb $1, %al
				1426	jne LBB1_2 ## bb4.i
				1427	LBB1_1: ## bb.i
				1428	...
				1429
				1430	//===---------------------------------------------------------------------===//
Chris Lattner	7cb1d33	2008-05-05 23:19:45 +0000	[diff] [blame]	1431
				1432	We compile:
				1433
				1434	int test(int x, int y) {
				1435	return x-y-1;
				1436	}
				1437
				1438	into (-m64):
				1439
				1440	_test:
				1441	decl %edi
				1442	movl %edi, %eax
				1443	subl %esi, %eax
				1444	ret
				1445
				1446	it would be better to codegen as: x+~y (notl+addl)
Edwin Török	fa9d5e2	2008-10-24 19:23:07 +0000	[diff] [blame]	1447
				1448	//===---------------------------------------------------------------------===//
				1449
				1450	This code:
				1451
				1452	int foo(const char *str,...)
				1453	{
				1454	__builtin_va_list a; int x;
				1455	__builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
				1456	return x;
				1457	}
				1458
				1459	gets compiled into this on x86-64:
				1460	subq $200, %rsp
				1461	movaps %xmm7, 160(%rsp)
				1462	movaps %xmm6, 144(%rsp)
				1463	movaps %xmm5, 128(%rsp)
				1464	movaps %xmm4, 112(%rsp)
				1465	movaps %xmm3, 96(%rsp)
				1466	movaps %xmm2, 80(%rsp)
				1467	movaps %xmm1, 64(%rsp)
				1468	movaps %xmm0, 48(%rsp)
				1469	movq %r9, 40(%rsp)
				1470	movq %r8, 32(%rsp)
				1471	movq %rcx, 24(%rsp)
				1472	movq %rdx, 16(%rsp)
				1473	movq %rsi, 8(%rsp)
				1474	leaq (%rsp), %rax
				1475	movq %rax, 192(%rsp)
				1476	leaq 208(%rsp), %rax
				1477	movq %rax, 184(%rsp)
				1478	movl $48, 180(%rsp)
				1479	movl $8, 176(%rsp)
				1480	movl 176(%rsp), %eax
				1481	cmpl $47, %eax
				1482	jbe .LBB1_3 # bb
				1483	.LBB1_1: # bb3
				1484	movq 184(%rsp), %rcx
				1485	leaq 8(%rcx), %rax
				1486	movq %rax, 184(%rsp)
				1487	.LBB1_2: # bb4
				1488	movl (%rcx), %eax
				1489	addq $200, %rsp
				1490	ret
				1491	.LBB1_3: # bb
				1492	movl %eax, %ecx
				1493	addl $8, %eax
				1494	addq 192(%rsp), %rcx
				1495	movl %eax, 176(%rsp)
				1496	jmp .LBB1_2 # bb4
				1497
				1498	gcc 4.3 generates:
				1499	subq $96, %rsp
				1500	.LCFI0:
				1501	leaq 104(%rsp), %rax
				1502	movq %rsi, -80(%rsp)
				1503	movl $8, -120(%rsp)
				1504	movq %rax, -112(%rsp)
				1505	leaq -88(%rsp), %rax
				1506	movq %rax, -104(%rsp)
				1507	movl $8, %eax
				1508	cmpl $48, %eax
				1509	jb .L6
				1510	movq -112(%rsp), %rdx
				1511	movl (%rdx), %eax
				1512	addq $96, %rsp
				1513	ret
				1514	.p2align 4,,10
				1515	.p2align 3
				1516	.L6:
				1517	mov %eax, %edx
				1518	addq -104(%rsp), %rdx
				1519	addl $8, %eax
				1520	movl %eax, -120(%rsp)
				1521	movl (%rdx), %eax
				1522	addq $96, %rsp
				1523	ret
				1524
				1525	and it gets compiled into this on x86:
				1526	pushl %ebp
				1527	movl %esp, %ebp
				1528	subl $4, %esp
				1529	leal 12(%ebp), %eax
				1530	movl %eax, -4(%ebp)
				1531	leal 16(%ebp), %eax
				1532	movl %eax, -4(%ebp)
				1533	movl 12(%ebp), %eax
				1534	addl $4, %esp
				1535	popl %ebp
				1536	ret
				1537
				1538	gcc 4.3 generates:
				1539	pushl %ebp
				1540	movl %esp, %ebp
				1541	movl 12(%ebp), %eax
				1542	popl %ebp
				1543	ret
Evan Cheng	bf97bec	2008-11-11 17:35:52 +0000	[diff] [blame]	1544
				1545	//===---------------------------------------------------------------------===//
				1546
				1547	Teach tblgen not to check bitconvert source type in some cases. This allows us
				1548	to consolidate the following patterns in X86InstrMMX.td:
				1549
				1550	def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1551	(iPTR 0))))),
				1552	(v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
				1553	def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1554	(iPTR 0))))),
				1555	(v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
				1556	def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1557	(iPTR 0))))),
				1558	(v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
				1559
				1560	There are other cases in various td files.
Eli Friedman	9ab1db0	2008-11-30 07:52:27 +0000	[diff] [blame]	1561
				1562	//===---------------------------------------------------------------------===//
				1563
				1564	Take something like the following on x86-32:
				1565	unsigned a(unsigned long long x, unsigned y) {return x % y;}
				1566
				1567	We currently generate a libcall, but we really shouldn't: the expansion is
				1568	shorter and likely faster than the libcall. The expected code is something
				1569	like the following:
				1570
				1571	movl 12(%ebp), %eax
				1572	movl 16(%ebp), %ecx
				1573	xorl %edx, %edx
				1574	divl %ecx
				1575	movl 8(%ebp), %eax
				1576	divl %ecx
				1577	movl %edx, %eax
				1578	ret
				1579
				1580	A similar code sequence works for division.
				1581
				1582	//===---------------------------------------------------------------------===//
Chris Lattner	bfccda6	2008-12-06 22:49:05 +0000	[diff] [blame]	1583
				1584	These should compile to the same code, but the later codegen's to useless
				1585	instructions on X86. This may be a trivial dag combine (GCC PR7061):
				1586
				1587	struct s1 { unsigned char a, b; };
				1588	unsigned long f1(struct s1 x) {
				1589	return x.a + x.b;
				1590	}
				1591	struct s2 { unsigned a: 8, b: 8; };
				1592	unsigned long f2(struct s2 x) {
				1593	return x.a + x.b;
				1594	}
				1595
				1596	//===---------------------------------------------------------------------===//
				1597
Chris Lattner	9f34dc6	2009-02-08 20:44:19 +0000	[diff] [blame]	1598	We currently compile this:
				1599
				1600	define i32 @func1(i32 %v1, i32 %v2) nounwind {
				1601	entry:
				1602	%t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)
				1603	%sum = extractvalue {i32, i1} %t, 0
				1604	%obit = extractvalue {i32, i1} %t, 1
				1605	br i1 %obit, label %overflow, label %normal
				1606	normal:
				1607	ret i32 %sum
				1608	overflow:
				1609	call void @llvm.trap()
				1610	unreachable
				1611	}
				1612	declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)
				1613	declare void @llvm.trap()
				1614
				1615	to:
				1616
				1617	_func1:
				1618	movl 4(%esp), %eax
				1619	addl 8(%esp), %eax
				1620	jo LBB1_2 ## overflow
				1621	LBB1_1: ## normal
				1622	ret
				1623	LBB1_2: ## overflow
				1624	ud2
				1625
				1626	it would be nice to produce "into" someday.
				1627
				1628	//===---------------------------------------------------------------------===//
Chris Lattner	09c650b	2009-02-17 01:16:14 +0000	[diff] [blame]	1629
				1630	This code:
				1631
				1632	void vec_mpys1(int y[], const int x[], int scaler) {
				1633	int i;
				1634	for (i = 0; i < 150; i++)
				1635	y[i] += (((long long)scaler * (long long)x[i]) >> 31);
				1636	}
				1637
				1638	Compiles to this loop with GCC 3.x:
				1639
				1640	.L5:
				1641	movl %ebx, %eax
				1642	imull (%edi,%ecx,4)
				1643	shrdl $31, %edx, %eax
				1644	addl %eax, (%esi,%ecx,4)
				1645	incl %ecx
				1646	cmpl $149, %ecx
				1647	jle .L5
				1648
				1649	llvm-gcc compiles it to the much uglier:
				1650
				1651	LBB1_1: ## bb1
				1652	movl 24(%esp), %eax
				1653	movl (%eax,%edi,4), %ebx
				1654	movl %ebx, %ebp
				1655	imull %esi, %ebp
				1656	movl %ebx, %eax
				1657	mull %ecx
				1658	addl %ebp, %edx
				1659	sarl $31, %ebx
				1660	imull %ecx, %ebx
				1661	addl %edx, %ebx
				1662	shldl $1, %eax, %ebx
				1663	movl 20(%esp), %eax
				1664	addl %ebx, (%eax,%edi,4)
				1665	incl %edi
				1666	cmpl $150, %edi
				1667	jne LBB1_1 ## bb1
				1668
Eli Friedman	0427ac1	2009-12-21 08:03:16 +0000	[diff] [blame]	1669	The issue is that we hoist the cast of "scaler" to long long outside of the
				1670	loop, the value comes into the loop as two values, and
				1671	RegsForValue::getCopyFromRegs doesn't know how to put an AssertSext on the
				1672	constructed BUILD_PAIR which represents the cast value.
				1673
Chris Lattner	09c650b	2009-02-17 01:16:14 +0000	[diff] [blame]	1674	//===---------------------------------------------------------------------===//
Chris Lattner	8eca7c7	2009-03-08 01:54:43 +0000	[diff] [blame]	1675
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1676	Test instructions can be eliminated by using EFLAGS values from arithmetic
Dan Gohman	8bb7db6	2009-03-10 00:26:23 +0000	[diff] [blame]	1677	instructions. This is currently not done for mul, and, or, xor, neg, shl,
				1678	sra, srl, shld, shrd, atomic ops, and others. It is also currently not done
				1679	for read-modify-write instructions. It is also current not done if the
				1680	OF or CF flags are needed.
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1681
				1682	The shift operators have the complication that when the shift count is
				1683	zero, EFLAGS is not set, so they can only subsume a test instruction if
Dan Gohman	8bb7db6	2009-03-10 00:26:23 +0000	[diff] [blame]	1684	the shift count is known to be non-zero. Also, using the EFLAGS value
				1685	from a shift is apparently very slow on some x86 implementations.
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1686
				1687	In read-modify-write instructions, the root node in the isel match is
				1688	the store, and isel has no way for the use of the EFLAGS result of the
				1689	arithmetic to be remapped to the new node.
				1690
Dan Gohman	8bb7db6	2009-03-10 00:26:23 +0000	[diff] [blame]	1691	Add and subtract instructions set OF on signed overflow and CF on unsiged
				1692	overflow, while test instructions always clear OF and CF. In order to
				1693	replace a test with an add or subtract in a situation where OF or CF is
				1694	needed, codegen must be able to prove that the operation cannot see
				1695	signed or unsigned overflow, respectively.
				1696
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1697	//===---------------------------------------------------------------------===//
				1698
Chris Lattner	fe8b559	2009-03-08 03:04:26 +0000	[diff] [blame]	1699	memcpy/memmove do not lower to SSE copies when possible. A silly example is:
				1700	define <16 x float> @foo(<16 x float> %A) nounwind {
				1701	%tmp = alloca <16 x float>, align 16
				1702	%tmp2 = alloca <16 x float>, align 16
				1703	store <16 x float> %A, <16 x float>* %tmp
				1704	%s = bitcast <16 x float>* %tmp to i8*
				1705	%s2 = bitcast <16 x float>* %tmp2 to i8*
				1706	call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)
				1707	%R = load <16 x float>* %tmp2
				1708	ret <16 x float> %R
				1709	}
				1710
				1711	declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind
				1712
				1713	which compiles to:
				1714
				1715	_foo:
				1716	subl $140, %esp
				1717	movaps %xmm3, 112(%esp)
				1718	movaps %xmm2, 96(%esp)
				1719	movaps %xmm1, 80(%esp)
				1720	movaps %xmm0, 64(%esp)
				1721	movl 60(%esp), %eax
				1722	movl %eax, 124(%esp)
				1723	movl 56(%esp), %eax
				1724	movl %eax, 120(%esp)
				1725	movl 52(%esp), %eax
				1726	<many many more 32-bit copies>
				1727	movaps (%esp), %xmm0
				1728	movaps 16(%esp), %xmm1
				1729	movaps 32(%esp), %xmm2
				1730	movaps 48(%esp), %xmm3
				1731	addl $140, %esp
				1732	ret
				1733
				1734	On Nehalem, it may even be cheaper to just use movups when unaligned than to
				1735	fall back to lower-granularity chunks.
				1736
				1737	//===---------------------------------------------------------------------===//
Chris Lattner	e00a4e0	2009-05-25 20:28:19 +0000	[diff] [blame]	1738
				1739	Implement processor-specific optimizations for parity with GCC on these
				1740	processors. GCC does two optimizations:
				1741
				1742	1. ix86_pad_returns inserts a noop before ret instructions if immediately
				1743	preceeded by a conditional branch or is the target of a jump.
				1744	2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of
				1745	code contains more than 3 branches.
				1746
				1747	The first one is done for all AMDs, Core2, and "Generic"
				1748	The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,
				1749	Core 2, and "Generic"
				1750
				1751	//===---------------------------------------------------------------------===//
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1752
				1753	Testcase:
				1754	int a(int x) { return (x & 127) > 31; }
				1755
				1756	Current output:
				1757	movl 4(%esp), %eax
				1758	andl $127, %eax
				1759	cmpl $31, %eax
				1760	seta %al
				1761	movzbl %al, %eax
				1762	ret
				1763
				1764	Ideal output:
				1765	xorl %eax, %eax
				1766	testl $96, 4(%esp)
				1767	setne %al
				1768	ret
				1769
Chris Lattner	2061097	2009-06-16 06:11:35 +0000	[diff] [blame]	1770	This should definitely be done in instcombine, canonicalizing the range
				1771	condition into a != condition. We get this IR:
				1772
				1773	define i32 @a(i32 %x) nounwind readnone {
				1774	entry:
				1775	%0 = and i32 %x, 127 ; <i32> [#uses=1]
				1776	%1 = icmp ugt i32 %0, 31 ; <i1> [#uses=1]
				1777	%2 = zext i1 %1 to i32 ; <i32> [#uses=1]
				1778	ret i32 %2
				1779	}
				1780
				1781	Instcombine prefers to strength reduce relational comparisons to equality
				1782	comparisons when possible, this should be another case of that. This could
				1783	be handled pretty easily in InstCombiner::visitICmpInstWithInstAndIntCst, but it
				1784	looks like InstCombiner::visitICmpInstWithInstAndIntCst should really already
				1785	be redesigned to use ComputeMaskedBits and friends.
				1786
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1787
				1788	//===---------------------------------------------------------------------===//
				1789	Testcase:
				1790	int x(int a) { return (a&0xf0)>>4; }
				1791
				1792	Current output:
				1793	movl 4(%esp), %eax
				1794	shrl $4, %eax
				1795	andl $15, %eax
				1796	ret
				1797
				1798	Ideal output:
				1799	movzbl 4(%esp), %eax
				1800	shrl $4, %eax
				1801	ret
				1802
				1803	//===---------------------------------------------------------------------===//
				1804
				1805	Testcase:
				1806	int x(int a) { return (a & 0x80) ? 0x100 : 0; }
Chris Lattner	498e6ad	2009-06-16 06:15:56 +0000	[diff] [blame]	1807	int y(int a) { return (a & 0x80) *2; }
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1808
Chris Lattner	498e6ad	2009-06-16 06:15:56 +0000	[diff] [blame]	1809	Current:
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1810	testl $128, 4(%esp)
				1811	setne %al
				1812	movzbl %al, %eax
				1813	shll $8, %eax
				1814	ret
				1815
Chris Lattner	498e6ad	2009-06-16 06:15:56 +0000	[diff] [blame]	1816	Better:
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1817	movl 4(%esp), %eax
				1818	addl %eax, %eax
				1819	andl $256, %eax
				1820	ret
				1821
Chris Lattner	498e6ad	2009-06-16 06:15:56 +0000	[diff] [blame]	1822	This is another general instcombine transformation that is profitable on all
				1823	targets. In LLVM IR, these functions look like this:
				1824
				1825	define i32 @x(i32 %a) nounwind readnone {
				1826	entry:
				1827	%0 = and i32 %a, 128
				1828	%1 = icmp eq i32 %0, 0
				1829	%iftmp.0.0 = select i1 %1, i32 0, i32 256
				1830	ret i32 %iftmp.0.0
				1831	}
				1832
				1833	define i32 @y(i32 %a) nounwind readnone {
				1834	entry:
				1835	%0 = shl i32 %a, 1
				1836	%1 = and i32 %0, 256
				1837	ret i32 %1
				1838	}
				1839
				1840	Replacing an icmp+select with a shift should always be considered profitable in
				1841	instcombine.
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1842
				1843	//===---------------------------------------------------------------------===//
Evan Cheng	454211e	2009-07-30 08:56:19 +0000	[diff] [blame]	1844
				1845	Re-implement atomic builtins __sync_add_and_fetch() and __sync_sub_and_fetch
				1846	properly.
				1847
				1848	When the return value is not used (i.e. only care about the value in the
				1849	memory), x86 does not have to use add to implement these. Instead, it can use
				1850	add, sub, inc, dec instructions with the "lock" prefix.
				1851
				1852	This is currently implemented using a bit of instruction selection trick. The
				1853	issue is the target independent pattern produces one output and a chain and we
				1854	want to map it into one that just output a chain. The current trick is to select
				1855	it into a MERGE_VALUES with the first definition being an implicit_def. The
				1856	proper solution is to add new ISD opcodes for the no-output variant. DAG
				1857	combiner can then transform the node before it gets to target node selection.
				1858
				1859	Problem #2 is we are adding a whole bunch of x86 atomic instructions when in
				1860	fact these instructions are identical to the non-lock versions. We need a way to
				1861	add target specific information to target nodes and have this information
				1862	carried over to machine instructions. Asm printer (or JIT) can use this
				1863	information to add the "lock" prefix.
Bill Wendling	85196b5	2009-10-27 22:34:43 +0000	[diff] [blame]	1864
				1865	//===---------------------------------------------------------------------===//
Eli Friedman	2a81a58	2010-02-10 21:26:04 +0000	[diff] [blame]	1866
				1867	_Bool bar(int x) { return x & 1; }
				1868
				1869	define zeroext i1 @bar(i32* nocapture %x) nounwind readonly {
				1870	entry:
				1871	%tmp1 = load i32* %x ; <i32> [#uses=1]
				1872	%and = and i32 %tmp1, 1 ; <i32> [#uses=1]
				1873	%tobool = icmp ne i32 %and, 0 ; <i1> [#uses=1]
				1874	ret i1 %tobool
				1875	}
				1876
				1877	bar: # @bar
				1878	# BB#0: # %entry
				1879	movl 4(%esp), %eax
				1880	movb (%eax), %al
				1881	andb $1, %al
				1882	movzbl %al, %eax
				1883	ret
				1884
				1885	Missed optimization: should be movl+andl.
				1886
				1887	//===---------------------------------------------------------------------===//
				1888
				1889	Consider the following two functions compiled with clang:
				1890	_Bool foo(int x) { return !(x & 4); }
				1891	unsigned bar(int x) { return !(x & 4); }
				1892
				1893	foo:
				1894	movl 4(%esp), %eax
				1895	testb $4, (%eax)
				1896	sete %al
				1897	movzbl %al, %eax
				1898	ret
				1899
				1900	bar:
				1901	movl 4(%esp), %eax
				1902	movl (%eax), %eax
				1903	shrl $2, %eax
				1904	andl $1, %eax
				1905	xorl $1, %eax
				1906	ret
				1907
				1908	The second function generates more code even though the two functions are
				1909	are functionally identical.
				1910
				1911	//===---------------------------------------------------------------------===//
				1912
				1913	Take the following C code:
				1914	int x(int y) { return (y & 63) << 14; }
				1915
				1916	Code produced by gcc:
				1917	andl $63, %edi
				1918	sall $14, %edi
				1919	movl %edi, %eax
				1920	ret
				1921
				1922	Code produced by clang:
				1923	shll $14, %edi
				1924	movl %edi, %eax
				1925	andl $1032192, %eax
				1926	ret
				1927
				1928	The code produced by gcc is 3 bytes shorter. This sort of construct often
				1929	shows up with bitfields.
				1930
				1931	//===---------------------------------------------------------------------===//