Blame - lib/Target/X86/README.txt - fp2-dev/platform/external/llvm

blob: a305ae6ec5505eea1d01814a6ae8b49f7c8734d6 [file] [log] [blame]

Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the X86 backend.
				3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	f9dc644	2009-05-25 16:34:44 +0000	[diff] [blame]	5	We should add support for the "movbe" instruction, which does a byte-swapping
				6	copy (3-addr bswap + memory support?) This is available on Atom processors.
Chris Lattner	8ceb0fd	2007-04-03 23:41:34 +0000	[diff] [blame]	7
				8	//===---------------------------------------------------------------------===//
				9
Chris Lattner	6ef3307	2007-03-28 18:17:19 +0000	[diff] [blame]	10	CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
				11	backend knows how to three-addressify this shift, but it appears the register
				12	allocator isn't even asking it to do so in this case. We should investigate
				13	why this isn't happening, it could have significant impact on other important
				14	cases for X86 as well.
				15
				16	//===---------------------------------------------------------------------===//
				17
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	18	This should be one DIV/IDIV instruction, not a libcall:
				19
				20	unsigned test(unsigned long long X, unsigned Y) {
				21	return X/Y;
				22	}
				23
				24	This can be done trivially with a custom legalizer. What about overflow
				25	though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
				26
				27	//===---------------------------------------------------------------------===//
				28
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	29	Improvements to the multiply -> shift/add algorithm:
				30	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
				31
				32	//===---------------------------------------------------------------------===//
				33
				34	Improve code like this (occurs fairly frequently, e.g. in LLVM):
				35	long long foo(int x) { return 1LL << x; }
				36
				37	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
				38	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
				39	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
				40
				41	Another useful one would be ~0ULL >> X and ~0ULL << X.
				42
Chris Lattner	95af34e	2006-09-13 03:54:54 +0000	[diff] [blame]	43	One better solution for 1LL << x is:
				44	xorl %eax, %eax
				45	xorl %edx, %edx
				46	testb $32, %cl
				47	sete %al
				48	setne %dl
				49	sall %cl, %eax
				50	sall %cl, %edx
				51
				52	But that requires good 8-bit subreg support.
				53
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	54	Also, this might be better. It's an extra shift, but it's one instruction
				55	shorter, and doesn't stress 8-bit subreg support.
				56	(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
				57	but without the unnecessary and.)
				58	movl %ecx, %eax
				59	shrl $5, %eax
				60	movl %eax, %edx
				61	xorl $1, %edx
				62	sall %cl, %eax
				63	sall %cl. %edx
				64
Chris Lattner	f73fb88	2006-09-18 05:36:54 +0000	[diff] [blame]	65	64-bit shifts (in general) expand to really bad code. Instead of using
				66	cmovs, we should expand to a conditional branch like GCC produces.
Chris Lattner	95af34e	2006-09-13 03:54:54 +0000	[diff] [blame]	67
Chris Lattner	ffff617	2005-10-23 21:44:59 +0000	[diff] [blame]	68	//===---------------------------------------------------------------------===//
				69
Chris Lattner	1e4ed93	2005-11-28 04:52:39 +0000	[diff] [blame]	70	Compile this:
				71	_Bool f(_Bool a) { return a!=1; }
				72
				73	into:
				74	movzbl %dil, %eax
				75	xorl $1, %eax
				76	ret
Evan Cheng	8dee8cc	2005-12-17 01:25:19 +0000	[diff] [blame]	77
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	78	(Although note that this isn't a legal way to express the code that llvm-gcc
				79	currently generates for that function.)
				80
Evan Cheng	8dee8cc	2005-12-17 01:25:19 +0000	[diff] [blame]	81	//===---------------------------------------------------------------------===//
				82
				83	Some isel ideas:
				84
				85	1. Dynamic programming based approach when compile time if not an
				86	issue.
				87	2. Code duplication (addressing mode) during isel.
				88	3. Other ideas from "Register-Sensitive Selection, Duplication, and
				89	Sequencing of Instructions".
Chris Lattner	cb29890	2006-02-08 07:12:07 +0000	[diff] [blame]	90	4. Scheduling for reduced register pressure. E.g. "Minimum Register
				91	Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
				92	and other related papers.
				93	http://citeseer.ist.psu.edu/govindarajan01minimum.html
Evan Cheng	8dee8cc	2005-12-17 01:25:19 +0000	[diff] [blame]	94
				95	//===---------------------------------------------------------------------===//
				96
				97	Should we promote i16 to i32 to avoid partial register update stalls?
Evan Cheng	98abbfb	2005-12-17 06:54:43 +0000	[diff] [blame]	98
				99	//===---------------------------------------------------------------------===//
				100
				101	Leave any_extend as pseudo instruction and hint to register
				102	allocator. Delay codegen until post register allocation.
Evan Cheng	1c5d83c	2007-10-12 18:22:55 +0000	[diff] [blame]	103	Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
				104	the coalescer how to deal with it though.
Evan Cheng	a3195e8	2006-01-12 22:54:21 +0000	[diff] [blame]	105
				106	//===---------------------------------------------------------------------===//
				107
Evan Cheng	698b638	2007-07-17 18:39:45 +0000	[diff] [blame]	108	It appears icc use push for parameter passing. Need to investigate.
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	109
				110	//===---------------------------------------------------------------------===//
				111
				112	Only use inc/neg/not instructions on processors where they are faster than
				113	add/sub/xor. They are slower on the P4 due to only updating some processor
				114	flags.
				115
				116	//===---------------------------------------------------------------------===//
				117
Chris Lattner	b638cd8	2006-01-29 09:08:15 +0000	[diff] [blame]	118	The instruction selector sometimes misses folding a load into a compare. The
				119	pattern is written as (cmp reg, (load p)). Because the compare isn't
				120	commutative, it is not matched with the load on both sides. The dag combiner
				121	should be made smart enough to cannonicalize the load into the RHS of a compare
				122	when it can invert the result of the compare for free.
				123
Evan Cheng	0f4aa6e	2006-09-11 05:25:15 +0000	[diff] [blame]	124	//===---------------------------------------------------------------------===//
				125
Chris Lattner	9acddcd	2006-02-02 19:16:34 +0000	[diff] [blame]	126	In many cases, LLVM generates code like this:
				127
				128	_test:
				129	movl 8(%esp), %eax
				130	cmpl %eax, 4(%esp)
				131	setl %al
				132	movzbl %al, %eax
				133	ret
				134
				135	on some processors (which ones?), it is more efficient to do this:
				136
				137	_test:
				138	movl 8(%esp), %ebx
Evan Cheng	0f4aa6e	2006-09-11 05:25:15 +0000	[diff] [blame]	139	xor %eax, %eax
Chris Lattner	9acddcd	2006-02-02 19:16:34 +0000	[diff] [blame]	140	cmpl %ebx, 4(%esp)
				141	setl %al
				142	ret
				143
				144	Doing this correctly is tricky though, as the xor clobbers the flags.
				145
Chris Lattner	d395d09	2006-02-02 19:43:28 +0000	[diff] [blame]	146	//===---------------------------------------------------------------------===//
				147
Chris Lattner	8f77b73	2006-02-08 06:52:06 +0000	[diff] [blame]	148	We should generate bts/btr/etc instructions on targets where they are cheap or
				149	when codesize is important. e.g., for:
				150
				151	void setbit(int *target, int bit) {
				152	*target \|= (1 << bit);
				153	}
				154	void clearbit(int *target, int bit) {
				155	*target &= ~(1 << bit);
				156	}
				157
Chris Lattner	dba382b	2006-02-08 17:47:22 +0000	[diff] [blame]	158	//===---------------------------------------------------------------------===//
				159
Evan Cheng	952b7d6	2006-02-14 08:25:32 +0000	[diff] [blame]	160	Instead of the following for memset char*, 1, 10:
				161
				162	movl $16843009, 4(%edx)
				163	movl $16843009, (%edx)
				164	movw $257, 8(%edx)
				165
				166	It might be better to generate
				167
				168	movl $16843009, %eax
				169	movl %eax, 4(%edx)
				170	movl %eax, (%edx)
				171	movw al, 8(%edx)
				172
				173	when we can spare a register. It reduces code size.
Evan Cheng	7634ac4	2006-02-17 00:04:28 +0000	[diff] [blame]	174
				175	//===---------------------------------------------------------------------===//
				176
Chris Lattner	a648df2	2006-02-17 04:20:13 +0000	[diff] [blame]	177	Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
				178	get this:
				179
Eli Friedman	41ce5b8	2008-02-28 00:21:43 +0000	[diff] [blame]	180	define i32 @test1(i32 %X) {
				181	%Y = sdiv i32 %X, 8
				182	ret i32 %Y
Chris Lattner	a648df2	2006-02-17 04:20:13 +0000	[diff] [blame]	183	}
				184
				185	_test1:
				186	movl 4(%esp), %eax
				187	movl %eax, %ecx
				188	sarl $31, %ecx
				189	shrl $29, %ecx
				190	addl %ecx, %eax
				191	sarl $3, %eax
				192	ret
				193
				194	GCC knows several different ways to codegen it, one of which is this:
				195
				196	_test1:
				197	movl 4(%esp), %eax
				198	cmpl $-1, %eax
				199	leal 7(%eax), %ecx
				200	cmovle %ecx, %eax
				201	sarl $3, %eax
				202	ret
				203
				204	which is probably slower, but it's interesting at least :)
				205
Evan Cheng	755ee8f	2006-02-20 19:58:27 +0000	[diff] [blame]	206	//===---------------------------------------------------------------------===//
				207
Nate Begeman	c02e5a8	2006-03-26 19:19:27 +0000	[diff] [blame]	208	We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
				209	We should leave these as libcalls for everything over a much lower threshold,
				210	since libc is hand tuned for medium and large mem ops (avoiding RFO for large
				211	stores, TLB preheating, etc)
				212
				213	//===---------------------------------------------------------------------===//
				214
Chris Lattner	181b9c6	2006-03-09 01:39:46 +0000	[diff] [blame]	215	Optimize this into something reasonable:
				216	x * copysign(1.0, y) * copysign(1.0, z)
				217
				218	//===---------------------------------------------------------------------===//
				219
				220	Optimize copysign(x, *y) to use an integer load from y.
				221
				222	//===---------------------------------------------------------------------===//
				223
Evan Cheng	0def9c3	2006-03-19 06:08:11 +0000	[diff] [blame]	224	The following tests perform worse with LSR:
				225
				226	lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
Chris Lattner	8bcf926	2006-03-19 22:27:41 +0000	[diff] [blame]	227
				228	//===---------------------------------------------------------------------===//
				229
Evan Cheng	8af5ef9	2006-04-05 23:46:04 +0000	[diff] [blame]	230	Adding to the list of cmp / test poor codegen issues:
				231
				232	int test(__m128 A, __m128 B) {
				233	if (_mm_comige_ss(A, B))
				234	return 3;
				235	else
				236	return 4;
				237	}
				238
				239	_test:
				240	movl 8(%esp), %eax
				241	movaps (%eax), %xmm0
				242	movl 4(%esp), %eax
				243	movaps (%eax), %xmm1
				244	comiss %xmm0, %xmm1
				245	setae %al
				246	movzbl %al, %ecx
				247	movl $3, %eax
				248	movl $4, %edx
				249	cmpl $0, %ecx
				250	cmove %edx, %eax
				251	ret
				252
				253	Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
				254	are a number of issues. 1) We are introducing a setcc between the result of the
				255	intrisic call and select. 2) The intrinsic is expected to produce a i32 value
				256	so a any extend (which becomes a zero extend) is added.
				257
				258	We probably need some kind of target DAG combine hook to fix this.
Evan Cheng	573cb7c	2006-04-06 23:21:24 +0000	[diff] [blame]	259
				260	//===---------------------------------------------------------------------===//
				261
Chris Lattner	57a6c13	2006-04-23 19:47:09 +0000	[diff] [blame]	262	We generate significantly worse code for this than GCC:
				263	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
				264	http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
				265
				266	There is also one case we do worse on PPC.
				267
				268	//===---------------------------------------------------------------------===//
Evan Cheng	d7ec518	2006-04-24 23:30:10 +0000	[diff] [blame]	269
Evan Cheng	5a622f2	2006-05-30 07:37:37 +0000	[diff] [blame]	270	For this:
				271
				272	int test(int a)
				273	{
				274	return a * 3;
				275	}
				276
				277	We currently emits
				278	imull $3, 4(%esp), %eax
				279
				280	Perhaps this is what we really should generate is? Is imull three or four
				281	cycles? Note: ICC generates this:
				282	movl 4(%esp), %eax
				283	leal (%eax,%eax,2), %eax
				284
				285	The current instruction priority is based on pattern complexity. The former is
				286	more "complex" because it folds a load so the latter will not be emitted.
				287
				288	Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
				289	should always try to match LEA first since the LEA matching code does some
				290	estimate to determine whether the match is profitable.
				291
				292	However, if we care more about code size, then imull is better. It's two bytes
				293	shorter than movl + leal.
Evan Cheng	c21051f	2006-06-04 09:08:00 +0000	[diff] [blame]	294
Eli Friedman	91db527	2008-11-30 07:52:27 +0000	[diff] [blame]	295	On a Pentium M, both variants have the same characteristics with regard
				296	to throughput; however, the multiplication has a latency of four cycles, as
				297	opposed to two cycles for the movl+lea variant.
				298
Evan Cheng	c21051f	2006-06-04 09:08:00 +0000	[diff] [blame]	299	//===---------------------------------------------------------------------===//
				300
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	301	__builtin_ffs codegen is messy.
Chris Lattner	ace2e8a	2007-08-11 18:19:07 +0000	[diff] [blame]	302
Chris Lattner	ace2e8a	2007-08-11 18:19:07 +0000	[diff] [blame]	303	int ffs_(unsigned X) { return __builtin_ffs(X); }
				304
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	305	llvm produces:
				306	ffs_:
				307	movl 4(%esp), %ecx
				308	bsfl %ecx, %eax
				309	movl $32, %edx
				310	cmove %edx, %eax
				311	incl %eax
				312	xorl %edx, %edx
				313	testl %ecx, %ecx
				314	cmove %edx, %eax
Chris Lattner	ace2e8a	2007-08-11 18:19:07 +0000	[diff] [blame]	315	ret
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	316
				317	vs gcc:
				318
Chris Lattner	ace2e8a	2007-08-11 18:19:07 +0000	[diff] [blame]	319	_ffs_:
				320	movl $-1, %edx
				321	bsfl 4(%esp), %eax
				322	cmove %edx, %eax
				323	addl $1, %eax
				324	ret
Evan Cheng	c21051f	2006-06-04 09:08:00 +0000	[diff] [blame]	325
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	326	Another example of __builtin_ffs (use predsimplify to eliminate a select):
				327
				328	int foo (unsigned long j) {
				329	if (j)
				330	return __builtin_ffs (j) - 1;
				331	else
				332	return 0;
				333	}
				334
Evan Cheng	c21051f	2006-06-04 09:08:00 +0000	[diff] [blame]	335	//===---------------------------------------------------------------------===//
				336
				337	It appears gcc place string data with linkonce linkage in
				338	.section __TEXT,__const_coal,coalesced instead of
				339	.section __DATA,__const_coal,coalesced.
				340	Take a look at darwin.h, there are other Darwin assembler directives that we
				341	do not make use of.
				342
				343	//===---------------------------------------------------------------------===//
				344
Chris Lattner	eb05f90	2008-02-14 06:19:02 +0000	[diff] [blame]	345	define i32 @foo(i32* %a, i32 %t) {
Nate Begeman	83a6d49	2006-08-02 05:31:20 +0000	[diff] [blame]	346	entry:
Chris Lattner	eb05f90	2008-02-14 06:19:02 +0000	[diff] [blame]	347	br label %cond_true
Nate Begeman	83a6d49	2006-08-02 05:31:20 +0000	[diff] [blame]	348
Chris Lattner	eb05f90	2008-02-14 06:19:02 +0000	[diff] [blame]	349	cond_true: ; preds = %cond_true, %entry
				350	%x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ] ; <i32> [#uses=3]
				351	%t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ] ; <i32> [#uses=1]
				352	%tmp2 = getelementptr i32* %a, i32 %x.0.0 ; <i32*> [#uses=1]
				353	%tmp3 = load i32* %tmp2 ; <i32> [#uses=1]
				354	%tmp5 = add i32 %t_addr.0.0, %x.0.0 ; <i32> [#uses=1]
				355	%tmp7 = add i32 %tmp5, %tmp3 ; <i32> [#uses=2]
				356	%tmp9 = add i32 %x.0.0, 1 ; <i32> [#uses=2]
				357	%tmp = icmp sgt i32 %tmp9, 39 ; <i1> [#uses=1]
				358	br i1 %tmp, label %bb12, label %cond_true
Nate Begeman	83a6d49	2006-08-02 05:31:20 +0000	[diff] [blame]	359
Chris Lattner	eb05f90	2008-02-14 06:19:02 +0000	[diff] [blame]	360	bb12: ; preds = %cond_true
				361	ret i32 %tmp7
Chris Lattner	8e173de	2006-06-15 21:33:31 +0000	[diff] [blame]	362	}
Nate Begeman	83a6d49	2006-08-02 05:31:20 +0000	[diff] [blame]	363	is pessimized by -loop-reduce and -indvars
Chris Lattner	8e173de	2006-06-15 21:33:31 +0000	[diff] [blame]	364
				365	//===---------------------------------------------------------------------===//
Evan Cheng	357edf8	2006-06-17 00:45:49 +0000	[diff] [blame]	366
Evan Cheng	abb4d78	2006-07-19 21:29:30 +0000	[diff] [blame]	367	u32 to float conversion improvement:
				368
				369	float uint32_2_float( unsigned u ) {
				370	float fl = (int) (u & 0xffff);
				371	float fh = (int) (u >> 16);
				372	fh *= 0x1.0p16f;
				373	return fh + fl;
				374	}
				375
				376	00000000 subl $0x04,%esp
				377	00000003 movl 0x08(%esp,1),%eax
				378	00000007 movl %eax,%ecx
				379	00000009 shrl $0x10,%ecx
				380	0000000c cvtsi2ss %ecx,%xmm0
				381	00000010 andl $0x0000ffff,%eax
				382	00000015 cvtsi2ss %eax,%xmm1
				383	00000019 mulss 0x00000078,%xmm0
				384	00000021 addss %xmm1,%xmm0
				385	00000025 movss %xmm0,(%esp,1)
				386	0000002a flds (%esp,1)
				387	0000002d addl $0x04,%esp
				388	00000030 ret
Evan Cheng	ae1d33f	2006-07-26 21:49:52 +0000	[diff] [blame]	389
				390	//===---------------------------------------------------------------------===//
				391
				392	When using fastcc abi, align stack slot of argument of type double on 8 byte
				393	boundary to improve performance.
Chris Lattner	5c36d78	2006-08-16 02:47:44 +0000	[diff] [blame]	394
				395	//===---------------------------------------------------------------------===//
				396
				397	Codegen:
				398
Chris Lattner	2a33a3f	2006-09-11 22:57:51 +0000	[diff] [blame]	399	int f(int a, int b) {
				400	if (a == 4 \|\| a == 6)
				401	b++;
				402	return b;
				403	}
				404
Chris Lattner	5c36d78	2006-08-16 02:47:44 +0000	[diff] [blame]	405
				406	as:
				407
				408	or eax, 2
				409	cmp eax, 6
				410	jz label
				411
Chris Lattner	0bfd7fd	2006-09-11 23:00:56 +0000	[diff] [blame]	412	//===---------------------------------------------------------------------===//
				413
Chris Lattner	d1468d3	2006-09-13 04:19:50 +0000	[diff] [blame]	414	GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
				415	simplifications for integer "x cmp y ? a : b". For example, instead of:
				416
				417	int G;
				418	void f(int X, int Y) {
				419	G = X < 0 ? 14 : 13;
				420	}
				421
				422	compiling to:
				423
				424	_f:
				425	movl $14, %eax
				426	movl $13, %ecx
				427	movl 4(%esp), %edx
				428	testl %edx, %edx
				429	cmovl %eax, %ecx
				430	movl %ecx, _G
				431	ret
				432
				433	it could be:
				434	_f:
				435	movl 4(%esp), %eax
				436	sarl $31, %eax
				437	notl %eax
				438	addl $14, %eax
				439	movl %eax, _G
				440	ret
				441
				442	etc.
				443
Chris Lattner	2539458	2007-11-02 17:04:20 +0000	[diff] [blame]	444	Another is:
				445	int usesbb(unsigned int a, unsigned int b) {
				446	return (a < b ? -1 : 0);
				447	}
				448	to:
				449	_usesbb:
				450	movl 8(%esp), %eax
				451	cmpl %eax, 4(%esp)
				452	sbbl %eax, %eax
				453	ret
				454
				455	instead of:
				456	_usesbb:
				457	xorl %eax, %eax
				458	movl 8(%esp), %ecx
				459	cmpl %ecx, 4(%esp)
				460	movl $4294967295, %ecx
				461	cmovb %ecx, %eax
				462	ret
				463
Chris Lattner	d1468d3	2006-09-13 04:19:50 +0000	[diff] [blame]	464	//===---------------------------------------------------------------------===//
Chris Lattner	08c3301	2006-09-13 23:37:16 +0000	[diff] [blame]	465
Chris Lattner	15bdc96	2006-10-12 22:01:26 +0000	[diff] [blame]	466	Consider the expansion of:
				467
Chris Lattner	eb05f90	2008-02-14 06:19:02 +0000	[diff] [blame]	468	define i32 @test3(i32 %X) {
				469	%tmp1 = urem i32 %X, 255
				470	ret i32 %tmp1
Chris Lattner	15bdc96	2006-10-12 22:01:26 +0000	[diff] [blame]	471	}
				472
				473	Currently it compiles to:
				474
				475	...
				476	movl $2155905153, %ecx
				477	movl 8(%esp), %esi
				478	movl %esi, %eax
				479	mull %ecx
				480	...
				481
				482	This could be "reassociated" into:
				483
				484	movl $2155905153, %eax
				485	movl 8(%esp), %ecx
				486	mull %ecx
				487
				488	to avoid the copy. In fact, the existing two-address stuff would do this
				489	except that mul isn't a commutative 2-addr instruction. I guess this has
				490	to be done at isel time based on the #uses to mul?
				491
Evan Cheng	ead1b80	2006-11-28 19:59:25 +0000	[diff] [blame]	492	//===---------------------------------------------------------------------===//
				493
				494	Make sure the instruction which starts a loop does not cross a cacheline
				495	boundary. This requires knowning the exact length of each machine instruction.
				496	That is somewhat complicated, but doable. Example 256.bzip2:
				497
				498	In the new trace, the hot loop has an instruction which crosses a cacheline
				499	boundary. In addition to potential cache misses, this can't help decoding as I
				500	imagine there has to be some kind of complicated decoder reset and realignment
				501	to grab the bytes from the next cacheline.
				502
				503	532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
Eli Friedman	91db527	2008-11-30 07:52:27 +0000	[diff] [blame]	504	942 942 0x3d03 movl %dh, (1809(%esp, %esi)
				505	937 937 0x3d0a incl %esi
				506	3 3 0x3d0b cmpb %bl, %dl
Evan Cheng	ead1b80	2006-11-28 19:59:25 +0000	[diff] [blame]	507	27 27 0x3d0d jnz 0x000062db <main+11707>
				508
				509	//===---------------------------------------------------------------------===//
				510
				511	In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
Chris Lattner	b9853eb	2006-12-22 01:03:22 +0000	[diff] [blame]	512
				513	//===---------------------------------------------------------------------===//
				514
				515	This could be a single 16-bit load.
Chris Lattner	c9d3471	2007-01-03 19:12:31 +0000	[diff] [blame]	516
Chris Lattner	b9853eb	2006-12-22 01:03:22 +0000	[diff] [blame]	517	int f(char *p) {
Chris Lattner	c9d3471	2007-01-03 19:12:31 +0000	[diff] [blame]	518	if ((p[0] == 1) & (p[1] == 2)) return 1;
Chris Lattner	b9853eb	2006-12-22 01:03:22 +0000	[diff] [blame]	519	return 0;
				520	}
				521
Chris Lattner	ab4be63	2007-01-06 01:30:45 +0000	[diff] [blame]	522	//===---------------------------------------------------------------------===//
				523
				524	We should inline lrintf and probably other libc functions.
				525
				526	//===---------------------------------------------------------------------===//
Chris Lattner	7ace299	2007-01-15 06:25:39 +0000	[diff] [blame]	527
Dan Gohman	3c02c22	2010-01-04 20:55:05 +0000	[diff] [blame]	528	Use the FLAGS values from arithmetic instructions more. For example, compile:
Chris Lattner	7ace299	2007-01-15 06:25:39 +0000	[diff] [blame]	529
				530	int add_zf(int *x, int y, int a, int b) {
				531	if ((*x += y) == 0)
				532	return a;
				533	else
				534	return b;
				535	}
				536
				537	to:
				538	addl %esi, (%rdi)
				539	movl %edx, %eax
				540	cmovne %ecx, %eax
				541	ret
				542	instead of:
				543
				544	_add_zf:
				545	addl (%rdi), %esi
				546	movl %esi, (%rdi)
				547	testl %esi, %esi
				548	cmove %edx, %ecx
				549	movl %ecx, %eax
				550	ret
				551
Dan Gohman	3c02c22	2010-01-04 20:55:05 +0000	[diff] [blame]	552	As another example, compile function f2 in test/CodeGen/X86/cmp-test.ll
				553	without a test instruction.
Chris Lattner	7ace299	2007-01-15 06:25:39 +0000	[diff] [blame]	554
				555	//===---------------------------------------------------------------------===//
				556
Chris Lattner	66bb5b5	2007-01-21 07:03:37 +0000	[diff] [blame]	557	These two functions have identical effects:
				558
				559	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
				560	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				561
				562	We currently compile them to:
				563
				564	_f:
				565	movl 4(%esp), %eax
				566	movl %eax, %ecx
				567	incl %ecx
				568	movl 8(%esp), %edx
				569	cmpl %edx, %ecx
				570	jne LBB1_2 #UnifiedReturnBlock
				571	LBB1_1: #cond_true
				572	addl $2, %eax
				573	ret
				574	LBB1_2: #UnifiedReturnBlock
				575	movl %ecx, %eax
				576	ret
				577	_f2:
				578	movl 4(%esp), %eax
				579	movl %eax, %ecx
				580	incl %ecx
				581	cmpl 8(%esp), %ecx
				582	sete %cl
				583	movzbl %cl, %ecx
				584	leal 1(%ecx,%eax), %eax
				585	ret
				586
				587	both of which are inferior to GCC's:
				588
				589	_f:
				590	movl 4(%esp), %edx
				591	leal 1(%edx), %eax
				592	addl $2, %edx
				593	cmpl 8(%esp), %eax
				594	cmove %edx, %eax
				595	ret
				596	_f2:
				597	movl 4(%esp), %eax
				598	addl $1, %eax
				599	xorl %edx, %edx
				600	cmpl 8(%esp), %eax
				601	sete %dl
				602	addl %edx, %eax
				603	ret
				604
				605	//===---------------------------------------------------------------------===//
				606
Chris Lattner	08ba1de	2007-02-12 20:26:34 +0000	[diff] [blame]	607	This code:
				608
				609	void test(int X) {
				610	if (X) abort();
				611	}
				612
Chris Lattner	48d3c10	2007-02-12 21:20:26 +0000	[diff] [blame]	613	is currently compiled to:
Chris Lattner	08ba1de	2007-02-12 20:26:34 +0000	[diff] [blame]	614
				615	_test:
				616	subl $12, %esp
				617	cmpl $0, 16(%esp)
Chris Lattner	48d3c10	2007-02-12 21:20:26 +0000	[diff] [blame]	618	jne LBB1_1
Chris Lattner	08ba1de	2007-02-12 20:26:34 +0000	[diff] [blame]	619	addl $12, %esp
				620	ret
Chris Lattner	48d3c10	2007-02-12 21:20:26 +0000	[diff] [blame]	621	LBB1_1:
Chris Lattner	08ba1de	2007-02-12 20:26:34 +0000	[diff] [blame]	622	call L_abort$stub
				623
				624	It would be better to produce:
				625
				626	_test:
				627	subl $12, %esp
				628	cmpl $0, 16(%esp)
				629	jne L_abort$stub
				630	addl $12, %esp
				631	ret
				632
				633	This can be applied to any no-return function call that takes no arguments etc.
Chris Lattner	48d3c10	2007-02-12 21:20:26 +0000	[diff] [blame]	634	Alternatively, the stack save/restore logic could be shrink-wrapped, producing
				635	something like this:
				636
				637	_test:
				638	cmpl $0, 4(%esp)
				639	jne LBB1_1
				640	ret
				641	LBB1_1:
				642	subl $12, %esp
				643	call L_abort$stub
				644
				645	Both are useful in different situations. Finally, it could be shrink-wrapped
				646	and tail called, like this:
				647
				648	_test:
				649	cmpl $0, 4(%esp)
				650	jne LBB1_1
				651	ret
				652	LBB1_1:
				653	pop %eax # realign stack.
				654	call L_abort$stub
				655
				656	Though this probably isn't worth it.
Chris Lattner	08ba1de	2007-02-12 20:26:34 +0000	[diff] [blame]	657
				658	//===---------------------------------------------------------------------===//
Chris Lattner	9b6f57c	2007-03-02 05:04:52 +0000	[diff] [blame]	659
Chris Lattner	0f1621b	2007-05-10 00:08:04 +0000	[diff] [blame]	660	Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
				661	a neg instead of a sub instruction. Consider:
				662
				663	int test(char X) { return 7-X; }
				664
				665	we currently produce:
				666	_test:
				667	movl $7, %eax
				668	movsbl 4(%esp), %ecx
				669	subl %ecx, %eax
				670	ret
				671
				672	We would use one fewer register if codegen'd as:
				673
				674	movsbl 4(%esp), %eax
				675	neg %eax
				676	add $7, %eax
				677	ret
				678
				679	Note that this isn't beneficial if the load can be folded into the sub. In
				680	this case, we want a sub:
				681
				682	int test(int X) { return 7-X; }
				683	_test:
				684	movl $7, %eax
				685	subl 4(%esp), %eax
				686	ret
Chris Lattner	7c162645	2007-04-14 23:06:09 +0000	[diff] [blame]	687
Evan Cheng	b5cd249	2007-07-18 08:21:49 +0000	[diff] [blame]	688	//===---------------------------------------------------------------------===//
				689
Chris Lattner	cf8ba69	2007-08-20 02:14:33 +0000	[diff] [blame]	690	Leaf functions that require one 4-byte spill slot have a prolog like this:
				691
				692	_foo:
				693	pushl %esi
				694	subl $4, %esp
				695	...
				696	and an epilog like this:
				697	addl $4, %esp
				698	popl %esi
				699	ret
				700
				701	It would be smaller, and potentially faster, to push eax on entry and to
				702	pop into a dummy register instead of using addl/subl of esp. Just don't pop
				703	into any return registers :)
				704
				705	//===---------------------------------------------------------------------===//
Chris Lattner	9e43d63	2007-08-23 15:22:07 +0000	[diff] [blame]	706
				707	The X86 backend should fold (branch (or (setcc, setcc))) into multiple
				708	branches. We generate really poor code for:
				709
				710	double testf(double a) {
				711	return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
				712	}
				713
				714	For example, the entry BB is:
				715
				716	_testf:
				717	subl $20, %esp
				718	pxor %xmm0, %xmm0
				719	movsd 24(%esp), %xmm1
				720	ucomisd %xmm0, %xmm1
				721	setnp %al
				722	sete %cl
				723	testb %cl, %al
				724	jne LBB1_5 # UnifiedReturnBlock
				725	LBB1_1: # cond_true
				726
				727
				728	it would be better to replace the last four instructions with:
				729
				730	jp LBB1_1
				731	je LBB1_5
				732	LBB1_1:
				733
				734	We also codegen the inner ?: into a diamond:
				735
				736	cvtss2sd LCPI1_0(%rip), %xmm2
				737	cvtss2sd LCPI1_1(%rip), %xmm3
				738	ucomisd %xmm1, %xmm0
				739	ja LBB1_3 # cond_true
				740	LBB1_2: # cond_true
				741	movapd %xmm3, %xmm2
				742	LBB1_3: # cond_true
				743	movapd %xmm2, %xmm0
				744	ret
				745
				746	We should sink the load into xmm3 into the LBB1_2 block. This should
				747	be pretty easy, and will nuke all the copies.
				748
				749	//===---------------------------------------------------------------------===//
Chris Lattner	bf8ae84	2007-09-10 21:43:18 +0000	[diff] [blame]	750
				751	This:
				752	#include <algorithm>
				753	inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				754	{ return std::make_pair(a + b, a + b < a); }
				755	bool no_overflow(unsigned a, unsigned b)
				756	{ return !full_add(a, b).second; }
				757
				758	Should compile to:
				759
				760
				761	_Z11no_overflowjj:
				762	addl %edi, %esi
				763	setae %al
				764	ret
				765
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	766	FIXME: That code looks wrong; bool return is normally defined as zext.
				767
Chris Lattner	bf8ae84	2007-09-10 21:43:18 +0000	[diff] [blame]	768	on x86-64, not:
				769
				770	__Z11no_overflowjj:
				771	addl %edi, %esi
				772	cmpl %edi, %esi
				773	setae %al
				774	movzbl %al, %eax
				775	ret
				776
				777
				778	//===---------------------------------------------------------------------===//
Evan Cheng	f618e7c	2007-09-10 22:16:37 +0000	[diff] [blame]	779
Bill Wendling	6aab491	2007-10-02 20:42:59 +0000	[diff] [blame]	780	The following code:
				781
Bill Wendling	8d1c8ce	2007-10-02 20:54:32 +0000	[diff] [blame]	782	bb114.preheader: ; preds = %cond_next94
				783	%tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
				784	%tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
				785	%tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
				786	%tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
				787	%tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
				788	%tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
				789	%tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
				790	%tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
				791	%tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
				792	%tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
				793	%tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
				794	br label %bb114
				795
				796	produces:
				797
Bill Wendling	6aab491	2007-10-02 20:42:59 +0000	[diff] [blame]	798	LBB3_5: # bb114.preheader
				799	movswl -68(%ebp), %eax
				800	movl $32, %ecx
				801	movl %ecx, -80(%ebp)
				802	subl %eax, -80(%ebp)
				803	movswl -52(%ebp), %eax
				804	movl %ecx, -84(%ebp)
				805	subl %eax, -84(%ebp)
				806	movswl -70(%ebp), %eax
				807	movl %ecx, -88(%ebp)
				808	subl %eax, -88(%ebp)
				809	movswl -50(%ebp), %eax
				810	subl %eax, %ecx
				811	movl %ecx, -76(%ebp)
				812	movswl -42(%ebp), %eax
				813	movl %eax, -92(%ebp)
				814	movswl -66(%ebp), %eax
				815	movl %eax, -96(%ebp)
				816	movw $0, -98(%ebp)
				817
Chris Lattner	67a1af9	2007-10-03 03:40:24 +0000	[diff] [blame]	818	This appears to be bad because the RA is not folding the store to the stack
				819	slot into the movl. The above instructions could be:
				820	movl $32, -80(%ebp)
				821	...
				822	movl $32, -84(%ebp)
				823	...
				824	This seems like a cross between remat and spill folding.
				825
Bill Wendling	8d1c8ce	2007-10-02 20:54:32 +0000	[diff] [blame]	826	This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling	6aab491	2007-10-02 20:42:59 +0000	[diff] [blame]	827	change, so we could simply subtract %eax from %ecx first and then use %ecx (or
				828	vice-versa).
				829
				830	//===---------------------------------------------------------------------===//
				831
Bill Wendling	7687bd0	2007-10-02 21:49:31 +0000	[diff] [blame]	832	This code:
				833
				834	%tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
				835	br i1 %tmp659, label %cond_true662, label %cond_next715
				836
				837	produces this:
				838
				839	testw %cx, %cx
				840	movswl %cx, %esi
				841	jns LBB4_109 # cond_next715
				842
				843	Shark tells us that using %cx in the testw instruction is sub-optimal. It
				844	suggests using the 32-bit register (which is what ICC uses).
				845
				846	//===---------------------------------------------------------------------===//
Chris Lattner	fce5cfe	2007-10-03 17:10:03 +0000	[diff] [blame]	847
Chris Lattner	87b77b9	2007-10-04 15:47:27 +0000	[diff] [blame]	848	We compile this:
				849
				850	void compare (long long foo) {
				851	if (foo < 4294967297LL)
				852	abort();
				853	}
				854
				855	to:
				856
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	857	compare:
				858	subl $4, %esp
				859	cmpl $0, 8(%esp)
Chris Lattner	87b77b9	2007-10-04 15:47:27 +0000	[diff] [blame]	860	setne %al
				861	movzbw %al, %ax
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	862	cmpl $1, 12(%esp)
Chris Lattner	87b77b9	2007-10-04 15:47:27 +0000	[diff] [blame]	863	setg %cl
				864	movzbw %cl, %cx
				865	cmove %ax, %cx
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	866	testb $1, %cl
				867	jne .LBB1_2 # UnifiedReturnBlock
				868	.LBB1_1: # ifthen
				869	call abort
				870	.LBB1_2: # UnifiedReturnBlock
				871	addl $4, %esp
				872	ret
Chris Lattner	87b77b9	2007-10-04 15:47:27 +0000	[diff] [blame]	873
				874	(also really horrible code on ppc). This is due to the expand code for 64-bit
				875	compares. GCC produces multiple branches, which is much nicer:
				876
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	877	compare:
				878	subl $12, %esp
				879	movl 20(%esp), %edx
				880	movl 16(%esp), %eax
				881	decl %edx
				882	jle .L7
				883	.L5:
				884	addl $12, %esp
				885	ret
				886	.p2align 4,,7
				887	.L7:
				888	jl .L4
Chris Lattner	87b77b9	2007-10-04 15:47:27 +0000	[diff] [blame]	889	cmpl $0, %eax
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	890	.p2align 4,,8
				891	ja .L5
				892	.L4:
				893	.p2align 4,,9
				894	call abort
Chris Lattner	87b77b9	2007-10-04 15:47:27 +0000	[diff] [blame]	895
				896	//===---------------------------------------------------------------------===//
Arnold Schwaighofer	c85e171	2007-10-11 19:40:01 +0000	[diff] [blame]	897
Arnold Schwaighofer	48abc5c	2007-10-12 21:30:57 +0000	[diff] [blame]	898	Tail call optimization improvements: Tail call optimization currently
				899	pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighofer	c8ab8cd	2008-01-11 16:49:42 +0000	[diff] [blame]	900	non-tail call optimized calls) that source from the callers arguments
				901	or that source from a virtual register (also possibly sourcing from
				902	callers arguments).
				903	This is done to prevent overwriting of parameters (see example
				904	below) that might be used later.
Arnold Schwaighofer	48abc5c	2007-10-12 21:30:57 +0000	[diff] [blame]	905
				906	example:
Arnold Schwaighofer	c85e171	2007-10-11 19:40:01 +0000	[diff] [blame]	907
				908	int callee(int32, int64);
				909	int caller(int32 arg1, int32 arg2) {
				910	int64 local = arg2 * 2;
				911	return callee(arg2, (int64)local);
				912	}
				913
				914	[arg1] [!arg2 no longer valid since we moved local onto it]
				915	[arg2] -> [(int64)
				916	[RETADDR] local ]
				917
Arnold Schwaighofer	48abc5c	2007-10-12 21:30:57 +0000	[diff] [blame]	918	Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighofer	c85e171	2007-10-11 19:40:01 +0000	[diff] [blame]	919	arg2 of the caller.
				920
				921	Possible optimizations:
				922
Arnold Schwaighofer	c85e171	2007-10-11 19:40:01 +0000	[diff] [blame]	923
Arnold Schwaighofer	48abc5c	2007-10-12 21:30:57 +0000	[diff] [blame]	924	- Analyse the actual parameters of the callee to see which would
				925	overwrite a caller parameter which is used by the callee and only
				926	push them onto the top of the stack.
Arnold Schwaighofer	c85e171	2007-10-11 19:40:01 +0000	[diff] [blame]	927
				928	int callee (int32 arg1, int32 arg2);
				929	int caller (int32 arg1, int32 arg2) {
				930	return callee(arg1,arg2);
				931	}
				932
Arnold Schwaighofer	48abc5c	2007-10-12 21:30:57 +0000	[diff] [blame]	933	Here we don't need to write any variables to the top of the stack
				934	since they don't overwrite each other.
Arnold Schwaighofer	c85e171	2007-10-11 19:40:01 +0000	[diff] [blame]	935
				936	int callee (int32 arg1, int32 arg2);
				937	int caller (int32 arg1, int32 arg2) {
				938	return callee(arg2,arg1);
				939	}
				940
Arnold Schwaighofer	48abc5c	2007-10-12 21:30:57 +0000	[diff] [blame]	941	Here we need to push the arguments because they overwrite each
				942	other.
Arnold Schwaighofer	c85e171	2007-10-11 19:40:01 +0000	[diff] [blame]	943
Arnold Schwaighofer	c85e171	2007-10-11 19:40:01 +0000	[diff] [blame]	944	//===---------------------------------------------------------------------===//
Evan Cheng	402b678	2007-10-28 04:01:09 +0000	[diff] [blame]	945
				946	main ()
				947	{
				948	int i = 0;
				949	unsigned long int z = 0;
				950
				951	do {
				952	z -= 0x00004000;
				953	i++;
				954	if (i > 0x00040000)
				955	abort ();
				956	} while (z > 0);
				957	exit (0);
				958	}
				959
				960	gcc compiles this to:
				961
				962	_main:
				963	subl $28, %esp
				964	xorl %eax, %eax
				965	jmp L2
				966	L3:
				967	cmpl $262144, %eax
				968	je L10
				969	L2:
				970	addl $1, %eax
				971	cmpl $262145, %eax
				972	jne L3
				973	call L_abort$stub
				974	L10:
				975	movl $0, (%esp)
				976	call L_exit$stub
				977
				978	llvm:
				979
				980	_main:
				981	subl $12, %esp
				982	movl $1, %eax
				983	movl $16384, %ecx
				984	LBB1_1: # bb
				985	cmpl $262145, %eax
				986	jge LBB1_4 # cond_true
				987	LBB1_2: # cond_next
				988	incl %eax
				989	addl $4294950912, %ecx
				990	cmpl $16384, %ecx
				991	jne LBB1_1 # bb
				992	LBB1_3: # bb11
				993	xorl %eax, %eax
				994	addl $12, %esp
				995	ret
				996	LBB1_4: # cond_true
				997	call L_abort$stub
				998
				999	1. LSR should rewrite the first cmp with induction variable %ecx.
				1000	2. DAG combiner should fold
				1001	leal 1(%eax), %edx
				1002	cmpl $262145, %edx
				1003	=>
				1004	cmpl $262144, %eax
				1005
				1006	//===---------------------------------------------------------------------===//
Chris Lattner	9461316	2007-11-24 06:13:33 +0000	[diff] [blame]	1007
				1008	define i64 @test(double %X) {
				1009	%Y = fptosi double %X to i64
				1010	ret i64 %Y
				1011	}
				1012
				1013	compiles to:
				1014
				1015	_test:
				1016	subl $20, %esp
				1017	movsd 24(%esp), %xmm0
				1018	movsd %xmm0, 8(%esp)
				1019	fldl 8(%esp)
				1020	fisttpll (%esp)
				1021	movl 4(%esp), %edx
				1022	movl (%esp), %eax
				1023	addl $20, %esp
				1024	#FP_REG_KILL
				1025	ret
				1026
				1027	This should just fldl directly from the input stack slot.
Chris Lattner	7d13015	2007-12-05 22:58:19 +0000	[diff] [blame]	1028
				1029	//===---------------------------------------------------------------------===//
				1030
				1031	This code:
				1032	int foo (int x) { return (x & 65535) \| 255; }
				1033
				1034	Should compile into:
				1035
				1036	_foo:
				1037	movzwl 4(%esp), %eax
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	1038	orl $255, %eax
Chris Lattner	7d13015	2007-12-05 22:58:19 +0000	[diff] [blame]	1039	ret
				1040
				1041	instead of:
				1042	_foo:
				1043	movl $255, %eax
				1044	orl 4(%esp), %eax
				1045	andl $65535, %eax
				1046	ret
				1047
Chris Lattner	4185b52	2007-12-18 16:48:14 +0000	[diff] [blame]	1048	//===---------------------------------------------------------------------===//
				1049
Chris Lattner	7c1687c	2008-02-21 06:51:29 +0000	[diff] [blame]	1050	We're codegen'ing multiply of long longs inefficiently:
Chris Lattner	4185b52	2007-12-18 16:48:14 +0000	[diff] [blame]	1051
Chris Lattner	7c1687c	2008-02-21 06:51:29 +0000	[diff] [blame]	1052	unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
				1053	return arg1 * arg2;
				1054	}
Chris Lattner	4185b52	2007-12-18 16:48:14 +0000	[diff] [blame]	1055
Chris Lattner	7c1687c	2008-02-21 06:51:29 +0000	[diff] [blame]	1056	We compile to (fomit-frame-pointer):
				1057
				1058	_LLM:
				1059	pushl %esi
				1060	movl 8(%esp), %ecx
				1061	movl 16(%esp), %esi
				1062	movl %esi, %eax
				1063	mull %ecx
				1064	imull 12(%esp), %esi
				1065	addl %edx, %esi
				1066	imull 20(%esp), %ecx
				1067	movl %esi, %edx
				1068	addl %ecx, %edx
				1069	popl %esi
				1070	ret
				1071
				1072	This looks like a scheduling deficiency and lack of remat of the load from
				1073	the argument area. ICC apparently produces:
				1074
				1075	movl 8(%esp), %ecx
				1076	imull 12(%esp), %ecx
				1077	movl 16(%esp), %eax
				1078	imull 4(%esp), %eax
				1079	addl %eax, %ecx
				1080	movl 4(%esp), %eax
				1081	mull 12(%esp)
				1082	addl %ecx, %edx
Chris Lattner	4185b52	2007-12-18 16:48:14 +0000	[diff] [blame]	1083	ret
				1084
Chris Lattner	7c1687c	2008-02-21 06:51:29 +0000	[diff] [blame]	1085	Note that it remat'd loads from 4(esp) and 12(esp). See this GCC PR:
				1086	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
Chris Lattner	4185b52	2007-12-18 16:48:14 +0000	[diff] [blame]	1087
				1088	//===---------------------------------------------------------------------===//
				1089
Chris Lattner	44cb8ef	2007-12-24 19:27:46 +0000	[diff] [blame]	1090	We can fold a store into "zeroing a reg". Instead of:
				1091
				1092	xorl %eax, %eax
				1093	movl %eax, 124(%esp)
				1094
				1095	we should get:
				1096
				1097	movl $0, 124(%esp)
				1098
				1099	if the flags of the xor are dead.
				1100
Chris Lattner	047ad94	2008-01-11 18:00:13 +0000	[diff] [blame]	1101	Likewise, we isel "x<<1" into "add reg,reg". If reg is spilled, this should
				1102	be folded into: shl [mem], 1
				1103
Chris Lattner	44cb8ef	2007-12-24 19:27:46 +0000	[diff] [blame]	1104	//===---------------------------------------------------------------------===//
Chris Lattner	9bfcc62	2007-12-28 21:50:40 +0000	[diff] [blame]	1105
Chris Lattner	84a7c41	2008-01-07 21:59:58 +0000	[diff] [blame]	1106	In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
				1107	or and instruction, for example:
				1108
Chris Lattner	269f059	2008-01-09 00:37:18 +0000	[diff] [blame]	1109	xorpd LCPI1_0, %xmm2
Chris Lattner	84a7c41	2008-01-07 21:59:58 +0000	[diff] [blame]	1110
				1111	However, if xmm2 gets spilled, we end up with really ugly code like this:
				1112
Chris Lattner	269f059	2008-01-09 00:37:18 +0000	[diff] [blame]	1113	movsd (%esp), %xmm0
				1114	xorpd LCPI1_0, %xmm0
				1115	movsd %xmm0, (%esp)
Chris Lattner	84a7c41	2008-01-07 21:59:58 +0000	[diff] [blame]	1116
				1117	Since we 'know' that this is a 'neg', we can actually "fold" the spill into
				1118	the neg/abs instruction, turning it into an integer operation, like this:
				1119
				1120	xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
				1121
				1122	you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattner	269f059	2008-01-09 00:37:18 +0000	[diff] [blame]	1123	stall. Here is a contrived testcase:
				1124
				1125	double a, b, c;
				1126	void test(double *P) {
				1127	double X = *P;
				1128	a = X;
				1129	bar();
				1130	X = -X;
				1131	b = X;
				1132	bar();
				1133	c = X;
				1134	}
Chris Lattner	84a7c41	2008-01-07 21:59:58 +0000	[diff] [blame]	1135
				1136	//===---------------------------------------------------------------------===//
Andrew Lenharth	22c5c1b	2008-02-16 01:24:58 +0000	[diff] [blame]	1137
Chris Lattner	456012c	2008-02-17 19:43:57 +0000	[diff] [blame]	1138	The generated code on x86 for checking for signed overflow on a multiply the
				1139	obvious way is much longer than it needs to be.
				1140
				1141	int x(int a, int b) {
				1142	long long prod = (long long)a*b;
				1143	return prod > 0x7FFFFFFF \|\| prod < (-0x7FFFFFFF-1);
				1144	}
				1145
				1146	See PR2053 for more details.
				1147
				1148	//===---------------------------------------------------------------------===//
Chris Lattner	92b416f	2008-02-18 18:30:13 +0000	[diff] [blame]	1149
Eli Friedman	a2e7efa	2008-02-21 21:16:49 +0000	[diff] [blame]	1150	We should investigate using cdq/ctld (effect: edx = sar eax, 31)
				1151	more aggressively; it should cost the same as a move+shift on any modern
				1152	processor, but it's a lot shorter. Downside is that it puts more
				1153	pressure on register allocation because it has fixed operands.
				1154
				1155	Example:
				1156	int abs(int x) {return x < 0 ? -x : x;}
				1157
				1158	gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
				1159	abs:
				1160	movl 4(%esp), %eax
				1161	cltd
				1162	xorl %edx, %eax
				1163	subl %edx, %eax
				1164	ret
				1165
				1166	//===---------------------------------------------------------------------===//
				1167
				1168	Consider:
Chris Lattner	92b416f	2008-02-18 18:30:13 +0000	[diff] [blame]	1169	int test(unsigned long a, unsigned long b) { return -(a < b); }
				1170
				1171	We currently compile this to:
				1172
				1173	define i32 @test(i32 %a, i32 %b) nounwind {
				1174	%tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
				1175	%tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
				1176	%tmp5 = sub i32 0, %tmp34 ; <i32> [#uses=1]
				1177	ret i32 %tmp5
				1178	}
				1179
				1180	and
				1181
				1182	_test:
				1183	movl 8(%esp), %eax
				1184	cmpl %eax, 4(%esp)
				1185	setb %al
				1186	movzbl %al, %eax
				1187	negl %eax
				1188	ret
				1189
				1190	Several deficiencies here. First, we should instcombine zext+neg into sext:
				1191
				1192	define i32 @test2(i32 %a, i32 %b) nounwind {
				1193	%tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
				1194	%tmp34 = sext i1 %tmp3 to i32 ; <i32> [#uses=1]
				1195	ret i32 %tmp34
				1196	}
				1197
				1198	However, before we can do that, we have to fix the bad codegen that we get for
				1199	sext from bool:
				1200
				1201	_test2:
				1202	movl 8(%esp), %eax
				1203	cmpl %eax, 4(%esp)
				1204	setb %al
				1205	movzbl %al, %eax
				1206	shll $31, %eax
				1207	sarl $31, %eax
				1208	ret
				1209
				1210	This code should be at least as good as the code above. Once this is fixed, we
				1211	can optimize this specific case even more to:
				1212
				1213	movl 8(%esp), %eax
				1214	xorl %ecx, %ecx
				1215	cmpl %eax, 4(%esp)
				1216	sbbl %ecx, %ecx
				1217
				1218	//===---------------------------------------------------------------------===//
Eli Friedman	41ce5b8	2008-02-28 00:21:43 +0000	[diff] [blame]	1219
				1220	Take the following code (from
				1221	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
				1222
				1223	extern unsigned char first_one[65536];
				1224	int FirstOnet(unsigned long long arg1)
				1225	{
				1226	if (arg1 >> 48)
				1227	return (first_one[arg1 >> 48]);
				1228	return 0;
				1229	}
				1230
				1231
				1232	The following code is currently generated:
				1233	FirstOnet:
				1234	movl 8(%esp), %eax
				1235	cmpl $65536, %eax
				1236	movl 4(%esp), %ecx
				1237	jb .LBB1_2 # UnifiedReturnBlock
				1238	.LBB1_1: # ifthen
				1239	shrl $16, %eax
				1240	movzbl first_one(%eax), %eax
				1241	ret
				1242	.LBB1_2: # UnifiedReturnBlock
				1243	xorl %eax, %eax
				1244	ret
				1245
Eli Friedman	2ad7e43	2010-06-03 01:01:48 +0000	[diff] [blame]	1246	We could change the "movl 8(%esp), %eax" into "movzwl 10(%esp), %eax"; this
				1247	lets us change the cmpl into a testl, which is shorter, and eliminate the shift.
Eli Friedman	41ce5b8	2008-02-28 00:21:43 +0000	[diff] [blame]	1248
				1249	//===---------------------------------------------------------------------===//
				1250
Chris Lattner	daf6c54	2008-02-28 04:52:59 +0000	[diff] [blame]	1251	We compile this function:
				1252
				1253	define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext %d) nounwind {
				1254	entry:
				1255	%tmp2 = icmp eq i8 %d, 0 ; <i1> [#uses=1]
				1256	br i1 %tmp2, label %bb7, label %bb
				1257
				1258	bb: ; preds = %entry
				1259	%tmp6 = add i32 %b, %a ; <i32> [#uses=1]
				1260	ret i32 %tmp6
				1261
				1262	bb7: ; preds = %entry
				1263	%tmp10 = sub i32 %a, %c ; <i32> [#uses=1]
				1264	ret i32 %tmp10
				1265	}
				1266
				1267	to:
				1268
Eli Friedman	2ad7e43	2010-06-03 01:01:48 +0000	[diff] [blame]	1269	foo: # @foo
				1270	# BB#0: # %entry
				1271	movl 4(%esp), %ecx
Chris Lattner	daf6c54	2008-02-28 04:52:59 +0000	[diff] [blame]	1272	cmpb $0, 16(%esp)
Eli Friedman	2ad7e43	2010-06-03 01:01:48 +0000	[diff] [blame]	1273	je .LBB0_2
				1274	# BB#1: # %bb
Chris Lattner	daf6c54	2008-02-28 04:52:59 +0000	[diff] [blame]	1275	movl 8(%esp), %eax
Eli Friedman	2ad7e43	2010-06-03 01:01:48 +0000	[diff] [blame]	1276	addl %ecx, %eax
Chris Lattner	daf6c54	2008-02-28 04:52:59 +0000	[diff] [blame]	1277	ret
Eli Friedman	2ad7e43	2010-06-03 01:01:48 +0000	[diff] [blame]	1278	.LBB0_2: # %bb7
				1279	movl 12(%esp), %edx
				1280	movl %ecx, %eax
				1281	subl %edx, %eax
Chris Lattner	daf6c54	2008-02-28 04:52:59 +0000	[diff] [blame]	1282	ret
				1283
Eli Friedman	2ad7e43	2010-06-03 01:01:48 +0000	[diff] [blame]	1284	There's an obviously unnecessary movl in .LBB0_2, and we could eliminate a
				1285	couple more movls by putting 4(%esp) into %eax instead of %ecx.
Chris Lattner	daf6c54	2008-02-28 04:52:59 +0000	[diff] [blame]	1286
				1287	//===---------------------------------------------------------------------===//
Evan Cheng	ddf0a06	2008-03-28 07:07:06 +0000	[diff] [blame]	1288
				1289	See rdar://4653682.
				1290
				1291	From flops:
				1292
				1293	LBB1_15: # bb310
				1294	cvtss2sd LCPI1_0, %xmm1
				1295	addsd %xmm1, %xmm0
				1296	movsd 176(%esp), %xmm2
				1297	mulsd %xmm0, %xmm2
				1298	movapd %xmm2, %xmm3
				1299	mulsd %xmm3, %xmm3
				1300	movapd %xmm3, %xmm4
				1301	mulsd LCPI1_23, %xmm4
				1302	addsd LCPI1_24, %xmm4
				1303	mulsd %xmm3, %xmm4
				1304	addsd LCPI1_25, %xmm4
				1305	mulsd %xmm3, %xmm4
				1306	addsd LCPI1_26, %xmm4
				1307	mulsd %xmm3, %xmm4
				1308	addsd LCPI1_27, %xmm4
				1309	mulsd %xmm3, %xmm4
				1310	addsd LCPI1_28, %xmm4
				1311	mulsd %xmm3, %xmm4
				1312	addsd %xmm1, %xmm4
				1313	mulsd %xmm2, %xmm4
				1314	movsd 152(%esp), %xmm1
				1315	addsd %xmm4, %xmm1
				1316	movsd %xmm1, 152(%esp)
				1317	incl %eax
				1318	cmpl %eax, %esi
				1319	jge LBB1_15 # bb310
				1320	LBB1_16: # bb358.loopexit
				1321	movsd 152(%esp), %xmm0
				1322	addsd %xmm0, %xmm0
				1323	addsd LCPI1_22, %xmm0
				1324	movsd %xmm0, 152(%esp)
				1325
				1326	Rather than spilling the result of the last addsd in the loop, we should have
				1327	insert a copy to split the interval (one for the duration of the loop, one
				1328	extending to the fall through). The register pressure in the loop isn't high
				1329	enough to warrant the spill.
				1330
				1331	Also check why xmm7 is not used at all in the function.
Chris Lattner	7d717a0	2008-04-21 04:46:30 +0000	[diff] [blame]	1332
				1333	//===---------------------------------------------------------------------===//
				1334
Eli Friedman	2ad7e43	2010-06-03 01:01:48 +0000	[diff] [blame]	1335	Take the following:
Chris Lattner	7d717a0	2008-04-21 04:46:30 +0000	[diff] [blame]	1336
				1337	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
				1338	target triple = "i386-apple-darwin8"
				1339	@in_exit.4870.b = internal global i1 false ; <i1*> [#uses=2]
				1340	define fastcc void @abort_gzip() noreturn nounwind {
				1341	entry:
				1342	%tmp.b.i = load i1* @in_exit.4870.b ; <i1> [#uses=1]
				1343	br i1 %tmp.b.i, label %bb.i, label %bb4.i
				1344	bb.i: ; preds = %entry
				1345	tail call void @exit( i32 1 ) noreturn nounwind
				1346	unreachable
				1347	bb4.i: ; preds = %entry
				1348	store i1 true, i1* @in_exit.4870.b
				1349	tail call void @exit( i32 1 ) noreturn nounwind
				1350	unreachable
				1351	}
				1352	declare void @exit(i32) noreturn nounwind
				1353
Eli Friedman	2ad7e43	2010-06-03 01:01:48 +0000	[diff] [blame]	1354	This compiles into:
				1355	_abort_gzip: ## @abort_gzip
				1356	## BB#0: ## %entry
Chris Lattner	7d717a0	2008-04-21 04:46:30 +0000	[diff] [blame]	1357	subl $12, %esp
				1358	movb _in_exit.4870.b, %al
Eli Friedman	2ad7e43	2010-06-03 01:01:48 +0000	[diff] [blame]	1359	cmpb $1, %al
				1360	jne LBB0_2
				1361
				1362	We somehow miss folding the movb into the cmpb.
Chris Lattner	7d717a0	2008-04-21 04:46:30 +0000	[diff] [blame]	1363
				1364	//===---------------------------------------------------------------------===//
Chris Lattner	88c1baa	2008-05-05 23:19:45 +0000	[diff] [blame]	1365
				1366	We compile:
				1367
				1368	int test(int x, int y) {
				1369	return x-y-1;
				1370	}
				1371
				1372	into (-m64):
				1373
				1374	_test:
				1375	decl %edi
				1376	movl %edi, %eax
				1377	subl %esi, %eax
				1378	ret
				1379
				1380	it would be better to codegen as: x+~y (notl+addl)
Torok Edwin	61df3b7	2008-10-24 19:23:07 +0000	[diff] [blame]	1381
				1382	//===---------------------------------------------------------------------===//
				1383
				1384	This code:
				1385
				1386	int foo(const char *str,...)
				1387	{
				1388	__builtin_va_list a; int x;
				1389	__builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
				1390	return x;
				1391	}
				1392
				1393	gets compiled into this on x86-64:
				1394	subq $200, %rsp
				1395	movaps %xmm7, 160(%rsp)
				1396	movaps %xmm6, 144(%rsp)
				1397	movaps %xmm5, 128(%rsp)
				1398	movaps %xmm4, 112(%rsp)
				1399	movaps %xmm3, 96(%rsp)
				1400	movaps %xmm2, 80(%rsp)
				1401	movaps %xmm1, 64(%rsp)
				1402	movaps %xmm0, 48(%rsp)
				1403	movq %r9, 40(%rsp)
				1404	movq %r8, 32(%rsp)
				1405	movq %rcx, 24(%rsp)
				1406	movq %rdx, 16(%rsp)
				1407	movq %rsi, 8(%rsp)
				1408	leaq (%rsp), %rax
				1409	movq %rax, 192(%rsp)
				1410	leaq 208(%rsp), %rax
				1411	movq %rax, 184(%rsp)
				1412	movl $48, 180(%rsp)
				1413	movl $8, 176(%rsp)
				1414	movl 176(%rsp), %eax
				1415	cmpl $47, %eax
				1416	jbe .LBB1_3 # bb
				1417	.LBB1_1: # bb3
				1418	movq 184(%rsp), %rcx
				1419	leaq 8(%rcx), %rax
				1420	movq %rax, 184(%rsp)
				1421	.LBB1_2: # bb4
				1422	movl (%rcx), %eax
				1423	addq $200, %rsp
				1424	ret
				1425	.LBB1_3: # bb
				1426	movl %eax, %ecx
				1427	addl $8, %eax
				1428	addq 192(%rsp), %rcx
				1429	movl %eax, 176(%rsp)
				1430	jmp .LBB1_2 # bb4
				1431
				1432	gcc 4.3 generates:
				1433	subq $96, %rsp
				1434	.LCFI0:
				1435	leaq 104(%rsp), %rax
				1436	movq %rsi, -80(%rsp)
				1437	movl $8, -120(%rsp)
				1438	movq %rax, -112(%rsp)
				1439	leaq -88(%rsp), %rax
				1440	movq %rax, -104(%rsp)
				1441	movl $8, %eax
				1442	cmpl $48, %eax
				1443	jb .L6
				1444	movq -112(%rsp), %rdx
				1445	movl (%rdx), %eax
				1446	addq $96, %rsp
				1447	ret
				1448	.p2align 4,,10
				1449	.p2align 3
				1450	.L6:
				1451	mov %eax, %edx
				1452	addq -104(%rsp), %rdx
				1453	addl $8, %eax
				1454	movl %eax, -120(%rsp)
				1455	movl (%rdx), %eax
				1456	addq $96, %rsp
				1457	ret
				1458
				1459	and it gets compiled into this on x86:
				1460	pushl %ebp
				1461	movl %esp, %ebp
				1462	subl $4, %esp
				1463	leal 12(%ebp), %eax
				1464	movl %eax, -4(%ebp)
				1465	leal 16(%ebp), %eax
				1466	movl %eax, -4(%ebp)
				1467	movl 12(%ebp), %eax
				1468	addl $4, %esp
				1469	popl %ebp
				1470	ret
				1471
				1472	gcc 4.3 generates:
				1473	pushl %ebp
				1474	movl %esp, %ebp
				1475	movl 12(%ebp), %eax
				1476	popl %ebp
				1477	ret
Evan Cheng	86d7733	2008-11-11 17:35:52 +0000	[diff] [blame]	1478
				1479	//===---------------------------------------------------------------------===//
				1480
				1481	Teach tblgen not to check bitconvert source type in some cases. This allows us
				1482	to consolidate the following patterns in X86InstrMMX.td:
				1483
				1484	def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1485	(iPTR 0))))),
				1486	(v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
				1487	def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1488	(iPTR 0))))),
				1489	(v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
				1490	def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1491	(iPTR 0))))),
				1492	(v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
				1493
				1494	There are other cases in various td files.
Eli Friedman	91db527	2008-11-30 07:52:27 +0000	[diff] [blame]	1495
				1496	//===---------------------------------------------------------------------===//
				1497
				1498	Take something like the following on x86-32:
				1499	unsigned a(unsigned long long x, unsigned y) {return x % y;}
				1500
				1501	We currently generate a libcall, but we really shouldn't: the expansion is
				1502	shorter and likely faster than the libcall. The expected code is something
				1503	like the following:
				1504
				1505	movl 12(%ebp), %eax
				1506	movl 16(%ebp), %ecx
				1507	xorl %edx, %edx
				1508	divl %ecx
				1509	movl 8(%ebp), %eax
				1510	divl %ecx
				1511	movl %edx, %eax
				1512	ret
				1513
				1514	A similar code sequence works for division.
				1515
				1516	//===---------------------------------------------------------------------===//
Chris Lattner	f96ca79	2008-12-06 22:49:05 +0000	[diff] [blame]	1517
				1518	These should compile to the same code, but the later codegen's to useless
				1519	instructions on X86. This may be a trivial dag combine (GCC PR7061):
				1520
				1521	struct s1 { unsigned char a, b; };
				1522	unsigned long f1(struct s1 x) {
				1523	return x.a + x.b;
				1524	}
				1525	struct s2 { unsigned a: 8, b: 8; };
				1526	unsigned long f2(struct s2 x) {
				1527	return x.a + x.b;
				1528	}
				1529
				1530	//===---------------------------------------------------------------------===//
				1531
Chris Lattner	be685cc	2009-02-08 20:44:19 +0000	[diff] [blame]	1532	We currently compile this:
				1533
				1534	define i32 @func1(i32 %v1, i32 %v2) nounwind {
				1535	entry:
				1536	%t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)
				1537	%sum = extractvalue {i32, i1} %t, 0
				1538	%obit = extractvalue {i32, i1} %t, 1
				1539	br i1 %obit, label %overflow, label %normal
				1540	normal:
				1541	ret i32 %sum
				1542	overflow:
				1543	call void @llvm.trap()
				1544	unreachable
				1545	}
				1546	declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)
				1547	declare void @llvm.trap()
				1548
				1549	to:
				1550
				1551	_func1:
				1552	movl 4(%esp), %eax
				1553	addl 8(%esp), %eax
				1554	jo LBB1_2 ## overflow
				1555	LBB1_1: ## normal
				1556	ret
				1557	LBB1_2: ## overflow
				1558	ud2
				1559
				1560	it would be nice to produce "into" someday.
				1561
				1562	//===---------------------------------------------------------------------===//
Chris Lattner	a66878b	2009-02-17 01:16:14 +0000	[diff] [blame]	1563
				1564	This code:
				1565
				1566	void vec_mpys1(int y[], const int x[], int scaler) {
				1567	int i;
				1568	for (i = 0; i < 150; i++)
				1569	y[i] += (((long long)scaler * (long long)x[i]) >> 31);
				1570	}
				1571
				1572	Compiles to this loop with GCC 3.x:
				1573
				1574	.L5:
				1575	movl %ebx, %eax
				1576	imull (%edi,%ecx,4)
				1577	shrdl $31, %edx, %eax
				1578	addl %eax, (%esi,%ecx,4)
				1579	incl %ecx
				1580	cmpl $149, %ecx
				1581	jle .L5
				1582
				1583	llvm-gcc compiles it to the much uglier:
				1584
				1585	LBB1_1: ## bb1
				1586	movl 24(%esp), %eax
				1587	movl (%eax,%edi,4), %ebx
				1588	movl %ebx, %ebp
				1589	imull %esi, %ebp
				1590	movl %ebx, %eax
				1591	mull %ecx
				1592	addl %ebp, %edx
				1593	sarl $31, %ebx
				1594	imull %ecx, %ebx
				1595	addl %edx, %ebx
				1596	shldl $1, %eax, %ebx
				1597	movl 20(%esp), %eax
				1598	addl %ebx, (%eax,%edi,4)
				1599	incl %edi
				1600	cmpl $150, %edi
				1601	jne LBB1_1 ## bb1
				1602
Eli Friedman	1f1b0f7	2009-12-21 08:03:16 +0000	[diff] [blame]	1603	The issue is that we hoist the cast of "scaler" to long long outside of the
				1604	loop, the value comes into the loop as two values, and
				1605	RegsForValue::getCopyFromRegs doesn't know how to put an AssertSext on the
				1606	constructed BUILD_PAIR which represents the cast value.
				1607
Chris Lattner	a66878b	2009-02-17 01:16:14 +0000	[diff] [blame]	1608	//===---------------------------------------------------------------------===//
Chris Lattner	b34487d	2009-03-08 01:54:43 +0000	[diff] [blame]	1609
Dan Gohman	ad93e1e	2009-03-09 23:47:02 +0000	[diff] [blame]	1610	Test instructions can be eliminated by using EFLAGS values from arithmetic
Dan Gohman	3328add	2009-03-10 00:26:23 +0000	[diff] [blame]	1611	instructions. This is currently not done for mul, and, or, xor, neg, shl,
				1612	sra, srl, shld, shrd, atomic ops, and others. It is also currently not done
				1613	for read-modify-write instructions. It is also current not done if the
				1614	OF or CF flags are needed.
Dan Gohman	ad93e1e	2009-03-09 23:47:02 +0000	[diff] [blame]	1615
				1616	The shift operators have the complication that when the shift count is
				1617	zero, EFLAGS is not set, so they can only subsume a test instruction if
Dan Gohman	3328add	2009-03-10 00:26:23 +0000	[diff] [blame]	1618	the shift count is known to be non-zero. Also, using the EFLAGS value
				1619	from a shift is apparently very slow on some x86 implementations.
Dan Gohman	ad93e1e	2009-03-09 23:47:02 +0000	[diff] [blame]	1620
				1621	In read-modify-write instructions, the root node in the isel match is
				1622	the store, and isel has no way for the use of the EFLAGS result of the
				1623	arithmetic to be remapped to the new node.
				1624
Dan Gohman	3328add	2009-03-10 00:26:23 +0000	[diff] [blame]	1625	Add and subtract instructions set OF on signed overflow and CF on unsiged
				1626	overflow, while test instructions always clear OF and CF. In order to
				1627	replace a test with an add or subtract in a situation where OF or CF is
				1628	needed, codegen must be able to prove that the operation cannot see
				1629	signed or unsigned overflow, respectively.
				1630
Dan Gohman	ad93e1e	2009-03-09 23:47:02 +0000	[diff] [blame]	1631	//===---------------------------------------------------------------------===//
				1632
Chris Lattner	ff9dcee	2009-03-08 03:04:26 +0000	[diff] [blame]	1633	memcpy/memmove do not lower to SSE copies when possible. A silly example is:
				1634	define <16 x float> @foo(<16 x float> %A) nounwind {
				1635	%tmp = alloca <16 x float>, align 16
				1636	%tmp2 = alloca <16 x float>, align 16
				1637	store <16 x float> %A, <16 x float>* %tmp
				1638	%s = bitcast <16 x float>* %tmp to i8*
				1639	%s2 = bitcast <16 x float>* %tmp2 to i8*
				1640	call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)
				1641	%R = load <16 x float>* %tmp2
				1642	ret <16 x float> %R
				1643	}
				1644
				1645	declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind
				1646
				1647	which compiles to:
				1648
				1649	_foo:
				1650	subl $140, %esp
				1651	movaps %xmm3, 112(%esp)
				1652	movaps %xmm2, 96(%esp)
				1653	movaps %xmm1, 80(%esp)
				1654	movaps %xmm0, 64(%esp)
				1655	movl 60(%esp), %eax
				1656	movl %eax, 124(%esp)
				1657	movl 56(%esp), %eax
				1658	movl %eax, 120(%esp)
				1659	movl 52(%esp), %eax
				1660	<many many more 32-bit copies>
				1661	movaps (%esp), %xmm0
				1662	movaps 16(%esp), %xmm1
				1663	movaps 32(%esp), %xmm2
				1664	movaps 48(%esp), %xmm3
				1665	addl $140, %esp
				1666	ret
				1667
				1668	On Nehalem, it may even be cheaper to just use movups when unaligned than to
				1669	fall back to lower-granularity chunks.
				1670
				1671	//===---------------------------------------------------------------------===//
Chris Lattner	d9b7715	2009-05-25 20:28:19 +0000	[diff] [blame]	1672
				1673	Implement processor-specific optimizations for parity with GCC on these
				1674	processors. GCC does two optimizations:
				1675
				1676	1. ix86_pad_returns inserts a noop before ret instructions if immediately
				1677	preceeded by a conditional branch or is the target of a jump.
				1678	2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of
				1679	code contains more than 3 branches.
				1680
				1681	The first one is done for all AMDs, Core2, and "Generic"
				1682	The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,
				1683	Core 2, and "Generic"
				1684
				1685	//===---------------------------------------------------------------------===//
Eli Friedman	7161cb1	2009-06-11 23:07:04 +0000	[diff] [blame]	1686
				1687	Testcase:
				1688	int a(int x) { return (x & 127) > 31; }
				1689
				1690	Current output:
				1691	movl 4(%esp), %eax
				1692	andl $127, %eax
				1693	cmpl $31, %eax
				1694	seta %al
				1695	movzbl %al, %eax
				1696	ret
				1697
				1698	Ideal output:
				1699	xorl %eax, %eax
				1700	testl $96, 4(%esp)
				1701	setne %al
				1702	ret
				1703
Chris Lattner	d23fffe	2009-06-16 06:11:35 +0000	[diff] [blame]	1704	This should definitely be done in instcombine, canonicalizing the range
				1705	condition into a != condition. We get this IR:
				1706
				1707	define i32 @a(i32 %x) nounwind readnone {
				1708	entry:
				1709	%0 = and i32 %x, 127 ; <i32> [#uses=1]
				1710	%1 = icmp ugt i32 %0, 31 ; <i1> [#uses=1]
				1711	%2 = zext i1 %1 to i32 ; <i32> [#uses=1]
				1712	ret i32 %2
				1713	}
				1714
				1715	Instcombine prefers to strength reduce relational comparisons to equality
				1716	comparisons when possible, this should be another case of that. This could
				1717	be handled pretty easily in InstCombiner::visitICmpInstWithInstAndIntCst, but it
				1718	looks like InstCombiner::visitICmpInstWithInstAndIntCst should really already
				1719	be redesigned to use ComputeMaskedBits and friends.
				1720
Eli Friedman	7161cb1	2009-06-11 23:07:04 +0000	[diff] [blame]	1721
				1722	//===---------------------------------------------------------------------===//
				1723	Testcase:
				1724	int x(int a) { return (a&0xf0)>>4; }
				1725
				1726	Current output:
				1727	movl 4(%esp), %eax
				1728	shrl $4, %eax
				1729	andl $15, %eax
				1730	ret
				1731
				1732	Ideal output:
				1733	movzbl 4(%esp), %eax
				1734	shrl $4, %eax
				1735	ret
				1736
				1737	//===---------------------------------------------------------------------===//
				1738
				1739	Testcase:
				1740	int x(int a) { return (a & 0x80) ? 0x100 : 0; }
Chris Lattner	b42e20b	2009-06-16 06:15:56 +0000	[diff] [blame]	1741	int y(int a) { return (a & 0x80) *2; }
Eli Friedman	7161cb1	2009-06-11 23:07:04 +0000	[diff] [blame]	1742
Chris Lattner	b42e20b	2009-06-16 06:15:56 +0000	[diff] [blame]	1743	Current:
Eli Friedman	7161cb1	2009-06-11 23:07:04 +0000	[diff] [blame]	1744	testl $128, 4(%esp)
				1745	setne %al
				1746	movzbl %al, %eax
				1747	shll $8, %eax
				1748	ret
				1749
Chris Lattner	b42e20b	2009-06-16 06:15:56 +0000	[diff] [blame]	1750	Better:
Eli Friedman	7161cb1	2009-06-11 23:07:04 +0000	[diff] [blame]	1751	movl 4(%esp), %eax
				1752	addl %eax, %eax
				1753	andl $256, %eax
				1754	ret
				1755
Chris Lattner	b42e20b	2009-06-16 06:15:56 +0000	[diff] [blame]	1756	This is another general instcombine transformation that is profitable on all
				1757	targets. In LLVM IR, these functions look like this:
				1758
				1759	define i32 @x(i32 %a) nounwind readnone {
				1760	entry:
				1761	%0 = and i32 %a, 128
				1762	%1 = icmp eq i32 %0, 0
				1763	%iftmp.0.0 = select i1 %1, i32 0, i32 256
				1764	ret i32 %iftmp.0.0
				1765	}
				1766
				1767	define i32 @y(i32 %a) nounwind readnone {
				1768	entry:
				1769	%0 = shl i32 %a, 1
				1770	%1 = and i32 %0, 256
				1771	ret i32 %1
				1772	}
				1773
				1774	Replacing an icmp+select with a shift should always be considered profitable in
				1775	instcombine.
Eli Friedman	7161cb1	2009-06-11 23:07:04 +0000	[diff] [blame]	1776
				1777	//===---------------------------------------------------------------------===//
Evan Cheng	7216920	2009-07-30 08:56:19 +0000	[diff] [blame]	1778
				1779	Re-implement atomic builtins __sync_add_and_fetch() and __sync_sub_and_fetch
				1780	properly.
				1781
				1782	When the return value is not used (i.e. only care about the value in the
				1783	memory), x86 does not have to use add to implement these. Instead, it can use
				1784	add, sub, inc, dec instructions with the "lock" prefix.
				1785
				1786	This is currently implemented using a bit of instruction selection trick. The
				1787	issue is the target independent pattern produces one output and a chain and we
				1788	want to map it into one that just output a chain. The current trick is to select
				1789	it into a MERGE_VALUES with the first definition being an implicit_def. The
				1790	proper solution is to add new ISD opcodes for the no-output variant. DAG
				1791	combiner can then transform the node before it gets to target node selection.
				1792
				1793	Problem #2 is we are adding a whole bunch of x86 atomic instructions when in
				1794	fact these instructions are identical to the non-lock versions. We need a way to
				1795	add target specific information to target nodes and have this information
				1796	carried over to machine instructions. Asm printer (or JIT) can use this
				1797	information to add the "lock" prefix.
Bill Wendling	1ff2c48	2009-10-27 22:34:43 +0000	[diff] [blame]	1798
				1799	//===---------------------------------------------------------------------===//
Eli Friedman	11d91e8	2010-02-10 21:26:04 +0000	[diff] [blame]	1800
				1801	_Bool bar(int x) { return x & 1; }
				1802
				1803	define zeroext i1 @bar(i32* nocapture %x) nounwind readonly {
				1804	entry:
				1805	%tmp1 = load i32* %x ; <i32> [#uses=1]
				1806	%and = and i32 %tmp1, 1 ; <i32> [#uses=1]
				1807	%tobool = icmp ne i32 %and, 0 ; <i1> [#uses=1]
				1808	ret i1 %tobool
				1809	}
				1810
				1811	bar: # @bar
				1812	# BB#0: # %entry
				1813	movl 4(%esp), %eax
				1814	movb (%eax), %al
				1815	andb $1, %al
				1816	movzbl %al, %eax
				1817	ret
				1818
				1819	Missed optimization: should be movl+andl.
				1820
				1821	//===---------------------------------------------------------------------===//
				1822
				1823	Consider the following two functions compiled with clang:
				1824	_Bool foo(int x) { return !(x & 4); }
				1825	unsigned bar(int x) { return !(x & 4); }
				1826
				1827	foo:
				1828	movl 4(%esp), %eax
				1829	testb $4, (%eax)
				1830	sete %al
				1831	movzbl %al, %eax
				1832	ret
				1833
				1834	bar:
				1835	movl 4(%esp), %eax
				1836	movl (%eax), %eax
				1837	shrl $2, %eax
				1838	andl $1, %eax
				1839	xorl $1, %eax
				1840	ret
				1841
				1842	The second function generates more code even though the two functions are
				1843	are functionally identical.
				1844
				1845	//===---------------------------------------------------------------------===//
				1846
				1847	Take the following C code:
				1848	int x(int y) { return (y & 63) << 14; }
				1849
				1850	Code produced by gcc:
				1851	andl $63, %edi
				1852	sall $14, %edi
				1853	movl %edi, %eax
				1854	ret
				1855
				1856	Code produced by clang:
				1857	shll $14, %edi
				1858	movl %edi, %eax
				1859	andl $1032192, %eax
				1860	ret
				1861
				1862	The code produced by gcc is 3 bytes shorter. This sort of construct often
				1863	shows up with bitfields.
				1864
				1865	//===---------------------------------------------------------------------===//
Eli Friedman	5033f64	2010-08-29 05:07:40 +0000	[diff] [blame]	1866
				1867	Take the following C code:
				1868	int f(int a, int b) { return (unsigned char)a == (unsigned char)b; }
				1869
				1870	We generate the following IR with clang:
				1871	define i32 @f(i32 %a, i32 %b) nounwind readnone {
				1872	entry:
				1873	%tmp = xor i32 %b, %a ; <i32> [#uses=1]
				1874	%tmp6 = and i32 %tmp, 255 ; <i32> [#uses=1]
				1875	%cmp = icmp eq i32 %tmp6, 0 ; <i1> [#uses=1]
				1876	%conv5 = zext i1 %cmp to i32 ; <i32> [#uses=1]
				1877	ret i32 %conv5
				1878	}
				1879
				1880	And the following x86 code:
				1881	xorl %esi, %edi
				1882	testb $-1, %dil
				1883	sete %al
				1884	movzbl %al, %eax
				1885	ret
				1886
				1887	A cmpb instead of the xorl+testb would be one instruction shorter.
				1888
				1889	//===---------------------------------------------------------------------===//
				1890
				1891	Given the following C code:
				1892	int f(int a, int b) { return (signed char)a == (signed char)b; }
				1893
				1894	We generate the following IR with clang:
				1895	define i32 @f(i32 %a, i32 %b) nounwind readnone {
				1896	entry:
				1897	%sext = shl i32 %a, 24 ; <i32> [#uses=1]
				1898	%conv1 = ashr i32 %sext, 24 ; <i32> [#uses=1]
				1899	%sext6 = shl i32 %b, 24 ; <i32> [#uses=1]
				1900	%conv4 = ashr i32 %sext6, 24 ; <i32> [#uses=1]
				1901	%cmp = icmp eq i32 %conv1, %conv4 ; <i1> [#uses=1]
				1902	%conv5 = zext i1 %cmp to i32 ; <i32> [#uses=1]
				1903	ret i32 %conv5
				1904	}
				1905
				1906	And the following x86 code:
				1907	movsbl %sil, %eax
				1908	movsbl %dil, %ecx
				1909	cmpl %eax, %ecx
				1910	sete %al
				1911	movzbl %al, %eax
				1912	ret
				1913
				1914
				1915	It should be possible to eliminate the sign extensions.
				1916
				1917	//===---------------------------------------------------------------------===//
Dan Gohman	24bde5b	2010-09-02 21:18:42 +0000	[diff] [blame]	1918
				1919	LLVM misses a load+store narrowing opportunity in this code:
				1920
				1921	%struct.bf = type { i64, i16, i16, i32 }
				1922
				1923	@bfi = external global %struct.bf* ; <%struct.bf**> [#uses=2]
				1924
				1925	define void @t1() nounwind ssp {
				1926	entry:
				1927	%0 = load %struct.bf** @bfi, align 8 ; <%struct.bf*> [#uses=1]
				1928	%1 = getelementptr %struct.bf* %0, i64 0, i32 1 ; <i16*> [#uses=1]
				1929	%2 = bitcast i16* %1 to i32* ; <i32*> [#uses=2]
				1930	%3 = load i32* %2, align 1 ; <i32> [#uses=1]
				1931	%4 = and i32 %3, -65537 ; <i32> [#uses=1]
				1932	store i32 %4, i32* %2, align 1
				1933	%5 = load %struct.bf** @bfi, align 8 ; <%struct.bf*> [#uses=1]
				1934	%6 = getelementptr %struct.bf* %5, i64 0, i32 1 ; <i16*> [#uses=1]
				1935	%7 = bitcast i16* %6 to i32* ; <i32*> [#uses=2]
				1936	%8 = load i32* %7, align 1 ; <i32> [#uses=1]
				1937	%9 = and i32 %8, -131073 ; <i32> [#uses=1]
				1938	store i32 %9, i32* %7, align 1
				1939	ret void
				1940	}
				1941
				1942	LLVM currently emits this:
				1943
				1944	movq bfi(%rip), %rax
				1945	andl $-65537, 8(%rax)
				1946	movq bfi(%rip), %rax
				1947	andl $-131073, 8(%rax)
				1948	ret
				1949
				1950	It could narrow the loads and stores to emit this:
				1951
				1952	movq bfi(%rip), %rax
				1953	andb $-2, 10(%rax)
				1954	movq bfi(%rip), %rax
				1955	andb $-3, 10(%rax)
				1956	ret
				1957
				1958	The trouble is that there is a TokenFactor between the store and the
				1959	load, making it non-trivial to determine if there's anything between
				1960	the load and the store which would prohibit narrowing.
				1961
				1962	//===---------------------------------------------------------------------===//