Blame - lib/Target/X86/README.txt - fp2-dev/platform/external/llvm

blob: 3c6138bb1e6a64089439a068c328c0e4834b5cc3 [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the X86 backend.
				3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	88de51e	2009-05-25 16:34:44 +0000	[diff] [blame]	5	We should add support for the "movbe" instruction, which does a byte-swapping
				6	copy (3-addr bswap + memory support?) This is available on Atom processors.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	7
				8	//===---------------------------------------------------------------------===//
				9
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	10	CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
				11	backend knows how to three-addressify this shift, but it appears the register
				12	allocator isn't even asking it to do so in this case. We should investigate
				13	why this isn't happening, it could have significant impact on other important
				14	cases for X86 as well.
				15
				16	//===---------------------------------------------------------------------===//
				17
				18	This should be one DIV/IDIV instruction, not a libcall:
				19
				20	unsigned test(unsigned long long X, unsigned Y) {
				21	return X/Y;
				22	}
				23
				24	This can be done trivially with a custom legalizer. What about overflow
				25	though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
				26
				27	//===---------------------------------------------------------------------===//
				28
				29	Improvements to the multiply -> shift/add algorithm:
				30	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
				31
				32	//===---------------------------------------------------------------------===//
				33
				34	Improve code like this (occurs fairly frequently, e.g. in LLVM):
				35	long long foo(int x) { return 1LL << x; }
				36
				37	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
				38	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
				39	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
				40
				41	Another useful one would be ~0ULL >> X and ~0ULL << X.
				42
				43	One better solution for 1LL << x is:
				44	xorl %eax, %eax
				45	xorl %edx, %edx
				46	testb $32, %cl
				47	sete %al
				48	setne %dl
				49	sall %cl, %eax
				50	sall %cl, %edx
				51
				52	But that requires good 8-bit subreg support.
				53
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	54	Also, this might be better. It's an extra shift, but it's one instruction
				55	shorter, and doesn't stress 8-bit subreg support.
				56	(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
				57	but without the unnecessary and.)
				58	movl %ecx, %eax
				59	shrl $5, %eax
				60	movl %eax, %edx
				61	xorl $1, %edx
				62	sall %cl, %eax
				63	sall %cl. %edx
				64
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	65	64-bit shifts (in general) expand to really bad code. Instead of using
				66	cmovs, we should expand to a conditional branch like GCC produces.
				67
				68	//===---------------------------------------------------------------------===//
				69
				70	Compile this:
				71	_Bool f(_Bool a) { return a!=1; }
				72
				73	into:
				74	movzbl %dil, %eax
				75	xorl $1, %eax
				76	ret
				77
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	78	(Although note that this isn't a legal way to express the code that llvm-gcc
				79	currently generates for that function.)
				80
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	81	//===---------------------------------------------------------------------===//
				82
				83	Some isel ideas:
				84
				85	1. Dynamic programming based approach when compile time if not an
				86	issue.
				87	2. Code duplication (addressing mode) during isel.
				88	3. Other ideas from "Register-Sensitive Selection, Duplication, and
				89	Sequencing of Instructions".
				90	4. Scheduling for reduced register pressure. E.g. "Minimum Register
				91	Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
				92	and other related papers.
				93	http://citeseer.ist.psu.edu/govindarajan01minimum.html
				94
				95	//===---------------------------------------------------------------------===//
				96
				97	Should we promote i16 to i32 to avoid partial register update stalls?
				98
				99	//===---------------------------------------------------------------------===//
				100
				101	Leave any_extend as pseudo instruction and hint to register
				102	allocator. Delay codegen until post register allocation.
Evan Cheng	fdbb667	2007-10-12 18:22:55 +0000	[diff] [blame]	103	Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
				104	the coalescer how to deal with it though.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	105
				106	//===---------------------------------------------------------------------===//
				107
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	108	It appears icc use push for parameter passing. Need to investigate.
				109
				110	//===---------------------------------------------------------------------===//
				111
				112	Only use inc/neg/not instructions on processors where they are faster than
				113	add/sub/xor. They are slower on the P4 due to only updating some processor
				114	flags.
				115
				116	//===---------------------------------------------------------------------===//
				117
				118	The instruction selector sometimes misses folding a load into a compare. The
				119	pattern is written as (cmp reg, (load p)). Because the compare isn't
				120	commutative, it is not matched with the load on both sides. The dag combiner
				121	should be made smart enough to cannonicalize the load into the RHS of a compare
				122	when it can invert the result of the compare for free.
				123
				124	//===---------------------------------------------------------------------===//
				125
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	126	In many cases, LLVM generates code like this:
				127
				128	_test:
				129	movl 8(%esp), %eax
				130	cmpl %eax, 4(%esp)
				131	setl %al
				132	movzbl %al, %eax
				133	ret
				134
				135	on some processors (which ones?), it is more efficient to do this:
				136
				137	_test:
				138	movl 8(%esp), %ebx
				139	xor %eax, %eax
				140	cmpl %ebx, 4(%esp)
				141	setl %al
				142	ret
				143
				144	Doing this correctly is tricky though, as the xor clobbers the flags.
				145
				146	//===---------------------------------------------------------------------===//
				147
				148	We should generate bts/btr/etc instructions on targets where they are cheap or
				149	when codesize is important. e.g., for:
				150
				151	void setbit(int *target, int bit) {
				152	*target \|= (1 << bit);
				153	}
				154	void clearbit(int *target, int bit) {
				155	*target &= ~(1 << bit);
				156	}
				157
				158	//===---------------------------------------------------------------------===//
				159
				160	Instead of the following for memset char*, 1, 10:
				161
				162	movl $16843009, 4(%edx)
				163	movl $16843009, (%edx)
				164	movw $257, 8(%edx)
				165
				166	It might be better to generate
				167
				168	movl $16843009, %eax
				169	movl %eax, 4(%edx)
				170	movl %eax, (%edx)
				171	movw al, 8(%edx)
				172
				173	when we can spare a register. It reduces code size.
				174
				175	//===---------------------------------------------------------------------===//
				176
				177	Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
				178	get this:
				179
Eli Friedman	1aa1f2c	2008-02-28 00:21:43 +0000	[diff] [blame]	180	define i32 @test1(i32 %X) {
				181	%Y = sdiv i32 %X, 8
				182	ret i32 %Y
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	183	}
				184
				185	_test1:
				186	movl 4(%esp), %eax
				187	movl %eax, %ecx
				188	sarl $31, %ecx
				189	shrl $29, %ecx
				190	addl %ecx, %eax
				191	sarl $3, %eax
				192	ret
				193
				194	GCC knows several different ways to codegen it, one of which is this:
				195
				196	_test1:
				197	movl 4(%esp), %eax
				198	cmpl $-1, %eax
				199	leal 7(%eax), %ecx
				200	cmovle %ecx, %eax
				201	sarl $3, %eax
				202	ret
				203
				204	which is probably slower, but it's interesting at least :)
				205
				206	//===---------------------------------------------------------------------===//
				207
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	208	We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
				209	We should leave these as libcalls for everything over a much lower threshold,
				210	since libc is hand tuned for medium and large mem ops (avoiding RFO for large
				211	stores, TLB preheating, etc)
				212
				213	//===---------------------------------------------------------------------===//
				214
				215	Optimize this into something reasonable:
				216	x * copysign(1.0, y) * copysign(1.0, z)
				217
				218	//===---------------------------------------------------------------------===//
				219
				220	Optimize copysign(x, *y) to use an integer load from y.
				221
				222	//===---------------------------------------------------------------------===//
				223
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	224	The following tests perform worse with LSR:
				225
				226	lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
				227
				228	//===---------------------------------------------------------------------===//
				229
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	230	Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
				231	FR64 to VR128.
				232
				233	//===---------------------------------------------------------------------===//
				234
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	235	Adding to the list of cmp / test poor codegen issues:
				236
				237	int test(__m128 A, __m128 B) {
				238	if (_mm_comige_ss(A, B))
				239	return 3;
				240	else
				241	return 4;
				242	}
				243
				244	_test:
				245	movl 8(%esp), %eax
				246	movaps (%eax), %xmm0
				247	movl 4(%esp), %eax
				248	movaps (%eax), %xmm1
				249	comiss %xmm0, %xmm1
				250	setae %al
				251	movzbl %al, %ecx
				252	movl $3, %eax
				253	movl $4, %edx
				254	cmpl $0, %ecx
				255	cmove %edx, %eax
				256	ret
				257
				258	Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
				259	are a number of issues. 1) We are introducing a setcc between the result of the
				260	intrisic call and select. 2) The intrinsic is expected to produce a i32 value
				261	so a any extend (which becomes a zero extend) is added.
				262
				263	We probably need some kind of target DAG combine hook to fix this.
				264
				265	//===---------------------------------------------------------------------===//
				266
				267	We generate significantly worse code for this than GCC:
				268	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
				269	http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
				270
				271	There is also one case we do worse on PPC.
				272
				273	//===---------------------------------------------------------------------===//
				274
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	275	For this:
				276
				277	int test(int a)
				278	{
				279	return a * 3;
				280	}
				281
				282	We currently emits
				283	imull $3, 4(%esp), %eax
				284
				285	Perhaps this is what we really should generate is? Is imull three or four
				286	cycles? Note: ICC generates this:
				287	movl 4(%esp), %eax
				288	leal (%eax,%eax,2), %eax
				289
				290	The current instruction priority is based on pattern complexity. The former is
				291	more "complex" because it folds a load so the latter will not be emitted.
				292
				293	Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
				294	should always try to match LEA first since the LEA matching code does some
				295	estimate to determine whether the match is profitable.
				296
				297	However, if we care more about code size, then imull is better. It's two bytes
				298	shorter than movl + leal.
				299
Eli Friedman	9ab1db0	2008-11-30 07:52:27 +0000	[diff] [blame]	300	On a Pentium M, both variants have the same characteristics with regard
				301	to throughput; however, the multiplication has a latency of four cycles, as
				302	opposed to two cycles for the movl+lea variant.
				303
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	304	//===---------------------------------------------------------------------===//
				305
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	306	__builtin_ffs codegen is messy.
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	307
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	308	int ffs_(unsigned X) { return __builtin_ffs(X); }
				309
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	310	llvm produces:
				311	ffs_:
				312	movl 4(%esp), %ecx
				313	bsfl %ecx, %eax
				314	movl $32, %edx
				315	cmove %edx, %eax
				316	incl %eax
				317	xorl %edx, %edx
				318	testl %ecx, %ecx
				319	cmove %edx, %eax
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	320	ret
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	321
				322	vs gcc:
				323
Chris Lattner	a86af9a	2007-08-11 18:19:07 +0000	[diff] [blame]	324	_ffs_:
				325	movl $-1, %edx
				326	bsfl 4(%esp), %eax
				327	cmove %edx, %eax
				328	addl $1, %eax
				329	ret
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	330
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	331	Another example of __builtin_ffs (use predsimplify to eliminate a select):
				332
				333	int foo (unsigned long j) {
				334	if (j)
				335	return __builtin_ffs (j) - 1;
				336	else
				337	return 0;
				338	}
				339
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	340	//===---------------------------------------------------------------------===//
				341
				342	It appears gcc place string data with linkonce linkage in
				343	.section __TEXT,__const_coal,coalesced instead of
				344	.section __DATA,__const_coal,coalesced.
				345	Take a look at darwin.h, there are other Darwin assembler directives that we
				346	do not make use of.
				347
				348	//===---------------------------------------------------------------------===//
				349
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	350	define i32 @foo(i32* %a, i32 %t) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	351	entry:
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	352	br label %cond_true
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	353
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	354	cond_true: ; preds = %cond_true, %entry
				355	%x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ] ; <i32> [#uses=3]
				356	%t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ] ; <i32> [#uses=1]
				357	%tmp2 = getelementptr i32* %a, i32 %x.0.0 ; <i32*> [#uses=1]
				358	%tmp3 = load i32* %tmp2 ; <i32> [#uses=1]
				359	%tmp5 = add i32 %t_addr.0.0, %x.0.0 ; <i32> [#uses=1]
				360	%tmp7 = add i32 %tmp5, %tmp3 ; <i32> [#uses=2]
				361	%tmp9 = add i32 %x.0.0, 1 ; <i32> [#uses=2]
				362	%tmp = icmp sgt i32 %tmp9, 39 ; <i1> [#uses=1]
				363	br i1 %tmp, label %bb12, label %cond_true
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	364
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	365	bb12: ; preds = %cond_true
				366	ret i32 %tmp7
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	367	}
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	368	is pessimized by -loop-reduce and -indvars
				369
				370	//===---------------------------------------------------------------------===//
				371
				372	u32 to float conversion improvement:
				373
				374	float uint32_2_float( unsigned u ) {
				375	float fl = (int) (u & 0xffff);
				376	float fh = (int) (u >> 16);
				377	fh *= 0x1.0p16f;
				378	return fh + fl;
				379	}
				380
				381	00000000 subl $0x04,%esp
				382	00000003 movl 0x08(%esp,1),%eax
				383	00000007 movl %eax,%ecx
				384	00000009 shrl $0x10,%ecx
				385	0000000c cvtsi2ss %ecx,%xmm0
				386	00000010 andl $0x0000ffff,%eax
				387	00000015 cvtsi2ss %eax,%xmm1
				388	00000019 mulss 0x00000078,%xmm0
				389	00000021 addss %xmm1,%xmm0
				390	00000025 movss %xmm0,(%esp,1)
				391	0000002a flds (%esp,1)
				392	0000002d addl $0x04,%esp
				393	00000030 ret
				394
				395	//===---------------------------------------------------------------------===//
				396
				397	When using fastcc abi, align stack slot of argument of type double on 8 byte
				398	boundary to improve performance.
				399
				400	//===---------------------------------------------------------------------===//
				401
				402	Codegen:
				403
				404	int f(int a, int b) {
				405	if (a == 4 \|\| a == 6)
				406	b++;
				407	return b;
				408	}
				409
				410
				411	as:
				412
				413	or eax, 2
				414	cmp eax, 6
				415	jz label
				416
				417	//===---------------------------------------------------------------------===//
				418
				419	GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
				420	simplifications for integer "x cmp y ? a : b". For example, instead of:
				421
				422	int G;
				423	void f(int X, int Y) {
				424	G = X < 0 ? 14 : 13;
				425	}
				426
				427	compiling to:
				428
				429	_f:
				430	movl $14, %eax
				431	movl $13, %ecx
				432	movl 4(%esp), %edx
				433	testl %edx, %edx
				434	cmovl %eax, %ecx
				435	movl %ecx, _G
				436	ret
				437
				438	it could be:
				439	_f:
				440	movl 4(%esp), %eax
				441	sarl $31, %eax
				442	notl %eax
				443	addl $14, %eax
				444	movl %eax, _G
				445	ret
				446
				447	etc.
				448
Chris Lattner	e7037c2	2007-11-02 17:04:20 +0000	[diff] [blame]	449	Another is:
				450	int usesbb(unsigned int a, unsigned int b) {
				451	return (a < b ? -1 : 0);
				452	}
				453	to:
				454	_usesbb:
				455	movl 8(%esp), %eax
				456	cmpl %eax, 4(%esp)
				457	sbbl %eax, %eax
				458	ret
				459
				460	instead of:
				461	_usesbb:
				462	xorl %eax, %eax
				463	movl 8(%esp), %ecx
				464	cmpl %ecx, 4(%esp)
				465	movl $4294967295, %ecx
				466	cmovb %ecx, %eax
				467	ret
				468
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	469	//===---------------------------------------------------------------------===//
				470
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	471	Consider the expansion of:
				472
Chris Lattner	bea5feb	2008-02-14 06:19:02 +0000	[diff] [blame]	473	define i32 @test3(i32 %X) {
				474	%tmp1 = urem i32 %X, 255
				475	ret i32 %tmp1
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	476	}
				477
				478	Currently it compiles to:
				479
				480	...
				481	movl $2155905153, %ecx
				482	movl 8(%esp), %esi
				483	movl %esi, %eax
				484	mull %ecx
				485	...
				486
				487	This could be "reassociated" into:
				488
				489	movl $2155905153, %eax
				490	movl 8(%esp), %ecx
				491	mull %ecx
				492
				493	to avoid the copy. In fact, the existing two-address stuff would do this
				494	except that mul isn't a commutative 2-addr instruction. I guess this has
				495	to be done at isel time based on the #uses to mul?
				496
				497	//===---------------------------------------------------------------------===//
				498
				499	Make sure the instruction which starts a loop does not cross a cacheline
				500	boundary. This requires knowning the exact length of each machine instruction.
				501	That is somewhat complicated, but doable. Example 256.bzip2:
				502
				503	In the new trace, the hot loop has an instruction which crosses a cacheline
				504	boundary. In addition to potential cache misses, this can't help decoding as I
				505	imagine there has to be some kind of complicated decoder reset and realignment
				506	to grab the bytes from the next cacheline.
				507
				508	532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
Eli Friedman	9ab1db0	2008-11-30 07:52:27 +0000	[diff] [blame]	509	942 942 0x3d03 movl %dh, (1809(%esp, %esi)
				510	937 937 0x3d0a incl %esi
				511	3 3 0x3d0b cmpb %bl, %dl
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	512	27 27 0x3d0d jnz 0x000062db <main+11707>
				513
				514	//===---------------------------------------------------------------------===//
				515
				516	In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
				517
				518	//===---------------------------------------------------------------------===//
				519
				520	This could be a single 16-bit load.
				521
				522	int f(char *p) {
				523	if ((p[0] == 1) & (p[1] == 2)) return 1;
				524	return 0;
				525	}
				526
				527	//===---------------------------------------------------------------------===//
				528
				529	We should inline lrintf and probably other libc functions.
				530
				531	//===---------------------------------------------------------------------===//
				532
Dan Gohman	52415eb	2010-01-04 20:55:05 +0000	[diff] [blame]	533	Use the FLAGS values from arithmetic instructions more. For example, compile:
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	534
				535	int add_zf(int *x, int y, int a, int b) {
				536	if ((*x += y) == 0)
				537	return a;
				538	else
				539	return b;
				540	}
				541
				542	to:
				543	addl %esi, (%rdi)
				544	movl %edx, %eax
				545	cmovne %ecx, %eax
				546	ret
				547	instead of:
				548
				549	_add_zf:
				550	addl (%rdi), %esi
				551	movl %esi, (%rdi)
				552	testl %esi, %esi
				553	cmove %edx, %ecx
				554	movl %ecx, %eax
				555	ret
				556
Dan Gohman	52415eb	2010-01-04 20:55:05 +0000	[diff] [blame]	557	As another example, compile function f2 in test/CodeGen/X86/cmp-test.ll
				558	without a test instruction.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	559
				560	//===---------------------------------------------------------------------===//
				561
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	562	These two functions have identical effects:
				563
				564	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
				565	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				566
				567	We currently compile them to:
				568
				569	_f:
				570	movl 4(%esp), %eax
				571	movl %eax, %ecx
				572	incl %ecx
				573	movl 8(%esp), %edx
				574	cmpl %edx, %ecx
				575	jne LBB1_2 #UnifiedReturnBlock
				576	LBB1_1: #cond_true
				577	addl $2, %eax
				578	ret
				579	LBB1_2: #UnifiedReturnBlock
				580	movl %ecx, %eax
				581	ret
				582	_f2:
				583	movl 4(%esp), %eax
				584	movl %eax, %ecx
				585	incl %ecx
				586	cmpl 8(%esp), %ecx
				587	sete %cl
				588	movzbl %cl, %ecx
				589	leal 1(%ecx,%eax), %eax
				590	ret
				591
				592	both of which are inferior to GCC's:
				593
				594	_f:
				595	movl 4(%esp), %edx
				596	leal 1(%edx), %eax
				597	addl $2, %edx
				598	cmpl 8(%esp), %eax
				599	cmove %edx, %eax
				600	ret
				601	_f2:
				602	movl 4(%esp), %eax
				603	addl $1, %eax
				604	xorl %edx, %edx
				605	cmpl 8(%esp), %eax
				606	sete %dl
				607	addl %edx, %eax
				608	ret
				609
				610	//===---------------------------------------------------------------------===//
				611
				612	This code:
				613
				614	void test(int X) {
				615	if (X) abort();
				616	}
				617
				618	is currently compiled to:
				619
				620	_test:
				621	subl $12, %esp
				622	cmpl $0, 16(%esp)
				623	jne LBB1_1
				624	addl $12, %esp
				625	ret
				626	LBB1_1:
				627	call L_abort$stub
				628
				629	It would be better to produce:
				630
				631	_test:
				632	subl $12, %esp
				633	cmpl $0, 16(%esp)
				634	jne L_abort$stub
				635	addl $12, %esp
				636	ret
				637
				638	This can be applied to any no-return function call that takes no arguments etc.
				639	Alternatively, the stack save/restore logic could be shrink-wrapped, producing
				640	something like this:
				641
				642	_test:
				643	cmpl $0, 4(%esp)
				644	jne LBB1_1
				645	ret
				646	LBB1_1:
				647	subl $12, %esp
				648	call L_abort$stub
				649
				650	Both are useful in different situations. Finally, it could be shrink-wrapped
				651	and tail called, like this:
				652
				653	_test:
				654	cmpl $0, 4(%esp)
				655	jne LBB1_1
				656	ret
				657	LBB1_1:
				658	pop %eax # realign stack.
				659	call L_abort$stub
				660
				661	Though this probably isn't worth it.
				662
				663	//===---------------------------------------------------------------------===//
				664
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	665	Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
				666	a neg instead of a sub instruction. Consider:
				667
				668	int test(char X) { return 7-X; }
				669
				670	we currently produce:
				671	_test:
				672	movl $7, %eax
				673	movsbl 4(%esp), %ecx
				674	subl %ecx, %eax
				675	ret
				676
				677	We would use one fewer register if codegen'd as:
				678
				679	movsbl 4(%esp), %eax
				680	neg %eax
				681	add $7, %eax
				682	ret
				683
				684	Note that this isn't beneficial if the load can be folded into the sub. In
				685	this case, we want a sub:
				686
				687	int test(int X) { return 7-X; }
				688	_test:
				689	movl $7, %eax
				690	subl 4(%esp), %eax
				691	ret
				692
				693	//===---------------------------------------------------------------------===//
				694
Chris Lattner	32f6587	2007-08-20 02:14:33 +0000	[diff] [blame]	695	Leaf functions that require one 4-byte spill slot have a prolog like this:
				696
				697	_foo:
				698	pushl %esi
				699	subl $4, %esp
				700	...
				701	and an epilog like this:
				702	addl $4, %esp
				703	popl %esi
				704	ret
				705
				706	It would be smaller, and potentially faster, to push eax on entry and to
				707	pop into a dummy register instead of using addl/subl of esp. Just don't pop
				708	into any return registers :)
				709
				710	//===---------------------------------------------------------------------===//
Chris Lattner	44b03cb	2007-08-23 15:22:07 +0000	[diff] [blame]	711
				712	The X86 backend should fold (branch (or (setcc, setcc))) into multiple
				713	branches. We generate really poor code for:
				714
				715	double testf(double a) {
				716	return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
				717	}
				718
				719	For example, the entry BB is:
				720
				721	_testf:
				722	subl $20, %esp
				723	pxor %xmm0, %xmm0
				724	movsd 24(%esp), %xmm1
				725	ucomisd %xmm0, %xmm1
				726	setnp %al
				727	sete %cl
				728	testb %cl, %al
				729	jne LBB1_5 # UnifiedReturnBlock
				730	LBB1_1: # cond_true
				731
				732
				733	it would be better to replace the last four instructions with:
				734
				735	jp LBB1_1
				736	je LBB1_5
				737	LBB1_1:
				738
				739	We also codegen the inner ?: into a diamond:
				740
				741	cvtss2sd LCPI1_0(%rip), %xmm2
				742	cvtss2sd LCPI1_1(%rip), %xmm3
				743	ucomisd %xmm1, %xmm0
				744	ja LBB1_3 # cond_true
				745	LBB1_2: # cond_true
				746	movapd %xmm3, %xmm2
				747	LBB1_3: # cond_true
				748	movapd %xmm2, %xmm0
				749	ret
				750
				751	We should sink the load into xmm3 into the LBB1_2 block. This should
				752	be pretty easy, and will nuke all the copies.
				753
				754	//===---------------------------------------------------------------------===//
Chris Lattner	4084d49	2007-09-10 21:43:18 +0000	[diff] [blame]	755
				756	This:
				757	#include <algorithm>
				758	inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				759	{ return std::make_pair(a + b, a + b < a); }
				760	bool no_overflow(unsigned a, unsigned b)
				761	{ return !full_add(a, b).second; }
				762
				763	Should compile to:
				764
				765
				766	_Z11no_overflowjj:
				767	addl %edi, %esi
				768	setae %al
				769	ret
				770
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	771	FIXME: That code looks wrong; bool return is normally defined as zext.
				772
Chris Lattner	4084d49	2007-09-10 21:43:18 +0000	[diff] [blame]	773	on x86-64, not:
				774
				775	__Z11no_overflowjj:
				776	addl %edi, %esi
				777	cmpl %edi, %esi
				778	setae %al
				779	movzbl %al, %eax
				780	ret
				781
				782
				783	//===---------------------------------------------------------------------===//
Evan Cheng	35127a6	2007-09-10 22:16:37 +0000	[diff] [blame]	784
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	785	The following code:
				786
Bill Wendling	c2036e3	2007-10-02 20:54:32 +0000	[diff] [blame]	787	bb114.preheader: ; preds = %cond_next94
				788	%tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
				789	%tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
				790	%tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
				791	%tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
				792	%tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
				793	%tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
				794	%tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
				795	%tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
				796	%tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
				797	%tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
				798	%tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
				799	br label %bb114
				800
				801	produces:
				802
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	803	LBB3_5: # bb114.preheader
				804	movswl -68(%ebp), %eax
				805	movl $32, %ecx
				806	movl %ecx, -80(%ebp)
				807	subl %eax, -80(%ebp)
				808	movswl -52(%ebp), %eax
				809	movl %ecx, -84(%ebp)
				810	subl %eax, -84(%ebp)
				811	movswl -70(%ebp), %eax
				812	movl %ecx, -88(%ebp)
				813	subl %eax, -88(%ebp)
				814	movswl -50(%ebp), %eax
				815	subl %eax, %ecx
				816	movl %ecx, -76(%ebp)
				817	movswl -42(%ebp), %eax
				818	movl %eax, -92(%ebp)
				819	movswl -66(%ebp), %eax
				820	movl %eax, -96(%ebp)
				821	movw $0, -98(%ebp)
				822
Chris Lattner	792bae5	2007-10-03 03:40:24 +0000	[diff] [blame]	823	This appears to be bad because the RA is not folding the store to the stack
				824	slot into the movl. The above instructions could be:
				825	movl $32, -80(%ebp)
				826	...
				827	movl $32, -84(%ebp)
				828	...
				829	This seems like a cross between remat and spill folding.
				830
Bill Wendling	c2036e3	2007-10-02 20:54:32 +0000	[diff] [blame]	831	This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
Bill Wendling	7f436dd	2007-10-02 20:42:59 +0000	[diff] [blame]	832	change, so we could simply subtract %eax from %ecx first and then use %ecx (or
				833	vice-versa).
				834
				835	//===---------------------------------------------------------------------===//
				836
Bill Wendling	54c4f83	2007-10-02 21:49:31 +0000	[diff] [blame]	837	This code:
				838
				839	%tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
				840	br i1 %tmp659, label %cond_true662, label %cond_next715
				841
				842	produces this:
				843
				844	testw %cx, %cx
				845	movswl %cx, %esi
				846	jns LBB4_109 # cond_next715
				847
				848	Shark tells us that using %cx in the testw instruction is sub-optimal. It
				849	suggests using the 32-bit register (which is what ICC uses).
				850
				851	//===---------------------------------------------------------------------===//
Chris Lattner	802c62a	2007-10-03 17:10:03 +0000	[diff] [blame]	852
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	853	We compile this:
				854
				855	void compare (long long foo) {
				856	if (foo < 4294967297LL)
				857	abort();
				858	}
				859
				860	to:
				861
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	862	compare:
				863	subl $4, %esp
				864	cmpl $0, 8(%esp)
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	865	setne %al
				866	movzbw %al, %ax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	867	cmpl $1, 12(%esp)
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	868	setg %cl
				869	movzbw %cl, %cx
				870	cmove %ax, %cx
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	871	testb $1, %cl
				872	jne .LBB1_2 # UnifiedReturnBlock
				873	.LBB1_1: # ifthen
				874	call abort
				875	.LBB1_2: # UnifiedReturnBlock
				876	addl $4, %esp
				877	ret
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	878
				879	(also really horrible code on ppc). This is due to the expand code for 64-bit
				880	compares. GCC produces multiple branches, which is much nicer:
				881
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	882	compare:
				883	subl $12, %esp
				884	movl 20(%esp), %edx
				885	movl 16(%esp), %eax
				886	decl %edx
				887	jle .L7
				888	.L5:
				889	addl $12, %esp
				890	ret
				891	.p2align 4,,7
				892	.L7:
				893	jl .L4
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	894	cmpl $0, %eax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	895	.p2align 4,,8
				896	ja .L5
				897	.L4:
				898	.p2align 4,,9
				899	call abort
Chris Lattner	ae25999	2007-10-04 15:47:27 +0000	[diff] [blame]	900
				901	//===---------------------------------------------------------------------===//
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	902
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	903	Tail call optimization improvements: Tail call optimization currently
				904	pushes all arguments on the top of the stack (their normal place for
Arnold Schwaighofer	449b01a	2008-01-11 16:49:42 +0000	[diff] [blame]	905	non-tail call optimized calls) that source from the callers arguments
				906	or that source from a virtual register (also possibly sourcing from
				907	callers arguments).
				908	This is done to prevent overwriting of parameters (see example
				909	below) that might be used later.
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	910
				911	example:
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	912
				913	int callee(int32, int64);
				914	int caller(int32 arg1, int32 arg2) {
				915	int64 local = arg2 * 2;
				916	return callee(arg2, (int64)local);
				917	}
				918
				919	[arg1] [!arg2 no longer valid since we moved local onto it]
				920	[arg2] -> [(int64)
				921	[RETADDR] local ]
				922
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	923	Moving arg1 onto the stack slot of callee function would overwrite
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	924	arg2 of the caller.
				925
				926	Possible optimizations:
				927
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	928
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	929	- Analyse the actual parameters of the callee to see which would
				930	overwrite a caller parameter which is used by the callee and only
				931	push them onto the top of the stack.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	932
				933	int callee (int32 arg1, int32 arg2);
				934	int caller (int32 arg1, int32 arg2) {
				935	return callee(arg1,arg2);
				936	}
				937
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	938	Here we don't need to write any variables to the top of the stack
				939	since they don't overwrite each other.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	940
				941	int callee (int32 arg1, int32 arg2);
				942	int caller (int32 arg1, int32 arg2) {
				943	return callee(arg2,arg1);
				944	}
				945
Arnold Schwaighofer	373e865	2007-10-12 21:30:57 +0000	[diff] [blame]	946	Here we need to push the arguments because they overwrite each
				947	other.
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	948
Arnold Schwaighofer	e2d6bbb	2007-10-11 19:40:01 +0000	[diff] [blame]	949	//===---------------------------------------------------------------------===//
Evan Cheng	7f1ad6a	2007-10-28 04:01:09 +0000	[diff] [blame]	950
				951	main ()
				952	{
				953	int i = 0;
				954	unsigned long int z = 0;
				955
				956	do {
				957	z -= 0x00004000;
				958	i++;
				959	if (i > 0x00040000)
				960	abort ();
				961	} while (z > 0);
				962	exit (0);
				963	}
				964
				965	gcc compiles this to:
				966
				967	_main:
				968	subl $28, %esp
				969	xorl %eax, %eax
				970	jmp L2
				971	L3:
				972	cmpl $262144, %eax
				973	je L10
				974	L2:
				975	addl $1, %eax
				976	cmpl $262145, %eax
				977	jne L3
				978	call L_abort$stub
				979	L10:
				980	movl $0, (%esp)
				981	call L_exit$stub
				982
				983	llvm:
				984
				985	_main:
				986	subl $12, %esp
				987	movl $1, %eax
				988	movl $16384, %ecx
				989	LBB1_1: # bb
				990	cmpl $262145, %eax
				991	jge LBB1_4 # cond_true
				992	LBB1_2: # cond_next
				993	incl %eax
				994	addl $4294950912, %ecx
				995	cmpl $16384, %ecx
				996	jne LBB1_1 # bb
				997	LBB1_3: # bb11
				998	xorl %eax, %eax
				999	addl $12, %esp
				1000	ret
				1001	LBB1_4: # cond_true
				1002	call L_abort$stub
				1003
				1004	1. LSR should rewrite the first cmp with induction variable %ecx.
				1005	2. DAG combiner should fold
				1006	leal 1(%eax), %edx
				1007	cmpl $262145, %edx
				1008	=>
				1009	cmpl $262144, %eax
				1010
				1011	//===---------------------------------------------------------------------===//
Chris Lattner	358670b	2007-11-24 06:13:33 +0000	[diff] [blame]	1012
				1013	define i64 @test(double %X) {
				1014	%Y = fptosi double %X to i64
				1015	ret i64 %Y
				1016	}
				1017
				1018	compiles to:
				1019
				1020	_test:
				1021	subl $20, %esp
				1022	movsd 24(%esp), %xmm0
				1023	movsd %xmm0, 8(%esp)
				1024	fldl 8(%esp)
				1025	fisttpll (%esp)
				1026	movl 4(%esp), %edx
				1027	movl (%esp), %eax
				1028	addl $20, %esp
				1029	#FP_REG_KILL
				1030	ret
				1031
				1032	This should just fldl directly from the input stack slot.
Chris Lattner	10d54d1	2007-12-05 22:58:19 +0000	[diff] [blame]	1033
				1034	//===---------------------------------------------------------------------===//
				1035
				1036	This code:
				1037	int foo (int x) { return (x & 65535) \| 255; }
				1038
				1039	Should compile into:
				1040
				1041	_foo:
				1042	movzwl 4(%esp), %eax
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1043	orl $255, %eax
Chris Lattner	10d54d1	2007-12-05 22:58:19 +0000	[diff] [blame]	1044	ret
				1045
				1046	instead of:
				1047	_foo:
				1048	movl $255, %eax
				1049	orl 4(%esp), %eax
				1050	andl $65535, %eax
				1051	ret
				1052
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1053	//===---------------------------------------------------------------------===//
				1054
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1055	We're codegen'ing multiply of long longs inefficiently:
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1056
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1057	unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
				1058	return arg1 * arg2;
				1059	}
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1060
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1061	We compile to (fomit-frame-pointer):
				1062
				1063	_LLM:
				1064	pushl %esi
				1065	movl 8(%esp), %ecx
				1066	movl 16(%esp), %esi
				1067	movl %esi, %eax
				1068	mull %ecx
				1069	imull 12(%esp), %esi
				1070	addl %edx, %esi
				1071	imull 20(%esp), %ecx
				1072	movl %esi, %edx
				1073	addl %ecx, %edx
				1074	popl %esi
				1075	ret
				1076
				1077	This looks like a scheduling deficiency and lack of remat of the load from
				1078	the argument area. ICC apparently produces:
				1079
				1080	movl 8(%esp), %ecx
				1081	imull 12(%esp), %ecx
				1082	movl 16(%esp), %eax
				1083	imull 4(%esp), %eax
				1084	addl %eax, %ecx
				1085	movl 4(%esp), %eax
				1086	mull 12(%esp)
				1087	addl %ecx, %edx
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1088	ret
				1089
Chris Lattner	eec7ac0	2008-02-21 06:51:29 +0000	[diff] [blame]	1090	Note that it remat'd loads from 4(esp) and 12(esp). See this GCC PR:
				1091	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
Chris Lattner	d079b4e	2007-12-18 16:48:14 +0000	[diff] [blame]	1092
				1093	//===---------------------------------------------------------------------===//
				1094
Chris Lattner	2b55ebd	2007-12-24 19:27:46 +0000	[diff] [blame]	1095	We can fold a store into "zeroing a reg". Instead of:
				1096
				1097	xorl %eax, %eax
				1098	movl %eax, 124(%esp)
				1099
				1100	we should get:
				1101
				1102	movl $0, 124(%esp)
				1103
				1104	if the flags of the xor are dead.
				1105
Chris Lattner	459ff99	2008-01-11 18:00:13 +0000	[diff] [blame]	1106	Likewise, we isel "x<<1" into "add reg,reg". If reg is spilled, this should
				1107	be folded into: shl [mem], 1
				1108
Chris Lattner	2b55ebd	2007-12-24 19:27:46 +0000	[diff] [blame]	1109	//===---------------------------------------------------------------------===//
Chris Lattner	6440095	2007-12-28 21:50:40 +0000	[diff] [blame]	1110
				1111	This testcase misses a read/modify/write opportunity (from PR1425):
				1112
				1113	void vertical_decompose97iH1(int b0, int b1, int *b2, int width){
				1114	int i;
				1115	for(i=0; i<width; i++)
				1116	b1[i] += (1*(b0[i] + b2[i])+0)>>0;
				1117	}
				1118
				1119	We compile it down to:
				1120
				1121	LBB1_2: # bb
				1122	movl (%esi,%edi,4), %ebx
				1123	addl (%ecx,%edi,4), %ebx
				1124	addl (%edx,%edi,4), %ebx
				1125	movl %ebx, (%ecx,%edi,4)
				1126	incl %edi
				1127	cmpl %eax, %edi
				1128	jne LBB1_2 # bb
				1129
				1130	the inner loop should add to the memory location (%ecx,%edi,4), saving
				1131	a mov. Something like:
				1132
				1133	movl (%esi,%edi,4), %ebx
				1134	addl (%edx,%edi,4), %ebx
				1135	addl %ebx, (%ecx,%edi,4)
				1136
Chris Lattner	bde7310	2007-12-29 05:51:58 +0000	[diff] [blame]	1137	Here is another interesting example:
				1138
				1139	void vertical_compose97iH1(int b0, int b1, int *b2, int width){
				1140	int i;
				1141	for(i=0; i<width; i++)
				1142	b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
				1143	}
				1144
				1145	We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
				1146
				1147	LBB9_2: # bb
				1148	movl (%ecx,%edi,4), %ebx
				1149	subl (%esi,%edi,4), %ebx
				1150	subl (%edx,%edi,4), %ebx
				1151	movl %ebx, (%ecx,%edi,4)
				1152	incl %edi
				1153	cmpl %eax, %edi
				1154	jne LBB9_2 # bb
				1155
				1156	Additionally, LSR should rewrite the exit condition of these loops to use
Chris Lattner	6440095	2007-12-28 21:50:40 +0000	[diff] [blame]	1157	a stride-4 IV, would would allow all the scales in the loop to go away.
				1158	This would result in smaller code and more efficient microops.
				1159
				1160	//===---------------------------------------------------------------------===//
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1161
				1162	In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
				1163	or and instruction, for example:
				1164
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1165	xorpd LCPI1_0, %xmm2
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1166
				1167	However, if xmm2 gets spilled, we end up with really ugly code like this:
				1168
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1169	movsd (%esp), %xmm0
				1170	xorpd LCPI1_0, %xmm0
				1171	movsd %xmm0, (%esp)
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1172
				1173	Since we 'know' that this is a 'neg', we can actually "fold" the spill into
				1174	the neg/abs instruction, turning it into an integer operation, like this:
				1175
				1176	xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
				1177
				1178	you could also use xorb, but xorl is less likely to lead to a partial register
Chris Lattner	b4cbb68	2008-01-09 00:37:18 +0000	[diff] [blame]	1179	stall. Here is a contrived testcase:
				1180
				1181	double a, b, c;
				1182	void test(double *P) {
				1183	double X = *P;
				1184	a = X;
				1185	bar();
				1186	X = -X;
				1187	b = X;
				1188	bar();
				1189	c = X;
				1190	}
Chris Lattner	0362a36	2008-01-07 21:59:58 +0000	[diff] [blame]	1191
				1192	//===---------------------------------------------------------------------===//
Andrew Lenharth	785610d	2008-02-16 01:24:58 +0000	[diff] [blame]	1193
				1194	handling llvm.memory.barrier on pre SSE2 cpus
				1195
				1196	should generate:
				1197	lock ; mov %esp, %esp
				1198
				1199	//===---------------------------------------------------------------------===//
Chris Lattner	7644ff3	2008-02-17 19:43:57 +0000	[diff] [blame]	1200
				1201	The generated code on x86 for checking for signed overflow on a multiply the
				1202	obvious way is much longer than it needs to be.
				1203
				1204	int x(int a, int b) {
				1205	long long prod = (long long)a*b;
				1206	return prod > 0x7FFFFFFF \|\| prod < (-0x7FFFFFFF-1);
				1207	}
				1208
				1209	See PR2053 for more details.
				1210
				1211	//===---------------------------------------------------------------------===//
Chris Lattner	83f2236	2008-02-18 18:30:13 +0000	[diff] [blame]	1212
Eli Friedman	577c749	2008-02-21 21:16:49 +0000	[diff] [blame]	1213	We should investigate using cdq/ctld (effect: edx = sar eax, 31)
				1214	more aggressively; it should cost the same as a move+shift on any modern
				1215	processor, but it's a lot shorter. Downside is that it puts more
				1216	pressure on register allocation because it has fixed operands.
				1217
				1218	Example:
				1219	int abs(int x) {return x < 0 ? -x : x;}
				1220
				1221	gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
				1222	abs:
				1223	movl 4(%esp), %eax
				1224	cltd
				1225	xorl %edx, %eax
				1226	subl %edx, %eax
				1227	ret
				1228
				1229	//===---------------------------------------------------------------------===//
				1230
				1231	Consider:
Chris Lattner	83f2236	2008-02-18 18:30:13 +0000	[diff] [blame]	1232	int test(unsigned long a, unsigned long b) { return -(a < b); }
				1233
				1234	We currently compile this to:
				1235
				1236	define i32 @test(i32 %a, i32 %b) nounwind {
				1237	%tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
				1238	%tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1]
				1239	%tmp5 = sub i32 0, %tmp34 ; <i32> [#uses=1]
				1240	ret i32 %tmp5
				1241	}
				1242
				1243	and
				1244
				1245	_test:
				1246	movl 8(%esp), %eax
				1247	cmpl %eax, 4(%esp)
				1248	setb %al
				1249	movzbl %al, %eax
				1250	negl %eax
				1251	ret
				1252
				1253	Several deficiencies here. First, we should instcombine zext+neg into sext:
				1254
				1255	define i32 @test2(i32 %a, i32 %b) nounwind {
				1256	%tmp3 = icmp ult i32 %a, %b ; <i1> [#uses=1]
				1257	%tmp34 = sext i1 %tmp3 to i32 ; <i32> [#uses=1]
				1258	ret i32 %tmp34
				1259	}
				1260
				1261	However, before we can do that, we have to fix the bad codegen that we get for
				1262	sext from bool:
				1263
				1264	_test2:
				1265	movl 8(%esp), %eax
				1266	cmpl %eax, 4(%esp)
				1267	setb %al
				1268	movzbl %al, %eax
				1269	shll $31, %eax
				1270	sarl $31, %eax
				1271	ret
				1272
				1273	This code should be at least as good as the code above. Once this is fixed, we
				1274	can optimize this specific case even more to:
				1275
				1276	movl 8(%esp), %eax
				1277	xorl %ecx, %ecx
				1278	cmpl %eax, 4(%esp)
				1279	sbbl %ecx, %ecx
				1280
				1281	//===---------------------------------------------------------------------===//
Eli Friedman	1aa1f2c	2008-02-28 00:21:43 +0000	[diff] [blame]	1282
				1283	Take the following code (from
				1284	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
				1285
				1286	extern unsigned char first_one[65536];
				1287	int FirstOnet(unsigned long long arg1)
				1288	{
				1289	if (arg1 >> 48)
				1290	return (first_one[arg1 >> 48]);
				1291	return 0;
				1292	}
				1293
				1294
				1295	The following code is currently generated:
				1296	FirstOnet:
				1297	movl 8(%esp), %eax
				1298	cmpl $65536, %eax
				1299	movl 4(%esp), %ecx
				1300	jb .LBB1_2 # UnifiedReturnBlock
				1301	.LBB1_1: # ifthen
				1302	shrl $16, %eax
				1303	movzbl first_one(%eax), %eax
				1304	ret
				1305	.LBB1_2: # UnifiedReturnBlock
				1306	xorl %eax, %eax
				1307	ret
				1308
				1309	There are a few possible improvements here:
				1310	1. We should be able to eliminate the dead load into %ecx
				1311	2. We could change the "movl 8(%esp), %eax" into
				1312	"movzwl 10(%esp), %eax"; this lets us change the cmpl
				1313	into a testl, which is shorter, and eliminate the shift.
				1314
				1315	We could also in theory eliminate the branch by using a conditional
				1316	for the address of the load, but that seems unlikely to be worthwhile
				1317	in general.
				1318
				1319	//===---------------------------------------------------------------------===//
				1320
Chris Lattner	44a98ac	2008-02-28 04:52:59 +0000	[diff] [blame]	1321	We compile this function:
				1322
				1323	define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext %d) nounwind {
				1324	entry:
				1325	%tmp2 = icmp eq i8 %d, 0 ; <i1> [#uses=1]
				1326	br i1 %tmp2, label %bb7, label %bb
				1327
				1328	bb: ; preds = %entry
				1329	%tmp6 = add i32 %b, %a ; <i32> [#uses=1]
				1330	ret i32 %tmp6
				1331
				1332	bb7: ; preds = %entry
				1333	%tmp10 = sub i32 %a, %c ; <i32> [#uses=1]
				1334	ret i32 %tmp10
				1335	}
				1336
				1337	to:
				1338
				1339	_foo:
				1340	cmpb $0, 16(%esp)
				1341	movl 12(%esp), %ecx
				1342	movl 8(%esp), %eax
				1343	movl 4(%esp), %edx
				1344	je LBB1_2 # bb7
				1345	LBB1_1: # bb
				1346	addl %edx, %eax
				1347	ret
				1348	LBB1_2: # bb7
				1349	movl %edx, %eax
				1350	subl %ecx, %eax
				1351	ret
				1352
Gabor Greif	0266159	2008-03-06 10:51:21 +0000	[diff] [blame]	1353	The coalescer could coalesce "edx" with "eax" to avoid the movl in LBB1_2
Chris Lattner	44a98ac	2008-02-28 04:52:59 +0000	[diff] [blame]	1354	if it commuted the addl in LBB1_1.
				1355
				1356	//===---------------------------------------------------------------------===//
Evan Cheng	921dcba	2008-03-28 07:07:06 +0000	[diff] [blame]	1357
				1358	See rdar://4653682.
				1359
				1360	From flops:
				1361
				1362	LBB1_15: # bb310
				1363	cvtss2sd LCPI1_0, %xmm1
				1364	addsd %xmm1, %xmm0
				1365	movsd 176(%esp), %xmm2
				1366	mulsd %xmm0, %xmm2
				1367	movapd %xmm2, %xmm3
				1368	mulsd %xmm3, %xmm3
				1369	movapd %xmm3, %xmm4
				1370	mulsd LCPI1_23, %xmm4
				1371	addsd LCPI1_24, %xmm4
				1372	mulsd %xmm3, %xmm4
				1373	addsd LCPI1_25, %xmm4
				1374	mulsd %xmm3, %xmm4
				1375	addsd LCPI1_26, %xmm4
				1376	mulsd %xmm3, %xmm4
				1377	addsd LCPI1_27, %xmm4
				1378	mulsd %xmm3, %xmm4
				1379	addsd LCPI1_28, %xmm4
				1380	mulsd %xmm3, %xmm4
				1381	addsd %xmm1, %xmm4
				1382	mulsd %xmm2, %xmm4
				1383	movsd 152(%esp), %xmm1
				1384	addsd %xmm4, %xmm1
				1385	movsd %xmm1, 152(%esp)
				1386	incl %eax
				1387	cmpl %eax, %esi
				1388	jge LBB1_15 # bb310
				1389	LBB1_16: # bb358.loopexit
				1390	movsd 152(%esp), %xmm0
				1391	addsd %xmm0, %xmm0
				1392	addsd LCPI1_22, %xmm0
				1393	movsd %xmm0, 152(%esp)
				1394
				1395	Rather than spilling the result of the last addsd in the loop, we should have
				1396	insert a copy to split the interval (one for the duration of the loop, one
				1397	extending to the fall through). The register pressure in the loop isn't high
				1398	enough to warrant the spill.
				1399
				1400	Also check why xmm7 is not used at all in the function.
Chris Lattner	16e5c78	2008-04-21 04:46:30 +0000	[diff] [blame]	1401
				1402	//===---------------------------------------------------------------------===//
				1403
				1404	Legalize loses track of the fact that bools are always zero extended when in
				1405	memory. This causes us to compile abort_gzip (from 164.gzip) from:
				1406
				1407	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
				1408	target triple = "i386-apple-darwin8"
				1409	@in_exit.4870.b = internal global i1 false ; <i1*> [#uses=2]
				1410	define fastcc void @abort_gzip() noreturn nounwind {
				1411	entry:
				1412	%tmp.b.i = load i1* @in_exit.4870.b ; <i1> [#uses=1]
				1413	br i1 %tmp.b.i, label %bb.i, label %bb4.i
				1414	bb.i: ; preds = %entry
				1415	tail call void @exit( i32 1 ) noreturn nounwind
				1416	unreachable
				1417	bb4.i: ; preds = %entry
				1418	store i1 true, i1* @in_exit.4870.b
				1419	tail call void @exit( i32 1 ) noreturn nounwind
				1420	unreachable
				1421	}
				1422	declare void @exit(i32) noreturn nounwind
				1423
				1424	into:
				1425
				1426	_abort_gzip:
				1427	subl $12, %esp
				1428	movb _in_exit.4870.b, %al
				1429	notb %al
				1430	testb $1, %al
				1431	jne LBB1_2 ## bb4.i
				1432	LBB1_1: ## bb.i
				1433	...
				1434
				1435	//===---------------------------------------------------------------------===//
Chris Lattner	7cb1d33	2008-05-05 23:19:45 +0000	[diff] [blame]	1436
				1437	We compile:
				1438
				1439	int test(int x, int y) {
				1440	return x-y-1;
				1441	}
				1442
				1443	into (-m64):
				1444
				1445	_test:
				1446	decl %edi
				1447	movl %edi, %eax
				1448	subl %esi, %eax
				1449	ret
				1450
				1451	it would be better to codegen as: x+~y (notl+addl)
Edwin Török	fa9d5e2	2008-10-24 19:23:07 +0000	[diff] [blame]	1452
				1453	//===---------------------------------------------------------------------===//
				1454
				1455	This code:
				1456
				1457	int foo(const char *str,...)
				1458	{
				1459	__builtin_va_list a; int x;
				1460	__builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
				1461	return x;
				1462	}
				1463
				1464	gets compiled into this on x86-64:
				1465	subq $200, %rsp
				1466	movaps %xmm7, 160(%rsp)
				1467	movaps %xmm6, 144(%rsp)
				1468	movaps %xmm5, 128(%rsp)
				1469	movaps %xmm4, 112(%rsp)
				1470	movaps %xmm3, 96(%rsp)
				1471	movaps %xmm2, 80(%rsp)
				1472	movaps %xmm1, 64(%rsp)
				1473	movaps %xmm0, 48(%rsp)
				1474	movq %r9, 40(%rsp)
				1475	movq %r8, 32(%rsp)
				1476	movq %rcx, 24(%rsp)
				1477	movq %rdx, 16(%rsp)
				1478	movq %rsi, 8(%rsp)
				1479	leaq (%rsp), %rax
				1480	movq %rax, 192(%rsp)
				1481	leaq 208(%rsp), %rax
				1482	movq %rax, 184(%rsp)
				1483	movl $48, 180(%rsp)
				1484	movl $8, 176(%rsp)
				1485	movl 176(%rsp), %eax
				1486	cmpl $47, %eax
				1487	jbe .LBB1_3 # bb
				1488	.LBB1_1: # bb3
				1489	movq 184(%rsp), %rcx
				1490	leaq 8(%rcx), %rax
				1491	movq %rax, 184(%rsp)
				1492	.LBB1_2: # bb4
				1493	movl (%rcx), %eax
				1494	addq $200, %rsp
				1495	ret
				1496	.LBB1_3: # bb
				1497	movl %eax, %ecx
				1498	addl $8, %eax
				1499	addq 192(%rsp), %rcx
				1500	movl %eax, 176(%rsp)
				1501	jmp .LBB1_2 # bb4
				1502
				1503	gcc 4.3 generates:
				1504	subq $96, %rsp
				1505	.LCFI0:
				1506	leaq 104(%rsp), %rax
				1507	movq %rsi, -80(%rsp)
				1508	movl $8, -120(%rsp)
				1509	movq %rax, -112(%rsp)
				1510	leaq -88(%rsp), %rax
				1511	movq %rax, -104(%rsp)
				1512	movl $8, %eax
				1513	cmpl $48, %eax
				1514	jb .L6
				1515	movq -112(%rsp), %rdx
				1516	movl (%rdx), %eax
				1517	addq $96, %rsp
				1518	ret
				1519	.p2align 4,,10
				1520	.p2align 3
				1521	.L6:
				1522	mov %eax, %edx
				1523	addq -104(%rsp), %rdx
				1524	addl $8, %eax
				1525	movl %eax, -120(%rsp)
				1526	movl (%rdx), %eax
				1527	addq $96, %rsp
				1528	ret
				1529
				1530	and it gets compiled into this on x86:
				1531	pushl %ebp
				1532	movl %esp, %ebp
				1533	subl $4, %esp
				1534	leal 12(%ebp), %eax
				1535	movl %eax, -4(%ebp)
				1536	leal 16(%ebp), %eax
				1537	movl %eax, -4(%ebp)
				1538	movl 12(%ebp), %eax
				1539	addl $4, %esp
				1540	popl %ebp
				1541	ret
				1542
				1543	gcc 4.3 generates:
				1544	pushl %ebp
				1545	movl %esp, %ebp
				1546	movl 12(%ebp), %eax
				1547	popl %ebp
				1548	ret
Evan Cheng	bf97bec	2008-11-11 17:35:52 +0000	[diff] [blame]	1549
				1550	//===---------------------------------------------------------------------===//
				1551
				1552	Teach tblgen not to check bitconvert source type in some cases. This allows us
				1553	to consolidate the following patterns in X86InstrMMX.td:
				1554
				1555	def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1556	(iPTR 0))))),
				1557	(v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
				1558	def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1559	(iPTR 0))))),
				1560	(v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
				1561	def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
				1562	(iPTR 0))))),
				1563	(v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
				1564
				1565	There are other cases in various td files.
Eli Friedman	9ab1db0	2008-11-30 07:52:27 +0000	[diff] [blame]	1566
				1567	//===---------------------------------------------------------------------===//
				1568
				1569	Take something like the following on x86-32:
				1570	unsigned a(unsigned long long x, unsigned y) {return x % y;}
				1571
				1572	We currently generate a libcall, but we really shouldn't: the expansion is
				1573	shorter and likely faster than the libcall. The expected code is something
				1574	like the following:
				1575
				1576	movl 12(%ebp), %eax
				1577	movl 16(%ebp), %ecx
				1578	xorl %edx, %edx
				1579	divl %ecx
				1580	movl 8(%ebp), %eax
				1581	divl %ecx
				1582	movl %edx, %eax
				1583	ret
				1584
				1585	A similar code sequence works for division.
				1586
				1587	//===---------------------------------------------------------------------===//
Chris Lattner	bfccda6	2008-12-06 22:49:05 +0000	[diff] [blame]	1588
				1589	These should compile to the same code, but the later codegen's to useless
				1590	instructions on X86. This may be a trivial dag combine (GCC PR7061):
				1591
				1592	struct s1 { unsigned char a, b; };
				1593	unsigned long f1(struct s1 x) {
				1594	return x.a + x.b;
				1595	}
				1596	struct s2 { unsigned a: 8, b: 8; };
				1597	unsigned long f2(struct s2 x) {
				1598	return x.a + x.b;
				1599	}
				1600
				1601	//===---------------------------------------------------------------------===//
				1602
Chris Lattner	9f34dc6	2009-02-08 20:44:19 +0000	[diff] [blame]	1603	We currently compile this:
				1604
				1605	define i32 @func1(i32 %v1, i32 %v2) nounwind {
				1606	entry:
				1607	%t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)
				1608	%sum = extractvalue {i32, i1} %t, 0
				1609	%obit = extractvalue {i32, i1} %t, 1
				1610	br i1 %obit, label %overflow, label %normal
				1611	normal:
				1612	ret i32 %sum
				1613	overflow:
				1614	call void @llvm.trap()
				1615	unreachable
				1616	}
				1617	declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)
				1618	declare void @llvm.trap()
				1619
				1620	to:
				1621
				1622	_func1:
				1623	movl 4(%esp), %eax
				1624	addl 8(%esp), %eax
				1625	jo LBB1_2 ## overflow
				1626	LBB1_1: ## normal
				1627	ret
				1628	LBB1_2: ## overflow
				1629	ud2
				1630
				1631	it would be nice to produce "into" someday.
				1632
				1633	//===---------------------------------------------------------------------===//
Chris Lattner	09c650b	2009-02-17 01:16:14 +0000	[diff] [blame]	1634
				1635	This code:
				1636
				1637	void vec_mpys1(int y[], const int x[], int scaler) {
				1638	int i;
				1639	for (i = 0; i < 150; i++)
				1640	y[i] += (((long long)scaler * (long long)x[i]) >> 31);
				1641	}
				1642
				1643	Compiles to this loop with GCC 3.x:
				1644
				1645	.L5:
				1646	movl %ebx, %eax
				1647	imull (%edi,%ecx,4)
				1648	shrdl $31, %edx, %eax
				1649	addl %eax, (%esi,%ecx,4)
				1650	incl %ecx
				1651	cmpl $149, %ecx
				1652	jle .L5
				1653
				1654	llvm-gcc compiles it to the much uglier:
				1655
				1656	LBB1_1: ## bb1
				1657	movl 24(%esp), %eax
				1658	movl (%eax,%edi,4), %ebx
				1659	movl %ebx, %ebp
				1660	imull %esi, %ebp
				1661	movl %ebx, %eax
				1662	mull %ecx
				1663	addl %ebp, %edx
				1664	sarl $31, %ebx
				1665	imull %ecx, %ebx
				1666	addl %edx, %ebx
				1667	shldl $1, %eax, %ebx
				1668	movl 20(%esp), %eax
				1669	addl %ebx, (%eax,%edi,4)
				1670	incl %edi
				1671	cmpl $150, %edi
				1672	jne LBB1_1 ## bb1
				1673
Eli Friedman	0427ac1	2009-12-21 08:03:16 +0000	[diff] [blame]	1674	The issue is that we hoist the cast of "scaler" to long long outside of the
				1675	loop, the value comes into the loop as two values, and
				1676	RegsForValue::getCopyFromRegs doesn't know how to put an AssertSext on the
				1677	constructed BUILD_PAIR which represents the cast value.
				1678
Chris Lattner	09c650b	2009-02-17 01:16:14 +0000	[diff] [blame]	1679	//===---------------------------------------------------------------------===//
Chris Lattner	8eca7c7	2009-03-08 01:54:43 +0000	[diff] [blame]	1680
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1681	Test instructions can be eliminated by using EFLAGS values from arithmetic
Dan Gohman	8bb7db6	2009-03-10 00:26:23 +0000	[diff] [blame]	1682	instructions. This is currently not done for mul, and, or, xor, neg, shl,
				1683	sra, srl, shld, shrd, atomic ops, and others. It is also currently not done
				1684	for read-modify-write instructions. It is also current not done if the
				1685	OF or CF flags are needed.
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1686
				1687	The shift operators have the complication that when the shift count is
				1688	zero, EFLAGS is not set, so they can only subsume a test instruction if
Dan Gohman	8bb7db6	2009-03-10 00:26:23 +0000	[diff] [blame]	1689	the shift count is known to be non-zero. Also, using the EFLAGS value
				1690	from a shift is apparently very slow on some x86 implementations.
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1691
				1692	In read-modify-write instructions, the root node in the isel match is
				1693	the store, and isel has no way for the use of the EFLAGS result of the
				1694	arithmetic to be remapped to the new node.
				1695
Dan Gohman	8bb7db6	2009-03-10 00:26:23 +0000	[diff] [blame]	1696	Add and subtract instructions set OF on signed overflow and CF on unsiged
				1697	overflow, while test instructions always clear OF and CF. In order to
				1698	replace a test with an add or subtract in a situation where OF or CF is
				1699	needed, codegen must be able to prove that the operation cannot see
				1700	signed or unsigned overflow, respectively.
				1701
Dan Gohman	5edb738	2009-03-09 23:47:02 +0000	[diff] [blame]	1702	//===---------------------------------------------------------------------===//
				1703
Chris Lattner	fe8b559	2009-03-08 03:04:26 +0000	[diff] [blame]	1704	memcpy/memmove do not lower to SSE copies when possible. A silly example is:
				1705	define <16 x float> @foo(<16 x float> %A) nounwind {
				1706	%tmp = alloca <16 x float>, align 16
				1707	%tmp2 = alloca <16 x float>, align 16
				1708	store <16 x float> %A, <16 x float>* %tmp
				1709	%s = bitcast <16 x float>* %tmp to i8*
				1710	%s2 = bitcast <16 x float>* %tmp2 to i8*
				1711	call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)
				1712	%R = load <16 x float>* %tmp2
				1713	ret <16 x float> %R
				1714	}
				1715
				1716	declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind
				1717
				1718	which compiles to:
				1719
				1720	_foo:
				1721	subl $140, %esp
				1722	movaps %xmm3, 112(%esp)
				1723	movaps %xmm2, 96(%esp)
				1724	movaps %xmm1, 80(%esp)
				1725	movaps %xmm0, 64(%esp)
				1726	movl 60(%esp), %eax
				1727	movl %eax, 124(%esp)
				1728	movl 56(%esp), %eax
				1729	movl %eax, 120(%esp)
				1730	movl 52(%esp), %eax
				1731	<many many more 32-bit copies>
				1732	movaps (%esp), %xmm0
				1733	movaps 16(%esp), %xmm1
				1734	movaps 32(%esp), %xmm2
				1735	movaps 48(%esp), %xmm3
				1736	addl $140, %esp
				1737	ret
				1738
				1739	On Nehalem, it may even be cheaper to just use movups when unaligned than to
				1740	fall back to lower-granularity chunks.
				1741
				1742	//===---------------------------------------------------------------------===//
Chris Lattner	e00a4e0	2009-05-25 20:28:19 +0000	[diff] [blame]	1743
				1744	Implement processor-specific optimizations for parity with GCC on these
				1745	processors. GCC does two optimizations:
				1746
				1747	1. ix86_pad_returns inserts a noop before ret instructions if immediately
				1748	preceeded by a conditional branch or is the target of a jump.
				1749	2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of
				1750	code contains more than 3 branches.
				1751
				1752	The first one is done for all AMDs, Core2, and "Generic"
				1753	The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,
				1754	Core 2, and "Generic"
				1755
				1756	//===---------------------------------------------------------------------===//
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1757
				1758	Testcase:
				1759	int a(int x) { return (x & 127) > 31; }
				1760
				1761	Current output:
				1762	movl 4(%esp), %eax
				1763	andl $127, %eax
				1764	cmpl $31, %eax
				1765	seta %al
				1766	movzbl %al, %eax
				1767	ret
				1768
				1769	Ideal output:
				1770	xorl %eax, %eax
				1771	testl $96, 4(%esp)
				1772	setne %al
				1773	ret
				1774
Chris Lattner	2061097	2009-06-16 06:11:35 +0000	[diff] [blame]	1775	This should definitely be done in instcombine, canonicalizing the range
				1776	condition into a != condition. We get this IR:
				1777
				1778	define i32 @a(i32 %x) nounwind readnone {
				1779	entry:
				1780	%0 = and i32 %x, 127 ; <i32> [#uses=1]
				1781	%1 = icmp ugt i32 %0, 31 ; <i1> [#uses=1]
				1782	%2 = zext i1 %1 to i32 ; <i32> [#uses=1]
				1783	ret i32 %2
				1784	}
				1785
				1786	Instcombine prefers to strength reduce relational comparisons to equality
				1787	comparisons when possible, this should be another case of that. This could
				1788	be handled pretty easily in InstCombiner::visitICmpInstWithInstAndIntCst, but it
				1789	looks like InstCombiner::visitICmpInstWithInstAndIntCst should really already
				1790	be redesigned to use ComputeMaskedBits and friends.
				1791
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1792
				1793	//===---------------------------------------------------------------------===//
				1794	Testcase:
				1795	int x(int a) { return (a&0xf0)>>4; }
				1796
				1797	Current output:
				1798	movl 4(%esp), %eax
				1799	shrl $4, %eax
				1800	andl $15, %eax
				1801	ret
				1802
				1803	Ideal output:
				1804	movzbl 4(%esp), %eax
				1805	shrl $4, %eax
				1806	ret
				1807
				1808	//===---------------------------------------------------------------------===//
				1809
				1810	Testcase:
				1811	int x(int a) { return (a & 0x80) ? 0x100 : 0; }
Chris Lattner	498e6ad	2009-06-16 06:15:56 +0000	[diff] [blame]	1812	int y(int a) { return (a & 0x80) *2; }
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1813
Chris Lattner	498e6ad	2009-06-16 06:15:56 +0000	[diff] [blame]	1814	Current:
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1815	testl $128, 4(%esp)
				1816	setne %al
				1817	movzbl %al, %eax
				1818	shll $8, %eax
				1819	ret
				1820
Chris Lattner	498e6ad	2009-06-16 06:15:56 +0000	[diff] [blame]	1821	Better:
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1822	movl 4(%esp), %eax
				1823	addl %eax, %eax
				1824	andl $256, %eax
				1825	ret
				1826
Chris Lattner	498e6ad	2009-06-16 06:15:56 +0000	[diff] [blame]	1827	This is another general instcombine transformation that is profitable on all
				1828	targets. In LLVM IR, these functions look like this:
				1829
				1830	define i32 @x(i32 %a) nounwind readnone {
				1831	entry:
				1832	%0 = and i32 %a, 128
				1833	%1 = icmp eq i32 %0, 0
				1834	%iftmp.0.0 = select i1 %1, i32 0, i32 256
				1835	ret i32 %iftmp.0.0
				1836	}
				1837
				1838	define i32 @y(i32 %a) nounwind readnone {
				1839	entry:
				1840	%0 = shl i32 %a, 1
				1841	%1 = and i32 %0, 256
				1842	ret i32 %1
				1843	}
				1844
				1845	Replacing an icmp+select with a shift should always be considered profitable in
				1846	instcombine.
Eli Friedman	dd95924	2009-06-11 23:07:04 +0000	[diff] [blame]	1847
				1848	//===---------------------------------------------------------------------===//
Evan Cheng	454211e	2009-07-30 08:56:19 +0000	[diff] [blame]	1849
				1850	Re-implement atomic builtins __sync_add_and_fetch() and __sync_sub_and_fetch
				1851	properly.
				1852
				1853	When the return value is not used (i.e. only care about the value in the
				1854	memory), x86 does not have to use add to implement these. Instead, it can use
				1855	add, sub, inc, dec instructions with the "lock" prefix.
				1856
				1857	This is currently implemented using a bit of instruction selection trick. The
				1858	issue is the target independent pattern produces one output and a chain and we
				1859	want to map it into one that just output a chain. The current trick is to select
				1860	it into a MERGE_VALUES with the first definition being an implicit_def. The
				1861	proper solution is to add new ISD opcodes for the no-output variant. DAG
				1862	combiner can then transform the node before it gets to target node selection.
				1863
				1864	Problem #2 is we are adding a whole bunch of x86 atomic instructions when in
				1865	fact these instructions are identical to the non-lock versions. We need a way to
				1866	add target specific information to target nodes and have this information
				1867	carried over to machine instructions. Asm printer (or JIT) can use this
				1868	information to add the "lock" prefix.
Bill Wendling	85196b5	2009-10-27 22:34:43 +0000	[diff] [blame]	1869
				1870	//===---------------------------------------------------------------------===//
Eli Friedman	2a81a58	2010-02-10 21:26:04 +0000	[diff] [blame]	1871
				1872	_Bool bar(int x) { return x & 1; }
				1873
				1874	define zeroext i1 @bar(i32* nocapture %x) nounwind readonly {
				1875	entry:
				1876	%tmp1 = load i32* %x ; <i32> [#uses=1]
				1877	%and = and i32 %tmp1, 1 ; <i32> [#uses=1]
				1878	%tobool = icmp ne i32 %and, 0 ; <i1> [#uses=1]
				1879	ret i1 %tobool
				1880	}
				1881
				1882	bar: # @bar
				1883	# BB#0: # %entry
				1884	movl 4(%esp), %eax
				1885	movb (%eax), %al
				1886	andb $1, %al
				1887	movzbl %al, %eax
				1888	ret
				1889
				1890	Missed optimization: should be movl+andl.
				1891
				1892	//===---------------------------------------------------------------------===//
				1893
				1894	Consider the following two functions compiled with clang:
				1895	_Bool foo(int x) { return !(x & 4); }
				1896	unsigned bar(int x) { return !(x & 4); }
				1897
				1898	foo:
				1899	movl 4(%esp), %eax
				1900	testb $4, (%eax)
				1901	sete %al
				1902	movzbl %al, %eax
				1903	ret
				1904
				1905	bar:
				1906	movl 4(%esp), %eax
				1907	movl (%eax), %eax
				1908	shrl $2, %eax
				1909	andl $1, %eax
				1910	xorl $1, %eax
				1911	ret
				1912
				1913	The second function generates more code even though the two functions are
				1914	are functionally identical.
				1915
				1916	//===---------------------------------------------------------------------===//
				1917
				1918	Take the following C code:
				1919	int x(int y) { return (y & 63) << 14; }
				1920
				1921	Code produced by gcc:
				1922	andl $63, %edi
				1923	sall $14, %edi
				1924	movl %edi, %eax
				1925	ret
				1926
				1927	Code produced by clang:
				1928	shll $14, %edi
				1929	movl %edi, %eax
				1930	andl $1032192, %eax
				1931	ret
				1932
				1933	The code produced by gcc is 3 bytes shorter. This sort of construct often
				1934	shows up with bitfields.
				1935
				1936	//===---------------------------------------------------------------------===//