Blame - lib/Target/X86/README.txt - fp2-dev/platform/external/llvm

blob: 98b58539f5a9217f28172f9288ba6cac2e9deb38 [file] [log] [blame]

Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the X86 backend.
				3	//===---------------------------------------------------------------------===//
				4
				5	Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
				6	Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
				7	X86, & make the dag combiner produce it when needed. This will eliminate one
				8	imul from the code generated for:
				9
				10	long long test(long long X, long long Y) { return X*Y; }
				11
				12	by using the EAX result from the mul. We should add a similar node for
				13	DIVREM.
				14
Chris Lattner	865874c	2005-12-02 00:11:20 +0000	[diff] [blame]	15	another case is:
				16
				17	long long test(int X, int Y) { return (long long)X*Y; }
				18
				19	... which should only be one imul instruction.
				20
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	21	//===---------------------------------------------------------------------===//
				22
				23	This should be one DIV/IDIV instruction, not a libcall:
				24
				25	unsigned test(unsigned long long X, unsigned Y) {
				26	return X/Y;
				27	}
				28
				29	This can be done trivially with a custom legalizer. What about overflow
				30	though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
				31
				32	//===---------------------------------------------------------------------===//
				33
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	34	Improvements to the multiply -> shift/add algorithm:
				35	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
				36
				37	//===---------------------------------------------------------------------===//
				38
				39	Improve code like this (occurs fairly frequently, e.g. in LLVM):
				40	long long foo(int x) { return 1LL << x; }
				41
				42	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
				43	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
				44	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
				45
				46	Another useful one would be ~0ULL >> X and ~0ULL << X.
				47
Chris Lattner	ffff617	2005-10-23 21:44:59 +0000	[diff] [blame]	48	//===---------------------------------------------------------------------===//
				49
Chris Lattner	1e4ed93	2005-11-28 04:52:39 +0000	[diff] [blame]	50	Compile this:
				51	_Bool f(_Bool a) { return a!=1; }
				52
				53	into:
				54	movzbl %dil, %eax
				55	xorl $1, %eax
				56	ret
Evan Cheng	8dee8cc	2005-12-17 01:25:19 +0000	[diff] [blame]	57
				58	//===---------------------------------------------------------------------===//
				59
				60	Some isel ideas:
				61
				62	1. Dynamic programming based approach when compile time if not an
				63	issue.
				64	2. Code duplication (addressing mode) during isel.
				65	3. Other ideas from "Register-Sensitive Selection, Duplication, and
				66	Sequencing of Instructions".
Chris Lattner	cb29890	2006-02-08 07:12:07 +0000	[diff] [blame]	67	4. Scheduling for reduced register pressure. E.g. "Minimum Register
				68	Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
				69	and other related papers.
				70	http://citeseer.ist.psu.edu/govindarajan01minimum.html
Evan Cheng	8dee8cc	2005-12-17 01:25:19 +0000	[diff] [blame]	71
				72	//===---------------------------------------------------------------------===//
				73
				74	Should we promote i16 to i32 to avoid partial register update stalls?
Evan Cheng	98abbfb	2005-12-17 06:54:43 +0000	[diff] [blame]	75
				76	//===---------------------------------------------------------------------===//
				77
				78	Leave any_extend as pseudo instruction and hint to register
				79	allocator. Delay codegen until post register allocation.
Evan Cheng	a3195e8	2006-01-12 22:54:21 +0000	[diff] [blame]	80
				81	//===---------------------------------------------------------------------===//
				82
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	83	Count leading zeros and count trailing zeros:
				84
				85	int clz(int X) { return __builtin_clz(X); }
				86	int ctz(int X) { return __builtin_ctz(X); }
				87
				88	$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
				89	clz:
				90	bsr %eax, DWORD PTR [%esp+4]
				91	xor %eax, 31
				92	ret
				93	ctz:
				94	bsf %eax, DWORD PTR [%esp+4]
				95	ret
				96
				97	however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
				98	aren't.
				99
				100	//===---------------------------------------------------------------------===//
				101
				102	Use push/pop instructions in prolog/epilog sequences instead of stores off
				103	ESP (certain code size win, perf win on some [which?] processors).
Evan Cheng	53f280a	2006-02-25 10:04:07 +0000	[diff] [blame]	104	Also, it appears icc use push for parameter passing. Need to investigate.
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	105
				106	//===---------------------------------------------------------------------===//
				107
				108	Only use inc/neg/not instructions on processors where they are faster than
				109	add/sub/xor. They are slower on the P4 due to only updating some processor
				110	flags.
				111
				112	//===---------------------------------------------------------------------===//
				113
Chris Lattner	b638cd8	2006-01-29 09:08:15 +0000	[diff] [blame]	114	The instruction selector sometimes misses folding a load into a compare. The
				115	pattern is written as (cmp reg, (load p)). Because the compare isn't
				116	commutative, it is not matched with the load on both sides. The dag combiner
				117	should be made smart enough to cannonicalize the load into the RHS of a compare
				118	when it can invert the result of the compare for free.
				119
Evan Cheng	0f4aa6e	2006-09-11 05:25:15 +0000	[diff] [blame]	120	//===---------------------------------------------------------------------===//
				121
Evan Cheng	fc7c17a	2006-04-13 05:09:45 +0000	[diff] [blame]	122	How about intrinsics? An example is:
				123	res = _mm_mulhi_epu16(A, _mm_mul_epu32(B, C));
				124
				125	compiles to
				126	pmuludq (%eax), %xmm0
				127	movl 8(%esp), %eax
				128	movdqa (%eax), %xmm1
				129	pmulhuw %xmm0, %xmm1
				130
				131	The transformation probably requires a X86 specific pass or a DAG combiner
				132	target specific hook.
				133
Chris Lattner	6a28456	2006-01-29 09:14:47 +0000	[diff] [blame]	134	//===---------------------------------------------------------------------===//
				135
Chris Lattner	9acddcd	2006-02-02 19:16:34 +0000	[diff] [blame]	136	In many cases, LLVM generates code like this:
				137
				138	_test:
				139	movl 8(%esp), %eax
				140	cmpl %eax, 4(%esp)
				141	setl %al
				142	movzbl %al, %eax
				143	ret
				144
				145	on some processors (which ones?), it is more efficient to do this:
				146
				147	_test:
				148	movl 8(%esp), %ebx
Evan Cheng	0f4aa6e	2006-09-11 05:25:15 +0000	[diff] [blame]	149	xor %eax, %eax
Chris Lattner	9acddcd	2006-02-02 19:16:34 +0000	[diff] [blame]	150	cmpl %ebx, 4(%esp)
				151	setl %al
				152	ret
				153
				154	Doing this correctly is tricky though, as the xor clobbers the flags.
				155
Chris Lattner	d395d09	2006-02-02 19:43:28 +0000	[diff] [blame]	156	//===---------------------------------------------------------------------===//
				157
Chris Lattner	8f77b73	2006-02-08 06:52:06 +0000	[diff] [blame]	158	We should generate bts/btr/etc instructions on targets where they are cheap or
				159	when codesize is important. e.g., for:
				160
				161	void setbit(int *target, int bit) {
				162	*target \|= (1 << bit);
				163	}
				164	void clearbit(int *target, int bit) {
				165	*target &= ~(1 << bit);
				166	}
				167
Chris Lattner	dba382b	2006-02-08 17:47:22 +0000	[diff] [blame]	168	//===---------------------------------------------------------------------===//
				169
Evan Cheng	952b7d6	2006-02-14 08:25:32 +0000	[diff] [blame]	170	Instead of the following for memset char*, 1, 10:
				171
				172	movl $16843009, 4(%edx)
				173	movl $16843009, (%edx)
				174	movw $257, 8(%edx)
				175
				176	It might be better to generate
				177
				178	movl $16843009, %eax
				179	movl %eax, 4(%edx)
				180	movl %eax, (%edx)
				181	movw al, 8(%edx)
				182
				183	when we can spare a register. It reduces code size.
Evan Cheng	7634ac4	2006-02-17 00:04:28 +0000	[diff] [blame]	184
				185	//===---------------------------------------------------------------------===//
				186
Chris Lattner	a648df2	2006-02-17 04:20:13 +0000	[diff] [blame]	187	Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
				188	get this:
				189
				190	int %test1(int %X) {
				191	%Y = div int %X, 8
				192	ret int %Y
				193	}
				194
				195	_test1:
				196	movl 4(%esp), %eax
				197	movl %eax, %ecx
				198	sarl $31, %ecx
				199	shrl $29, %ecx
				200	addl %ecx, %eax
				201	sarl $3, %eax
				202	ret
				203
				204	GCC knows several different ways to codegen it, one of which is this:
				205
				206	_test1:
				207	movl 4(%esp), %eax
				208	cmpl $-1, %eax
				209	leal 7(%eax), %ecx
				210	cmovle %ecx, %eax
				211	sarl $3, %eax
				212	ret
				213
				214	which is probably slower, but it's interesting at least :)
				215
Evan Cheng	755ee8f	2006-02-20 19:58:27 +0000	[diff] [blame]	216	//===---------------------------------------------------------------------===//
				217
Evan Cheng	7ab5404	2006-03-21 07:18:26 +0000	[diff] [blame]	218	Should generate min/max for stuff like:
				219
				220	void minf(float a, float b, float *X) {
				221	*X = a <= b ? a : b;
				222	}
				223
Evan Cheng	3032410	2006-02-23 02:50:21 +0000	[diff] [blame]	224	Make use of floating point min / max instructions. Perhaps introduce ISD::FMIN
				225	and ISD::FMAX node types?
				226
				227	//===---------------------------------------------------------------------===//
				228
Chris Lattner	205065a	2006-02-23 05:17:43 +0000	[diff] [blame]	229	The first BB of this code:
				230
				231	declare bool %foo()
				232	int %bar() {
				233	%V = call bool %foo()
				234	br bool %V, label %T, label %F
				235	T:
				236	ret int 1
				237	F:
				238	call bool %foo()
				239	ret int 12
				240	}
				241
				242	compiles to:
				243
				244	_bar:
				245	subl $12, %esp
				246	call L_foo$stub
				247	xorb $1, %al
				248	testb %al, %al
				249	jne LBB_bar_2 # F
				250
				251	It would be better to emit "cmp %al, 1" than a xor and test.
				252
Evan Cheng	53f280a	2006-02-25 10:04:07 +0000	[diff] [blame]	253	//===---------------------------------------------------------------------===//
Chris Lattner	205065a	2006-02-23 05:17:43 +0000	[diff] [blame]	254
Evan Cheng	53f280a	2006-02-25 10:04:07 +0000	[diff] [blame]	255	Enable X86InstrInfo::convertToThreeAddress().
Evan Cheng	aafc141	2006-02-28 23:38:49 +0000	[diff] [blame]	256
				257	//===---------------------------------------------------------------------===//
				258
Nate Begeman	c02e5a8	2006-03-26 19:19:27 +0000	[diff] [blame]	259	We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
				260	We should leave these as libcalls for everything over a much lower threshold,
				261	since libc is hand tuned for medium and large mem ops (avoiding RFO for large
				262	stores, TLB preheating, etc)
				263
				264	//===---------------------------------------------------------------------===//
				265
Chris Lattner	181b9c6	2006-03-09 01:39:46 +0000	[diff] [blame]	266	Optimize this into something reasonable:
				267	x * copysign(1.0, y) * copysign(1.0, z)
				268
				269	//===---------------------------------------------------------------------===//
				270
				271	Optimize copysign(x, *y) to use an integer load from y.
				272
				273	//===---------------------------------------------------------------------===//
				274
Evan Cheng	2771d21	2006-03-16 22:44:22 +0000	[diff] [blame]	275	%X = weak global int 0
				276
				277	void %foo(int %N) {
				278	%N = cast int %N to uint
				279	%tmp.24 = setgt int %N, 0
				280	br bool %tmp.24, label %no_exit, label %return
				281
				282	no_exit:
				283	%indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
				284	%i.0.0 = cast uint %indvar to int
				285	volatile store int %i.0.0, int* %X
				286	%indvar.next = add uint %indvar, 1
				287	%exitcond = seteq uint %indvar.next, %N
				288	br bool %exitcond, label %return, label %no_exit
				289
				290	return:
				291	ret void
				292	}
				293
				294	compiles into:
				295
				296	.text
				297	.align 4
				298	.globl _foo
				299	_foo:
				300	movl 4(%esp), %eax
				301	cmpl $1, %eax
				302	jl LBB_foo_4 # return
				303	LBB_foo_1: # no_exit.preheader
				304	xorl %ecx, %ecx
				305	LBB_foo_2: # no_exit
				306	movl L_X$non_lazy_ptr, %edx
				307	movl %ecx, (%edx)
				308	incl %ecx
				309	cmpl %eax, %ecx
				310	jne LBB_foo_2 # no_exit
				311	LBB_foo_3: # return.loopexit
				312	LBB_foo_4: # return
				313	ret
				314
				315	We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
				316	remateralization is implemented. This can be accomplished with 1) a target
				317	dependent LICM pass or 2) makeing SelectDAG represent the whole function.
				318
				319	//===---------------------------------------------------------------------===//
Evan Cheng	0def9c3	2006-03-19 06:08:11 +0000	[diff] [blame]	320
				321	The following tests perform worse with LSR:
				322
				323	lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
Chris Lattner	8bcf926	2006-03-19 22:27:41 +0000	[diff] [blame]	324
				325	//===---------------------------------------------------------------------===//
				326
Chris Lattner	9d5da1d	2006-03-24 07:12:19 +0000	[diff] [blame]	327	Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
Evan Cheng	50a6d8c	2006-03-21 07:12:57 +0000	[diff] [blame]	328	FR64 to VR128.
Evan Cheng	b20aace	2006-03-24 02:57:03 +0000	[diff] [blame]	329
				330	//===---------------------------------------------------------------------===//
				331
				332	mov $reg, 48(%esp)
				333	...
				334	leal 48(%esp), %eax
				335	mov %eax, (%esp)
				336	call _foo
				337
				338	Obviously it would have been better for the first mov (or any op) to store
				339	directly %esp[0] if there are no other uses.
Evan Cheng	4c4a2e2	2006-03-28 02:49:12 +0000	[diff] [blame]	340
				341	//===---------------------------------------------------------------------===//
				342
Evan Cheng	8af5ef9	2006-04-05 23:46:04 +0000	[diff] [blame]	343	Adding to the list of cmp / test poor codegen issues:
				344
				345	int test(__m128 A, __m128 B) {
				346	if (_mm_comige_ss(A, B))
				347	return 3;
				348	else
				349	return 4;
				350	}
				351
				352	_test:
				353	movl 8(%esp), %eax
				354	movaps (%eax), %xmm0
				355	movl 4(%esp), %eax
				356	movaps (%eax), %xmm1
				357	comiss %xmm0, %xmm1
				358	setae %al
				359	movzbl %al, %ecx
				360	movl $3, %eax
				361	movl $4, %edx
				362	cmpl $0, %ecx
				363	cmove %edx, %eax
				364	ret
				365
				366	Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
				367	are a number of issues. 1) We are introducing a setcc between the result of the
				368	intrisic call and select. 2) The intrinsic is expected to produce a i32 value
				369	so a any extend (which becomes a zero extend) is added.
				370
				371	We probably need some kind of target DAG combine hook to fix this.
Evan Cheng	573cb7c	2006-04-06 23:21:24 +0000	[diff] [blame]	372
				373	//===---------------------------------------------------------------------===//
				374
Chris Lattner	57a6c13	2006-04-23 19:47:09 +0000	[diff] [blame]	375	We generate significantly worse code for this than GCC:
				376	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
				377	http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
				378
				379	There is also one case we do worse on PPC.
				380
				381	//===---------------------------------------------------------------------===//
Evan Cheng	d7ec518	2006-04-24 23:30:10 +0000	[diff] [blame]	382
Chris Lattner	8ef67ac	2006-05-08 21:24:21 +0000	[diff] [blame]	383	If shorter, we should use things like:
				384	movzwl %ax, %eax
				385	instead of:
				386	andl $65535, %EAX
				387
				388	The former can also be used when the two-addressy nature of the 'and' would
				389	require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
Chris Lattner	217fde5	2006-04-27 21:40:57 +0000	[diff] [blame]	390
Chris Lattner	9fd868a	2006-05-08 21:39:45 +0000	[diff] [blame]	391	//===---------------------------------------------------------------------===//
				392
Evan Cheng	4dfa85d	2006-05-17 19:05:31 +0000	[diff] [blame]	393	Bad codegen:
				394
				395	char foo(int x) { return x; }
				396
				397	_foo:
				398	movl 4(%esp), %eax
				399	shll $24, %eax
				400	sarl $24, %eax
				401	ret
Evan Cheng	c8999f2	2006-05-17 21:20:51 +0000	[diff] [blame]	402
Evan Cheng	c21051f	2006-06-04 09:08:00 +0000	[diff] [blame]	403	SIGN_EXTEND_INREG can be implemented as (sext (trunc)) to take advantage of
				404	sub-registers.
				405
Evan Cheng	c8999f2	2006-05-17 21:20:51 +0000	[diff] [blame]	406	//===---------------------------------------------------------------------===//
Chris Lattner	778ae71	2006-05-19 20:55:31 +0000	[diff] [blame]	407
				408	Consider this:
				409
				410	typedef struct pair { float A, B; } pair;
				411	void pairtest(pair P, float *FP) {
				412	*FP = P.A+P.B;
				413	}
				414
				415	We currently generate this code with llvmgcc4:
				416
				417	_pairtest:
				418	subl $12, %esp
				419	movl 20(%esp), %eax
				420	movl %eax, 4(%esp)
				421	movl 16(%esp), %eax
				422	movl %eax, (%esp)
				423	movss (%esp), %xmm0
				424	addss 4(%esp), %xmm0
				425	movl 24(%esp), %eax
				426	movss %xmm0, (%eax)
				427	addl $12, %esp
				428	ret
				429
				430	we should be able to generate:
				431	_pairtest:
				432	movss 4(%esp), %xmm0
				433	movl 12(%esp), %eax
				434	addss 8(%esp), %xmm0
				435	movss %xmm0, (%eax)
				436	ret
				437
				438	The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
				439	integer chunks. It does this so that structs like {short,short} are passed in
				440	a single 32-bit integer stack slot. We should handle the safe cases above much
				441	nicer, while still handling the hard cases.
				442
				443	//===---------------------------------------------------------------------===//
				444
Evan Cheng	435bcd7	2006-05-22 05:54:49 +0000	[diff] [blame]	445	Another instruction selector deficiency:
				446
				447	void %bar() {
				448	%tmp = load int (int)** %foo
				449	%tmp = tail call int %tmp( int 3 )
				450	ret void
				451	}
				452
				453	_bar:
				454	subl $12, %esp
				455	movl L_foo$non_lazy_ptr, %eax
				456	movl (%eax), %eax
				457	call *%eax
				458	addl $12, %esp
				459	ret
				460
				461	The current isel scheme will not allow the load to be folded in the call since
				462	the load's chain result is read by the callseq_start.
Evan Cheng	8c65fa5	2006-05-30 06:23:50 +0000	[diff] [blame]	463
				464	//===---------------------------------------------------------------------===//
				465
				466	Don't forget to find a way to squash noop truncates in the JIT environment.
				467
				468	//===---------------------------------------------------------------------===//
				469
				470	Implement anyext in the same manner as truncate that would allow them to be
				471	eliminated.
				472
				473	//===---------------------------------------------------------------------===//
				474
				475	How about implementing truncate / anyext as a property of machine instruction
				476	operand? i.e. Print as 32-bit super-class register / 16-bit sub-class register.
				477	Do this for the cases where a truncate / anyext is guaranteed to be eliminated.
				478	For IA32 that is truncate from 32 to 16 and anyext from 16 to 32.
Evan Cheng	5a622f2	2006-05-30 07:37:37 +0000	[diff] [blame]	479
				480	//===---------------------------------------------------------------------===//
				481
				482	For this:
				483
				484	int test(int a)
				485	{
				486	return a * 3;
				487	}
				488
				489	We currently emits
				490	imull $3, 4(%esp), %eax
				491
				492	Perhaps this is what we really should generate is? Is imull three or four
				493	cycles? Note: ICC generates this:
				494	movl 4(%esp), %eax
				495	leal (%eax,%eax,2), %eax
				496
				497	The current instruction priority is based on pattern complexity. The former is
				498	more "complex" because it folds a load so the latter will not be emitted.
				499
				500	Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
				501	should always try to match LEA first since the LEA matching code does some
				502	estimate to determine whether the match is profitable.
				503
				504	However, if we care more about code size, then imull is better. It's two bytes
				505	shorter than movl + leal.
Evan Cheng	c21051f	2006-06-04 09:08:00 +0000	[diff] [blame]	506
				507	//===---------------------------------------------------------------------===//
				508
				509	Implement CTTZ, CTLZ with bsf and bsr.
				510
				511	//===---------------------------------------------------------------------===//
				512
				513	It appears gcc place string data with linkonce linkage in
				514	.section __TEXT,__const_coal,coalesced instead of
				515	.section __DATA,__const_coal,coalesced.
				516	Take a look at darwin.h, there are other Darwin assembler directives that we
				517	do not make use of.
				518
				519	//===---------------------------------------------------------------------===//
				520
				521	We should handle __attribute__ ((__visibility__ ("hidden"))).
Chris Lattner	8e173de	2006-06-15 21:33:31 +0000	[diff] [blame]	522
				523	//===---------------------------------------------------------------------===//
				524
Nate Begeman	83a6d49	2006-08-02 05:31:20 +0000	[diff] [blame]	525	int %foo(int* %a, int %t) {
				526	entry:
				527	br label %cond_true
				528
				529	cond_true: ; preds = %cond_true, %entry
				530	%x.0.0 = phi int [ 0, %entry ], [ %tmp9, %cond_true ] ; <int> [#uses=3]
				531	%t_addr.0.0 = phi int [ %t, %entry ], [ %tmp7, %cond_true ] ; <int> [#uses=1]
				532	%tmp2 = getelementptr int* %a, int %x.0.0 ; <int*> [#uses=1]
				533	%tmp3 = load int* %tmp2 ; <int> [#uses=1]
				534	%tmp5 = add int %t_addr.0.0, %x.0.0 ; <int> [#uses=1]
				535	%tmp7 = add int %tmp5, %tmp3 ; <int> [#uses=2]
				536	%tmp9 = add int %x.0.0, 1 ; <int> [#uses=2]
				537	%tmp = setgt int %tmp9, 39 ; <bool> [#uses=1]
				538	br bool %tmp, label %bb12, label %cond_true
				539
				540	bb12: ; preds = %cond_true
				541	ret int %tmp7
Chris Lattner	8e173de	2006-06-15 21:33:31 +0000	[diff] [blame]	542	}
				543
Nate Begeman	83a6d49	2006-08-02 05:31:20 +0000	[diff] [blame]	544	is pessimized by -loop-reduce and -indvars
Chris Lattner	8e173de	2006-06-15 21:33:31 +0000	[diff] [blame]	545
				546	//===---------------------------------------------------------------------===//
Evan Cheng	357edf8	2006-06-17 00:45:49 +0000	[diff] [blame]	547
				548	Use cpuid to auto-detect CPU features such as SSE, SSE2, and SSE3.
Evan Cheng	1c96953	2006-07-19 06:06:24 +0000	[diff] [blame]	549
				550	//===---------------------------------------------------------------------===//
				551
Evan Cheng	abb4d78	2006-07-19 21:29:30 +0000	[diff] [blame]	552	u32 to float conversion improvement:
				553
				554	float uint32_2_float( unsigned u ) {
				555	float fl = (int) (u & 0xffff);
				556	float fh = (int) (u >> 16);
				557	fh *= 0x1.0p16f;
				558	return fh + fl;
				559	}
				560
				561	00000000 subl $0x04,%esp
				562	00000003 movl 0x08(%esp,1),%eax
				563	00000007 movl %eax,%ecx
				564	00000009 shrl $0x10,%ecx
				565	0000000c cvtsi2ss %ecx,%xmm0
				566	00000010 andl $0x0000ffff,%eax
				567	00000015 cvtsi2ss %eax,%xmm1
				568	00000019 mulss 0x00000078,%xmm0
				569	00000021 addss %xmm1,%xmm0
				570	00000025 movss %xmm0,(%esp,1)
				571	0000002a flds (%esp,1)
				572	0000002d addl $0x04,%esp
				573	00000030 ret
Evan Cheng	ae1d33f	2006-07-26 21:49:52 +0000	[diff] [blame]	574
				575	//===---------------------------------------------------------------------===//
				576
				577	When using fastcc abi, align stack slot of argument of type double on 8 byte
				578	boundary to improve performance.
Chris Lattner	5c36d78	2006-08-16 02:47:44 +0000	[diff] [blame]	579
				580	//===---------------------------------------------------------------------===//
				581
				582	Codegen:
				583
Chris Lattner	2a33a3f	2006-09-11 22:57:51 +0000	[diff] [blame^]	584	int f(int a, int b) {
				585	if (a == 4 \|\| a == 6)
				586	b++;
				587	return b;
				588	}
				589
Chris Lattner	5c36d78	2006-08-16 02:47:44 +0000	[diff] [blame]	590
				591	as:
				592
				593	or eax, 2
				594	cmp eax, 6
				595	jz label
				596