Blame - lib/Target/X86/README.txt - fp2-dev/platform/external/llvm

blob: 4016e4ee7a28ecb9ddc3b345cfcb4b9ae7e0988a [file] [log] [blame]

Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the X86 backend.
				3	//===---------------------------------------------------------------------===//
				4
				5	Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
				6	Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
				7	X86, & make the dag combiner produce it when needed. This will eliminate one
				8	imul from the code generated for:
				9
				10	long long test(long long X, long long Y) { return X*Y; }
				11
				12	by using the EAX result from the mul. We should add a similar node for
				13	DIVREM.
				14
Chris Lattner	865874c	2005-12-02 00:11:20 +0000	[diff] [blame]	15	another case is:
				16
				17	long long test(int X, int Y) { return (long long)X*Y; }
				18
				19	... which should only be one imul instruction.
				20
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	21	//===---------------------------------------------------------------------===//
				22
				23	This should be one DIV/IDIV instruction, not a libcall:
				24
				25	unsigned test(unsigned long long X, unsigned Y) {
				26	return X/Y;
				27	}
				28
				29	This can be done trivially with a custom legalizer. What about overflow
				30	though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
				31
				32	//===---------------------------------------------------------------------===//
				33
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	34	Some targets (e.g. athlons) prefer freep to fstp ST(0):
				35	http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html
				36
				37	//===---------------------------------------------------------------------===//
				38
Evan Cheng	a3195e8	2006-01-12 22:54:21 +0000	[diff] [blame]	39	This should use fiadd on chips where it is profitable:
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	40	double foo(double P, int I) { return P+I; }
				41
Evan Cheng	755ee8f	2006-02-20 19:58:27 +0000	[diff] [blame]	42	We have fiadd patterns now but the followings have the same cost and
				43	complexity. We need a way to specify the later is more profitable.
				44
				45	def FpADD32m : FpI<(ops RFP:$dst, RFP:$src1, f32mem:$src2), OneArgFPRW,
				46	[(set RFP:$dst, (fadd RFP:$src1,
				47	(extloadf64f32 addr:$src2)))]>;
				48	// ST(0) = ST(0) + [mem32]
				49
				50	def FpIADD32m : FpI<(ops RFP:$dst, RFP:$src1, i32mem:$src2), OneArgFPRW,
				51	[(set RFP:$dst, (fadd RFP:$src1,
				52	(X86fild addr:$src2, i32)))]>;
				53	// ST(0) = ST(0) + [mem32int]
				54
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	55	//===---------------------------------------------------------------------===//
				56
				57	The FP stackifier needs to be global. Also, it should handle simple permutates
				58	to reduce number of shuffle instructions, e.g. turning:
				59
				60	fld P -> fld Q
				61	fld Q fld P
				62	fxch
				63
				64	or:
				65
				66	fxch -> fucomi
				67	fucomi jl X
				68	jg X
				69
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	70	Ideas:
				71	http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html
				72
				73
Chris Lattner	1171ff4	2005-10-23 19:52:42 +0000	[diff] [blame]	74	//===---------------------------------------------------------------------===//
				75
				76	Improvements to the multiply -> shift/add algorithm:
				77	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
				78
				79	//===---------------------------------------------------------------------===//
				80
				81	Improve code like this (occurs fairly frequently, e.g. in LLVM):
				82	long long foo(int x) { return 1LL << x; }
				83
				84	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
				85	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
				86	http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
				87
				88	Another useful one would be ~0ULL >> X and ~0ULL << X.
				89
Chris Lattner	ffff617	2005-10-23 21:44:59 +0000	[diff] [blame]	90	//===---------------------------------------------------------------------===//
				91
Chris Lattner	1e4ed93	2005-11-28 04:52:39 +0000	[diff] [blame]	92	Compile this:
				93	_Bool f(_Bool a) { return a!=1; }
				94
				95	into:
				96	movzbl %dil, %eax
				97	xorl $1, %eax
				98	ret
Evan Cheng	8dee8cc	2005-12-17 01:25:19 +0000	[diff] [blame]	99
				100	//===---------------------------------------------------------------------===//
				101
				102	Some isel ideas:
				103
				104	1. Dynamic programming based approach when compile time if not an
				105	issue.
				106	2. Code duplication (addressing mode) during isel.
				107	3. Other ideas from "Register-Sensitive Selection, Duplication, and
				108	Sequencing of Instructions".
Chris Lattner	cb29890	2006-02-08 07:12:07 +0000	[diff] [blame]	109	4. Scheduling for reduced register pressure. E.g. "Minimum Register
				110	Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
				111	and other related papers.
				112	http://citeseer.ist.psu.edu/govindarajan01minimum.html
Evan Cheng	8dee8cc	2005-12-17 01:25:19 +0000	[diff] [blame]	113
				114	//===---------------------------------------------------------------------===//
				115
				116	Should we promote i16 to i32 to avoid partial register update stalls?
Evan Cheng	98abbfb	2005-12-17 06:54:43 +0000	[diff] [blame]	117
				118	//===---------------------------------------------------------------------===//
				119
				120	Leave any_extend as pseudo instruction and hint to register
				121	allocator. Delay codegen until post register allocation.
Evan Cheng	a3195e8	2006-01-12 22:54:21 +0000	[diff] [blame]	122
				123	//===---------------------------------------------------------------------===//
				124
				125	Add a target specific hook to DAG combiner to handle SINT_TO_FP and
				126	FP_TO_SINT when the source operand is already in memory.
				127
				128	//===---------------------------------------------------------------------===//
				129
Evan Cheng	e08c270	2006-01-13 01:20:42 +0000	[diff] [blame]	130	Model X86 EFLAGS as a real register to avoid redudant cmp / test. e.g.
				131
				132	cmpl $1, %eax
				133	setg %al
				134	testb %al, %al # unnecessary
				135	jne .BB7
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	136
				137	//===---------------------------------------------------------------------===//
				138
				139	Count leading zeros and count trailing zeros:
				140
				141	int clz(int X) { return __builtin_clz(X); }
				142	int ctz(int X) { return __builtin_ctz(X); }
				143
				144	$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
				145	clz:
				146	bsr %eax, DWORD PTR [%esp+4]
				147	xor %eax, 31
				148	ret
				149	ctz:
				150	bsf %eax, DWORD PTR [%esp+4]
				151	ret
				152
				153	however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
				154	aren't.
				155
				156	//===---------------------------------------------------------------------===//
				157
				158	Use push/pop instructions in prolog/epilog sequences instead of stores off
				159	ESP (certain code size win, perf win on some [which?] processors).
Evan Cheng	53f280a	2006-02-25 10:04:07 +0000	[diff] [blame]	160	Also, it appears icc use push for parameter passing. Need to investigate.
Chris Lattner	1db4b4f	2006-01-16 17:53:00 +0000	[diff] [blame]	161
				162	//===---------------------------------------------------------------------===//
				163
				164	Only use inc/neg/not instructions on processors where they are faster than
				165	add/sub/xor. They are slower on the P4 due to only updating some processor
				166	flags.
				167
				168	//===---------------------------------------------------------------------===//
				169
				170	Open code rint,floor,ceil,trunc:
				171	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html
				172	http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html
				173
				174	//===---------------------------------------------------------------------===//
				175
				176	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				177
Chris Lattner	8f77b73	2006-02-08 06:52:06 +0000	[diff] [blame]	178	Expand these to calls of sin/cos and stores:
				179	double sincos(double x, double sin, double cos);
				180	float sincosf(float x, float sin, float cos);
				181	long double sincosl(long double x, long double sin, long double cos);
				182
				183	Doing so could allow SROA of the destination pointers. See also:
				184	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				185
Evan Cheng	e826a01	2006-01-27 22:11:01 +0000	[diff] [blame]	186	//===---------------------------------------------------------------------===//
				187
Chris Lattner	b638cd8	2006-01-29 09:08:15 +0000	[diff] [blame]	188	The instruction selector sometimes misses folding a load into a compare. The
				189	pattern is written as (cmp reg, (load p)). Because the compare isn't
				190	commutative, it is not matched with the load on both sides. The dag combiner
				191	should be made smart enough to cannonicalize the load into the RHS of a compare
				192	when it can invert the result of the compare for free.
				193
Evan Cheng	fc7c17a	2006-04-13 05:09:45 +0000	[diff] [blame]	194	How about intrinsics? An example is:
				195	res = _mm_mulhi_epu16(A, _mm_mul_epu32(B, C));
				196
				197	compiles to
				198	pmuludq (%eax), %xmm0
				199	movl 8(%esp), %eax
				200	movdqa (%eax), %xmm1
				201	pmulhuw %xmm0, %xmm1
				202
				203	The transformation probably requires a X86 specific pass or a DAG combiner
				204	target specific hook.
				205
Chris Lattner	6a28456	2006-01-29 09:14:47 +0000	[diff] [blame]	206	//===---------------------------------------------------------------------===//
				207
Chris Lattner	5164a31	2006-01-29 09:42:20 +0000	[diff] [blame]	208	LSR should be turned on for the X86 backend and tuned to take advantage of its
				209	addressing modes.
				210
Chris Lattner	c7097af	2006-01-29 09:46:06 +0000	[diff] [blame]	211	//===---------------------------------------------------------------------===//
				212
				213	When compiled with unsafemath enabled, "main" should enable SSE DAZ mode and
				214	other fast SSE modes.
Chris Lattner	bdde465	2006-01-31 00:20:38 +0000	[diff] [blame]	215
				216	//===---------------------------------------------------------------------===//
				217
Chris Lattner	594086d	2006-01-31 00:45:37 +0000	[diff] [blame]	218	Think about doing i64 math in SSE regs.
				219
Chris Lattner	8e38ae6	2006-01-31 02:10:06 +0000	[diff] [blame]	220	//===---------------------------------------------------------------------===//
				221
				222	The DAG Isel doesn't fold the loads into the adds in this testcase. The
				223	pattern selector does. This is because the chain value of the load gets
				224	selected first, and the loads aren't checking to see if they are only used by
				225	and add.
				226
				227	.ll:
				228
				229	int %test(int* %x, int* %y, int* %z) {
				230	%X = load int* %x
				231	%Y = load int* %y
				232	%Z = load int* %z
				233	%a = add int %X, %Y
				234	%b = add int %a, %Z
				235	ret int %b
				236	}
				237
				238	dag isel:
				239
				240	_test:
				241	movl 4(%esp), %eax
				242	movl (%eax), %eax
				243	movl 8(%esp), %ecx
				244	movl (%ecx), %ecx
				245	addl %ecx, %eax
				246	movl 12(%esp), %ecx
				247	movl (%ecx), %ecx
				248	addl %ecx, %eax
				249	ret
				250
				251	pattern isel:
				252
				253	_test:
				254	movl 12(%esp), %ecx
				255	movl 4(%esp), %edx
				256	movl 8(%esp), %eax
				257	movl (%eax), %eax
				258	addl (%edx), %eax
				259	addl (%ecx), %eax
				260	ret
				261
				262	This is bad for register pressure, though the dag isel is producing a
				263	better schedule. :)
Chris Lattner	3e1d5e5	2006-02-01 01:44:25 +0000	[diff] [blame]	264
				265	//===---------------------------------------------------------------------===//
				266
				267	This testcase should have no SSE instructions in it, and only one load from
				268	a constant pool:
				269
				270	double %test3(bool %B) {
				271	%C = select bool %B, double 123.412, double 523.01123123
				272	ret double %C
				273	}
				274
				275	Currently, the select is being lowered, which prevents the dag combiner from
				276	turning 'select (load CPI1), (load CPI2)' -> 'load (select CPI1, CPI2)'
				277
				278	The pattern isel got this one right.
				279
Chris Lattner	1f7c630	2006-02-01 06:40:32 +0000	[diff] [blame]	280	//===---------------------------------------------------------------------===//
				281
Chris Lattner	3e2b94a	2006-02-01 21:44:48 +0000	[diff] [blame]	282	We need to lower switch statements to tablejumps when appropriate instead of
				283	always into binary branch trees.
Chris Lattner	4d7db40	2006-02-01 23:38:08 +0000	[diff] [blame]	284
				285	//===---------------------------------------------------------------------===//
				286
				287	SSE doesn't have [mem] op= reg instructions. If we have an SSE instruction
				288	like this:
				289
				290	X += y
				291
				292	and the register allocator decides to spill X, it is cheaper to emit this as:
				293
				294	Y += [xslot]
				295	store Y -> [xslot]
				296
				297	than as:
				298
				299	tmp = [xslot]
				300	tmp += y
				301	store tmp -> [xslot]
				302
				303	..and this uses one fewer register (so this should be done at load folding
				304	time, not at spiller time). Note however that this can only be done
				305	if Y is dead. Here's a testcase:
				306
				307	%.str_3 = external global [15 x sbyte] ; <[15 x sbyte]*> [#uses=0]
				308	implementation ; Functions:
				309	declare void %printf(int, ...)
				310	void %main() {
				311	build_tree.exit:
				312	br label %no_exit.i7
				313	no_exit.i7: ; preds = %no_exit.i7, %build_tree.exit
				314	%tmp.0.1.0.i9 = phi double [ 0.000000e+00, %build_tree.exit ], [ %tmp.34.i18, %no_exit.i7 ] ; <double> [#uses=1]
				315	%tmp.0.0.0.i10 = phi double [ 0.000000e+00, %build_tree.exit ], [ %tmp.28.i16, %no_exit.i7 ] ; <double> [#uses=1]
				316	%tmp.28.i16 = add double %tmp.0.0.0.i10, 0.000000e+00
				317	%tmp.34.i18 = add double %tmp.0.1.0.i9, 0.000000e+00
				318	br bool false, label %Compute_Tree.exit23, label %no_exit.i7
				319	Compute_Tree.exit23: ; preds = %no_exit.i7
				320	tail call void (int, ...)* %printf( int 0 )
				321	store double %tmp.34.i18, double* null
				322	ret void
				323	}
				324
				325	We currently emit:
				326
				327	.BBmain_1:
				328	xorpd %XMM1, %XMM1
				329	addsd %XMM0, %XMM1
				330	*** movsd %XMM2, QWORD PTR [%ESP + 8]
				331	*** addsd %XMM2, %XMM1
				332	*** movsd QWORD PTR [%ESP + 8], %XMM2
				333	jmp .BBmain_1 # no_exit.i7
				334
				335	This is a bugpoint reduced testcase, which is why the testcase doesn't make
				336	much sense (e.g. its an infinite loop). :)
				337
Evan Cheng	8b6e4e6	2006-02-02 02:40:17 +0000	[diff] [blame]	338	//===---------------------------------------------------------------------===//
				339
				340	None of the FPStack instructions are handled in
				341	X86RegisterInfo::foldMemoryOperand, which prevents the spiller from
				342	folding spill code into the instructions.
Chris Lattner	9acddcd	2006-02-02 19:16:34 +0000	[diff] [blame]	343
				344	//===---------------------------------------------------------------------===//
				345
				346	In many cases, LLVM generates code like this:
				347
				348	_test:
				349	movl 8(%esp), %eax
				350	cmpl %eax, 4(%esp)
				351	setl %al
				352	movzbl %al, %eax
				353	ret
				354
				355	on some processors (which ones?), it is more efficient to do this:
				356
				357	_test:
				358	movl 8(%esp), %ebx
				359	xor %eax, %eax
				360	cmpl %ebx, 4(%esp)
				361	setl %al
				362	ret
				363
				364	Doing this correctly is tricky though, as the xor clobbers the flags.
				365
Chris Lattner	d395d09	2006-02-02 19:43:28 +0000	[diff] [blame]	366	//===---------------------------------------------------------------------===//
				367
				368	We should generate 'test' instead of 'cmp' in various cases, e.g.:
				369
				370	bool %test(int %X) {
				371	%Y = shl int %X, ubyte 1
				372	%C = seteq int %Y, 0
				373	ret bool %C
				374	}
				375	bool %test(int %X) {
				376	%Y = and int %X, 8
				377	%C = seteq int %Y, 0
				378	ret bool %C
				379	}
				380
				381	This may just be a matter of using 'test' to write bigger patterns for X86cmp.
				382
				383	//===---------------------------------------------------------------------===//
				384
Chris Lattner	d395d09	2006-02-02 19:43:28 +0000	[diff] [blame]	385	SSE should implement 'select_cc' using 'emulated conditional moves' that use
				386	pcmp/pand/pandn/por to do a selection instead of a conditional branch:
				387
				388	double %X(double %Y, double %Z, double %A, double %B) {
				389	%C = setlt double %A, %B
				390	%z = add double %Z, 0.0 ;; select operand is not a load
				391	%D = select bool %C, double %Y, double %z
				392	ret double %D
				393	}
				394
				395	We currently emit:
				396
				397	_X:
				398	subl $12, %esp
				399	xorpd %xmm0, %xmm0
				400	addsd 24(%esp), %xmm0
				401	movsd 32(%esp), %xmm1
				402	movsd 16(%esp), %xmm2
				403	ucomisd 40(%esp), %xmm1
				404	jb LBB_X_2
				405	LBB_X_1:
				406	movsd %xmm0, %xmm2
				407	LBB_X_2:
				408	movsd %xmm2, (%esp)
				409	fldl (%esp)
				410	addl $12, %esp
				411	ret
Chris Lattner	9acddcd	2006-02-02 19:16:34 +0000	[diff] [blame]	412
Evan Cheng	183fff9	2006-02-07 08:35:44 +0000	[diff] [blame]	413	//===---------------------------------------------------------------------===//
				414
Chris Lattner	8f77b73	2006-02-08 06:52:06 +0000	[diff] [blame]	415	We should generate bts/btr/etc instructions on targets where they are cheap or
				416	when codesize is important. e.g., for:
				417
				418	void setbit(int *target, int bit) {
				419	*target \|= (1 << bit);
				420	}
				421	void clearbit(int *target, int bit) {
				422	*target &= ~(1 << bit);
				423	}
				424
Chris Lattner	dba382b	2006-02-08 17:47:22 +0000	[diff] [blame]	425	//===---------------------------------------------------------------------===//
				426
Evan Cheng	952b7d6	2006-02-14 08:25:32 +0000	[diff] [blame]	427	Instead of the following for memset char*, 1, 10:
				428
				429	movl $16843009, 4(%edx)
				430	movl $16843009, (%edx)
				431	movw $257, 8(%edx)
				432
				433	It might be better to generate
				434
				435	movl $16843009, %eax
				436	movl %eax, 4(%edx)
				437	movl %eax, (%edx)
				438	movw al, 8(%edx)
				439
				440	when we can spare a register. It reduces code size.
Evan Cheng	7634ac4	2006-02-17 00:04:28 +0000	[diff] [blame]	441
				442	//===---------------------------------------------------------------------===//
				443
				444	It's not clear whether we should use pxor or xorps / xorpd to clear XMM
				445	registers. The choice may depend on subtarget information. We should do some
				446	more experiments on different x86 machines.
Chris Lattner	a648df2	2006-02-17 04:20:13 +0000	[diff] [blame]	447
				448	//===---------------------------------------------------------------------===//
				449
				450	Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
				451	get this:
				452
				453	int %test1(int %X) {
				454	%Y = div int %X, 8
				455	ret int %Y
				456	}
				457
				458	_test1:
				459	movl 4(%esp), %eax
				460	movl %eax, %ecx
				461	sarl $31, %ecx
				462	shrl $29, %ecx
				463	addl %ecx, %eax
				464	sarl $3, %eax
				465	ret
				466
				467	GCC knows several different ways to codegen it, one of which is this:
				468
				469	_test1:
				470	movl 4(%esp), %eax
				471	cmpl $-1, %eax
				472	leal 7(%eax), %ecx
				473	cmovle %ecx, %eax
				474	sarl $3, %eax
				475	ret
				476
				477	which is probably slower, but it's interesting at least :)
				478
Evan Cheng	755ee8f	2006-02-20 19:58:27 +0000	[diff] [blame]	479	//===---------------------------------------------------------------------===//
				480
				481	Currently the x86 codegen isn't very good at mixing SSE and FPStack
				482	code:
				483
				484	unsigned int foo(double x) { return x; }
				485
				486	foo:
				487	subl $20, %esp
				488	movsd 24(%esp), %xmm0
				489	movsd %xmm0, 8(%esp)
				490	fldl 8(%esp)
				491	fisttpll (%esp)
				492	movl (%esp), %eax
				493	addl $20, %esp
				494	ret
				495
				496	This will be solved when we go to a dynamic programming based isel.
Evan Cheng	3032410	2006-02-23 02:50:21 +0000	[diff] [blame]	497
				498	//===---------------------------------------------------------------------===//
				499
Evan Cheng	7ab5404	2006-03-21 07:18:26 +0000	[diff] [blame]	500	Should generate min/max for stuff like:
				501
				502	void minf(float a, float b, float *X) {
				503	*X = a <= b ? a : b;
				504	}
				505
Evan Cheng	3032410	2006-02-23 02:50:21 +0000	[diff] [blame]	506	Make use of floating point min / max instructions. Perhaps introduce ISD::FMIN
				507	and ISD::FMAX node types?
				508
				509	//===---------------------------------------------------------------------===//
				510
Chris Lattner	205065a	2006-02-23 05:17:43 +0000	[diff] [blame]	511	The first BB of this code:
				512
				513	declare bool %foo()
				514	int %bar() {
				515	%V = call bool %foo()
				516	br bool %V, label %T, label %F
				517	T:
				518	ret int 1
				519	F:
				520	call bool %foo()
				521	ret int 12
				522	}
				523
				524	compiles to:
				525
				526	_bar:
				527	subl $12, %esp
				528	call L_foo$stub
				529	xorb $1, %al
				530	testb %al, %al
				531	jne LBB_bar_2 # F
				532
				533	It would be better to emit "cmp %al, 1" than a xor and test.
				534
Evan Cheng	53f280a	2006-02-25 10:04:07 +0000	[diff] [blame]	535	//===---------------------------------------------------------------------===//
Chris Lattner	205065a	2006-02-23 05:17:43 +0000	[diff] [blame]	536
Evan Cheng	53f280a	2006-02-25 10:04:07 +0000	[diff] [blame]	537	Enable X86InstrInfo::convertToThreeAddress().
Evan Cheng	aafc141	2006-02-28 23:38:49 +0000	[diff] [blame]	538
				539	//===---------------------------------------------------------------------===//
				540
				541	Investigate whether it is better to codegen the following
				542
				543	%tmp.1 = mul int %x, 9
				544	to
				545
				546	movl 4(%esp), %eax
				547	leal (%eax,%eax,8), %eax
				548
				549	as opposed to what llc is currently generating:
				550
				551	imull $9, 4(%esp), %eax
				552
				553	Currently the load folding imull has a higher complexity than the LEA32 pattern.
Evan Cheng	f42f516	2006-03-04 07:49:50 +0000	[diff] [blame]	554
				555	//===---------------------------------------------------------------------===//
				556
Nate Begeman	c02e5a8	2006-03-26 19:19:27 +0000	[diff] [blame]	557	We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
				558	We should leave these as libcalls for everything over a much lower threshold,
				559	since libc is hand tuned for medium and large mem ops (avoiding RFO for large
				560	stores, TLB preheating, etc)
				561
				562	//===---------------------------------------------------------------------===//
				563
Evan Cheng	f42f516	2006-03-04 07:49:50 +0000	[diff] [blame]	564	Lower memcpy / memset to a series of SSE 128 bit move instructions when it's
				565	feasible.
Chris Lattner	a4929df	2006-03-05 01:15:18 +0000	[diff] [blame]	566
				567	//===---------------------------------------------------------------------===//
				568
Chris Lattner	9d5da1d	2006-03-24 07:12:19 +0000	[diff] [blame]	569	Teach the coalescer to commute 2-addr instructions, allowing us to eliminate
Chris Lattner	a4929df	2006-03-05 01:15:18 +0000	[diff] [blame]	570	the reg-reg copy in this example:
				571
				572	float foo(int x, float y, unsigned c) {
				573	float res = 0.0;
				574	unsigned i;
				575	for (i = 0; i < c; i++) {
				576	float xx = (float)x[i];
				577	xx = xx * y[i];
				578	xx += res;
				579	res = xx;
				580	}
				581	return res;
				582	}
				583
				584	LBB_foo_3: # no_exit
				585	cvtsi2ss %XMM0, DWORD PTR [%EDX + 4*%ESI]
				586	mulss %XMM0, DWORD PTR [%EAX + 4*%ESI]
				587	addss %XMM0, %XMM1
				588	inc %ESI
				589	cmp %ESI, %ECX
				590	**** movaps %XMM1, %XMM0
				591	jb LBB_foo_3 # no_exit
				592
				593	//===---------------------------------------------------------------------===//
Chris Lattner	181b9c6	2006-03-09 01:39:46 +0000	[diff] [blame]	594
				595	Codegen:
				596	if (copysign(1.0, x) == copysign(1.0, y))
				597	into:
				598	if (x^y & mask)
				599	when using SSE.
				600
				601	//===---------------------------------------------------------------------===//
				602
				603	Optimize this into something reasonable:
				604	x * copysign(1.0, y) * copysign(1.0, z)
				605
				606	//===---------------------------------------------------------------------===//
				607
				608	Optimize copysign(x, *y) to use an integer load from y.
				609
				610	//===---------------------------------------------------------------------===//
				611
Evan Cheng	2771d21	2006-03-16 22:44:22 +0000	[diff] [blame]	612	%X = weak global int 0
				613
				614	void %foo(int %N) {
				615	%N = cast int %N to uint
				616	%tmp.24 = setgt int %N, 0
				617	br bool %tmp.24, label %no_exit, label %return
				618
				619	no_exit:
				620	%indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
				621	%i.0.0 = cast uint %indvar to int
				622	volatile store int %i.0.0, int* %X
				623	%indvar.next = add uint %indvar, 1
				624	%exitcond = seteq uint %indvar.next, %N
				625	br bool %exitcond, label %return, label %no_exit
				626
				627	return:
				628	ret void
				629	}
				630
				631	compiles into:
				632
				633	.text
				634	.align 4
				635	.globl _foo
				636	_foo:
				637	movl 4(%esp), %eax
				638	cmpl $1, %eax
				639	jl LBB_foo_4 # return
				640	LBB_foo_1: # no_exit.preheader
				641	xorl %ecx, %ecx
				642	LBB_foo_2: # no_exit
				643	movl L_X$non_lazy_ptr, %edx
				644	movl %ecx, (%edx)
				645	incl %ecx
				646	cmpl %eax, %ecx
				647	jne LBB_foo_2 # no_exit
				648	LBB_foo_3: # return.loopexit
				649	LBB_foo_4: # return
				650	ret
				651
				652	We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
				653	remateralization is implemented. This can be accomplished with 1) a target
				654	dependent LICM pass or 2) makeing SelectDAG represent the whole function.
				655
				656	//===---------------------------------------------------------------------===//
Evan Cheng	0def9c3	2006-03-19 06:08:11 +0000	[diff] [blame]	657
				658	The following tests perform worse with LSR:
				659
				660	lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
Chris Lattner	8bcf926	2006-03-19 22:27:41 +0000	[diff] [blame]	661
				662	//===---------------------------------------------------------------------===//
				663
Chris Lattner	9d5da1d	2006-03-24 07:12:19 +0000	[diff] [blame]	664	Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
Evan Cheng	50a6d8c	2006-03-21 07:12:57 +0000	[diff] [blame]	665	FR64 to VR128.
Evan Cheng	b20aace	2006-03-24 02:57:03 +0000	[diff] [blame]	666
				667	//===---------------------------------------------------------------------===//
				668
				669	mov $reg, 48(%esp)
				670	...
				671	leal 48(%esp), %eax
				672	mov %eax, (%esp)
				673	call _foo
				674
				675	Obviously it would have been better for the first mov (or any op) to store
				676	directly %esp[0] if there are no other uses.
Evan Cheng	4c4a2e2	2006-03-28 02:49:12 +0000	[diff] [blame]	677
				678	//===---------------------------------------------------------------------===//
				679
Evan Cheng	6a6d354	2006-03-28 06:55:45 +0000	[diff] [blame]	680	Use movhps to update upper 64-bits of a v4sf value. Also movlps on lower half
				681	of a v4sf value.
Evan Cheng	c999c74	2006-03-29 03:03:46 +0000	[diff] [blame]	682
				683	//===---------------------------------------------------------------------===//
				684
				685	Better codegen for vector_shuffles like this { x, 0, 0, 0 } or { x, 0, x, 0}.
				686	Perhaps use pxor / xorp* to clear a XMM register first?
				687
Evan Cheng	8af5ef9	2006-04-05 23:46:04 +0000	[diff] [blame]	688	//===---------------------------------------------------------------------===//
				689
Chris Lattner	a956db2	2006-04-10 21:51:03 +0000	[diff] [blame]	690	Better codegen for:
				691
				692	void f(float a, float b, vector float * out) { *out = (vector float){ a, 0.0, 0.0, b}; }
				693	void f(float a, float b, vector float * out) { *out = (vector float){ a, b, 0.0, 0}; }
				694
				695	For the later we generate:
				696
				697	_f:
				698	pxor %xmm0, %xmm0
				699	movss 8(%esp), %xmm1
				700	movaps %xmm0, %xmm2
				701	unpcklps %xmm1, %xmm2
				702	movss 4(%esp), %xmm1
				703	unpcklps %xmm0, %xmm1
				704	unpcklps %xmm2, %xmm1
				705	movl 12(%esp), %eax
				706	movaps %xmm1, (%eax)
				707	ret
				708
				709	This seems like it should use shufps, one for each of a & b.
				710
				711	//===---------------------------------------------------------------------===//
				712
Evan Cheng	8af5ef9	2006-04-05 23:46:04 +0000	[diff] [blame]	713	Adding to the list of cmp / test poor codegen issues:
				714
				715	int test(__m128 A, __m128 B) {
				716	if (_mm_comige_ss(A, B))
				717	return 3;
				718	else
				719	return 4;
				720	}
				721
				722	_test:
				723	movl 8(%esp), %eax
				724	movaps (%eax), %xmm0
				725	movl 4(%esp), %eax
				726	movaps (%eax), %xmm1
				727	comiss %xmm0, %xmm1
				728	setae %al
				729	movzbl %al, %ecx
				730	movl $3, %eax
				731	movl $4, %edx
				732	cmpl $0, %ecx
				733	cmove %edx, %eax
				734	ret
				735
				736	Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
				737	are a number of issues. 1) We are introducing a setcc between the result of the
				738	intrisic call and select. 2) The intrinsic is expected to produce a i32 value
				739	so a any extend (which becomes a zero extend) is added.
				740
				741	We probably need some kind of target DAG combine hook to fix this.
Evan Cheng	573cb7c	2006-04-06 23:21:24 +0000	[diff] [blame]	742
				743	//===---------------------------------------------------------------------===//
				744
				745	How to decide when to use the "floating point version" of logical ops? Here are
				746	some code fragments:
				747
				748	movaps LCPI5_5, %xmm2
				749	divps %xmm1, %xmm2
				750	mulps %xmm2, %xmm3
				751	mulps 8656(%ecx), %xmm3
				752	addps 8672(%ecx), %xmm3
				753	andps LCPI5_6, %xmm2
				754	andps LCPI5_1, %xmm3
				755	por %xmm2, %xmm3
				756	movdqa %xmm3, (%edi)
				757
				758	movaps LCPI5_5, %xmm1
				759	divps %xmm0, %xmm1
				760	mulps %xmm1, %xmm3
				761	mulps 8656(%ecx), %xmm3
				762	addps 8672(%ecx), %xmm3
				763	andps LCPI5_6, %xmm1
				764	andps LCPI5_1, %xmm3
				765	orps %xmm1, %xmm3
				766	movaps %xmm3, 112(%esp)
				767	movaps %xmm3, (%ebx)
				768
				769	Due to some minor source change, the later case ended up using orps and movaps
				770	instead of por and movdqa. Does it matter?
				771
				772	//===---------------------------------------------------------------------===//
				773
				774	Use movddup to splat a v2f64 directly from a memory source. e.g.
				775
				776	#include <emmintrin.h>
				777
				778	void test(__m128d *r, double A) {
				779	*r = _mm_set1_pd(A);
				780	}
				781
				782	llc:
				783
				784	_test:
				785	movsd 8(%esp), %xmm0
				786	unpcklpd %xmm0, %xmm0
				787	movl 4(%esp), %eax
				788	movapd %xmm0, (%eax)
				789	ret
				790
				791	icc:
				792
				793	_test:
				794	movl 4(%esp), %eax
				795	movddup 8(%esp), %xmm0
				796	movapd %xmm0, (%eax)
				797	ret
Evan Cheng	9984eb4	2006-04-07 21:19:53 +0000	[diff] [blame]	798
				799	//===---------------------------------------------------------------------===//
				800
				801	A Mac OS X IA-32 specific ABI bug wrt returning value > 8 bytes:
				802	http://llvm.org/bugs/show_bug.cgi?id=729
Evan Cheng	c58a5ee	2006-04-10 07:22:03 +0000	[diff] [blame]	803
				804	//===---------------------------------------------------------------------===//
				805
Evan Cheng	c58a5ee	2006-04-10 07:22:03 +0000	[diff] [blame]	806	X86RegisterInfo::copyRegToReg() returns X86::MOVAPSrr for VR128. Is it possible
				807	to choose between movaps, movapd, and movdqa based on types of source and
				808	destination?
Evan Cheng	2c3ae37	2006-04-12 21:21:57 +0000	[diff] [blame]	809
				810	How about andps, andpd, and pand? Do we really care about the type of the packed
				811	elements? If not, why not always use the "ps" variants which are likely to be
				812	shorter.
Evan Cheng	7fa094a	2006-04-18 00:21:01 +0000	[diff] [blame]	813
				814	//===---------------------------------------------------------------------===//
				815
				816	We are emitting bad code for this:
				817
				818	float %test(float* %V, int %I, int %D, float %V) {
				819	entry:
				820	%tmp = seteq int %D, 0
				821	br bool %tmp, label %cond_true, label %cond_false23
				822
				823	cond_true:
				824	%tmp3 = getelementptr float* %V, int %I
				825	%tmp = load float* %tmp3
				826	%tmp5 = setgt float %tmp, %V
				827	%tmp6 = tail call bool %llvm.isunordered.f32( float %tmp, float %V )
				828	%tmp7 = or bool %tmp5, %tmp6
				829	br bool %tmp7, label %UnifiedReturnBlock, label %cond_next
				830
				831	cond_next:
				832	%tmp10 = add int %I, 1
				833	%tmp12 = getelementptr float* %V, int %tmp10
				834	%tmp13 = load float* %tmp12
				835	%tmp15 = setle float %tmp13, %V
				836	%tmp16 = tail call bool %llvm.isunordered.f32( float %tmp13, float %V )
				837	%tmp17 = or bool %tmp15, %tmp16
				838	%retval = select bool %tmp17, float 0.000000e+00, float 1.000000e+00
				839	ret float %retval
				840
				841	cond_false23:
				842	%tmp28 = tail call float %foo( float* %V, int %I, int %D, float %V )
				843	ret float %tmp28
				844
				845	UnifiedReturnBlock: ; preds = %cond_true
				846	ret float 0.000000e+00
				847	}
				848
				849	declare bool %llvm.isunordered.f32(float, float)
				850
				851	declare float %foo(float*, int, int, float)
				852
				853
				854	It exposes a known load folding problem:
				855
				856	movss (%edx,%ecx,4), %xmm1
				857	ucomiss %xmm1, %xmm0
				858
				859	As well as this:
				860
				861	LBB_test_2: # cond_next
				862	movss LCPI1_0, %xmm2
				863	pxor %xmm3, %xmm3
				864	ucomiss %xmm0, %xmm1
				865	jbe LBB_test_6 # cond_next
				866	LBB_test_5: # cond_next
				867	movaps %xmm2, %xmm3
				868	LBB_test_6: # cond_next
				869	movss %xmm3, 40(%esp)
				870	flds 40(%esp)
				871	addl $44, %esp
				872	ret
				873
				874	Clearly it's unnecessary to clear %xmm3. It's also not clear why we are emitting
				875	three moves (movss, movaps, movss).
				876
				877	//===---------------------------------------------------------------------===//
				878
				879	External test Nurbs exposed some problems. Look for
				880	__ZN15Nurbs_SSE_Cubic17TessellateSurfaceE, bb cond_next140. This is what icc
				881	emits:
				882
				883	movaps (%edx), %xmm2 #59.21
				884	movaps (%edx), %xmm5 #60.21
				885	movaps (%edx), %xmm4 #61.21
				886	movaps (%edx), %xmm3 #62.21
				887	movl 40(%ecx), %ebp #69.49
				888	shufps $0, %xmm2, %xmm5 #60.21
				889	movl 100(%esp), %ebx #69.20
				890	movl (%ebx), %edi #69.20
				891	imull %ebp, %edi #69.49
				892	addl (%eax), %edi #70.33
				893	shufps $85, %xmm2, %xmm4 #61.21
				894	shufps $170, %xmm2, %xmm3 #62.21
				895	shufps $255, %xmm2, %xmm2 #63.21
				896	lea (%ebp,%ebp,2), %ebx #69.49
				897	negl %ebx #69.49
				898	lea -3(%edi,%ebx), %ebx #70.33
				899	shll $4, %ebx #68.37
				900	addl 32(%ecx), %ebx #68.37
				901	testb $15, %bl #91.13
				902	jne L_B1.24 # Prob 5% #91.13
				903
				904	This is the llvm code after instruction scheduling:
				905
				906	cond_next140 (0xa910740, LLVM BB @0xa90beb0):
				907	%reg1078 = MOV32ri -3
				908	%reg1079 = ADD32rm %reg1078, %reg1068, 1, %NOREG, 0
				909	%reg1037 = MOV32rm %reg1024, 1, %NOREG, 40
				910	%reg1080 = IMUL32rr %reg1079, %reg1037
				911	%reg1081 = MOV32rm %reg1058, 1, %NOREG, 0
				912	%reg1038 = LEA32r %reg1081, 1, %reg1080, -3
				913	%reg1036 = MOV32rm %reg1024, 1, %NOREG, 32
				914	%reg1082 = SHL32ri %reg1038, 4
				915	%reg1039 = ADD32rr %reg1036, %reg1082
				916	%reg1083 = MOVAPSrm %reg1059, 1, %NOREG, 0
				917	%reg1034 = SHUFPSrr %reg1083, %reg1083, 170
				918	%reg1032 = SHUFPSrr %reg1083, %reg1083, 0
				919	%reg1035 = SHUFPSrr %reg1083, %reg1083, 255
				920	%reg1033 = SHUFPSrr %reg1083, %reg1083, 85
				921	%reg1040 = MOV32rr %reg1039
				922	%reg1084 = AND32ri8 %reg1039, 15
				923	CMP32ri8 %reg1084, 0
				924	JE mbb<cond_next204,0xa914d30>
				925
				926	Still ok. After register allocation:
				927
				928	cond_next140 (0xa910740, LLVM BB @0xa90beb0):
				929	%EAX = MOV32ri -3
				930	%EDX = MOV32rm <fi#3>, 1, %NOREG, 0
				931	ADD32rm %EAX<def&use>, %EDX, 1, %NOREG, 0
				932	%EDX = MOV32rm <fi#7>, 1, %NOREG, 0
				933	%EDX = MOV32rm %EDX, 1, %NOREG, 40
				934	IMUL32rr %EAX<def&use>, %EDX
				935	%ESI = MOV32rm <fi#5>, 1, %NOREG, 0
				936	%ESI = MOV32rm %ESI, 1, %NOREG, 0
				937	MOV32mr <fi#4>, 1, %NOREG, 0, %ESI
				938	%EAX = LEA32r %ESI, 1, %EAX, -3
				939	%ESI = MOV32rm <fi#7>, 1, %NOREG, 0
				940	%ESI = MOV32rm %ESI, 1, %NOREG, 32
				941	%EDI = MOV32rr %EAX
				942	SHL32ri %EDI<def&use>, 4
				943	ADD32rr %EDI<def&use>, %ESI
				944	%XMM0 = MOVAPSrm %ECX, 1, %NOREG, 0
				945	%XMM1 = MOVAPSrr %XMM0
				946	SHUFPSrr %XMM1<def&use>, %XMM1, 170
				947	%XMM2 = MOVAPSrr %XMM0
				948	SHUFPSrr %XMM2<def&use>, %XMM2, 0
				949	%XMM3 = MOVAPSrr %XMM0
				950	SHUFPSrr %XMM3<def&use>, %XMM3, 255
				951	SHUFPSrr %XMM0<def&use>, %XMM0, 85
				952	%EBX = MOV32rr %EDI
				953	AND32ri8 %EBX<def&use>, 15
				954	CMP32ri8 %EBX, 0
				955	JE mbb<cond_next204,0xa914d30>
				956
				957	This looks really bad. The problem is shufps is a destructive opcode. Since it
				958	appears as operand two in more than one shufps ops. It resulted in a number of
				959	copies. Note icc also suffers from the same problem. Either the instruction
				960	selector should select pshufd or The register allocator can made the two-address
				961	to three-address transformation.
				962
				963	It also exposes some other problems. See MOV32ri -3 and the spills.
Evan Cheng	74e955d	2006-04-18 01:22:57 +0000	[diff] [blame]	964
				965	//===---------------------------------------------------------------------===//
				966
				967	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25500
				968
				969	LLVM is producing bad code.
				970
				971	LBB_main_4: # cond_true44
				972	addps %xmm1, %xmm2
				973	subps %xmm3, %xmm2
				974	movaps (%ecx), %xmm4
				975	movaps %xmm2, %xmm1
				976	addps %xmm4, %xmm1
				977	addl $16, %ecx
				978	incl %edx
				979	cmpl $262144, %edx
				980	movaps %xmm3, %xmm2
				981	movaps %xmm4, %xmm3
				982	jne LBB_main_4 # cond_true44
				983
				984	There are two problems. 1) No need to two loop induction variables. We can
Evan Cheng	4980467	2006-04-18 03:45:01 +0000	[diff] [blame]	985	compare against 262144 * 16. 2) Known register coalescer issue. We should
Evan Cheng	74e955d	2006-04-18 01:22:57 +0000	[diff] [blame]	986	be able eliminate one of the movaps:
				987
Evan Cheng	4980467	2006-04-18 03:45:01 +0000	[diff] [blame]	988	addps %xmm2, %xmm1 <=== Commute!
				989	subps %xmm3, %xmm1
Evan Cheng	74e955d	2006-04-18 01:22:57 +0000	[diff] [blame]	990	movaps (%ecx), %xmm4
Evan Cheng	4980467	2006-04-18 03:45:01 +0000	[diff] [blame]	991	movaps %xmm1, %xmm1 <=== Eliminate!
				992	addps %xmm4, %xmm1
Evan Cheng	74e955d	2006-04-18 01:22:57 +0000	[diff] [blame]	993	addl $16, %ecx
				994	incl %edx
				995	cmpl $262144, %edx
Evan Cheng	4980467	2006-04-18 03:45:01 +0000	[diff] [blame]	996	movaps %xmm3, %xmm2
Evan Cheng	74e955d	2006-04-18 01:22:57 +0000	[diff] [blame]	997	movaps %xmm4, %xmm3
				998	jne LBB_main_4 # cond_true44
Chris Lattner	ff5cdc2	2006-04-19 05:53:27 +0000	[diff] [blame]	999
				1000	//===---------------------------------------------------------------------===//
				1001
Chris Lattner	3e663a6	2006-04-21 21:03:21 +0000	[diff] [blame]	1002	Consider:
				1003
				1004	__m128 test(float a) {
				1005	return _mm_set_ps(0.0, 0.0, 0.0, a*a);
				1006	}
				1007
				1008	This compiles into:
				1009
				1010	movss 4(%esp), %xmm1
				1011	mulss %xmm1, %xmm1
				1012	xorps %xmm0, %xmm0
				1013	movss %xmm1, %xmm0
				1014	ret
				1015
Chris Lattner	6e68a77	2006-04-21 21:05:22 +0000	[diff] [blame]	1016	Because mulss doesn't modify the top 3 elements, the top elements of
				1017	xmm1 are already zero'd. We could compile this to:
Chris Lattner	3e663a6	2006-04-21 21:03:21 +0000	[diff] [blame]	1018
				1019	movss 4(%esp), %xmm0
				1020	mulss %xmm0, %xmm0
				1021	ret
				1022
				1023	//===---------------------------------------------------------------------===//
				1024
				1025	Here's a sick and twisted idea. Consider code like this:
				1026
				1027	__m128 test(__m128 a) {
				1028	float b = (float)&A;
				1029	...
				1030	return _mm_set_ps(0.0, 0.0, 0.0, b);
				1031	}
				1032
				1033	This might compile to this code:
				1034
				1035	movaps c(%esp), %xmm1
				1036	xorps %xmm0, %xmm0
				1037	movss %xmm1, %xmm0
				1038	ret
				1039
				1040	Now consider if the ... code caused xmm1 to get spilled. This might produce
				1041	this code:
				1042
				1043	movaps c(%esp), %xmm1
				1044	movaps %xmm1, c2(%esp)
				1045	...
				1046
				1047	xorps %xmm0, %xmm0
				1048	movaps c2(%esp), %xmm1
				1049	movss %xmm1, %xmm0
				1050	ret
				1051
				1052	However, since the reload is only used by these instructions, we could
				1053	"fold" it into the uses, producing something like this:
				1054
				1055	movaps c(%esp), %xmm1
				1056	movaps %xmm1, c2(%esp)
				1057	...
				1058
				1059	movss c2(%esp), %xmm0
				1060	ret
				1061
				1062	... saving two instructions.
				1063
				1064	The basic idea is that a reload from a spill slot, can, if only one 4-byte
				1065	chunk is used, bring in 3 zeros the the one element instead of 4 elements.
				1066	This can be used to simplify a variety of shuffle operations, where the
				1067	elements are fixed zeros.
				1068
				1069	//===---------------------------------------------------------------------===//
				1070
Chris Lattner	57a6c13	2006-04-23 19:47:09 +0000	[diff] [blame]	1071	We generate significantly worse code for this than GCC:
				1072	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
				1073	http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
				1074
				1075	There is also one case we do worse on PPC.
				1076
				1077	//===---------------------------------------------------------------------===//
Evan Cheng	d7ec518	2006-04-24 23:30:10 +0000	[diff] [blame^]	1078
				1079	For this:
				1080
				1081	#include <emmintrin.h>
				1082	void test(__m128d r, __m128d A, double B) {
				1083	r = _mm_loadl_pd(A, &B);
				1084	}
				1085
				1086	We generates:
				1087
				1088	subl $12, %esp
				1089	movsd 24(%esp), %xmm0
				1090	movsd %xmm0, (%esp)
				1091	movl 20(%esp), %eax
				1092	movapd (%eax), %xmm0
				1093	movlpd (%esp), %xmm0
				1094	movl 16(%esp), %eax
				1095	movapd %xmm0, (%eax)
				1096	addl $12, %esp
				1097	ret
				1098
				1099	icc generates:
				1100
				1101	movl 4(%esp), %edx #3.6
				1102	movl 8(%esp), %eax #3.6
				1103	movapd (%eax), %xmm0 #4.22
				1104	movlpd 12(%esp), %xmm0 #4.8
				1105	movapd %xmm0, (%edx) #4.3
				1106	ret #5.1
				1107
				1108	So icc is smart enough to know that B is in memory so it doesn't load it and
				1109	store it back to stack.