Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: 538d1371a16f82996731c58f65ad0ab0ec36a735 [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
				3	//===---------------------------------------------------------------------===//
				4
				5	With the recent changes to make the implicit def/use set explicit in
				6	machineinstrs, we should change the target descriptions for 'call' instructions
				7	so that the .td files don't list all the call-clobbered registers as implicit
				8	defs. Instead, these should be added by the code generator (e.g. on the dag).
				9
				10	This has a number of uses:
				11
				12	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				13	for their different impdef sets.
				14	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				15	sets don't need copies of call instructions.
				16	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				17	of calls.
				18
				19	//===---------------------------------------------------------------------===//
				20
				21	Make the PPC branch selector target independant
				22
				23	//===---------------------------------------------------------------------===//
				24
				25	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
Chris Lattner	5ae3f3d	2008-12-10 01:30:48 +0000	[diff] [blame]	26	precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
				27	safe in general, even on darwin. See the libm implementation of hypot for
				28	examples (which special case when x/y are exactly zero to get signed zeros etc
				29	right).
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	30
				31	//===---------------------------------------------------------------------===//
				32
				33	Solve this DAG isel folding deficiency:
				34
				35	int X, Y;
				36
				37	void fn1(void)
				38	{
				39	X = X \| (Y << 3);
				40	}
				41
				42	compiles to
				43
				44	fn1:
				45	movl Y, %eax
				46	shll $3, %eax
				47	orl X, %eax
				48	movl %eax, X
				49	ret
				50
				51	The problem is the store's chain operand is not the load X but rather
				52	a TokenFactor of the load X and load Y, which prevents the folding.
				53
				54	There are two ways to fix this:
				55
				56	1. The dag combiner can start using alias analysis to realize that y/x
				57	don't alias, making the store to X not dependent on the load from Y.
				58	2. The generated isel could be made smarter in the case it can't
				59	disambiguate the pointers.
				60
				61	Number 1 is the preferred solution.
				62
				63	This has been "fixed" by a TableGen hack. But that is a short term workaround
				64	which will be removed once the proper fix is made.
				65
				66	//===---------------------------------------------------------------------===//
				67
				68	On targets with expensive 64-bit multiply, we could LSR this:
				69
				70	for (i = ...; ++i) {
				71	x = 1ULL << i;
				72
				73	into:
				74	long long tmp = 1;
				75	for (i = ...; ++i, tmp+=tmp)
				76	x = tmp;
				77
				78	This would be a win on ppc32, but not x86 or ppc64.
				79
				80	//===---------------------------------------------------------------------===//
				81
				82	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				83
				84	//===---------------------------------------------------------------------===//
				85
				86	Reassociate should turn: XXXX -> t=(XX) (t*t) to eliminate a multiply.
				87
				88	//===---------------------------------------------------------------------===//
				89
				90	Interesting? testcase for add/shift/mul reassoc:
				91
				92	int bar(int x, int y) {
				93	return xxx+y+xxxxxyyyy;
				94	}
				95	int foo(int z, int n) {
				96	return bar(z, n) + bar(2z, 2n);
				97	}
				98
				99	Reassociate should handle the example in GCC PR16157.
				100
				101	//===---------------------------------------------------------------------===//
				102
				103	These two functions should generate the same code on big-endian systems:
				104
				105	int g(int j,int l) { return memcmp(j,l,4); }
				106	int h(int j, int l) { return j - l; }
				107
				108	this could be done in SelectionDAGISel.cpp, along with other special cases,
				109	for 1,2,4,8 bytes.
				110
				111	//===---------------------------------------------------------------------===//
				112
				113	It would be nice to revert this patch:
				114	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				115
				116	And teach the dag combiner enough to simplify the code expanded before
				117	legalize. It seems plausible that this knowledge would let it simplify other
				118	stuff too.
				119
				120	//===---------------------------------------------------------------------===//
				121
				122	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
				123	to the type size. It works but can be overly conservative as the alignment of
				124	specific vector types are target dependent.
				125
				126	//===---------------------------------------------------------------------===//
				127
Dan Gohman	71e4f77	2009-05-11 18:51:16 +0000	[diff] [blame^]	128	We should produce an unaligned load from code like this:
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	129
				130	v4sf example(float *P) {
				131	return (v4sf){P[0], P[1], P[2], P[3] };
				132	}
				133
				134	//===---------------------------------------------------------------------===//
				135
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	136	Add support for conditional increments, and other related patterns. Instead
				137	of:
				138
				139	movl 136(%esp), %eax
				140	cmpl $0, %eax
				141	je LBB16_2 #cond_next
				142	LBB16_1: #cond_true
				143	incl _foo
				144	LBB16_2: #cond_next
				145
				146	emit:
				147	movl _foo, %eax
				148	cmpl $1, %edi
				149	sbbl $-1, %eax
				150	movl %eax, _foo
				151
				152	//===---------------------------------------------------------------------===//
				153
				154	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				155
				156	Expand these to calls of sin/cos and stores:
				157	double sincos(double x, double sin, double cos);
				158	float sincosf(float x, float sin, float cos);
				159	long double sincosl(long double x, long double sin, long double cos);
				160
				161	Doing so could allow SROA of the destination pointers. See also:
				162	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				163
Chris Lattner	5ae3f3d	2008-12-10 01:30:48 +0000	[diff] [blame]	164	This is now easily doable with MRVs. We could even make an intrinsic for this
				165	if anyone cared enough about sincos.
				166
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	167	//===---------------------------------------------------------------------===//
				168
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	169	Turn this into a single byte store with no load (the other 3 bytes are
				170	unmodified):
				171
Dan Gohman	95df962	2009-05-11 18:04:52 +0000	[diff] [blame]	172	define void @test(i32* %P) {
				173	%tmp = load i32* %P
				174	%tmp14 = or i32 %tmp, 3305111552
				175	%tmp15 = and i32 %tmp14, 3321888767
				176	store i32 %tmp15, i32* %P
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	177	ret void
				178	}
				179
				180	//===---------------------------------------------------------------------===//
				181
				182	dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
				183
				184	Compile:
				185
				186	int bar(int x)
				187	{
				188	int t = __builtin_clz(x);
				189	return -(t>>5);
				190	}
				191
				192	to:
				193
				194	_bar: addic r3,r3,-1
				195	subfe r3,r3,r3
				196	blr
				197
				198	//===---------------------------------------------------------------------===//
				199
				200	Legalize should lower ctlz like this:
				201	ctlz(x) = popcnt((x-1) & ~x)
				202
				203	on targets that have popcnt but not ctlz. itanium, what else?
				204
				205	//===---------------------------------------------------------------------===//
				206
				207	quantum_sigma_x in 462.libquantum contains the following loop:
				208
				209	for(i=0; i<reg->size; i++)
				210	{
				211	/* Flip the target bit of each basis state */
				212	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				213	}
				214
				215	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				216	so cool to turn it into something like:
				217
				218	long long Res = ((MAX_UNSIGNED) 1 << target);
				219	if (target < 32) {
				220	for(i=0; i<reg->size; i++)
				221	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
				222	} else {
				223	for(i=0; i<reg->size; i++)
				224	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
				225	}
				226
				227	... which would only do one 32-bit XOR per loop iteration instead of two.
				228
				229	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
				230	alas...
				231
				232	//===---------------------------------------------------------------------===//
				233
Chris Lattner	909c72c	2008-10-05 02:16:12 +0000	[diff] [blame]	234	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	235
				236	unsigned long reverse(unsigned v) {
				237	unsigned t;
				238	t = v ^ ((v << 16) \| (v >> 16));
				239	t &= ~0xff0000;
				240	v = (v << 24) \| (v >> 8);
				241	return v ^ (t >> 8);
				242	}
				243
				244	//===---------------------------------------------------------------------===//
				245
Chris Lattner	200ca61	2008-10-15 16:02:15 +0000	[diff] [blame]	246	These idioms should be recognized as popcount (see PR1488):
				247
				248	unsigned countbits_slow(unsigned v) {
				249	unsigned c;
				250	for (c = 0; v; v >>= 1)
				251	c += v & 1;
				252	return c;
				253	}
				254	unsigned countbits_fast(unsigned v){
				255	unsigned c;
				256	for (c = 0; v; c++)
				257	v &= v - 1; // clear the least significant bit set
				258	return c;
				259	}
				260
				261	BITBOARD = unsigned long long
				262	int PopCnt(register BITBOARD a) {
				263	register int c=0;
				264	while(a) {
				265	c++;
				266	a &= a - 1;
				267	}
				268	return c;
				269	}
				270	unsigned int popcount(unsigned int input) {
				271	unsigned int count = 0;
				272	for (unsigned int i = 0; i < 4 * 8; i++)
				273	count += (input >> i) & i;
				274	return count;
				275	}
				276
				277	//===---------------------------------------------------------------------===//
				278
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	279	These should turn into single 16-bit (unaligned?) loads on little/big endian
				280	processors.
				281
				282	unsigned short read_16_le(const unsigned char *adr) {
				283	return adr[0] \| (adr[1] << 8);
				284	}
				285	unsigned short read_16_be(const unsigned char *adr) {
				286	return (adr[0] << 8) \| adr[1];
				287	}
				288
				289	//===---------------------------------------------------------------------===//
				290
				291	-instcombine should handle this transform:
				292	icmp pred (sdiv X / C1 ), C2
				293	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				294
				295	Currently InstCombine avoids this transform but will do it when the signs of
				296	the operands and the sign of the divide match. See the FIXME in
				297	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				298	for Instruction::UDiv (around line 4447) for more details.
				299
				300	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				301	this construct.
				302
				303	//===---------------------------------------------------------------------===//
				304
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	305	viterbi speeds up significantly if the various "history" related copy loops
				306	are turned into memcpy calls at the source level. We need a "loops to memcpy"
				307	pass.
				308
				309	//===---------------------------------------------------------------------===//
				310
				311	Consider:
				312
				313	typedef unsigned U32;
				314	typedef unsigned long long U64;
				315	int test (U32 inst, U64 regs) {
				316	U64 effective_addr2;
				317	U32 temp = *inst;
				318	int r1 = (temp >> 20) & 0xf;
				319	int b2 = (temp >> 16) & 0xf;
				320	effective_addr2 = temp & 0xfff;
				321	if (b2) effective_addr2 += regs[b2];
				322	b2 = (temp >> 12) & 0xf;
				323	if (b2) effective_addr2 += regs[b2];
				324	effective_addr2 &= regs[4];
				325	if ((effective_addr2 & 3) == 0)
				326	return 1;
				327	return 0;
				328	}
				329
				330	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				331	we don't eliminate the computation of the top half of effective_addr2 because
				332	we don't have whole-function selection dags. On x86, this means we use one
				333	extra register for the function when effective_addr2 is declared as U64 than
				334	when it is declared U32.
				335
				336	//===---------------------------------------------------------------------===//
				337
				338	Promote for i32 bswap can use i64 bswap + shr. Useful on targets with 64-bit
				339	regs and bswap, like itanium.
				340
				341	//===---------------------------------------------------------------------===//
				342
				343	LSR should know what GPR types a target has. This code:
				344
				345	volatile short X, Y; // globals
				346
				347	void foo(int N) {
				348	int i;
				349	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				350	}
				351
				352	produces two identical IV's (after promotion) on PPC/ARM:
				353
				354	LBB1_1: @bb.preheader
				355	mov r3, #0
				356	mov r2, r3
				357	mov r1, r3
				358	LBB1_2: @bb
				359	ldr r12, LCPI1_0
				360	ldr r12, [r12]
				361	strh r2, [r12]
				362	ldr r12, LCPI1_1
				363	ldr r12, [r12]
				364	strh r3, [r12]
				365	add r1, r1, #1 <- [0,+,1]
				366	add r3, r3, #4
				367	add r2, r2, #1 <- [0,+,1]
				368	cmp r1, r0
				369	bne LBB1_2 @bb
				370
				371
				372	//===---------------------------------------------------------------------===//
				373
				374	Tail call elim should be more aggressive, checking to see if the call is
				375	followed by an uncond branch to an exit block.
				376
				377	; This testcase is due to tail-duplication not wanting to copy the return
				378	; instruction into the terminating blocks because there was other code
				379	; optimized out of the function after the taildup happened.
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	380	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	381
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	382	define i32 @t4(i32 %a) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	383	entry:
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	384	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				385	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				386	br i1 %tmp.2, label %then.0, label %else.0
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	387
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	388	then.0: ; preds = %entry
				389	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				390	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				391	br label %return
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	392
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	393	else.0: ; preds = %entry
				394	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				395	br i1 %tmp.7, label %then.1, label %return
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	396
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	397	then.1: ; preds = %else.0
				398	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				399	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				400	br label %return
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	401
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	402	return: ; preds = %then.1, %else.0, %then.0
				403	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	404	[ %tmp.9, %then.1 ]
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	405	ret i32 %result.0
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	406	}
				407
				408	//===---------------------------------------------------------------------===//
				409
Chris Lattner	00159fc	2007-10-03 06:10:59 +0000	[diff] [blame]	410	Tail recursion elimination is not transforming this function, because it is
				411	returning n, which fails the isDynamicConstant check in the accumulator
				412	recursion checks.
				413
				414	long long fib(const long long n) {
				415	switch(n) {
				416	case 0:
				417	case 1:
				418	return n;
				419	default:
				420	return fib(n-1) + fib(n-2);
				421	}
				422	}
				423
				424	//===---------------------------------------------------------------------===//
				425
Chris Lattner	7be8ac2	2008-08-10 00:47:21 +0000	[diff] [blame]	426	Tail recursion elimination should handle:
				427
				428	int pow2m1(int n) {
				429	if (n == 0)
				430	return 0;
				431	return 2 * pow2m1 (n - 1) + 1;
				432	}
				433
				434	Also, multiplies can be turned into SHL's, so they should be handled as if
				435	they were associative. "return foo() << 1" can be tail recursion eliminated.
				436
				437	//===---------------------------------------------------------------------===//
				438
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	439	Argument promotion should promote arguments for recursive functions, like
				440	this:
				441
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	442	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	443
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	444	define internal i32 @foo(i32* %x) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	445	entry:
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	446	%tmp = load i32* %x ; <i32> [#uses=0]
				447	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				448	ret i32 %tmp.foo
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	449	}
				450
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	451	define i32 @bar(i32* %x) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	452	entry:
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	453	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				454	ret i32 %tmp3
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	455	}
				456
Chris Lattner	421a733	2007-12-05 23:05:06 +0000	[diff] [blame]	457	//===---------------------------------------------------------------------===//
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	458
				459	"basicaa" should know how to look through "or" instructions that act like add
				460	instructions. For example in this code, the x4+1 is turned into x4 \| 1, and
				461	basicaa can't analyze the array subscript, leading to duplicated loads in the
				462	generated code:
				463
				464	void test(int X, int Y, int a[]) {
				465	int i;
				466	for (i=2; i<1000; i+=4) {
				467	a[i+0] = a[i-1+0]*a[i-2+0];
				468	a[i+1] = a[i-1+1]*a[i-2+1];
				469	a[i+2] = a[i-1+2]*a[i-2+2];
				470	a[i+3] = a[i-1+3]*a[i-2+3];
				471	}
				472	}
				473
Chris Lattner	5ae3f3d	2008-12-10 01:30:48 +0000	[diff] [blame]	474	BasicAA also doesn't do this for add. It needs to know that &A[i+1] != &A[i].
				475
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	476	//===---------------------------------------------------------------------===//
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	477
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	478	We should investigate an instruction sinking pass. Consider this silly
				479	example in pic mode:
				480
				481	#include <assert.h>
				482	void foo(int x) {
				483	assert(x);
				484	//...
				485	}
				486
				487	we compile this to:
				488	_foo:
				489	subl $28, %esp
				490	call "L1$pb"
				491	"L1$pb":
				492	popl %eax
				493	cmpl $0, 32(%esp)
				494	je LBB1_2 # cond_true
				495	LBB1_1: # return
				496	# ...
				497	addl $28, %esp
				498	ret
				499	LBB1_2: # cond_true
				500	...
				501
				502	The PIC base computation (call+popl) is only used on one path through the
				503	code, but is currently always computed in the entry block. It would be
				504	better to sink the picbase computation down into the block for the
				505	assertion, as it is the only one that uses it. This happens for a lot of
				506	code with early outs.
				507
Chris Lattner	be9fe9d	2007-12-29 01:05:01 +0000	[diff] [blame]	508	Another example is loads of arguments, which are usually emitted into the
				509	entry block on targets like x86. If not used in all paths through a
				510	function, they should be sunk into the ones that do.
				511
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	512	In this case, whole-function-isel would also handle this.
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	513
				514	//===---------------------------------------------------------------------===//
Chris Lattner	8551f19	2008-01-07 21:38:14 +0000	[diff] [blame]	515
				516	Investigate lowering of sparse switch statements into perfect hash tables:
				517	http://burtleburtle.net/bob/hash/perfect.html
				518
				519	//===---------------------------------------------------------------------===//
Chris Lattner	dc089f0	2008-01-09 00:17:57 +0000	[diff] [blame]	520
				521	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				522	corresponding integer operations. On a yonah, this loop:
				523
				524	double a[256];
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	525	void foo() {
				526	int i, b;
				527	for (b = 0; b < 10000000; b++)
				528	for (i = 0; i < 256; i++)
				529	a[i] = -a[i];
				530	}
Chris Lattner	dc089f0	2008-01-09 00:17:57 +0000	[diff] [blame]	531
				532	is twice as slow as this loop:
				533
				534	long long a[256];
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	535	void foo() {
				536	int i, b;
				537	for (b = 0; b < 10000000; b++)
				538	for (i = 0; i < 256; i++)
				539	a[i] ^= (1ULL << 63);
				540	}
Chris Lattner	dc089f0	2008-01-09 00:17:57 +0000	[diff] [blame]	541
				542	and I suspect other processors are similar. On X86 in particular this is a
				543	big win because doing this with integers allows the use of read/modify/write
				544	instructions.
				545
				546	//===---------------------------------------------------------------------===//
Chris Lattner	a909d31	2008-01-10 18:25:41 +0000	[diff] [blame]	547
				548	DAG Combiner should try to combine small loads into larger loads when
				549	profitable. For example, we compile this C++ example:
				550
				551	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				552	extern THotKey m_HotKey;
				553	THotKey GetHotKey () { return m_HotKey; }
				554
				555	into (-O3 -fno-exceptions -static -fomit-frame-pointer):
				556
				557	__Z9GetHotKeyv:
				558	pushl %esi
				559	movl 8(%esp), %eax
				560	movb _m_HotKey+3, %cl
				561	movb _m_HotKey+4, %dl
				562	movb _m_HotKey+2, %ch
				563	movw _m_HotKey, %si
				564	movw %si, (%eax)
				565	movb %ch, 2(%eax)
				566	movb %cl, 3(%eax)
				567	movb %dl, 4(%eax)
				568	popl %esi
				569	ret $4
				570
				571	GCC produces:
				572
				573	__Z9GetHotKeyv:
				574	movl _m_HotKey, %edx
				575	movl 4(%esp), %eax
				576	movl %edx, (%eax)
				577	movzwl _m_HotKey+4, %edx
				578	movw %dx, 4(%eax)
				579	ret $4
				580
				581	The LLVM IR contains the needed alignment info, so we should be able to
				582	merge the loads and stores into 4-byte loads:
				583
				584	%struct.THotKey = type { i16, i8, i8, i8 }
				585	define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
				586	...
				587	%tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
				588	%tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
				589	%tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
				590	%tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
				591
				592	Alternatively, we should use a small amount of base-offset alias analysis
				593	to make it so the scheduler doesn't need to hold all the loads in regs at
				594	once.
				595
				596	//===---------------------------------------------------------------------===//
Chris Lattner	71538b5	2008-01-11 06:17:47 +0000	[diff] [blame]	597
Nate Begeman	a3c7ba3	2008-02-18 18:39:23 +0000	[diff] [blame]	598	We should add an FRINT node to the DAG to model targets that have legal
				599	implementations of ceil/floor/rint.
Chris Lattner	a672d3d	2008-02-28 05:34:27 +0000	[diff] [blame]	600
				601	//===---------------------------------------------------------------------===//
				602
Chris Lattner	61075f3	2008-02-28 17:21:27 +0000	[diff] [blame]	603	This GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34043
				604	contains a testcase that compiles down to:
				605
				606	%struct.XMM128 = type { <4 x float> }
				607	..
				608	%src = alloca %struct.XMM128
				609	..
				610	%tmp6263 = bitcast %struct.XMM128* %src to <2 x i64>*
				611	%tmp65 = getelementptr %struct.XMM128* %src, i32 0, i32 0
				612	store <2 x i64> %tmp5899, <2 x i64>* %tmp6263, align 16
				613	%tmp66 = load <4 x float>* %tmp65, align 16
				614	%tmp71 = add <4 x float> %tmp66, %tmp66
				615
				616	If the mid-level optimizer turned the bitcast of pointer + store of tmp5899
				617	into a bitcast of the vector value and a store to the pointer, then the
				618	store->load could be easily removed.
				619
				620	//===---------------------------------------------------------------------===//
				621
Chris Lattner	a672d3d	2008-02-28 05:34:27 +0000	[diff] [blame]	622	Consider:
				623
				624	int test() {
				625	long long input[8] = {1,1,1,1,1,1,1,1};
				626	foo(input);
				627	}
				628
				629	We currently compile this into a memcpy from a global array since the
				630	initializer is fairly large and not memset'able. This is good, but the memcpy
				631	gets lowered to load/stores in the code generator. This is also ok, except
				632	that the codegen lowering for memcpy doesn't handle the case when the source
				633	is a constant global. This gives us atrocious code like this:
				634
				635	call "L1$pb"
				636	"L1$pb":
				637	popl %eax
				638	movl _C.0.1444-"L1$pb"+32(%eax), %ecx
				639	movl %ecx, 40(%esp)
				640	movl _C.0.1444-"L1$pb"+20(%eax), %ecx
				641	movl %ecx, 28(%esp)
				642	movl _C.0.1444-"L1$pb"+36(%eax), %ecx
				643	movl %ecx, 44(%esp)
				644	movl _C.0.1444-"L1$pb"+44(%eax), %ecx
				645	movl %ecx, 52(%esp)
				646	movl _C.0.1444-"L1$pb"+40(%eax), %ecx
				647	movl %ecx, 48(%esp)
				648	movl _C.0.1444-"L1$pb"+12(%eax), %ecx
				649	movl %ecx, 20(%esp)
				650	movl _C.0.1444-"L1$pb"+4(%eax), %ecx
				651	...
				652
				653	instead of:
				654	movl $1, 16(%esp)
				655	movl $0, 20(%esp)
				656	movl $1, 24(%esp)
				657	movl $0, 28(%esp)
				658	movl $1, 32(%esp)
				659	movl $0, 36(%esp)
				660	...
				661
				662	//===---------------------------------------------------------------------===//
Chris Lattner	a709d33	2008-03-02 02:51:40 +0000	[diff] [blame]	663
				664	http://llvm.org/PR717:
				665
				666	The following code should compile into "ret int undef". Instead, LLVM
				667	produces "ret int 0":
				668
				669	int f() {
				670	int x = 4;
				671	int y;
				672	if (x == 3) y = 0;
				673	return y;
				674	}
				675
				676	//===---------------------------------------------------------------------===//
Chris Lattner	e9c6844	2008-03-02 19:29:42 +0000	[diff] [blame]	677
				678	The loop unroller should partially unroll loops (instead of peeling them)
				679	when code growth isn't too bad and when an unroll count allows simplification
				680	of some code within the loop. One trivial example is:
				681
				682	#include <stdio.h>
				683	int main() {
				684	int nRet = 17;
				685	int nLoop;
				686	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				687	if ( nLoop & 1 )
				688	nRet += 2;
				689	else
				690	nRet -= 1;
				691	}
				692	return nRet;
				693	}
				694
				695	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				696	reduction in code size. The resultant code would then also be suitable for
				697	exit value computation.
				698
				699	//===---------------------------------------------------------------------===//
Chris Lattner	962a7d5	2008-03-17 01:47:51 +0000	[diff] [blame]	700
				701	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				702	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				703	matching code in dag combine doesn't look through truncates aggressively
				704	enough. Here are some testcases reduces from GCC PR17886:
				705
				706	unsigned long long f(unsigned long long x, int y) {
				707	return (x << y) \| (x >> 64-y);
				708	}
				709	unsigned f2(unsigned x, int y){
				710	return (x << y) \| (x >> 32-y);
				711	}
				712	unsigned long long f3(unsigned long long x){
				713	int y = 9;
				714	return (x << y) \| (x >> 64-y);
				715	}
				716	unsigned f4(unsigned x){
				717	int y = 10;
				718	return (x << y) \| (x >> 32-y);
				719	}
				720	unsigned long long f5(unsigned long long x, unsigned long long y) {
				721	return (x << 8) \| ((y >> 48) & 0xffull);
				722	}
				723	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				724	switch(z) {
				725	case 1:
				726	return (x << 8) \| ((y >> 48) & 0xffull);
				727	case 2:
				728	return (x << 16) \| ((y >> 40) & 0xffffull);
				729	case 3:
				730	return (x << 24) \| ((y >> 32) & 0xffffffull);
				731	case 4:
				732	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				733	default:
				734	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				735	}
				736	}
				737
Dan Gohman	2896a4e	2008-10-17 21:39:27 +0000	[diff] [blame]	738	On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner	962a7d5	2008-03-17 01:47:51 +0000	[diff] [blame]	739	generate truly horrible code, instead of using shld and friends. On
				740	ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
				741	badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
				742
				743	//===---------------------------------------------------------------------===//
Chris Lattner	38c4a15	2008-03-20 04:46:13 +0000	[diff] [blame]	744
				745	We do a number of simplifications in simplify libcalls to strength reduce
				746	standard library functions, but we don't currently merge them together. For
				747	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				748	be done safely if "b" isn't modified between the strlen and memcpy of course.
				749
				750	//===---------------------------------------------------------------------===//
				751
Chris Lattner	045fd22	2008-05-17 15:37:38 +0000	[diff] [blame]	752	We should be able to evaluate this loop:
				753
				754	int test(int x_offs) {
				755	while (x_offs > 4)
				756	x_offs -= 4;
				757	return x_offs;
				758	}
				759
				760	//===---------------------------------------------------------------------===//
Chris Lattner	40a6864	2008-07-14 00:19:59 +0000	[diff] [blame]	761
				762	Reassociate should turn things like:
				763
				764	int factorial(int X) {
				765	return XXXXXXX*X;
				766	}
				767
				768	into llvm.powi calls, allowing the code generator to produce balanced
				769	multiplication trees.
				770
				771	//===---------------------------------------------------------------------===//
				772
Chris Lattner	16e192f	2008-08-10 01:14:08 +0000	[diff] [blame]	773	We generate a horrible libcall for llvm.powi. For example, we compile:
				774
				775	#include <cmath>
				776	double f(double a) { return std::pow(a, 4); }
				777
				778	into:
				779
				780	__Z1fd:
				781	subl $12, %esp
				782	movsd 16(%esp), %xmm0
				783	movsd %xmm0, (%esp)
				784	movl $4, 8(%esp)
				785	call L___powidf2$stub
				786	addl $12, %esp
				787	ret
				788
				789	GCC produces:
				790
				791	__Z1fd:
				792	subl $12, %esp
				793	movsd 16(%esp), %xmm0
				794	mulsd %xmm0, %xmm0
				795	mulsd %xmm0, %xmm0
				796	movsd %xmm0, (%esp)
				797	fldl (%esp)
				798	addl $12, %esp
				799	ret
				800
				801	//===---------------------------------------------------------------------===//
				802
				803	We compile this program: (from GCC PR11680)
				804	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				805
				806	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				807	slower than when compile with GCC (either 4.0 or 4.2):
				808
				809	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				810	$ time ./a.out fast
				811	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				812
				813	$ g++ perf.cpp -O3 -fno-exceptions
				814	$ time ./a.out fast
				815	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				816
				817	It looks like we are making the same inlining decisions, so this may be raw
				818	codegen badness or something else (haven't investigated).
				819
				820	//===---------------------------------------------------------------------===//
				821
				822	We miss some instcombines for stuff like this:
				823	void bar (void);
				824	void foo (unsigned int a) {
				825	/* This one is equivalent to a >= (3 << 2). */
				826	if ((a >> 2) >= 3)
				827	bar ();
				828	}
				829
				830	A few other related ones are in GCC PR14753.
				831
				832	//===---------------------------------------------------------------------===//
				833
				834	Divisibility by constant can be simplified (according to GCC PR12849) from
				835	being a mulhi to being a mul lo (cheaper). Testcase:
				836
				837	void bar(unsigned n) {
				838	if (n % 3 == 0)
				839	true();
				840	}
				841
				842	I think this basically amounts to a dag combine to simplify comparisons against
				843	multiply hi's into a comparison against the mullo.
				844
				845	//===---------------------------------------------------------------------===//
Chris Lattner	c10e07a	2008-08-19 06:22:16 +0000	[diff] [blame]	846
Chris Lattner	92f9783	2008-10-15 16:06:03 +0000	[diff] [blame]	847	Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
				848	bunch of other stuff from this example (see PR1604):
				849
				850	#include <cstdio>
				851	struct test {
				852	int val;
				853	virtual ~test() {}
				854	};
				855
				856	int main() {
				857	test t;
				858	std::scanf("%d", &t.val);
				859	std::printf("%d\n", t.val);
				860	}
				861
				862	//===---------------------------------------------------------------------===//
				863
Chris Lattner	74da911	2008-10-15 16:33:52 +0000	[diff] [blame]	864	Instcombine will merge comparisons like (x >= 10) && (x < 20) by producing (x -
				865	10) u< 10, but only when the comparisons have matching sign.
				866
				867	This could be converted with a similiar technique. (PR1941)
				868
				869	define i1 @test(i8 %x) {
				870	%A = icmp uge i8 %x, 5
				871	%B = icmp slt i8 %x, 20
				872	%C = and i1 %A, %B
				873	ret i1 %C
				874	}
				875
				876	//===---------------------------------------------------------------------===//
Nick Lewycky	8ef52e2	2008-11-27 22:12:22 +0000	[diff] [blame]	877
Nick Lewycky	728f874	2008-11-27 22:41:45 +0000	[diff] [blame]	878	These functions perform the same computation, but produce different assembly.
Nick Lewycky	8ef52e2	2008-11-27 22:12:22 +0000	[diff] [blame]	879
				880	define i8 @select(i8 %x) readnone nounwind {
				881	%A = icmp ult i8 %x, 250
				882	%B = select i1 %A, i8 0, i8 1
				883	ret i8 %B
				884	}
				885
				886	define i8 @addshr(i8 %x) readnone nounwind {
				887	%A = zext i8 %x to i9
				888	%B = add i9 %A, 6 ;; 256 - 250 == 6
				889	%C = lshr i9 %B, 8
				890	%D = trunc i9 %C to i8
				891	ret i8 %D
				892	}
				893
				894	//===---------------------------------------------------------------------===//
Eli Friedman	8e59f06	2008-11-30 07:36:04 +0000	[diff] [blame]	895
				896	From gcc bug 24696:
				897	int
				898	f (unsigned long a, unsigned long b, unsigned long c)
				899	{
				900	return ((a & (c - 1)) != 0) \|\| ((b & (c - 1)) != 0);
				901	}
				902	int
				903	f (unsigned long a, unsigned long b, unsigned long c)
				904	{
				905	return ((a & (c - 1)) != 0) \| ((b & (c - 1)) != 0);
				906	}
				907	Both should combine to ((a\|b) & (c-1)) != 0. Currently not optimized with
				908	"clang -emit-llvm-bc \| opt -std-compile-opts".
				909
				910	//===---------------------------------------------------------------------===//
				911
				912	From GCC Bug 20192:
				913	#define PMD_MASK (~((1UL << 23) - 1))
				914	void clear_pmd_range(unsigned long start, unsigned long end)
				915	{
				916	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
				917	f();
				918	}
				919	The expression should optimize to something like
				920	"!((start\|end)&~PMD_MASK). Currently not optimized with "clang
				921	-emit-llvm-bc \| opt -std-compile-opts".
				922
				923	//===---------------------------------------------------------------------===//
				924
				925	From GCC Bug 15241:
				926	unsigned int
				927	foo (unsigned int a, unsigned int b)
				928	{
				929	if (a <= 7 && b <= 7)
				930	baz ();
				931	}
				932	Should combine to "(a\|b) <= 7". Currently not optimized with "clang
				933	-emit-llvm-bc \| opt -std-compile-opts".
				934
				935	//===---------------------------------------------------------------------===//
				936
				937	From GCC Bug 3756:
				938	int
				939	pn (int n)
				940	{
				941	return (n >= 0 ? 1 : -1);
				942	}
				943	Should combine to (n >> 31) \| 1. Currently not optimized with "clang
				944	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				945
				946	//===---------------------------------------------------------------------===//
				947
				948	From GCC Bug 28685:
				949	int test(int a, int b)
				950	{
				951	int lt = a < b;
				952	int eq = a == b;
				953
				954	return (lt \|\| eq);
				955	}
				956	Should combine to "a <= b". Currently not optimized with "clang
				957	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				958
				959	//===---------------------------------------------------------------------===//
				960
				961	void a(int variable)
				962	{
				963	if (variable == 4 \|\| variable == 6)
				964	bar();
				965	}
				966	This should optimize to "if ((variable \| 2) == 6)". Currently not
				967	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts \| llc".
				968
				969	//===---------------------------------------------------------------------===//
				970
				971	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
				972	i;}
				973	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				974	These should combine to the same thing. Currently, the first function
				975	produces better code on X86.
				976
				977	//===---------------------------------------------------------------------===//
				978
Eli Friedman	8e59f06	2008-11-30 07:36:04 +0000	[diff] [blame]	979	From GCC Bug 15784:
				980	#define abs(x) x>0?x:-x
				981	int f(int x, int y)
				982	{
				983	return (abs(x)) >= 0;
				984	}
				985	This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
				986	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				987
				988	//===---------------------------------------------------------------------===//
				989
				990	From GCC Bug 14753:
				991	void
				992	rotate_cst (unsigned int a)
				993	{
				994	a = (a << 10) \| (a >> 22);
				995	if (a == 123)
				996	bar ();
				997	}
				998	void
				999	minus_cst (unsigned int a)
				1000	{
				1001	unsigned int tem;
				1002
				1003	tem = 20 - a;
				1004	if (tem == 5)
				1005	bar ();
				1006	}
				1007	void
				1008	mask_gt (unsigned int a)
				1009	{
				1010	/* This is equivalent to a > 15. */
				1011	if ((a & ~7) > 8)
				1012	bar ();
				1013	}
				1014	void
				1015	rshift_gt (unsigned int a)
				1016	{
				1017	/* This is equivalent to a > 23. */
				1018	if ((a >> 2) > 5)
				1019	bar ();
				1020	}
				1021	All should simplify to a single comparison. All of these are
				1022	currently not optimized with "clang -emit-llvm-bc \| opt
				1023	-std-compile-opts".
				1024
				1025	//===---------------------------------------------------------------------===//
				1026
				1027	From GCC Bug 32605:
				1028	int c(int* x) {return (char)x+2 == (char)x;}
				1029	Should combine to 0. Currently not optimized with "clang
				1030	-emit-llvm-bc \| opt -std-compile-opts" (although llc can optimize it).
				1031
				1032	//===---------------------------------------------------------------------===//
				1033
				1034	int a(unsigned char* b) {return *b > 99;}
				1035	There's an unnecessary zext in the generated code with "clang
				1036	-emit-llvm-bc \| opt -std-compile-opts".
				1037
				1038	//===---------------------------------------------------------------------===//
				1039
Eli Friedman	8e59f06	2008-11-30 07:36:04 +0000	[diff] [blame]	1040	int a(unsigned b) {return ((b << 31) \| (b << 30)) >> 31;}
				1041	Should be combined to "((b >> 1) \| b) & 1". Currently not optimized
				1042	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1043
				1044	//===---------------------------------------------------------------------===//
				1045
				1046	unsigned a(unsigned x, unsigned y) { return x \| (y & 1) \| (y & 2);}
				1047	Should combine to "x \| (y & 3)". Currently not optimized with "clang
				1048	-emit-llvm-bc \| opt -std-compile-opts".
				1049
				1050	//===---------------------------------------------------------------------===//
				1051
				1052	unsigned a(unsigned a) {return ((a \| 1) & 3) \| (a & -4);}
				1053	Should combine to "a \| 1". Currently not optimized with "clang
				1054	-emit-llvm-bc \| opt -std-compile-opts".
				1055
				1056	//===---------------------------------------------------------------------===//
				1057
Eli Friedman	8e59f06	2008-11-30 07:36:04 +0000	[diff] [blame]	1058	int a(int a, int b, int c) {return (~a & c) \| ((c\|a) & b);}
				1059	Should fold to "(~a & c) \| (a & b)". Currently not optimized with
				1060	"clang -emit-llvm-bc \| opt -std-compile-opts".
				1061
				1062	//===---------------------------------------------------------------------===//
				1063
				1064	int a(int a,int b) {return (~(a\|b))\|a;}
				1065	Should fold to "a\|~b". Currently not optimized with "clang
				1066	-emit-llvm-bc \| opt -std-compile-opts".
				1067
				1068	//===---------------------------------------------------------------------===//
				1069
				1070	int a(int a, int b) {return (a&&b) \|\| (a&&!b);}
				1071	Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
				1072	\| opt -std-compile-opts".
				1073
				1074	//===---------------------------------------------------------------------===//
				1075
				1076	int a(int a, int b, int c) {return (a&&b) \|\| (!a&&c);}
				1077	Should fold to "a ? b : c", or at least something sane. Currently not
				1078	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1079
				1080	//===---------------------------------------------------------------------===//
				1081
				1082	int a(int a, int b, int c) {return (a&&b) \|\| (a&&c) \|\| (a&&b&&c);}
				1083	Should fold to a && (b \|\| c). Currently not optimized with "clang
				1084	-emit-llvm-bc \| opt -std-compile-opts".
				1085
				1086	//===---------------------------------------------------------------------===//
				1087
				1088	int a(int x) {return x \| ((x & 8) ^ 8);}
				1089	Should combine to x \| 8. Currently not optimized with "clang
				1090	-emit-llvm-bc \| opt -std-compile-opts".
				1091
				1092	//===---------------------------------------------------------------------===//
				1093
				1094	int a(int x) {return x ^ ((x & 8) ^ 8);}
				1095	Should also combine to x \| 8. Currently not optimized with "clang
				1096	-emit-llvm-bc \| opt -std-compile-opts".
				1097
				1098	//===---------------------------------------------------------------------===//
				1099
				1100	int a(int x) {return (x & 8) == 0 ? -1 : -9;}
				1101	Should combine to (x \| -9) ^ 8. Currently not optimized with "clang
				1102	-emit-llvm-bc \| opt -std-compile-opts".
				1103
				1104	//===---------------------------------------------------------------------===//
				1105
				1106	int a(int x) {return (x & 8) == 0 ? -9 : -1;}
				1107	Should combine to x \| -9. Currently not optimized with "clang
				1108	-emit-llvm-bc \| opt -std-compile-opts".
				1109
				1110	//===---------------------------------------------------------------------===//
				1111
				1112	int a(int x) {return ((x \| -9) ^ 8) & x;}
				1113	Should combine to x & -9. Currently not optimized with "clang
				1114	-emit-llvm-bc \| opt -std-compile-opts".
				1115
				1116	//===---------------------------------------------------------------------===//
				1117
				1118	unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
				1119	Should combine to "a * 0x88888888 >> 31". Currently not optimized
				1120	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1121
				1122	//===---------------------------------------------------------------------===//
				1123
				1124	unsigned a(char* x) {if ((*x & 32) == 0) return b();}
				1125	There's an unnecessary zext in the generated code with "clang
				1126	-emit-llvm-bc \| opt -std-compile-opts".
				1127
				1128	//===---------------------------------------------------------------------===//
				1129
				1130	unsigned a(unsigned long long x) {return 40 * (x >> 1);}
				1131	Should combine to "20 * (((unsigned)x) & -2)". Currently not
				1132	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1133
				1134	//===---------------------------------------------------------------------===//
Bill Wendling	d0a76ee	2008-12-02 05:12:47 +0000	[diff] [blame]	1135
				1136	We would like to do the following transform in the instcombiner:
				1137
				1138	-X/C -> X/-C
				1139
				1140	However, this isn't valid if (-X) overflows. We can implement this when we
				1141	have the concept of a "C signed subtraction" operator that which is undefined
				1142	on overflow.
				1143
				1144	//===---------------------------------------------------------------------===//
Chris Lattner	8cbcc3a	2008-12-02 06:32:34 +0000	[diff] [blame]	1145
				1146	This was noticed in the entryblock for grokdeclarator in 403.gcc:
				1147
				1148	%tmp = icmp eq i32 %decl_context, 4
				1149	%decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
				1150	%tmp1 = icmp eq i32 %decl_context_addr.0, 1
				1151	%decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
				1152
				1153	tmp1 should be simplified to something like:
				1154	(!tmp \|\| decl_context == 1)
				1155
				1156	This allows recursive simplifications, tmp1 is used all over the place in
				1157	the function, e.g. by:
				1158
				1159	%tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
				1160	%tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
				1161	%or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
				1162
				1163	later.
				1164
Chris Lattner	b37c3a4	2008-12-06 19:28:22 +0000	[diff] [blame]	1165	//===---------------------------------------------------------------------===//
				1166
				1167	Store sinking: This code:
				1168
				1169	void f (int n, int cond, int res) {
				1170	int i;
				1171	*res = 0;
				1172	for (i = 0; i < n; i++)
				1173	if (*cond)
				1174	res ^= 234; / () /
				1175	}
				1176
				1177	On this function GVN hoists the fully redundant value of *res, but nothing
				1178	moves the store out. This gives us this code:
				1179
				1180	bb: ; preds = %bb2, %entry
				1181	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
				1182	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
				1183	%1 = load i32* %cond, align 4
				1184	%2 = icmp eq i32 %1, 0
				1185	br i1 %2, label %bb2, label %bb1
				1186
				1187	bb1: ; preds = %bb
				1188	%3 = xor i32 %.rle, 234
				1189	store i32 %3, i32* %res, align 4
				1190	br label %bb2
				1191
				1192	bb2: ; preds = %bb, %bb1
				1193	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
				1194	%indvar.next = add i32 %i.05, 1
				1195	%exitcond = icmp eq i32 %indvar.next, %n
				1196	br i1 %exitcond, label %return, label %bb
				1197
				1198	DSE should sink partially dead stores to get the store out of the loop.
				1199
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1200	Here's another partial dead case:
				1201	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
				1202
Chris Lattner	b37c3a4	2008-12-06 19:28:22 +0000	[diff] [blame]	1203	//===---------------------------------------------------------------------===//
				1204
				1205	Scalar PRE hoists the mul in the common block up to the else:
				1206
				1207	int test (int a, int b, int c, int g) {
				1208	int d, e;
				1209	if (a)
				1210	d = b * c;
				1211	else
				1212	d = b - c;
				1213	e = b * c + g;
				1214	return d + e;
				1215	}
				1216
				1217	It would be better to do the mul once to reduce codesize above the if.
				1218	This is GCC PR38204.
				1219
				1220	//===---------------------------------------------------------------------===//
				1221
				1222	GCC PR37810 is an interesting case where we should sink load/store reload
				1223	into the if block and outside the loop, so we don't reload/store it on the
				1224	non-call path.
				1225
				1226	for () {
				1227	*P += 1;
				1228	if ()
				1229	call();
				1230	else
				1231	...
				1232	->
				1233	tmp = *P
				1234	for () {
				1235	tmp += 1;
				1236	if () {
				1237	*P = tmp;
				1238	call();
				1239	tmp = *P;
				1240	} else ...
				1241	}
				1242	*P = tmp;
				1243
Chris Lattner	a8ff424	2008-12-15 07:49:24 +0000	[diff] [blame]	1244	We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
				1245	we don't sink the store. We need partially dead store sinking.
				1246
Chris Lattner	b37c3a4	2008-12-06 19:28:22 +0000	[diff] [blame]	1247	//===---------------------------------------------------------------------===//
				1248
Chris Lattner	a8ff424	2008-12-15 07:49:24 +0000	[diff] [blame]	1249	[PHI TRANSLATE GEPs]
				1250
Chris Lattner	b37c3a4	2008-12-06 19:28:22 +0000	[diff] [blame]	1251	GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
				1252	leading to excess stack traffic. This could be handled by GVN with some crazy
				1253	symbolic phi translation. The code we get looks like (g is on the stack):
				1254
				1255	bb2: ; preds = %bb1
				1256	..
				1257	%9 = getelementptr %struct.f* %g, i32 0, i32 0
				1258	store i32 %8, i32* %9, align bel %bb3
				1259
				1260	bb3: ; preds = %bb1, %bb2, %bb
				1261	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
				1262	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
				1263	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
				1264	%11 = load i32* %10, align 4
				1265
				1266	%11 is fully redundant, an in BB2 it should have the value %8.
				1267
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1268	GCC PR33344 is a similar case.
				1269
Chris Lattner	b37c3a4	2008-12-06 19:28:22 +0000	[diff] [blame]	1270	//===---------------------------------------------------------------------===//
				1271
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1272	There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
				1273	GCC testsuite. There are many pre testcases as ssa-pre-*.c
				1274
Chris Lattner	faf145c	2008-12-15 08:32:28 +0000	[diff] [blame]	1275	//===---------------------------------------------------------------------===//
				1276
				1277	There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
				1278	GCC testsuite. For example, predcom-1.c is:
				1279
				1280	for (i = 2; i < 1000; i++)
				1281	fib[i] = (fib[i-1] + fib[i - 2]) & 0xffff;
				1282
				1283	which compiles into:
				1284
				1285	bb1: ; preds = %bb1, %bb1.thread
				1286	%indvar = phi i32 [ 0, %bb1.thread ], [ %0, %bb1 ]
				1287	%i.0.reg2mem.0 = add i32 %indvar, 2
				1288	%0 = add i32 %indvar, 1 ; <i32> [#uses=3]
				1289	%1 = getelementptr [1000 x i32]* @fib, i32 0, i32 %0
				1290	%2 = load i32* %1, align 4 ; <i32> [#uses=1]
				1291	%3 = getelementptr [1000 x i32]* @fib, i32 0, i32 %indvar
				1292	%4 = load i32* %3, align 4 ; <i32> [#uses=1]
				1293	%5 = add i32 %4, %2 ; <i32> [#uses=1]
				1294	%6 = and i32 %5, 65535 ; <i32> [#uses=1]
				1295	%7 = getelementptr [1000 x i32]* @fib, i32 0, i32 %i.0.reg2mem.0
				1296	store i32 %6, i32* %7, align 4
				1297	%exitcond = icmp eq i32 %0, 998 ; <i1> [#uses=1]
				1298	br i1 %exitcond, label %return, label %bb1
				1299
				1300	This is basically:
				1301	LOAD fib[i+1]
				1302	LOAD fib[i]
				1303	STORE fib[i+2]
				1304
				1305	instead of handling this as a loop or other xform, all we'd need to do is teach
				1306	load PRE to phi translate the %0 add (i+1) into the predecessor as (i'+1+1) =
				1307	(i'+2) (where i' is the previous iteration of i). This would find the store
				1308	which feeds it.
				1309
				1310	predcom-2.c is apparently the same as predcom-1.c
				1311	predcom-3.c is very similar but needs loads feeding each other instead of
				1312	store->load.
				1313	predcom-4.c seems the same as the rest.
				1314
				1315
				1316	//===---------------------------------------------------------------------===//
				1317
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1318	Other simple load PRE cases:
Chris Lattner	a8ff424	2008-12-15 07:49:24 +0000	[diff] [blame]	1319	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35287 [LPRE crit edge splitting]
				1320
				1321	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34677 (licm does this, LPRE crit edge)
				1322	llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \| opt -mem2reg -simplifycfg -gvn \| llvm-dis
				1323
Chris Lattner	faf145c	2008-12-15 08:32:28 +0000	[diff] [blame]	1324	//===---------------------------------------------------------------------===//
				1325
				1326	Type based alias analysis:
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1327	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
				1328
				1329	//===---------------------------------------------------------------------===//
				1330
				1331	When GVN/PRE finds a store of float* to a must aliases pointer when expecting
				1332	an int*, it should turn it into a bitcast. This is a nice generalization of
Chris Lattner	e3a6497	2008-12-07 00:15:10 +0000	[diff] [blame]	1333	the SROA hack that would apply to other cases, e.g.:
				1334
				1335	int foo(int C, int *P, float X) {
				1336	if (C) {
				1337	bar();
				1338	*P = 42;
				1339	} else
				1340	(float)P = X;
				1341
				1342	return *P;
				1343	}
				1344
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1345
				1346	One example (that requires crazy phi translation) is:
Chris Lattner	faf145c	2008-12-15 08:32:28 +0000	[diff] [blame]	1347	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16799 [BITCAST PHI TRANS]
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1348
				1349	//===---------------------------------------------------------------------===//
				1350
				1351	A/B get pinned to the stack because we turn an if/then into a select instead
				1352	of PRE'ing the load/store. This may be fixable in instcombine:
				1353	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
				1354
Chris Lattner	faf145c	2008-12-15 08:32:28 +0000	[diff] [blame]	1355
				1356
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1357	Interesting missed case because of control flow flattening (should be 2 loads):
				1358	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner	faf145c	2008-12-15 08:32:28 +0000	[diff] [blame]	1359	With: llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \|
				1360	opt -mem2reg -gvn -instcombine \| llvm-dis
				1361	we miss it because we need 1) GEP PHI TRAN, 2) CRIT EDGE 3) MULTIPLE DIFFERENT
				1362	VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1363
				1364	//===---------------------------------------------------------------------===//
				1365
				1366	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
				1367	We could eliminate the branch condition here, loading from null is undefined:
				1368
				1369	struct S { int w, x, y, z; };
				1370	struct T { int r; struct S s; };
				1371	void bar (struct S, int);
				1372	void foo (int a, struct T b)
				1373	{
				1374	struct S *c = 0;
				1375	if (a)
				1376	c = &b.s;
				1377	bar (*c, a);
				1378	}
				1379
				1380	//===---------------------------------------------------------------------===//
Chris Lattner	8cbcc3a	2008-12-02 06:32:34 +0000	[diff] [blame]	1381
Chris Lattner	b36ace9	2008-12-23 20:52:52 +0000	[diff] [blame]	1382	simplifylibcalls should do several optimizations for strspn/strcspn:
				1383
				1384	strcspn(x, "") -> strlen(x)
				1385	strcspn("", x) -> 0
				1386	strspn("", x) -> 0
				1387	strspn(x, "") -> strlen(x)
				1388	strspn(x, "a") -> strchr(x, 'a')-x
				1389
				1390	strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
				1391
				1392	size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
				1393	int __reject3) {
				1394	register size_t __result = 0;
				1395	while (__s[__result] != '\0' && __s[__result] != __reject1 &&
				1396	__s[__result] != __reject2 && __s[__result] != __reject3)
				1397	++__result;
				1398	return __result;
				1399	}
				1400
				1401	This should turn into a switch on the character. See PR3253 for some notes on
				1402	codegen.
				1403
				1404	456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
				1405
				1406	//===---------------------------------------------------------------------===//
Chris Lattner	107d30a	2008-12-31 00:54:13 +0000	[diff] [blame]	1407
				1408	"gas" uses this idiom:
				1409	else if (strchr ("+-/%\|&^:[]()~", intel_parser.op_string))
				1410	..
				1411	else if (strchr ("<>", *intel_parser.op_string)
				1412
				1413	Those should be turned into a switch.
				1414
				1415	//===---------------------------------------------------------------------===//
Chris Lattner	a146bc5	2009-01-08 06:52:57 +0000	[diff] [blame]	1416
				1417	252.eon contains this interesting code:
				1418
				1419	%3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
				1420	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1421	%strlen = call i32 @strlen(i8* %3072) ; uses = 1
				1422	%endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
				1423	call void @llvm.memcpy.i32(i8* %endptr,
				1424	i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
				1425	%3074 = call i32 @strlen(i8* %endptr) nounwind readonly
				1426
				1427	This is interesting for a couple reasons. First, in this:
				1428
				1429	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1430	%strlen = call i32 @strlen(i8* %3072)
				1431
				1432	The strlen could be replaced with: %strlen = sub %3072, %3073, because the
				1433	strcpy call returns a pointer to the end of the string. Based on that, the
				1434	endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
				1435
				1436	Second, the memcpy+strlen strlen can be replaced with:
				1437
				1438	%3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
				1439
				1440	Because the destination was just copied into the specified memory buffer. This,
				1441	in turn, can be constant folded to "4".
				1442
				1443	In other code, it contains:
				1444
				1445	%endptr6978 = bitcast i8* %endptr69 to i32*
				1446	store i32 7107374, i32* %endptr6978, align 1
				1447	%3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
				1448
				1449	Which could also be constant folded. Whatever is producing this should probably
				1450	be fixed to leave this as a memcpy from a string.
				1451
				1452	Further, eon also has an interesting partially redundant strlen call:
				1453
				1454	bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
				1455	%682 = getelementptr i8 %argv, i32 6 ; <i8> [#uses=2]
				1456	%683 = load i8** %682, align 4 ; <i8*> [#uses=4]
				1457	%684 = load i8* %683, align 1 ; <i8> [#uses=1]
				1458	%685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
				1459	br i1 %685, label %bb10, label %bb9
				1460
				1461	bb9: ; preds = %bb8
				1462	%686 = call i32 @strlen(i8* %683) nounwind readonly
				1463	%687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
				1464	br i1 %687, label %bb10, label %bb11
				1465
				1466	bb10: ; preds = %bb9, %bb8
				1467	%688 = call i32 @strlen(i8* %683) nounwind readonly
				1468
				1469	This could be eliminated by doing the strlen once in bb8, saving code size and
				1470	improving perf on the bb8->9->10 path.
				1471
				1472	//===---------------------------------------------------------------------===//
Chris Lattner	9c04463	2009-01-08 07:34:55 +0000	[diff] [blame]	1473
				1474	I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
				1475	which looks like:
				1476	%movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
				1477
				1478
				1479	bb62: ; preds = %bb55, %bb53
				1480	%promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
				1481	%171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1482	%172 = add i32 %171, -1 ; <i32> [#uses=1]
				1483	%173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
				1484
				1485	... no stores ...
				1486	br i1 %or.cond, label %bb65, label %bb72
				1487
				1488	bb65: ; preds = %bb62
				1489	store i8 0, i8* %173, align 1
				1490	br label %bb72
				1491
				1492	bb72: ; preds = %bb65, %bb62
				1493	%trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
				1494	%177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1495
				1496	Note that on the bb62->bb72 path, that the %177 strlen call is partially
				1497	redundant with the %171 call. At worst, we could shove the %177 strlen call
				1498	up into the bb65 block moving it out of the bb62->bb72 path. However, note
				1499	that bb65 stores to the string, zeroing out the last byte. This means that on
				1500	that path the value of %177 is actually just %171-1. A sub is cheaper than a
				1501	strlen!
				1502
				1503	This pattern repeats several times, basically doing:
				1504
				1505	A = strlen(P);
				1506	P[A-1] = 0;
				1507	B = strlen(P);
				1508	where it is "obvious" that B = A-1.
				1509
				1510	//===---------------------------------------------------------------------===//
				1511
				1512	186.crafty contains this interesting pattern:
				1513
				1514	%77 = call i8* @strstr(i8* getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0),
				1515	i8* %30)
				1516	%phitmp648 = icmp eq i8* %77, getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0)
				1517	br i1 %phitmp648, label %bb70, label %bb76
				1518
				1519	bb70: ; preds = %OptionMatch.exit91, %bb69
				1520	%78 = call i32 @strlen(i8* %30) nounwind readonly align 1 ; <i32> [#uses=1]
				1521
				1522	This is basically:
				1523	cststr = "abcdef";
				1524	if (strstr(cststr, P) == cststr) {
				1525	x = strlen(P);
				1526	...
				1527
				1528	The strstr call would be significantly cheaper written as:
				1529
				1530	cststr = "abcdef";
				1531	if (memcmp(P, str, strlen(P)))
				1532	x = strlen(P);
				1533
				1534	This is memcmp+strlen instead of strstr. This also makes the strlen fully
				1535	redundant.
				1536
				1537	//===---------------------------------------------------------------------===//
				1538
				1539	186.crafty also contains this code:
				1540
				1541	%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1542	%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
				1543	%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
				1544	%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1545	%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
				1546
				1547	The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
				1548
				1549	//===---------------------------------------------------------------------===//
				1550
				1551	186.crafty has this interesting pattern with the "out.4543" variable:
				1552
				1553	call void @llvm.memcpy.i32(
				1554	i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
				1555	i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
				1556	%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
				1557
				1558	It is basically doing:
				1559
				1560	memcpy(globalarray, "string");
				1561	printf(..., globalarray);
				1562
				1563	Anyway, by knowing that printf just reads the memory and forward substituting
				1564	the string directly into the printf, this eliminates reads from globalarray.
				1565	Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
				1566	other similar functions) there are many stores to "out". Once all the printfs
				1567	stop using "out", all that is left is the memcpy's into it. This should allow
				1568	globalopt to remove the "stored only" global.
				1569
				1570	//===---------------------------------------------------------------------===//
				1571
Dan Gohman	a340102	2009-01-20 01:07:33 +0000	[diff] [blame]	1572	This code:
				1573
				1574	define inreg i32 @foo(i8* inreg %p) nounwind {
				1575	%tmp0 = load i8* %p
				1576	%tmp1 = ashr i8 %tmp0, 5
				1577	%tmp2 = sext i8 %tmp1 to i32
				1578	ret i32 %tmp2
				1579	}
				1580
				1581	could be dagcombine'd to a sign-extending load with a shift.
				1582	For example, on x86 this currently gets this:
				1583
				1584	movb (%eax), %al
				1585	sarb $5, %al
				1586	movsbl %al, %eax
				1587
				1588	while it could get this:
				1589
				1590	movsbl (%eax), %eax
				1591	sarl $5, %eax
				1592
				1593	//===---------------------------------------------------------------------===//
Chris Lattner	ed047e9	2009-01-22 07:16:03 +0000	[diff] [blame]	1594
				1595	GCC PR31029:
				1596
				1597	int test(int x) { return 1-x == x; } // --> return false
				1598	int test2(int x) { return 2-x == x; } // --> return x == 1 ?
				1599
				1600	Always foldable for odd constants, what is the rule for even?
				1601
				1602	//===---------------------------------------------------------------------===//
				1603
edwin	331eef2	2009-01-24 19:30:25 +0000	[diff] [blame]	1604	PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
				1605	for next field in struct (which is at same address).
				1606
				1607	For example: store of float into { {{}}, float } could be turned into a store to
				1608	the float directly.
				1609
Edwin Török	13ede9c	2009-02-20 18:42:06 +0000	[diff] [blame]	1610	//===---------------------------------------------------------------------===//
Nick Lewycky	7ff6241	2009-02-25 06:52:48 +0000	[diff] [blame]	1611
Edwin Török	13ede9c	2009-02-20 18:42:06 +0000	[diff] [blame]	1612	#include <math.h>
				1613	double foo(double a) { return sin(a); }
				1614
				1615	This compiles into this on x86-64 Linux:
				1616	foo:
				1617	subq $8, %rsp
				1618	call sin
				1619	addq $8, %rsp
				1620	ret
				1621	vs:
				1622
				1623	foo:
				1624	jmp sin
				1625
Nick Lewycky	7ff6241	2009-02-25 06:52:48 +0000	[diff] [blame]	1626	//===---------------------------------------------------------------------===//
				1627
Chris Lattner	b1f817f	2009-05-11 17:41:40 +0000	[diff] [blame]	1628	The arg promotion pass should make use of nocapture to make its alias analysis
				1629	stuff much more precise.
				1630
				1631	//===---------------------------------------------------------------------===//
				1632
				1633	The following functions should be optimized to use a select instead of a
				1634	branch (from gcc PR40072):
				1635
				1636	char char_int(int m) {if(m>7) return 0; return m;}
				1637	int int_char(char m) {if(m>7) return 0; return m;}
				1638
				1639	//===---------------------------------------------------------------------===//
				1640
Nick Lewycky	7ff6241	2009-02-25 06:52:48 +0000	[diff] [blame]	1641	Instcombine should replace the load with a constant in:
				1642
				1643	static const char x[4] = {'a', 'b', 'c', 'd'};
				1644
				1645	unsigned int y(void) {
				1646	return (unsigned int )x;
				1647	}
				1648
				1649	It currently only does this transformation when the size of the constant
				1650	is the same as the size of the integer (so, try x[5]) and the last byte
				1651	is a null (making it a C string). There's no need for these restrictions.
				1652
				1653	//===---------------------------------------------------------------------===//
				1654
Chris Lattner	fd62ccc	2009-05-11 17:36:33 +0000	[diff] [blame]	1655	InstCombine's "turn load from constant into constant" optimization should be
				1656	more aggressive in the presence of bitcasts. For example, because of unions,
				1657	this code:
				1658
				1659	union vec2d {
				1660	double e[2];
				1661	double v __attribute__((vector_size(16)));
				1662	};
				1663	typedef union vec2d vec2d;
				1664
				1665	static vec2d a={{1,2}}, b={{3,4}};
				1666
				1667	vec2d foo () {
				1668	return (vec2d){ .v = a.v + b.v * (vec2d){{5,5}}.v };
				1669	}
				1670
				1671	Compiles into:
				1672
				1673	@a = internal constant %0 { [2 x double]
				1674	[double 1.000000e+00, double 2.000000e+00] }, align 16
				1675	@b = internal constant %0 { [2 x double]
				1676	[double 3.000000e+00, double 4.000000e+00] }, align 16
				1677	...
				1678	define void @foo(%struct.vec2d* noalias nocapture sret %agg.result) nounwind {
				1679	entry:
				1680	%0 = load <2 x double>* getelementptr (%struct.vec2d*
				1681	bitcast (%0* @a to %struct.vec2d*), i32 0, i32 0), align 16
				1682	%1 = load <2 x double>* getelementptr (%struct.vec2d*
				1683	bitcast (%0* @b to %struct.vec2d*), i32 0, i32 0), align 16
				1684
				1685
				1686	Instcombine should be able to optimize away the loads (and thus the globals).
				1687
				1688
				1689	//===---------------------------------------------------------------------===//