Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: 9dd2b365c03ee8fd18e07add7865ab4735c250c6 [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
				3	//===---------------------------------------------------------------------===//
				4
				5	With the recent changes to make the implicit def/use set explicit in
				6	machineinstrs, we should change the target descriptions for 'call' instructions
				7	so that the .td files don't list all the call-clobbered registers as implicit
				8	defs. Instead, these should be added by the code generator (e.g. on the dag).
				9
				10	This has a number of uses:
				11
				12	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				13	for their different impdef sets.
				14	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				15	sets don't need copies of call instructions.
				16	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				17	of calls.
				18
				19	//===---------------------------------------------------------------------===//
				20
				21	Make the PPC branch selector target independant
				22
				23	//===---------------------------------------------------------------------===//
				24
				25	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
Chris Lattner	5ae3f3d	2008-12-10 01:30:48 +0000	[diff] [blame]	26	precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
				27	safe in general, even on darwin. See the libm implementation of hypot for
				28	examples (which special case when x/y are exactly zero to get signed zeros etc
				29	right).
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	30
				31	//===---------------------------------------------------------------------===//
				32
				33	Solve this DAG isel folding deficiency:
				34
				35	int X, Y;
				36
				37	void fn1(void)
				38	{
				39	X = X \| (Y << 3);
				40	}
				41
				42	compiles to
				43
				44	fn1:
				45	movl Y, %eax
				46	shll $3, %eax
				47	orl X, %eax
				48	movl %eax, X
				49	ret
				50
				51	The problem is the store's chain operand is not the load X but rather
				52	a TokenFactor of the load X and load Y, which prevents the folding.
				53
				54	There are two ways to fix this:
				55
				56	1. The dag combiner can start using alias analysis to realize that y/x
				57	don't alias, making the store to X not dependent on the load from Y.
				58	2. The generated isel could be made smarter in the case it can't
				59	disambiguate the pointers.
				60
				61	Number 1 is the preferred solution.
				62
				63	This has been "fixed" by a TableGen hack. But that is a short term workaround
				64	which will be removed once the proper fix is made.
				65
				66	//===---------------------------------------------------------------------===//
				67
				68	On targets with expensive 64-bit multiply, we could LSR this:
				69
				70	for (i = ...; ++i) {
				71	x = 1ULL << i;
				72
				73	into:
				74	long long tmp = 1;
				75	for (i = ...; ++i, tmp+=tmp)
				76	x = tmp;
				77
				78	This would be a win on ppc32, but not x86 or ppc64.
				79
				80	//===---------------------------------------------------------------------===//
				81
				82	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				83
				84	//===---------------------------------------------------------------------===//
				85
				86	Reassociate should turn: XXXX -> t=(XX) (t*t) to eliminate a multiply.
				87
				88	//===---------------------------------------------------------------------===//
				89
				90	Interesting? testcase for add/shift/mul reassoc:
				91
				92	int bar(int x, int y) {
				93	return xxx+y+xxxxxyyyy;
				94	}
				95	int foo(int z, int n) {
				96	return bar(z, n) + bar(2z, 2n);
				97	}
				98
				99	Reassociate should handle the example in GCC PR16157.
				100
				101	//===---------------------------------------------------------------------===//
				102
				103	These two functions should generate the same code on big-endian systems:
				104
				105	int g(int j,int l) { return memcmp(j,l,4); }
				106	int h(int j, int l) { return j - l; }
				107
				108	this could be done in SelectionDAGISel.cpp, along with other special cases,
				109	for 1,2,4,8 bytes.
				110
				111	//===---------------------------------------------------------------------===//
				112
				113	It would be nice to revert this patch:
				114	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				115
				116	And teach the dag combiner enough to simplify the code expanded before
				117	legalize. It seems plausible that this knowledge would let it simplify other
				118	stuff too.
				119
				120	//===---------------------------------------------------------------------===//
				121
				122	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
				123	to the type size. It works but can be overly conservative as the alignment of
				124	specific vector types are target dependent.
				125
				126	//===---------------------------------------------------------------------===//
				127
Dan Gohman	71e4f77	2009-05-11 18:51:16 +0000	[diff] [blame]	128	We should produce an unaligned load from code like this:
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	129
				130	v4sf example(float *P) {
				131	return (v4sf){P[0], P[1], P[2], P[3] };
				132	}
				133
				134	//===---------------------------------------------------------------------===//
				135
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	136	Add support for conditional increments, and other related patterns. Instead
				137	of:
				138
				139	movl 136(%esp), %eax
				140	cmpl $0, %eax
				141	je LBB16_2 #cond_next
				142	LBB16_1: #cond_true
				143	incl _foo
				144	LBB16_2: #cond_next
				145
				146	emit:
				147	movl _foo, %eax
				148	cmpl $1, %edi
				149	sbbl $-1, %eax
				150	movl %eax, _foo
				151
				152	//===---------------------------------------------------------------------===//
				153
				154	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				155
				156	Expand these to calls of sin/cos and stores:
				157	double sincos(double x, double sin, double cos);
				158	float sincosf(float x, float sin, float cos);
				159	long double sincosl(long double x, long double sin, long double cos);
				160
				161	Doing so could allow SROA of the destination pointers. See also:
				162	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				163
Chris Lattner	5ae3f3d	2008-12-10 01:30:48 +0000	[diff] [blame]	164	This is now easily doable with MRVs. We could even make an intrinsic for this
				165	if anyone cared enough about sincos.
				166
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	167	//===---------------------------------------------------------------------===//
				168
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	169	Turn this into a single byte store with no load (the other 3 bytes are
				170	unmodified):
				171
Dan Gohman	95df962	2009-05-11 18:04:52 +0000	[diff] [blame]	172	define void @test(i32* %P) {
				173	%tmp = load i32* %P
				174	%tmp14 = or i32 %tmp, 3305111552
				175	%tmp15 = and i32 %tmp14, 3321888767
				176	store i32 %tmp15, i32* %P
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	177	ret void
				178	}
				179
				180	//===---------------------------------------------------------------------===//
				181
				182	dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
				183
				184	Compile:
				185
				186	int bar(int x)
				187	{
				188	int t = __builtin_clz(x);
				189	return -(t>>5);
				190	}
				191
				192	to:
				193
				194	_bar: addic r3,r3,-1
				195	subfe r3,r3,r3
				196	blr
				197
				198	//===---------------------------------------------------------------------===//
				199
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	200	quantum_sigma_x in 462.libquantum contains the following loop:
				201
				202	for(i=0; i<reg->size; i++)
				203	{
				204	/* Flip the target bit of each basis state */
				205	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				206	}
				207
				208	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				209	so cool to turn it into something like:
				210
				211	long long Res = ((MAX_UNSIGNED) 1 << target);
				212	if (target < 32) {
				213	for(i=0; i<reg->size; i++)
				214	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
				215	} else {
				216	for(i=0; i<reg->size; i++)
				217	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
				218	}
				219
				220	... which would only do one 32-bit XOR per loop iteration instead of two.
				221
				222	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
				223	alas...
				224
				225	//===---------------------------------------------------------------------===//
				226
Chris Lattner	909c72c	2008-10-05 02:16:12 +0000	[diff] [blame]	227	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	228
				229	unsigned long reverse(unsigned v) {
				230	unsigned t;
				231	t = v ^ ((v << 16) \| (v >> 16));
				232	t &= ~0xff0000;
				233	v = (v << 24) \| (v >> 8);
				234	return v ^ (t >> 8);
				235	}
				236
				237	//===---------------------------------------------------------------------===//
				238
Chris Lattner	200ca61	2008-10-15 16:02:15 +0000	[diff] [blame]	239	These idioms should be recognized as popcount (see PR1488):
				240
				241	unsigned countbits_slow(unsigned v) {
				242	unsigned c;
				243	for (c = 0; v; v >>= 1)
				244	c += v & 1;
				245	return c;
				246	}
				247	unsigned countbits_fast(unsigned v){
				248	unsigned c;
				249	for (c = 0; v; c++)
				250	v &= v - 1; // clear the least significant bit set
				251	return c;
				252	}
				253
				254	BITBOARD = unsigned long long
				255	int PopCnt(register BITBOARD a) {
				256	register int c=0;
				257	while(a) {
				258	c++;
				259	a &= a - 1;
				260	}
				261	return c;
				262	}
				263	unsigned int popcount(unsigned int input) {
				264	unsigned int count = 0;
				265	for (unsigned int i = 0; i < 4 * 8; i++)
				266	count += (input >> i) & i;
				267	return count;
				268	}
				269
				270	//===---------------------------------------------------------------------===//
				271
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	272	These should turn into single 16-bit (unaligned?) loads on little/big endian
				273	processors.
				274
				275	unsigned short read_16_le(const unsigned char *adr) {
				276	return adr[0] \| (adr[1] << 8);
				277	}
				278	unsigned short read_16_be(const unsigned char *adr) {
				279	return (adr[0] << 8) \| adr[1];
				280	}
				281
				282	//===---------------------------------------------------------------------===//
				283
				284	-instcombine should handle this transform:
				285	icmp pred (sdiv X / C1 ), C2
				286	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				287
				288	Currently InstCombine avoids this transform but will do it when the signs of
				289	the operands and the sign of the divide match. See the FIXME in
				290	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				291	for Instruction::UDiv (around line 4447) for more details.
				292
				293	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				294	this construct.
				295
				296	//===---------------------------------------------------------------------===//
				297
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	298	viterbi speeds up significantly if the various "history" related copy loops
				299	are turned into memcpy calls at the source level. We need a "loops to memcpy"
				300	pass.
				301
				302	//===---------------------------------------------------------------------===//
				303
				304	Consider:
				305
				306	typedef unsigned U32;
				307	typedef unsigned long long U64;
				308	int test (U32 inst, U64 regs) {
				309	U64 effective_addr2;
				310	U32 temp = *inst;
				311	int r1 = (temp >> 20) & 0xf;
				312	int b2 = (temp >> 16) & 0xf;
				313	effective_addr2 = temp & 0xfff;
				314	if (b2) effective_addr2 += regs[b2];
				315	b2 = (temp >> 12) & 0xf;
				316	if (b2) effective_addr2 += regs[b2];
				317	effective_addr2 &= regs[4];
				318	if ((effective_addr2 & 3) == 0)
				319	return 1;
				320	return 0;
				321	}
				322
				323	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				324	we don't eliminate the computation of the top half of effective_addr2 because
				325	we don't have whole-function selection dags. On x86, this means we use one
				326	extra register for the function when effective_addr2 is declared as U64 than
				327	when it is declared U32.
				328
				329	//===---------------------------------------------------------------------===//
				330
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	331	LSR should know what GPR types a target has. This code:
				332
				333	volatile short X, Y; // globals
				334
				335	void foo(int N) {
				336	int i;
				337	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				338	}
				339
				340	produces two identical IV's (after promotion) on PPC/ARM:
				341
				342	LBB1_1: @bb.preheader
				343	mov r3, #0
				344	mov r2, r3
				345	mov r1, r3
				346	LBB1_2: @bb
				347	ldr r12, LCPI1_0
				348	ldr r12, [r12]
				349	strh r2, [r12]
				350	ldr r12, LCPI1_1
				351	ldr r12, [r12]
				352	strh r3, [r12]
				353	add r1, r1, #1 <- [0,+,1]
				354	add r3, r3, #4
				355	add r2, r2, #1 <- [0,+,1]
				356	cmp r1, r0
				357	bne LBB1_2 @bb
				358
				359
				360	//===---------------------------------------------------------------------===//
				361
				362	Tail call elim should be more aggressive, checking to see if the call is
				363	followed by an uncond branch to an exit block.
				364
				365	; This testcase is due to tail-duplication not wanting to copy the return
				366	; instruction into the terminating blocks because there was other code
				367	; optimized out of the function after the taildup happened.
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	368	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	369
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	370	define i32 @t4(i32 %a) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	371	entry:
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	372	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				373	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				374	br i1 %tmp.2, label %then.0, label %else.0
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	375
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	376	then.0: ; preds = %entry
				377	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				378	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				379	br label %return
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	380
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	381	else.0: ; preds = %entry
				382	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				383	br i1 %tmp.7, label %then.1, label %return
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	384
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	385	then.1: ; preds = %else.0
				386	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				387	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				388	br label %return
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	389
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	390	return: ; preds = %then.1, %else.0, %then.0
				391	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	392	[ %tmp.9, %then.1 ]
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	393	ret i32 %result.0
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	394	}
				395
				396	//===---------------------------------------------------------------------===//
				397
Chris Lattner	00159fc	2007-10-03 06:10:59 +0000	[diff] [blame]	398	Tail recursion elimination is not transforming this function, because it is
				399	returning n, which fails the isDynamicConstant check in the accumulator
				400	recursion checks.
				401
				402	long long fib(const long long n) {
				403	switch(n) {
				404	case 0:
				405	case 1:
				406	return n;
				407	default:
				408	return fib(n-1) + fib(n-2);
				409	}
				410	}
				411
				412	//===---------------------------------------------------------------------===//
				413
Chris Lattner	7be8ac2	2008-08-10 00:47:21 +0000	[diff] [blame]	414	Tail recursion elimination should handle:
				415
				416	int pow2m1(int n) {
				417	if (n == 0)
				418	return 0;
				419	return 2 * pow2m1 (n - 1) + 1;
				420	}
				421
				422	Also, multiplies can be turned into SHL's, so they should be handled as if
				423	they were associative. "return foo() << 1" can be tail recursion eliminated.
				424
				425	//===---------------------------------------------------------------------===//
				426
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	427	Argument promotion should promote arguments for recursive functions, like
				428	this:
				429
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	430	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	431
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	432	define internal i32 @foo(i32* %x) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	433	entry:
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	434	%tmp = load i32* %x ; <i32> [#uses=0]
				435	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				436	ret i32 %tmp.foo
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	437	}
				438
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	439	define i32 @bar(i32* %x) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	440	entry:
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	441	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				442	ret i32 %tmp3
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	443	}
				444
Chris Lattner	421a733	2007-12-05 23:05:06 +0000	[diff] [blame]	445	//===---------------------------------------------------------------------===//
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	446
				447	"basicaa" should know how to look through "or" instructions that act like add
				448	instructions. For example in this code, the x4+1 is turned into x4 \| 1, and
				449	basicaa can't analyze the array subscript, leading to duplicated loads in the
				450	generated code:
				451
				452	void test(int X, int Y, int a[]) {
				453	int i;
				454	for (i=2; i<1000; i+=4) {
				455	a[i+0] = a[i-1+0]*a[i-2+0];
				456	a[i+1] = a[i-1+1]*a[i-2+1];
				457	a[i+2] = a[i-1+2]*a[i-2+2];
				458	a[i+3] = a[i-1+3]*a[i-2+3];
				459	}
				460	}
				461
Chris Lattner	5ae3f3d	2008-12-10 01:30:48 +0000	[diff] [blame]	462	BasicAA also doesn't do this for add. It needs to know that &A[i+1] != &A[i].
				463
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	464	//===---------------------------------------------------------------------===//
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	465
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	466	We should investigate an instruction sinking pass. Consider this silly
				467	example in pic mode:
				468
				469	#include <assert.h>
				470	void foo(int x) {
				471	assert(x);
				472	//...
				473	}
				474
				475	we compile this to:
				476	_foo:
				477	subl $28, %esp
				478	call "L1$pb"
				479	"L1$pb":
				480	popl %eax
				481	cmpl $0, 32(%esp)
				482	je LBB1_2 # cond_true
				483	LBB1_1: # return
				484	# ...
				485	addl $28, %esp
				486	ret
				487	LBB1_2: # cond_true
				488	...
				489
				490	The PIC base computation (call+popl) is only used on one path through the
				491	code, but is currently always computed in the entry block. It would be
				492	better to sink the picbase computation down into the block for the
				493	assertion, as it is the only one that uses it. This happens for a lot of
				494	code with early outs.
				495
Chris Lattner	be9fe9d	2007-12-29 01:05:01 +0000	[diff] [blame]	496	Another example is loads of arguments, which are usually emitted into the
				497	entry block on targets like x86. If not used in all paths through a
				498	function, they should be sunk into the ones that do.
				499
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	500	In this case, whole-function-isel would also handle this.
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	501
				502	//===---------------------------------------------------------------------===//
Chris Lattner	8551f19	2008-01-07 21:38:14 +0000	[diff] [blame]	503
				504	Investigate lowering of sparse switch statements into perfect hash tables:
				505	http://burtleburtle.net/bob/hash/perfect.html
				506
				507	//===---------------------------------------------------------------------===//
Chris Lattner	dc089f0	2008-01-09 00:17:57 +0000	[diff] [blame]	508
				509	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				510	corresponding integer operations. On a yonah, this loop:
				511
				512	double a[256];
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	513	void foo() {
				514	int i, b;
				515	for (b = 0; b < 10000000; b++)
				516	for (i = 0; i < 256; i++)
				517	a[i] = -a[i];
				518	}
Chris Lattner	dc089f0	2008-01-09 00:17:57 +0000	[diff] [blame]	519
				520	is twice as slow as this loop:
				521
				522	long long a[256];
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	523	void foo() {
				524	int i, b;
				525	for (b = 0; b < 10000000; b++)
				526	for (i = 0; i < 256; i++)
				527	a[i] ^= (1ULL << 63);
				528	}
Chris Lattner	dc089f0	2008-01-09 00:17:57 +0000	[diff] [blame]	529
				530	and I suspect other processors are similar. On X86 in particular this is a
				531	big win because doing this with integers allows the use of read/modify/write
				532	instructions.
				533
				534	//===---------------------------------------------------------------------===//
Chris Lattner	a909d31	2008-01-10 18:25:41 +0000	[diff] [blame]	535
				536	DAG Combiner should try to combine small loads into larger loads when
				537	profitable. For example, we compile this C++ example:
				538
				539	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				540	extern THotKey m_HotKey;
				541	THotKey GetHotKey () { return m_HotKey; }
				542
				543	into (-O3 -fno-exceptions -static -fomit-frame-pointer):
				544
				545	__Z9GetHotKeyv:
				546	pushl %esi
				547	movl 8(%esp), %eax
				548	movb _m_HotKey+3, %cl
				549	movb _m_HotKey+4, %dl
				550	movb _m_HotKey+2, %ch
				551	movw _m_HotKey, %si
				552	movw %si, (%eax)
				553	movb %ch, 2(%eax)
				554	movb %cl, 3(%eax)
				555	movb %dl, 4(%eax)
				556	popl %esi
				557	ret $4
				558
				559	GCC produces:
				560
				561	__Z9GetHotKeyv:
				562	movl _m_HotKey, %edx
				563	movl 4(%esp), %eax
				564	movl %edx, (%eax)
				565	movzwl _m_HotKey+4, %edx
				566	movw %dx, 4(%eax)
				567	ret $4
				568
				569	The LLVM IR contains the needed alignment info, so we should be able to
				570	merge the loads and stores into 4-byte loads:
				571
				572	%struct.THotKey = type { i16, i8, i8, i8 }
				573	define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
				574	...
				575	%tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
				576	%tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
				577	%tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
				578	%tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
				579
				580	Alternatively, we should use a small amount of base-offset alias analysis
				581	to make it so the scheduler doesn't need to hold all the loads in regs at
				582	once.
				583
				584	//===---------------------------------------------------------------------===//
Chris Lattner	71538b5	2008-01-11 06:17:47 +0000	[diff] [blame]	585
Nate Begeman	a3c7ba3	2008-02-18 18:39:23 +0000	[diff] [blame]	586	We should add an FRINT node to the DAG to model targets that have legal
				587	implementations of ceil/floor/rint.
Chris Lattner	a672d3d	2008-02-28 05:34:27 +0000	[diff] [blame]	588
				589	//===---------------------------------------------------------------------===//
				590
Chris Lattner	61075f3	2008-02-28 17:21:27 +0000	[diff] [blame]	591	This GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34043
				592	contains a testcase that compiles down to:
				593
				594	%struct.XMM128 = type { <4 x float> }
				595	..
				596	%src = alloca %struct.XMM128
				597	..
				598	%tmp6263 = bitcast %struct.XMM128* %src to <2 x i64>*
				599	%tmp65 = getelementptr %struct.XMM128* %src, i32 0, i32 0
				600	store <2 x i64> %tmp5899, <2 x i64>* %tmp6263, align 16
				601	%tmp66 = load <4 x float>* %tmp65, align 16
				602	%tmp71 = add <4 x float> %tmp66, %tmp66
				603
				604	If the mid-level optimizer turned the bitcast of pointer + store of tmp5899
				605	into a bitcast of the vector value and a store to the pointer, then the
				606	store->load could be easily removed.
				607
				608	//===---------------------------------------------------------------------===//
				609
Chris Lattner	a672d3d	2008-02-28 05:34:27 +0000	[diff] [blame]	610	Consider:
				611
				612	int test() {
				613	long long input[8] = {1,1,1,1,1,1,1,1};
				614	foo(input);
				615	}
				616
				617	We currently compile this into a memcpy from a global array since the
				618	initializer is fairly large and not memset'able. This is good, but the memcpy
				619	gets lowered to load/stores in the code generator. This is also ok, except
				620	that the codegen lowering for memcpy doesn't handle the case when the source
				621	is a constant global. This gives us atrocious code like this:
				622
				623	call "L1$pb"
				624	"L1$pb":
				625	popl %eax
				626	movl _C.0.1444-"L1$pb"+32(%eax), %ecx
				627	movl %ecx, 40(%esp)
				628	movl _C.0.1444-"L1$pb"+20(%eax), %ecx
				629	movl %ecx, 28(%esp)
				630	movl _C.0.1444-"L1$pb"+36(%eax), %ecx
				631	movl %ecx, 44(%esp)
				632	movl _C.0.1444-"L1$pb"+44(%eax), %ecx
				633	movl %ecx, 52(%esp)
				634	movl _C.0.1444-"L1$pb"+40(%eax), %ecx
				635	movl %ecx, 48(%esp)
				636	movl _C.0.1444-"L1$pb"+12(%eax), %ecx
				637	movl %ecx, 20(%esp)
				638	movl _C.0.1444-"L1$pb"+4(%eax), %ecx
				639	...
				640
				641	instead of:
				642	movl $1, 16(%esp)
				643	movl $0, 20(%esp)
				644	movl $1, 24(%esp)
				645	movl $0, 28(%esp)
				646	movl $1, 32(%esp)
				647	movl $0, 36(%esp)
				648	...
				649
				650	//===---------------------------------------------------------------------===//
Chris Lattner	a709d33	2008-03-02 02:51:40 +0000	[diff] [blame]	651
				652	http://llvm.org/PR717:
				653
				654	The following code should compile into "ret int undef". Instead, LLVM
				655	produces "ret int 0":
				656
				657	int f() {
				658	int x = 4;
				659	int y;
				660	if (x == 3) y = 0;
				661	return y;
				662	}
				663
				664	//===---------------------------------------------------------------------===//
Chris Lattner	e9c6844	2008-03-02 19:29:42 +0000	[diff] [blame]	665
				666	The loop unroller should partially unroll loops (instead of peeling them)
				667	when code growth isn't too bad and when an unroll count allows simplification
				668	of some code within the loop. One trivial example is:
				669
				670	#include <stdio.h>
				671	int main() {
				672	int nRet = 17;
				673	int nLoop;
				674	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				675	if ( nLoop & 1 )
				676	nRet += 2;
				677	else
				678	nRet -= 1;
				679	}
				680	return nRet;
				681	}
				682
				683	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				684	reduction in code size. The resultant code would then also be suitable for
				685	exit value computation.
				686
				687	//===---------------------------------------------------------------------===//
Chris Lattner	962a7d5	2008-03-17 01:47:51 +0000	[diff] [blame]	688
				689	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				690	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				691	matching code in dag combine doesn't look through truncates aggressively
				692	enough. Here are some testcases reduces from GCC PR17886:
				693
				694	unsigned long long f(unsigned long long x, int y) {
				695	return (x << y) \| (x >> 64-y);
				696	}
				697	unsigned f2(unsigned x, int y){
				698	return (x << y) \| (x >> 32-y);
				699	}
				700	unsigned long long f3(unsigned long long x){
				701	int y = 9;
				702	return (x << y) \| (x >> 64-y);
				703	}
				704	unsigned f4(unsigned x){
				705	int y = 10;
				706	return (x << y) \| (x >> 32-y);
				707	}
				708	unsigned long long f5(unsigned long long x, unsigned long long y) {
				709	return (x << 8) \| ((y >> 48) & 0xffull);
				710	}
				711	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				712	switch(z) {
				713	case 1:
				714	return (x << 8) \| ((y >> 48) & 0xffull);
				715	case 2:
				716	return (x << 16) \| ((y >> 40) & 0xffffull);
				717	case 3:
				718	return (x << 24) \| ((y >> 32) & 0xffffffull);
				719	case 4:
				720	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				721	default:
				722	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				723	}
				724	}
				725
Dan Gohman	2896a4e	2008-10-17 21:39:27 +0000	[diff] [blame]	726	On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner	962a7d5	2008-03-17 01:47:51 +0000	[diff] [blame]	727	generate truly horrible code, instead of using shld and friends. On
				728	ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
				729	badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
				730
				731	//===---------------------------------------------------------------------===//
Chris Lattner	38c4a15	2008-03-20 04:46:13 +0000	[diff] [blame]	732
				733	We do a number of simplifications in simplify libcalls to strength reduce
				734	standard library functions, but we don't currently merge them together. For
				735	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				736	be done safely if "b" isn't modified between the strlen and memcpy of course.
				737
				738	//===---------------------------------------------------------------------===//
				739
Chris Lattner	40a6864	2008-07-14 00:19:59 +0000	[diff] [blame]	740	Reassociate should turn things like:
				741
				742	int factorial(int X) {
				743	return XXXXXXX*X;
				744	}
				745
				746	into llvm.powi calls, allowing the code generator to produce balanced
				747	multiplication trees.
				748
				749	//===---------------------------------------------------------------------===//
				750
Chris Lattner	16e192f	2008-08-10 01:14:08 +0000	[diff] [blame]	751	We generate a horrible libcall for llvm.powi. For example, we compile:
				752
				753	#include <cmath>
				754	double f(double a) { return std::pow(a, 4); }
				755
				756	into:
				757
				758	__Z1fd:
				759	subl $12, %esp
				760	movsd 16(%esp), %xmm0
				761	movsd %xmm0, (%esp)
				762	movl $4, 8(%esp)
				763	call L___powidf2$stub
				764	addl $12, %esp
				765	ret
				766
				767	GCC produces:
				768
				769	__Z1fd:
				770	subl $12, %esp
				771	movsd 16(%esp), %xmm0
				772	mulsd %xmm0, %xmm0
				773	mulsd %xmm0, %xmm0
				774	movsd %xmm0, (%esp)
				775	fldl (%esp)
				776	addl $12, %esp
				777	ret
				778
				779	//===---------------------------------------------------------------------===//
				780
				781	We compile this program: (from GCC PR11680)
				782	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				783
				784	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				785	slower than when compile with GCC (either 4.0 or 4.2):
				786
				787	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				788	$ time ./a.out fast
				789	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				790
				791	$ g++ perf.cpp -O3 -fno-exceptions
				792	$ time ./a.out fast
				793	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				794
				795	It looks like we are making the same inlining decisions, so this may be raw
				796	codegen badness or something else (haven't investigated).
				797
				798	//===---------------------------------------------------------------------===//
				799
				800	We miss some instcombines for stuff like this:
				801	void bar (void);
				802	void foo (unsigned int a) {
				803	/* This one is equivalent to a >= (3 << 2). */
				804	if ((a >> 2) >= 3)
				805	bar ();
				806	}
				807
				808	A few other related ones are in GCC PR14753.
				809
				810	//===---------------------------------------------------------------------===//
				811
				812	Divisibility by constant can be simplified (according to GCC PR12849) from
				813	being a mulhi to being a mul lo (cheaper). Testcase:
				814
				815	void bar(unsigned n) {
				816	if (n % 3 == 0)
				817	true();
				818	}
				819
				820	I think this basically amounts to a dag combine to simplify comparisons against
				821	multiply hi's into a comparison against the mullo.
				822
				823	//===---------------------------------------------------------------------===//
Chris Lattner	c10e07a	2008-08-19 06:22:16 +0000	[diff] [blame]	824
Chris Lattner	92f9783	2008-10-15 16:06:03 +0000	[diff] [blame]	825	Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
				826	bunch of other stuff from this example (see PR1604):
				827
				828	#include <cstdio>
				829	struct test {
				830	int val;
				831	virtual ~test() {}
				832	};
				833
				834	int main() {
				835	test t;
				836	std::scanf("%d", &t.val);
				837	std::printf("%d\n", t.val);
				838	}
				839
				840	//===---------------------------------------------------------------------===//
				841
Chris Lattner	74da911	2008-10-15 16:33:52 +0000	[diff] [blame]	842	Instcombine will merge comparisons like (x >= 10) && (x < 20) by producing (x -
				843	10) u< 10, but only when the comparisons have matching sign.
				844
				845	This could be converted with a similiar technique. (PR1941)
				846
				847	define i1 @test(i8 %x) {
				848	%A = icmp uge i8 %x, 5
				849	%B = icmp slt i8 %x, 20
				850	%C = and i1 %A, %B
				851	ret i1 %C
				852	}
				853
				854	//===---------------------------------------------------------------------===//
Nick Lewycky	8ef52e2	2008-11-27 22:12:22 +0000	[diff] [blame]	855
Nick Lewycky	728f874	2008-11-27 22:41:45 +0000	[diff] [blame]	856	These functions perform the same computation, but produce different assembly.
Nick Lewycky	8ef52e2	2008-11-27 22:12:22 +0000	[diff] [blame]	857
				858	define i8 @select(i8 %x) readnone nounwind {
				859	%A = icmp ult i8 %x, 250
				860	%B = select i1 %A, i8 0, i8 1
				861	ret i8 %B
				862	}
				863
				864	define i8 @addshr(i8 %x) readnone nounwind {
				865	%A = zext i8 %x to i9
				866	%B = add i9 %A, 6 ;; 256 - 250 == 6
				867	%C = lshr i9 %B, 8
				868	%D = trunc i9 %C to i8
				869	ret i8 %D
				870	}
				871
				872	//===---------------------------------------------------------------------===//
Eli Friedman	8e59f06	2008-11-30 07:36:04 +0000	[diff] [blame]	873
				874	From gcc bug 24696:
				875	int
				876	f (unsigned long a, unsigned long b, unsigned long c)
				877	{
				878	return ((a & (c - 1)) != 0) \|\| ((b & (c - 1)) != 0);
				879	}
				880	int
				881	f (unsigned long a, unsigned long b, unsigned long c)
				882	{
				883	return ((a & (c - 1)) != 0) \| ((b & (c - 1)) != 0);
				884	}
				885	Both should combine to ((a\|b) & (c-1)) != 0. Currently not optimized with
				886	"clang -emit-llvm-bc \| opt -std-compile-opts".
				887
				888	//===---------------------------------------------------------------------===//
				889
				890	From GCC Bug 20192:
				891	#define PMD_MASK (~((1UL << 23) - 1))
				892	void clear_pmd_range(unsigned long start, unsigned long end)
				893	{
				894	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
				895	f();
				896	}
				897	The expression should optimize to something like
				898	"!((start\|end)&~PMD_MASK). Currently not optimized with "clang
				899	-emit-llvm-bc \| opt -std-compile-opts".
				900
				901	//===---------------------------------------------------------------------===//
				902
				903	From GCC Bug 15241:
				904	unsigned int
				905	foo (unsigned int a, unsigned int b)
				906	{
				907	if (a <= 7 && b <= 7)
				908	baz ();
				909	}
				910	Should combine to "(a\|b) <= 7". Currently not optimized with "clang
				911	-emit-llvm-bc \| opt -std-compile-opts".
				912
				913	//===---------------------------------------------------------------------===//
				914
				915	From GCC Bug 3756:
				916	int
				917	pn (int n)
				918	{
				919	return (n >= 0 ? 1 : -1);
				920	}
				921	Should combine to (n >> 31) \| 1. Currently not optimized with "clang
				922	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				923
				924	//===---------------------------------------------------------------------===//
				925
				926	From GCC Bug 28685:
				927	int test(int a, int b)
				928	{
				929	int lt = a < b;
				930	int eq = a == b;
				931
				932	return (lt \|\| eq);
				933	}
				934	Should combine to "a <= b". Currently not optimized with "clang
				935	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				936
				937	//===---------------------------------------------------------------------===//
				938
				939	void a(int variable)
				940	{
				941	if (variable == 4 \|\| variable == 6)
				942	bar();
				943	}
				944	This should optimize to "if ((variable \| 2) == 6)". Currently not
				945	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts \| llc".
				946
				947	//===---------------------------------------------------------------------===//
				948
				949	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
				950	i;}
				951	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				952	These should combine to the same thing. Currently, the first function
				953	produces better code on X86.
				954
				955	//===---------------------------------------------------------------------===//
				956
Eli Friedman	8e59f06	2008-11-30 07:36:04 +0000	[diff] [blame]	957	From GCC Bug 15784:
				958	#define abs(x) x>0?x:-x
				959	int f(int x, int y)
				960	{
				961	return (abs(x)) >= 0;
				962	}
				963	This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
				964	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				965
				966	//===---------------------------------------------------------------------===//
				967
				968	From GCC Bug 14753:
				969	void
				970	rotate_cst (unsigned int a)
				971	{
				972	a = (a << 10) \| (a >> 22);
				973	if (a == 123)
				974	bar ();
				975	}
				976	void
				977	minus_cst (unsigned int a)
				978	{
				979	unsigned int tem;
				980
				981	tem = 20 - a;
				982	if (tem == 5)
				983	bar ();
				984	}
				985	void
				986	mask_gt (unsigned int a)
				987	{
				988	/* This is equivalent to a > 15. */
				989	if ((a & ~7) > 8)
				990	bar ();
				991	}
				992	void
				993	rshift_gt (unsigned int a)
				994	{
				995	/* This is equivalent to a > 23. */
				996	if ((a >> 2) > 5)
				997	bar ();
				998	}
				999	All should simplify to a single comparison. All of these are
				1000	currently not optimized with "clang -emit-llvm-bc \| opt
				1001	-std-compile-opts".
				1002
				1003	//===---------------------------------------------------------------------===//
				1004
				1005	From GCC Bug 32605:
				1006	int c(int* x) {return (char)x+2 == (char)x;}
				1007	Should combine to 0. Currently not optimized with "clang
				1008	-emit-llvm-bc \| opt -std-compile-opts" (although llc can optimize it).
				1009
				1010	//===---------------------------------------------------------------------===//
				1011
				1012	int a(unsigned char* b) {return *b > 99;}
				1013	There's an unnecessary zext in the generated code with "clang
				1014	-emit-llvm-bc \| opt -std-compile-opts".
				1015
				1016	//===---------------------------------------------------------------------===//
				1017
Eli Friedman	8e59f06	2008-11-30 07:36:04 +0000	[diff] [blame]	1018	int a(unsigned b) {return ((b << 31) \| (b << 30)) >> 31;}
				1019	Should be combined to "((b >> 1) \| b) & 1". Currently not optimized
				1020	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1021
				1022	//===---------------------------------------------------------------------===//
				1023
				1024	unsigned a(unsigned x, unsigned y) { return x \| (y & 1) \| (y & 2);}
				1025	Should combine to "x \| (y & 3)". Currently not optimized with "clang
				1026	-emit-llvm-bc \| opt -std-compile-opts".
				1027
				1028	//===---------------------------------------------------------------------===//
				1029
				1030	unsigned a(unsigned a) {return ((a \| 1) & 3) \| (a & -4);}
				1031	Should combine to "a \| 1". Currently not optimized with "clang
				1032	-emit-llvm-bc \| opt -std-compile-opts".
				1033
				1034	//===---------------------------------------------------------------------===//
				1035
Eli Friedman	8e59f06	2008-11-30 07:36:04 +0000	[diff] [blame]	1036	int a(int a, int b, int c) {return (~a & c) \| ((c\|a) & b);}
				1037	Should fold to "(~a & c) \| (a & b)". Currently not optimized with
				1038	"clang -emit-llvm-bc \| opt -std-compile-opts".
				1039
				1040	//===---------------------------------------------------------------------===//
				1041
				1042	int a(int a,int b) {return (~(a\|b))\|a;}
				1043	Should fold to "a\|~b". Currently not optimized with "clang
				1044	-emit-llvm-bc \| opt -std-compile-opts".
				1045
				1046	//===---------------------------------------------------------------------===//
				1047
				1048	int a(int a, int b) {return (a&&b) \|\| (a&&!b);}
				1049	Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
				1050	\| opt -std-compile-opts".
				1051
				1052	//===---------------------------------------------------------------------===//
				1053
				1054	int a(int a, int b, int c) {return (a&&b) \|\| (!a&&c);}
				1055	Should fold to "a ? b : c", or at least something sane. Currently not
				1056	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1057
				1058	//===---------------------------------------------------------------------===//
				1059
				1060	int a(int a, int b, int c) {return (a&&b) \|\| (a&&c) \|\| (a&&b&&c);}
				1061	Should fold to a && (b \|\| c). Currently not optimized with "clang
				1062	-emit-llvm-bc \| opt -std-compile-opts".
				1063
				1064	//===---------------------------------------------------------------------===//
				1065
				1066	int a(int x) {return x \| ((x & 8) ^ 8);}
				1067	Should combine to x \| 8. Currently not optimized with "clang
				1068	-emit-llvm-bc \| opt -std-compile-opts".
				1069
				1070	//===---------------------------------------------------------------------===//
				1071
				1072	int a(int x) {return x ^ ((x & 8) ^ 8);}
				1073	Should also combine to x \| 8. Currently not optimized with "clang
				1074	-emit-llvm-bc \| opt -std-compile-opts".
				1075
				1076	//===---------------------------------------------------------------------===//
				1077
				1078	int a(int x) {return (x & 8) == 0 ? -1 : -9;}
				1079	Should combine to (x \| -9) ^ 8. Currently not optimized with "clang
				1080	-emit-llvm-bc \| opt -std-compile-opts".
				1081
				1082	//===---------------------------------------------------------------------===//
				1083
				1084	int a(int x) {return (x & 8) == 0 ? -9 : -1;}
				1085	Should combine to x \| -9. Currently not optimized with "clang
				1086	-emit-llvm-bc \| opt -std-compile-opts".
				1087
				1088	//===---------------------------------------------------------------------===//
				1089
				1090	int a(int x) {return ((x \| -9) ^ 8) & x;}
				1091	Should combine to x & -9. Currently not optimized with "clang
				1092	-emit-llvm-bc \| opt -std-compile-opts".
				1093
				1094	//===---------------------------------------------------------------------===//
				1095
				1096	unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
				1097	Should combine to "a * 0x88888888 >> 31". Currently not optimized
				1098	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1099
				1100	//===---------------------------------------------------------------------===//
				1101
				1102	unsigned a(char* x) {if ((*x & 32) == 0) return b();}
				1103	There's an unnecessary zext in the generated code with "clang
				1104	-emit-llvm-bc \| opt -std-compile-opts".
				1105
				1106	//===---------------------------------------------------------------------===//
				1107
				1108	unsigned a(unsigned long long x) {return 40 * (x >> 1);}
				1109	Should combine to "20 * (((unsigned)x) & -2)". Currently not
				1110	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1111
				1112	//===---------------------------------------------------------------------===//
Bill Wendling	d0a76ee	2008-12-02 05:12:47 +0000	[diff] [blame]	1113
Chris Lattner	8cbcc3a	2008-12-02 06:32:34 +0000	[diff] [blame]	1114	This was noticed in the entryblock for grokdeclarator in 403.gcc:
				1115
				1116	%tmp = icmp eq i32 %decl_context, 4
				1117	%decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
				1118	%tmp1 = icmp eq i32 %decl_context_addr.0, 1
				1119	%decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
				1120
				1121	tmp1 should be simplified to something like:
				1122	(!tmp \|\| decl_context == 1)
				1123
				1124	This allows recursive simplifications, tmp1 is used all over the place in
				1125	the function, e.g. by:
				1126
				1127	%tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
				1128	%tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
				1129	%or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
				1130
				1131	later.
				1132
Chris Lattner	b37c3a4	2008-12-06 19:28:22 +0000	[diff] [blame]	1133	//===---------------------------------------------------------------------===//
				1134
				1135	Store sinking: This code:
				1136
				1137	void f (int n, int cond, int res) {
				1138	int i;
				1139	*res = 0;
				1140	for (i = 0; i < n; i++)
				1141	if (*cond)
				1142	res ^= 234; / () /
				1143	}
				1144
				1145	On this function GVN hoists the fully redundant value of *res, but nothing
				1146	moves the store out. This gives us this code:
				1147
				1148	bb: ; preds = %bb2, %entry
				1149	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
				1150	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
				1151	%1 = load i32* %cond, align 4
				1152	%2 = icmp eq i32 %1, 0
				1153	br i1 %2, label %bb2, label %bb1
				1154
				1155	bb1: ; preds = %bb
				1156	%3 = xor i32 %.rle, 234
				1157	store i32 %3, i32* %res, align 4
				1158	br label %bb2
				1159
				1160	bb2: ; preds = %bb, %bb1
				1161	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
				1162	%indvar.next = add i32 %i.05, 1
				1163	%exitcond = icmp eq i32 %indvar.next, %n
				1164	br i1 %exitcond, label %return, label %bb
				1165
				1166	DSE should sink partially dead stores to get the store out of the loop.
				1167
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1168	Here's another partial dead case:
				1169	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
				1170
Chris Lattner	b37c3a4	2008-12-06 19:28:22 +0000	[diff] [blame]	1171	//===---------------------------------------------------------------------===//
				1172
				1173	Scalar PRE hoists the mul in the common block up to the else:
				1174
				1175	int test (int a, int b, int c, int g) {
				1176	int d, e;
				1177	if (a)
				1178	d = b * c;
				1179	else
				1180	d = b - c;
				1181	e = b * c + g;
				1182	return d + e;
				1183	}
				1184
				1185	It would be better to do the mul once to reduce codesize above the if.
				1186	This is GCC PR38204.
				1187
				1188	//===---------------------------------------------------------------------===//
				1189
				1190	GCC PR37810 is an interesting case where we should sink load/store reload
				1191	into the if block and outside the loop, so we don't reload/store it on the
				1192	non-call path.
				1193
				1194	for () {
				1195	*P += 1;
				1196	if ()
				1197	call();
				1198	else
				1199	...
				1200	->
				1201	tmp = *P
				1202	for () {
				1203	tmp += 1;
				1204	if () {
				1205	*P = tmp;
				1206	call();
				1207	tmp = *P;
				1208	} else ...
				1209	}
				1210	*P = tmp;
				1211
Chris Lattner	a8ff424	2008-12-15 07:49:24 +0000	[diff] [blame]	1212	We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
				1213	we don't sink the store. We need partially dead store sinking.
				1214
Chris Lattner	b37c3a4	2008-12-06 19:28:22 +0000	[diff] [blame]	1215	//===---------------------------------------------------------------------===//
				1216
Chris Lattner	a8ff424	2008-12-15 07:49:24 +0000	[diff] [blame]	1217	[PHI TRANSLATE GEPs]
				1218
Chris Lattner	b37c3a4	2008-12-06 19:28:22 +0000	[diff] [blame]	1219	GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
				1220	leading to excess stack traffic. This could be handled by GVN with some crazy
				1221	symbolic phi translation. The code we get looks like (g is on the stack):
				1222
				1223	bb2: ; preds = %bb1
				1224	..
				1225	%9 = getelementptr %struct.f* %g, i32 0, i32 0
				1226	store i32 %8, i32* %9, align bel %bb3
				1227
				1228	bb3: ; preds = %bb1, %bb2, %bb
				1229	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
				1230	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
				1231	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
				1232	%11 = load i32* %10, align 4
				1233
				1234	%11 is fully redundant, an in BB2 it should have the value %8.
				1235
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1236	GCC PR33344 is a similar case.
				1237
Chris Lattner	b37c3a4	2008-12-06 19:28:22 +0000	[diff] [blame]	1238	//===---------------------------------------------------------------------===//
				1239
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1240	There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
				1241	GCC testsuite. There are many pre testcases as ssa-pre-*.c
				1242
Chris Lattner	faf145c	2008-12-15 08:32:28 +0000	[diff] [blame]	1243	//===---------------------------------------------------------------------===//
				1244
				1245	There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
				1246	GCC testsuite. For example, predcom-1.c is:
				1247
				1248	for (i = 2; i < 1000; i++)
				1249	fib[i] = (fib[i-1] + fib[i - 2]) & 0xffff;
				1250
				1251	which compiles into:
				1252
				1253	bb1: ; preds = %bb1, %bb1.thread
				1254	%indvar = phi i32 [ 0, %bb1.thread ], [ %0, %bb1 ]
				1255	%i.0.reg2mem.0 = add i32 %indvar, 2
				1256	%0 = add i32 %indvar, 1 ; <i32> [#uses=3]
				1257	%1 = getelementptr [1000 x i32]* @fib, i32 0, i32 %0
				1258	%2 = load i32* %1, align 4 ; <i32> [#uses=1]
				1259	%3 = getelementptr [1000 x i32]* @fib, i32 0, i32 %indvar
				1260	%4 = load i32* %3, align 4 ; <i32> [#uses=1]
				1261	%5 = add i32 %4, %2 ; <i32> [#uses=1]
				1262	%6 = and i32 %5, 65535 ; <i32> [#uses=1]
				1263	%7 = getelementptr [1000 x i32]* @fib, i32 0, i32 %i.0.reg2mem.0
				1264	store i32 %6, i32* %7, align 4
				1265	%exitcond = icmp eq i32 %0, 998 ; <i1> [#uses=1]
				1266	br i1 %exitcond, label %return, label %bb1
				1267
				1268	This is basically:
				1269	LOAD fib[i+1]
				1270	LOAD fib[i]
				1271	STORE fib[i+2]
				1272
				1273	instead of handling this as a loop or other xform, all we'd need to do is teach
				1274	load PRE to phi translate the %0 add (i+1) into the predecessor as (i'+1+1) =
				1275	(i'+2) (where i' is the previous iteration of i). This would find the store
				1276	which feeds it.
				1277
				1278	predcom-2.c is apparently the same as predcom-1.c
				1279	predcom-3.c is very similar but needs loads feeding each other instead of
				1280	store->load.
				1281	predcom-4.c seems the same as the rest.
				1282
				1283
				1284	//===---------------------------------------------------------------------===//
				1285
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1286	Other simple load PRE cases:
Chris Lattner	a8ff424	2008-12-15 07:49:24 +0000	[diff] [blame]	1287	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35287 [LPRE crit edge splitting]
				1288
				1289	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34677 (licm does this, LPRE crit edge)
				1290	llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \| opt -mem2reg -simplifycfg -gvn \| llvm-dis
				1291
Chris Lattner	faf145c	2008-12-15 08:32:28 +0000	[diff] [blame]	1292	//===---------------------------------------------------------------------===//
				1293
				1294	Type based alias analysis:
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1295	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
				1296
				1297	//===---------------------------------------------------------------------===//
				1298
				1299	When GVN/PRE finds a store of float* to a must aliases pointer when expecting
				1300	an int*, it should turn it into a bitcast. This is a nice generalization of
Chris Lattner	e3a6497	2008-12-07 00:15:10 +0000	[diff] [blame]	1301	the SROA hack that would apply to other cases, e.g.:
				1302
				1303	int foo(int C, int *P, float X) {
				1304	if (C) {
				1305	bar();
				1306	*P = 42;
				1307	} else
				1308	(float)P = X;
				1309
				1310	return *P;
				1311	}
				1312
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1313
				1314	One example (that requires crazy phi translation) is:
Chris Lattner	faf145c	2008-12-15 08:32:28 +0000	[diff] [blame]	1315	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16799 [BITCAST PHI TRANS]
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1316
				1317	//===---------------------------------------------------------------------===//
				1318
				1319	A/B get pinned to the stack because we turn an if/then into a select instead
				1320	of PRE'ing the load/store. This may be fixable in instcombine:
				1321	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
				1322
Chris Lattner	faf145c	2008-12-15 08:32:28 +0000	[diff] [blame]	1323
				1324
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1325	Interesting missed case because of control flow flattening (should be 2 loads):
				1326	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner	faf145c	2008-12-15 08:32:28 +0000	[diff] [blame]	1327	With: llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \|
				1328	opt -mem2reg -gvn -instcombine \| llvm-dis
				1329	we miss it because we need 1) GEP PHI TRAN, 2) CRIT EDGE 3) MULTIPLE DIFFERENT
				1330	VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner	7d675ed	2008-12-06 22:52:12 +0000	[diff] [blame]	1331
				1332	//===---------------------------------------------------------------------===//
				1333
				1334	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
				1335	We could eliminate the branch condition here, loading from null is undefined:
				1336
				1337	struct S { int w, x, y, z; };
				1338	struct T { int r; struct S s; };
				1339	void bar (struct S, int);
				1340	void foo (int a, struct T b)
				1341	{
				1342	struct S *c = 0;
				1343	if (a)
				1344	c = &b.s;
				1345	bar (*c, a);
				1346	}
				1347
				1348	//===---------------------------------------------------------------------===//
Chris Lattner	8cbcc3a	2008-12-02 06:32:34 +0000	[diff] [blame]	1349
Chris Lattner	b36ace9	2008-12-23 20:52:52 +0000	[diff] [blame]	1350	simplifylibcalls should do several optimizations for strspn/strcspn:
				1351
				1352	strcspn(x, "") -> strlen(x)
				1353	strcspn("", x) -> 0
				1354	strspn("", x) -> 0
				1355	strspn(x, "") -> strlen(x)
				1356	strspn(x, "a") -> strchr(x, 'a')-x
				1357
				1358	strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
				1359
				1360	size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
				1361	int __reject3) {
				1362	register size_t __result = 0;
				1363	while (__s[__result] != '\0' && __s[__result] != __reject1 &&
				1364	__s[__result] != __reject2 && __s[__result] != __reject3)
				1365	++__result;
				1366	return __result;
				1367	}
				1368
				1369	This should turn into a switch on the character. See PR3253 for some notes on
				1370	codegen.
				1371
				1372	456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
				1373
				1374	//===---------------------------------------------------------------------===//
Chris Lattner	107d30a	2008-12-31 00:54:13 +0000	[diff] [blame]	1375
				1376	"gas" uses this idiom:
				1377	else if (strchr ("+-/%\|&^:[]()~", intel_parser.op_string))
				1378	..
				1379	else if (strchr ("<>", *intel_parser.op_string)
				1380
				1381	Those should be turned into a switch.
				1382
				1383	//===---------------------------------------------------------------------===//
Chris Lattner	a146bc5	2009-01-08 06:52:57 +0000	[diff] [blame]	1384
				1385	252.eon contains this interesting code:
				1386
				1387	%3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
				1388	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1389	%strlen = call i32 @strlen(i8* %3072) ; uses = 1
				1390	%endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
				1391	call void @llvm.memcpy.i32(i8* %endptr,
				1392	i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
				1393	%3074 = call i32 @strlen(i8* %endptr) nounwind readonly
				1394
				1395	This is interesting for a couple reasons. First, in this:
				1396
				1397	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1398	%strlen = call i32 @strlen(i8* %3072)
				1399
				1400	The strlen could be replaced with: %strlen = sub %3072, %3073, because the
				1401	strcpy call returns a pointer to the end of the string. Based on that, the
				1402	endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
				1403
				1404	Second, the memcpy+strlen strlen can be replaced with:
				1405
				1406	%3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
				1407
				1408	Because the destination was just copied into the specified memory buffer. This,
				1409	in turn, can be constant folded to "4".
				1410
				1411	In other code, it contains:
				1412
				1413	%endptr6978 = bitcast i8* %endptr69 to i32*
				1414	store i32 7107374, i32* %endptr6978, align 1
				1415	%3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
				1416
				1417	Which could also be constant folded. Whatever is producing this should probably
				1418	be fixed to leave this as a memcpy from a string.
				1419
				1420	Further, eon also has an interesting partially redundant strlen call:
				1421
				1422	bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
				1423	%682 = getelementptr i8 %argv, i32 6 ; <i8> [#uses=2]
				1424	%683 = load i8** %682, align 4 ; <i8*> [#uses=4]
				1425	%684 = load i8* %683, align 1 ; <i8> [#uses=1]
				1426	%685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
				1427	br i1 %685, label %bb10, label %bb9
				1428
				1429	bb9: ; preds = %bb8
				1430	%686 = call i32 @strlen(i8* %683) nounwind readonly
				1431	%687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
				1432	br i1 %687, label %bb10, label %bb11
				1433
				1434	bb10: ; preds = %bb9, %bb8
				1435	%688 = call i32 @strlen(i8* %683) nounwind readonly
				1436
				1437	This could be eliminated by doing the strlen once in bb8, saving code size and
				1438	improving perf on the bb8->9->10 path.
				1439
				1440	//===---------------------------------------------------------------------===//
Chris Lattner	9c04463	2009-01-08 07:34:55 +0000	[diff] [blame]	1441
				1442	I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
				1443	which looks like:
				1444	%movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
				1445
				1446
				1447	bb62: ; preds = %bb55, %bb53
				1448	%promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
				1449	%171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1450	%172 = add i32 %171, -1 ; <i32> [#uses=1]
				1451	%173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
				1452
				1453	... no stores ...
				1454	br i1 %or.cond, label %bb65, label %bb72
				1455
				1456	bb65: ; preds = %bb62
				1457	store i8 0, i8* %173, align 1
				1458	br label %bb72
				1459
				1460	bb72: ; preds = %bb65, %bb62
				1461	%trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
				1462	%177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1463
				1464	Note that on the bb62->bb72 path, that the %177 strlen call is partially
				1465	redundant with the %171 call. At worst, we could shove the %177 strlen call
				1466	up into the bb65 block moving it out of the bb62->bb72 path. However, note
				1467	that bb65 stores to the string, zeroing out the last byte. This means that on
				1468	that path the value of %177 is actually just %171-1. A sub is cheaper than a
				1469	strlen!
				1470
				1471	This pattern repeats several times, basically doing:
				1472
				1473	A = strlen(P);
				1474	P[A-1] = 0;
				1475	B = strlen(P);
				1476	where it is "obvious" that B = A-1.
				1477
				1478	//===---------------------------------------------------------------------===//
				1479
				1480	186.crafty contains this interesting pattern:
				1481
				1482	%77 = call i8* @strstr(i8* getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0),
				1483	i8* %30)
				1484	%phitmp648 = icmp eq i8* %77, getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0)
				1485	br i1 %phitmp648, label %bb70, label %bb76
				1486
				1487	bb70: ; preds = %OptionMatch.exit91, %bb69
				1488	%78 = call i32 @strlen(i8* %30) nounwind readonly align 1 ; <i32> [#uses=1]
				1489
				1490	This is basically:
				1491	cststr = "abcdef";
				1492	if (strstr(cststr, P) == cststr) {
				1493	x = strlen(P);
				1494	...
				1495
				1496	The strstr call would be significantly cheaper written as:
				1497
				1498	cststr = "abcdef";
				1499	if (memcmp(P, str, strlen(P)))
				1500	x = strlen(P);
				1501
				1502	This is memcmp+strlen instead of strstr. This also makes the strlen fully
				1503	redundant.
				1504
				1505	//===---------------------------------------------------------------------===//
				1506
				1507	186.crafty also contains this code:
				1508
				1509	%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1510	%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
				1511	%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
				1512	%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1513	%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
				1514
				1515	The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
				1516
				1517	//===---------------------------------------------------------------------===//
				1518
				1519	186.crafty has this interesting pattern with the "out.4543" variable:
				1520
				1521	call void @llvm.memcpy.i32(
				1522	i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
				1523	i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
				1524	%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
				1525
				1526	It is basically doing:
				1527
				1528	memcpy(globalarray, "string");
				1529	printf(..., globalarray);
				1530
				1531	Anyway, by knowing that printf just reads the memory and forward substituting
				1532	the string directly into the printf, this eliminates reads from globalarray.
				1533	Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
				1534	other similar functions) there are many stores to "out". Once all the printfs
				1535	stop using "out", all that is left is the memcpy's into it. This should allow
				1536	globalopt to remove the "stored only" global.
				1537
				1538	//===---------------------------------------------------------------------===//
				1539
Dan Gohman	a340102	2009-01-20 01:07:33 +0000	[diff] [blame]	1540	This code:
				1541
				1542	define inreg i32 @foo(i8* inreg %p) nounwind {
				1543	%tmp0 = load i8* %p
				1544	%tmp1 = ashr i8 %tmp0, 5
				1545	%tmp2 = sext i8 %tmp1 to i32
				1546	ret i32 %tmp2
				1547	}
				1548
				1549	could be dagcombine'd to a sign-extending load with a shift.
				1550	For example, on x86 this currently gets this:
				1551
				1552	movb (%eax), %al
				1553	sarb $5, %al
				1554	movsbl %al, %eax
				1555
				1556	while it could get this:
				1557
				1558	movsbl (%eax), %eax
				1559	sarl $5, %eax
				1560
				1561	//===---------------------------------------------------------------------===//
Chris Lattner	ed047e9	2009-01-22 07:16:03 +0000	[diff] [blame]	1562
				1563	GCC PR31029:
				1564
				1565	int test(int x) { return 1-x == x; } // --> return false
				1566	int test2(int x) { return 2-x == x; } // --> return x == 1 ?
				1567
				1568	Always foldable for odd constants, what is the rule for even?
				1569
				1570	//===---------------------------------------------------------------------===//
				1571
edwin	331eef2	2009-01-24 19:30:25 +0000	[diff] [blame]	1572	PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
				1573	for next field in struct (which is at same address).
				1574
				1575	For example: store of float into { {{}}, float } could be turned into a store to
				1576	the float directly.
				1577
Edwin Török	13ede9c	2009-02-20 18:42:06 +0000	[diff] [blame]	1578	//===---------------------------------------------------------------------===//
Nick Lewycky	7ff6241	2009-02-25 06:52:48 +0000	[diff] [blame]	1579
Edwin Török	13ede9c	2009-02-20 18:42:06 +0000	[diff] [blame]	1580	#include <math.h>
				1581	double foo(double a) { return sin(a); }
				1582
				1583	This compiles into this on x86-64 Linux:
				1584	foo:
				1585	subq $8, %rsp
				1586	call sin
				1587	addq $8, %rsp
				1588	ret
				1589	vs:
				1590
				1591	foo:
				1592	jmp sin
				1593
Nick Lewycky	7ff6241	2009-02-25 06:52:48 +0000	[diff] [blame]	1594	//===---------------------------------------------------------------------===//
				1595
Chris Lattner	b1f817f	2009-05-11 17:41:40 +0000	[diff] [blame]	1596	The arg promotion pass should make use of nocapture to make its alias analysis
				1597	stuff much more precise.
				1598
				1599	//===---------------------------------------------------------------------===//
				1600
				1601	The following functions should be optimized to use a select instead of a
				1602	branch (from gcc PR40072):
				1603
				1604	char char_int(int m) {if(m>7) return 0; return m;}
				1605	int int_char(char m) {if(m>7) return 0; return m;}
				1606
				1607	//===---------------------------------------------------------------------===//
				1608
Nick Lewycky	7ff6241	2009-02-25 06:52:48 +0000	[diff] [blame]	1609	Instcombine should replace the load with a constant in:
				1610
				1611	static const char x[4] = {'a', 'b', 'c', 'd'};
				1612
				1613	unsigned int y(void) {
				1614	return (unsigned int )x;
				1615	}
				1616
				1617	It currently only does this transformation when the size of the constant
				1618	is the same as the size of the integer (so, try x[5]) and the last byte
				1619	is a null (making it a C string). There's no need for these restrictions.
				1620
				1621	//===---------------------------------------------------------------------===//
				1622
Chris Lattner	fd62ccc	2009-05-11 17:36:33 +0000	[diff] [blame]	1623	InstCombine's "turn load from constant into constant" optimization should be
				1624	more aggressive in the presence of bitcasts. For example, because of unions,
				1625	this code:
				1626
				1627	union vec2d {
				1628	double e[2];
				1629	double v __attribute__((vector_size(16)));
				1630	};
				1631	typedef union vec2d vec2d;
				1632
				1633	static vec2d a={{1,2}}, b={{3,4}};
				1634
				1635	vec2d foo () {
				1636	return (vec2d){ .v = a.v + b.v * (vec2d){{5,5}}.v };
				1637	}
				1638
				1639	Compiles into:
				1640
				1641	@a = internal constant %0 { [2 x double]
				1642	[double 1.000000e+00, double 2.000000e+00] }, align 16
				1643	@b = internal constant %0 { [2 x double]
				1644	[double 3.000000e+00, double 4.000000e+00] }, align 16
				1645	...
				1646	define void @foo(%struct.vec2d* noalias nocapture sret %agg.result) nounwind {
				1647	entry:
				1648	%0 = load <2 x double>* getelementptr (%struct.vec2d*
				1649	bitcast (%0* @a to %struct.vec2d*), i32 0, i32 0), align 16
				1650	%1 = load <2 x double>* getelementptr (%struct.vec2d*
				1651	bitcast (%0* @b to %struct.vec2d*), i32 0, i32 0), align 16
				1652
				1653
				1654	Instcombine should be able to optimize away the loads (and thus the globals).
				1655
Chris Lattner	9105aa9	2009-09-14 16:49:26 +0000	[diff] [blame^]	1656	See also PR4973
Chris Lattner	fd62ccc	2009-05-11 17:36:33 +0000	[diff] [blame]	1657
				1658	//===---------------------------------------------------------------------===//
Nick Lewycky	e07a535	2009-07-09 04:03:30 +0000	[diff] [blame]	1659
				1660	I saw this constant expression in real code after llvm-g++ -O2:
				1661
				1662	declare extern_weak i32 @0(i64)
				1663
				1664	define void @foo() {
				1665	br i1 icmp eq (i32 zext (i1 icmp ne (i32 (i64)* @0, i32 (i64)* null) to i32),
				1666	i32 0), label %cond_true, label %cond_false
				1667	cond_true:
				1668	ret void
				1669	cond_false:
				1670	ret void
				1671	}
				1672
				1673	That branch expression should be reduced to:
				1674
				1675	i1 icmp eq (i32 (i64)* @0, i32 (i64)* null)
				1676
				1677	It's probably not a perf issue, I just happened to see it while examining
				1678	something else and didn't want to forget about it.
				1679
				1680	//===---------------------------------------------------------------------===//