Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: 1c33f3946d3e2e04a275c97e7393919973d477d4 [file] [log] [blame]

Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
Chris Lattner	f308ea0	2006-09-28 06:01:17 +0000	[diff] [blame]	3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	9b62b45	2006-11-14 01:57:53 +0000	[diff] [blame]	5	With the recent changes to make the implicit def/use set explicit in
				6	machineinstrs, we should change the target descriptions for 'call' instructions
				7	so that the .td files don't list all the call-clobbered registers as implicit
				8	defs. Instead, these should be added by the code generator (e.g. on the dag).
				9
				10	This has a number of uses:
				11
				12	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				13	for their different impdef sets.
				14	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				15	sets don't need copies of call instructions.
				16	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				17	of calls.
				18
				19	//===---------------------------------------------------------------------===//
				20
Nate Begeman	81e8097	2006-03-17 01:40:33 +0000	[diff] [blame]	21	Make the PPC branch selector target independant
				22
				23	//===---------------------------------------------------------------------===//
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	24
				25	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	26	precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
				27	safe in general, even on darwin. See the libm implementation of hypot for
				28	examples (which special case when x/y are exactly zero to get signed zeros etc
				29	right).
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	30
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	31	//===---------------------------------------------------------------------===//
				32
				33	Solve this DAG isel folding deficiency:
				34
				35	int X, Y;
				36
				37	void fn1(void)
				38	{
				39	X = X \| (Y << 3);
				40	}
				41
				42	compiles to
				43
				44	fn1:
				45	movl Y, %eax
				46	shll $3, %eax
				47	orl X, %eax
				48	movl %eax, X
				49	ret
				50
				51	The problem is the store's chain operand is not the load X but rather
				52	a TokenFactor of the load X and load Y, which prevents the folding.
				53
				54	There are two ways to fix this:
				55
				56	1. The dag combiner can start using alias analysis to realize that y/x
				57	don't alias, making the store to X not dependent on the load from Y.
				58	2. The generated isel could be made smarter in the case it can't
				59	disambiguate the pointers.
				60
				61	Number 1 is the preferred solution.
				62
Evan Cheng	e617b08	2006-03-13 23:19:10 +0000	[diff] [blame]	63	This has been "fixed" by a TableGen hack. But that is a short term workaround
				64	which will be removed once the proper fix is made.
				65
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	66	//===---------------------------------------------------------------------===//
				67
Chris Lattner	b27b69f	2006-03-04 01:19:34 +0000	[diff] [blame]	68	On targets with expensive 64-bit multiply, we could LSR this:
				69
				70	for (i = ...; ++i) {
				71	x = 1ULL << i;
				72
				73	into:
				74	long long tmp = 1;
				75	for (i = ...; ++i, tmp+=tmp)
				76	x = tmp;
				77
				78	This would be a win on ppc32, but not x86 or ppc64.
				79
Chris Lattner	ad01993	2006-03-04 08:44:51 +0000	[diff] [blame]	80	//===---------------------------------------------------------------------===//
Chris Lattner	5b0fe7d	2006-03-05 20:00:08 +0000	[diff] [blame]	81
				82	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				83
				84	//===---------------------------------------------------------------------===//
Chris Lattner	549f27d2	2006-03-07 02:46:26 +0000	[diff] [blame]	85
Chris Lattner	c20995e	2006-03-11 20:17:08 +0000	[diff] [blame]	86	Reassociate should turn: XXXX -> t=(XX) (t*t) to eliminate a multiply.
				87
				88	//===---------------------------------------------------------------------===//
				89
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	90	Interesting? testcase for add/shift/mul reassoc:
				91
				92	int bar(int x, int y) {
				93	return xxx+y+xxxxxyyyy;
				94	}
				95	int foo(int z, int n) {
				96	return bar(z, n) + bar(2z, 2n);
				97	}
				98
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	99	Reassociate should handle the example in GCC PR16157.
				100
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	101	//===---------------------------------------------------------------------===//
				102
Chris Lattner	82c78b2	2006-03-09 20:13:21 +0000	[diff] [blame]	103	These two functions should generate the same code on big-endian systems:
				104
				105	int g(int j,int l) { return memcmp(j,l,4); }
				106	int h(int j, int l) { return j - l; }
				107
				108	this could be done in SelectionDAGISel.cpp, along with other special cases,
				109	for 1,2,4,8 bytes.
				110
				111	//===---------------------------------------------------------------------===//
				112
Chris Lattner	c04b423	2006-03-22 07:33:46 +0000	[diff] [blame]	113	It would be nice to revert this patch:
				114	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				115
				116	And teach the dag combiner enough to simplify the code expanded before
				117	legalize. It seems plausible that this knowledge would let it simplify other
				118	stuff too.
				119
Chris Lattner	e6cd96d	2006-03-24 19:59:17 +0000	[diff] [blame]	120	//===---------------------------------------------------------------------===//
				121
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	122	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng	67d3d4c	2006-03-31 22:35:14 +0000	[diff] [blame]	123	to the type size. It works but can be overly conservative as the alignment of
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	124	specific vector types are target dependent.
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	125
				126	//===---------------------------------------------------------------------===//
				127
				128	We should add 'unaligned load/store' nodes, and produce them from code like
				129	this:
				130
				131	v4sf example(float *P) {
				132	return (v4sf){P[0], P[1], P[2], P[3] };
				133	}
				134
				135	//===---------------------------------------------------------------------===//
				136
Chris Lattner	16abfdf	2006-05-18 18:26:13 +0000	[diff] [blame]	137	Add support for conditional increments, and other related patterns. Instead
				138	of:
				139
				140	movl 136(%esp), %eax
				141	cmpl $0, %eax
				142	je LBB16_2 #cond_next
				143	LBB16_1: #cond_true
				144	incl _foo
				145	LBB16_2: #cond_next
				146
				147	emit:
				148	movl _foo, %eax
				149	cmpl $1, %edi
				150	sbbl $-1, %eax
				151	movl %eax, _foo
				152
				153	//===---------------------------------------------------------------------===//
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	154
				155	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				156
				157	Expand these to calls of sin/cos and stores:
				158	double sincos(double x, double sin, double cos);
				159	float sincosf(float x, float sin, float cos);
				160	long double sincosl(long double x, long double sin, long double cos);
				161
				162	Doing so could allow SROA of the destination pointers. See also:
				163	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				164
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	165	This is now easily doable with MRVs. We could even make an intrinsic for this
				166	if anyone cared enough about sincos.
				167
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	168	//===---------------------------------------------------------------------===//
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	169
				170	Scalar Repl cannot currently promote this testcase to 'ret long cst':
				171
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	172	%struct.X = type { i32, i32 }
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	173	%struct.Y = type { %struct.X }
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	174
				175	define i64 @bar() {
				176	%retval = alloca %struct.Y, align 8
				177	%tmp12 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 0
				178	store i32 0, i32* %tmp12
				179	%tmp15 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 1
				180	store i32 1, i32* %tmp15
				181	%retval.upgrd.1 = bitcast %struct.Y* %retval to i64*
				182	%retval.upgrd.2 = load i64* %retval.upgrd.1
				183	ret i64 %retval.upgrd.2
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	184	}
				185
				186	it should be extended to do so.
				187
				188	//===---------------------------------------------------------------------===//
Chris Lattner	e8263e6	2006-05-21 03:57:07 +0000	[diff] [blame]	189
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	190	-scalarrepl should promote this to be a vector scalar.
				191
				192	%struct..0anon = type { <4 x float> }
				193
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	194	define void @test1(<4 x float> %V, float* %P) {
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	195	%u = alloca %struct..0anon, align 16
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	196	%tmp = getelementptr %struct..0anon* %u, i32 0, i32 0
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	197	store <4 x float> %V, <4 x float>* %tmp
				198	%tmp1 = bitcast %struct..0anon* %u to [4 x float]*
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	199	%tmp.upgrd.1 = getelementptr [4 x float]* %tmp1, i32 0, i32 1
				200	%tmp.upgrd.2 = load float* %tmp.upgrd.1
				201	%tmp3 = mul float %tmp.upgrd.2, 2.000000e+00
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	202	store float %tmp3, float* %P
				203	ret void
				204	}
				205
				206	//===---------------------------------------------------------------------===//
				207
Chris Lattner	e8263e6	2006-05-21 03:57:07 +0000	[diff] [blame]	208	Turn this into a single byte store with no load (the other 3 bytes are
				209	unmodified):
				210
				211	void %test(uint* %P) {
				212	%tmp = load uint* %P
				213	%tmp14 = or uint %tmp, 3305111552
				214	%tmp15 = and uint %tmp14, 3321888767
				215	store uint %tmp15, uint* %P
				216	ret void
				217	}
				218
Chris Lattner	9e18ef5	2006-05-30 21:29:15 +0000	[diff] [blame]	219	//===---------------------------------------------------------------------===//
				220
				221	dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
				222
				223	Compile:
				224
				225	int bar(int x)
				226	{
				227	int t = __builtin_clz(x);
				228	return -(t>>5);
				229	}
				230
				231	to:
				232
				233	_bar: addic r3,r3,-1
				234	subfe r3,r3,r3
				235	blr
				236
Chris Lattner	cbce2f6	2006-09-15 20:31:36 +0000	[diff] [blame]	237	//===---------------------------------------------------------------------===//
				238
				239	Legalize should lower ctlz like this:
				240	ctlz(x) = popcnt((x-1) & ~x)
				241
				242	on targets that have popcnt but not ctlz. itanium, what else?
Chris Lattner	9e18ef5	2006-05-30 21:29:15 +0000	[diff] [blame]	243
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	244	//===---------------------------------------------------------------------===//
				245
				246	quantum_sigma_x in 462.libquantum contains the following loop:
				247
				248	for(i=0; i<reg->size; i++)
				249	{
				250	/* Flip the target bit of each basis state */
				251	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				252	}
				253
				254	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				255	so cool to turn it into something like:
				256
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	257	long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	258	if (target < 32) {
				259	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	260	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	261	} else {
				262	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	263	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	264	}
				265
				266	... which would only do one 32-bit XOR per loop iteration instead of two.
				267
				268	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
				269	alas...
				270
				271	//===---------------------------------------------------------------------===//
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	272
Chris Lattner	b1ac769	2008-10-05 02:16:12 +0000	[diff] [blame]	273	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattner	f9bae43	2006-12-08 02:01:32 +0000	[diff] [blame]	274
				275	unsigned long reverse(unsigned v) {
				276	unsigned t;
				277	t = v ^ ((v << 16) \| (v >> 16));
				278	t &= ~0xff0000;
				279	v = (v << 24) \| (v >> 8);
				280	return v ^ (t >> 8);
				281	}
				282
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	283	//===---------------------------------------------------------------------===//
				284
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	285	These idioms should be recognized as popcount (see PR1488):
				286
				287	unsigned countbits_slow(unsigned v) {
				288	unsigned c;
				289	for (c = 0; v; v >>= 1)
				290	c += v & 1;
				291	return c;
				292	}
				293	unsigned countbits_fast(unsigned v){
				294	unsigned c;
				295	for (c = 0; v; c++)
				296	v &= v - 1; // clear the least significant bit set
				297	return c;
				298	}
				299
				300	BITBOARD = unsigned long long
				301	int PopCnt(register BITBOARD a) {
				302	register int c=0;
				303	while(a) {
				304	c++;
				305	a &= a - 1;
				306	}
				307	return c;
				308	}
				309	unsigned int popcount(unsigned int input) {
				310	unsigned int count = 0;
				311	for (unsigned int i = 0; i < 4 * 8; i++)
				312	count += (input >> i) & i;
				313	return count;
				314	}
				315
				316	//===---------------------------------------------------------------------===//
				317
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	318	These should turn into single 16-bit (unaligned?) loads on little/big endian
				319	processors.
				320
				321	unsigned short read_16_le(const unsigned char *adr) {
				322	return adr[0] \| (adr[1] << 8);
				323	}
				324	unsigned short read_16_be(const unsigned char *adr) {
				325	return (adr[0] << 8) \| adr[1];
				326	}
				327
				328	//===---------------------------------------------------------------------===//
Chris Lattner	cf10391	2006-10-24 16:12:47 +0000	[diff] [blame]	329
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	330	-instcombine should handle this transform:
Reid Spencer	e4d87aa	2006-12-23 06:05:41 +0000	[diff] [blame]	331	icmp pred (sdiv X / C1 ), C2
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	332	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				333
				334	Currently InstCombine avoids this transform but will do it when the signs of
				335	the operands and the sign of the divide match. See the FIXME in
				336	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				337	for Instruction::UDiv (around line 4447) for more details.
				338
				339	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				340	this construct.
Chris Lattner	d7c628d	2006-11-03 22:27:39 +0000	[diff] [blame]	341
				342	//===---------------------------------------------------------------------===//
				343
Chris Lattner	578d2df	2006-11-10 00:23:26 +0000	[diff] [blame]	344	viterbi speeds up significantly if the various "history" related copy loops
				345	are turned into memcpy calls at the source level. We need a "loops to memcpy"
				346	pass.
				347
				348	//===---------------------------------------------------------------------===//
Nick Lewycky	bf63734	2006-11-13 00:23:28 +0000	[diff] [blame]	349
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	350	Consider:
				351
				352	typedef unsigned U32;
				353	typedef unsigned long long U64;
				354	int test (U32 inst, U64 regs) {
				355	U64 effective_addr2;
				356	U32 temp = *inst;
				357	int r1 = (temp >> 20) & 0xf;
				358	int b2 = (temp >> 16) & 0xf;
				359	effective_addr2 = temp & 0xfff;
				360	if (b2) effective_addr2 += regs[b2];
				361	b2 = (temp >> 12) & 0xf;
				362	if (b2) effective_addr2 += regs[b2];
				363	effective_addr2 &= regs[4];
				364	if ((effective_addr2 & 3) == 0)
				365	return 1;
				366	return 0;
				367	}
				368
				369	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				370	we don't eliminate the computation of the top half of effective_addr2 because
				371	we don't have whole-function selection dags. On x86, this means we use one
				372	extra register for the function when effective_addr2 is declared as U64 than
				373	when it is declared U32.
				374
				375	//===---------------------------------------------------------------------===//
				376
Chris Lattner	36e37d2	2007-02-13 21:44:43 +0000	[diff] [blame]	377	Promote for i32 bswap can use i64 bswap + shr. Useful on targets with 64-bit
				378	regs and bswap, like itanium.
				379
				380	//===---------------------------------------------------------------------===//
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	381
				382	LSR should know what GPR types a target has. This code:
				383
				384	volatile short X, Y; // globals
				385
				386	void foo(int N) {
				387	int i;
				388	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				389	}
				390
				391	produces two identical IV's (after promotion) on PPC/ARM:
				392
				393	LBB1_1: @bb.preheader
				394	mov r3, #0
				395	mov r2, r3
				396	mov r1, r3
				397	LBB1_2: @bb
				398	ldr r12, LCPI1_0
				399	ldr r12, [r12]
				400	strh r2, [r12]
				401	ldr r12, LCPI1_1
				402	ldr r12, [r12]
				403	strh r3, [r12]
				404	add r1, r1, #1 <- [0,+,1]
				405	add r3, r3, #4
				406	add r2, r2, #1 <- [0,+,1]
				407	cmp r1, r0
				408	bne LBB1_2 @bb
				409
				410
				411	//===---------------------------------------------------------------------===//
				412
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	413	Tail call elim should be more aggressive, checking to see if the call is
				414	followed by an uncond branch to an exit block.
				415
				416	; This testcase is due to tail-duplication not wanting to copy the return
				417	; instruction into the terminating blocks because there was other code
				418	; optimized out of the function after the taildup happened.
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	419	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	420
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	421	define i32 @t4(i32 %a) {
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	422	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	423	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				424	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				425	br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	426
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	427	then.0: ; preds = %entry
				428	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				429	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				430	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	431
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	432	else.0: ; preds = %entry
				433	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				434	br i1 %tmp.7, label %then.1, label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	435
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	436	then.1: ; preds = %else.0
				437	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				438	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				439	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	440
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	441	return: ; preds = %then.1, %else.0, %then.0
				442	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	443	[ %tmp.9, %then.1 ]
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	444	ret i32 %result.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	445	}
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	446
				447	//===---------------------------------------------------------------------===//
				448
Chris Lattner	e1bb6ab	2007-10-03 06:10:59 +0000	[diff] [blame]	449	Tail recursion elimination is not transforming this function, because it is
				450	returning n, which fails the isDynamicConstant check in the accumulator
				451	recursion checks.
				452
				453	long long fib(const long long n) {
				454	switch(n) {
				455	case 0:
				456	case 1:
				457	return n;
				458	default:
				459	return fib(n-1) + fib(n-2);
				460	}
				461	}
				462
				463	//===---------------------------------------------------------------------===//
				464
Chris Lattner	c90b866	2008-08-10 00:47:21 +0000	[diff] [blame]	465	Tail recursion elimination should handle:
				466
				467	int pow2m1(int n) {
				468	if (n == 0)
				469	return 0;
				470	return 2 * pow2m1 (n - 1) + 1;
				471	}
				472
				473	Also, multiplies can be turned into SHL's, so they should be handled as if
				474	they were associative. "return foo() << 1" can be tail recursion eliminated.
				475
				476	//===---------------------------------------------------------------------===//
				477
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	478	Argument promotion should promote arguments for recursive functions, like
				479	this:
				480
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	481	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	482
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	483	define internal i32 @foo(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	484	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	485	%tmp = load i32* %x ; <i32> [#uses=0]
				486	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				487	ret i32 %tmp.foo
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	488	}
				489
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	490	define i32 @bar(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	491	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	492	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				493	ret i32 %tmp3
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	494	}
				495
Chris Lattner	81f2d71	2007-12-05 23:05:06 +0000	[diff] [blame]	496	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	497
				498	"basicaa" should know how to look through "or" instructions that act like add
				499	instructions. For example in this code, the x4+1 is turned into x4 \| 1, and
				500	basicaa can't analyze the array subscript, leading to duplicated loads in the
				501	generated code:
				502
				503	void test(int X, int Y, int a[]) {
				504	int i;
				505	for (i=2; i<1000; i+=4) {
				506	a[i+0] = a[i-1+0]*a[i-2+0];
				507	a[i+1] = a[i-1+1]*a[i-2+1];
				508	a[i+2] = a[i-1+2]*a[i-2+2];
				509	a[i+3] = a[i-1+3]*a[i-2+3];
				510	}
				511	}
				512
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	513	BasicAA also doesn't do this for add. It needs to know that &A[i+1] != &A[i].
				514
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	515	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	516
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	517	We should investigate an instruction sinking pass. Consider this silly
				518	example in pic mode:
				519
				520	#include <assert.h>
				521	void foo(int x) {
				522	assert(x);
				523	//...
				524	}
				525
				526	we compile this to:
				527	_foo:
				528	subl $28, %esp
				529	call "L1$pb"
				530	"L1$pb":
				531	popl %eax
				532	cmpl $0, 32(%esp)
				533	je LBB1_2 # cond_true
				534	LBB1_1: # return
				535	# ...
				536	addl $28, %esp
				537	ret
				538	LBB1_2: # cond_true
				539	...
				540
				541	The PIC base computation (call+popl) is only used on one path through the
				542	code, but is currently always computed in the entry block. It would be
				543	better to sink the picbase computation down into the block for the
				544	assertion, as it is the only one that uses it. This happens for a lot of
				545	code with early outs.
				546
Chris Lattner	92c06a0	2007-12-29 01:05:01 +0000	[diff] [blame]	547	Another example is loads of arguments, which are usually emitted into the
				548	entry block on targets like x86. If not used in all paths through a
				549	function, they should be sunk into the ones that do.
				550
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	551	In this case, whole-function-isel would also handle this.
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	552
				553	//===---------------------------------------------------------------------===//
Chris Lattner	b304194	2008-01-07 21:38:14 +0000	[diff] [blame]	554
				555	Investigate lowering of sparse switch statements into perfect hash tables:
				556	http://burtleburtle.net/bob/hash/perfect.html
				557
				558	//===---------------------------------------------------------------------===//
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	559
				560	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				561	corresponding integer operations. On a yonah, this loop:
				562
				563	double a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	564	void foo() {
				565	int i, b;
				566	for (b = 0; b < 10000000; b++)
				567	for (i = 0; i < 256; i++)
				568	a[i] = -a[i];
				569	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	570
				571	is twice as slow as this loop:
				572
				573	long long a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	574	void foo() {
				575	int i, b;
				576	for (b = 0; b < 10000000; b++)
				577	for (i = 0; i < 256; i++)
				578	a[i] ^= (1ULL << 63);
				579	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	580
				581	and I suspect other processors are similar. On X86 in particular this is a
				582	big win because doing this with integers allows the use of read/modify/write
				583	instructions.
				584
				585	//===---------------------------------------------------------------------===//
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	586
				587	DAG Combiner should try to combine small loads into larger loads when
				588	profitable. For example, we compile this C++ example:
				589
				590	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				591	extern THotKey m_HotKey;
				592	THotKey GetHotKey () { return m_HotKey; }
				593
				594	into (-O3 -fno-exceptions -static -fomit-frame-pointer):
				595
				596	__Z9GetHotKeyv:
				597	pushl %esi
				598	movl 8(%esp), %eax
				599	movb _m_HotKey+3, %cl
				600	movb _m_HotKey+4, %dl
				601	movb _m_HotKey+2, %ch
				602	movw _m_HotKey, %si
				603	movw %si, (%eax)
				604	movb %ch, 2(%eax)
				605	movb %cl, 3(%eax)
				606	movb %dl, 4(%eax)
				607	popl %esi
				608	ret $4
				609
				610	GCC produces:
				611
				612	__Z9GetHotKeyv:
				613	movl _m_HotKey, %edx
				614	movl 4(%esp), %eax
				615	movl %edx, (%eax)
				616	movzwl _m_HotKey+4, %edx
				617	movw %dx, 4(%eax)
				618	ret $4
				619
				620	The LLVM IR contains the needed alignment info, so we should be able to
				621	merge the loads and stores into 4-byte loads:
				622
				623	%struct.THotKey = type { i16, i8, i8, i8 }
				624	define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
				625	...
				626	%tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
				627	%tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
				628	%tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
				629	%tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
				630
				631	Alternatively, we should use a small amount of base-offset alias analysis
				632	to make it so the scheduler doesn't need to hold all the loads in regs at
				633	once.
				634
				635	//===---------------------------------------------------------------------===//
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	636
				637	We should extend parameter attributes to capture more information about
				638	pointer parameters for alias analysis. Some ideas:
				639
				640	1. Add a "nocapture" attribute, which indicates that the callee does not store
				641	the address of the parameter into a global or any other memory location
				642	visible to the callee. This can be used to make basicaa and other analyses
				643	more powerful. It is true for things like memcpy, strcat, and many other
				644	things, including structs passed by value, most C++ references, etc.
				645	2. Generalize readonly to be set on parameters. This is important mod/ref
				646	info for the function, which is important for basicaa and others. It can
				647	also be used by the inliner to avoid inserting a memcpy for byval
				648	arguments when the function is inlined.
				649
				650	These functions can be inferred by various analysis passes such as the
Chris Lattner	65844fb	2008-01-12 18:58:46 +0000	[diff] [blame]	651	globalsmodrefaa pass. Note that getting #2 right is actually really tricky.
				652	Consider this code:
				653
				654	struct S; S G;
				655	void caller(S byvalarg) { G.field = 1; ... }
				656	void callee() { caller(G); }
				657
				658	The fact that the caller does not modify byval arg is not enough, we need
				659	to know that it doesn't modify G either. This is very tricky.
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	660
				661	//===---------------------------------------------------------------------===//
Nate Begeman	e9fe65c	2008-02-18 18:39:23 +0000	[diff] [blame]	662
				663	We should add an FRINT node to the DAG to model targets that have legal
				664	implementations of ceil/floor/rint.
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	665
				666	//===---------------------------------------------------------------------===//
				667
Chris Lattner	e29536c	2008-02-28 17:21:27 +0000	[diff] [blame]	668	This GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34043
				669	contains a testcase that compiles down to:
				670
				671	%struct.XMM128 = type { <4 x float> }
				672	..
				673	%src = alloca %struct.XMM128
				674	..
				675	%tmp6263 = bitcast %struct.XMM128* %src to <2 x i64>*
				676	%tmp65 = getelementptr %struct.XMM128* %src, i32 0, i32 0
				677	store <2 x i64> %tmp5899, <2 x i64>* %tmp6263, align 16
				678	%tmp66 = load <4 x float>* %tmp65, align 16
				679	%tmp71 = add <4 x float> %tmp66, %tmp66
				680
				681	If the mid-level optimizer turned the bitcast of pointer + store of tmp5899
				682	into a bitcast of the vector value and a store to the pointer, then the
				683	store->load could be easily removed.
				684
				685	//===---------------------------------------------------------------------===//
				686
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	687	Consider:
				688
				689	int test() {
				690	long long input[8] = {1,1,1,1,1,1,1,1};
				691	foo(input);
				692	}
				693
				694	We currently compile this into a memcpy from a global array since the
				695	initializer is fairly large and not memset'able. This is good, but the memcpy
				696	gets lowered to load/stores in the code generator. This is also ok, except
				697	that the codegen lowering for memcpy doesn't handle the case when the source
				698	is a constant global. This gives us atrocious code like this:
				699
				700	call "L1$pb"
				701	"L1$pb":
				702	popl %eax
				703	movl _C.0.1444-"L1$pb"+32(%eax), %ecx
				704	movl %ecx, 40(%esp)
				705	movl _C.0.1444-"L1$pb"+20(%eax), %ecx
				706	movl %ecx, 28(%esp)
				707	movl _C.0.1444-"L1$pb"+36(%eax), %ecx
				708	movl %ecx, 44(%esp)
				709	movl _C.0.1444-"L1$pb"+44(%eax), %ecx
				710	movl %ecx, 52(%esp)
				711	movl _C.0.1444-"L1$pb"+40(%eax), %ecx
				712	movl %ecx, 48(%esp)
				713	movl _C.0.1444-"L1$pb"+12(%eax), %ecx
				714	movl %ecx, 20(%esp)
				715	movl _C.0.1444-"L1$pb"+4(%eax), %ecx
				716	...
				717
				718	instead of:
				719	movl $1, 16(%esp)
				720	movl $0, 20(%esp)
				721	movl $1, 24(%esp)
				722	movl $0, 28(%esp)
				723	movl $1, 32(%esp)
				724	movl $0, 36(%esp)
				725	...
				726
				727	//===---------------------------------------------------------------------===//
Chris Lattner	a11deb0	2008-03-02 02:51:40 +0000	[diff] [blame]	728
				729	http://llvm.org/PR717:
				730
				731	The following code should compile into "ret int undef". Instead, LLVM
				732	produces "ret int 0":
				733
				734	int f() {
				735	int x = 4;
				736	int y;
				737	if (x == 3) y = 0;
				738	return y;
				739	}
				740
				741	//===---------------------------------------------------------------------===//
Chris Lattner	53b7277	2008-03-02 19:29:42 +0000	[diff] [blame]	742
				743	The loop unroller should partially unroll loops (instead of peeling them)
				744	when code growth isn't too bad and when an unroll count allows simplification
				745	of some code within the loop. One trivial example is:
				746
				747	#include <stdio.h>
				748	int main() {
				749	int nRet = 17;
				750	int nLoop;
				751	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				752	if ( nLoop & 1 )
				753	nRet += 2;
				754	else
				755	nRet -= 1;
				756	}
				757	return nRet;
				758	}
				759
				760	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				761	reduction in code size. The resultant code would then also be suitable for
				762	exit value computation.
				763
				764	//===---------------------------------------------------------------------===//
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	765
				766	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				767	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				768	matching code in dag combine doesn't look through truncates aggressively
				769	enough. Here are some testcases reduces from GCC PR17886:
				770
				771	unsigned long long f(unsigned long long x, int y) {
				772	return (x << y) \| (x >> 64-y);
				773	}
				774	unsigned f2(unsigned x, int y){
				775	return (x << y) \| (x >> 32-y);
				776	}
				777	unsigned long long f3(unsigned long long x){
				778	int y = 9;
				779	return (x << y) \| (x >> 64-y);
				780	}
				781	unsigned f4(unsigned x){
				782	int y = 10;
				783	return (x << y) \| (x >> 32-y);
				784	}
				785	unsigned long long f5(unsigned long long x, unsigned long long y) {
				786	return (x << 8) \| ((y >> 48) & 0xffull);
				787	}
				788	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				789	switch(z) {
				790	case 1:
				791	return (x << 8) \| ((y >> 48) & 0xffull);
				792	case 2:
				793	return (x << 16) \| ((y >> 40) & 0xffffull);
				794	case 3:
				795	return (x << 24) \| ((y >> 32) & 0xffffffull);
				796	case 4:
				797	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				798	default:
				799	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				800	}
				801	}
				802
Dan Gohman	cb747c5	2008-10-17 21:39:27 +0000	[diff] [blame]	803	On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	804	generate truly horrible code, instead of using shld and friends. On
				805	ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
				806	badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
				807
				808	//===---------------------------------------------------------------------===//
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	809
				810	We do a number of simplifications in simplify libcalls to strength reduce
				811	standard library functions, but we don't currently merge them together. For
				812	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				813	be done safely if "b" isn't modified between the strlen and memcpy of course.
				814
				815	//===---------------------------------------------------------------------===//
				816
Chris Lattner	b578310	2008-05-17 15:37:38 +0000	[diff] [blame]	817	We should be able to evaluate this loop:
				818
				819	int test(int x_offs) {
				820	while (x_offs > 4)
				821	x_offs -= 4;
				822	return x_offs;
				823	}
				824
				825	//===---------------------------------------------------------------------===//
Chris Lattner	10c5d36	2008-07-14 00:19:59 +0000	[diff] [blame]	826
				827	Reassociate should turn things like:
				828
				829	int factorial(int X) {
				830	return XXXXXXX*X;
				831	}
				832
				833	into llvm.powi calls, allowing the code generator to produce balanced
				834	multiplication trees.
				835
				836	//===---------------------------------------------------------------------===//
				837
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	838	We generate a horrible libcall for llvm.powi. For example, we compile:
				839
				840	#include <cmath>
				841	double f(double a) { return std::pow(a, 4); }
				842
				843	into:
				844
				845	__Z1fd:
				846	subl $12, %esp
				847	movsd 16(%esp), %xmm0
				848	movsd %xmm0, (%esp)
				849	movl $4, 8(%esp)
				850	call L___powidf2$stub
				851	addl $12, %esp
				852	ret
				853
				854	GCC produces:
				855
				856	__Z1fd:
				857	subl $12, %esp
				858	movsd 16(%esp), %xmm0
				859	mulsd %xmm0, %xmm0
				860	mulsd %xmm0, %xmm0
				861	movsd %xmm0, (%esp)
				862	fldl (%esp)
				863	addl $12, %esp
				864	ret
				865
				866	//===---------------------------------------------------------------------===//
				867
				868	We compile this program: (from GCC PR11680)
				869	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				870
				871	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				872	slower than when compile with GCC (either 4.0 or 4.2):
				873
				874	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				875	$ time ./a.out fast
				876	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				877
				878	$ g++ perf.cpp -O3 -fno-exceptions
				879	$ time ./a.out fast
				880	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				881
				882	It looks like we are making the same inlining decisions, so this may be raw
				883	codegen badness or something else (haven't investigated).
				884
				885	//===---------------------------------------------------------------------===//
				886
				887	We miss some instcombines for stuff like this:
				888	void bar (void);
				889	void foo (unsigned int a) {
				890	/* This one is equivalent to a >= (3 << 2). */
				891	if ((a >> 2) >= 3)
				892	bar ();
				893	}
				894
				895	A few other related ones are in GCC PR14753.
				896
				897	//===---------------------------------------------------------------------===//
				898
				899	Divisibility by constant can be simplified (according to GCC PR12849) from
				900	being a mulhi to being a mul lo (cheaper). Testcase:
				901
				902	void bar(unsigned n) {
				903	if (n % 3 == 0)
				904	true();
				905	}
				906
				907	I think this basically amounts to a dag combine to simplify comparisons against
				908	multiply hi's into a comparison against the mullo.
				909
				910	//===---------------------------------------------------------------------===//
Chris Lattner	23f35bc	2008-08-19 06:22:16 +0000	[diff] [blame]	911
Chris Lattner	db03983	2008-10-15 16:06:03 +0000	[diff] [blame]	912	Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
				913	bunch of other stuff from this example (see PR1604):
				914
				915	#include <cstdio>
				916	struct test {
				917	int val;
				918	virtual ~test() {}
				919	};
				920
				921	int main() {
				922	test t;
				923	std::scanf("%d", &t.val);
				924	std::printf("%d\n", t.val);
				925	}
				926
				927	//===---------------------------------------------------------------------===//
				928
Chris Lattner	3b364cb	2008-10-15 16:33:52 +0000	[diff] [blame]	929	Instcombine will merge comparisons like (x >= 10) && (x < 20) by producing (x -
				930	10) u< 10, but only when the comparisons have matching sign.
				931
				932	This could be converted with a similiar technique. (PR1941)
				933
				934	define i1 @test(i8 %x) {
				935	%A = icmp uge i8 %x, 5
				936	%B = icmp slt i8 %x, 20
				937	%C = and i1 %A, %B
				938	ret i1 %C
				939	}
				940
				941	//===---------------------------------------------------------------------===//
Nick Lewycky	df563ca	2008-11-27 22:12:22 +0000	[diff] [blame]	942
Nick Lewycky	d2f0db1	2008-11-27 22:41:45 +0000	[diff] [blame]	943	These functions perform the same computation, but produce different assembly.
Nick Lewycky	df563ca	2008-11-27 22:12:22 +0000	[diff] [blame]	944
				945	define i8 @select(i8 %x) readnone nounwind {
				946	%A = icmp ult i8 %x, 250
				947	%B = select i1 %A, i8 0, i8 1
				948	ret i8 %B
				949	}
				950
				951	define i8 @addshr(i8 %x) readnone nounwind {
				952	%A = zext i8 %x to i9
				953	%B = add i9 %A, 6 ;; 256 - 250 == 6
				954	%C = lshr i9 %B, 8
				955	%D = trunc i9 %C to i8
				956	ret i8 %D
				957	}
				958
				959	//===---------------------------------------------------------------------===//
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	960
				961	From gcc bug 24696:
				962	int
				963	f (unsigned long a, unsigned long b, unsigned long c)
				964	{
				965	return ((a & (c - 1)) != 0) \|\| ((b & (c - 1)) != 0);
				966	}
				967	int
				968	f (unsigned long a, unsigned long b, unsigned long c)
				969	{
				970	return ((a & (c - 1)) != 0) \| ((b & (c - 1)) != 0);
				971	}
				972	Both should combine to ((a\|b) & (c-1)) != 0. Currently not optimized with
				973	"clang -emit-llvm-bc \| opt -std-compile-opts".
				974
				975	//===---------------------------------------------------------------------===//
				976
				977	From GCC Bug 20192:
				978	#define PMD_MASK (~((1UL << 23) - 1))
				979	void clear_pmd_range(unsigned long start, unsigned long end)
				980	{
				981	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
				982	f();
				983	}
				984	The expression should optimize to something like
				985	"!((start\|end)&~PMD_MASK). Currently not optimized with "clang
				986	-emit-llvm-bc \| opt -std-compile-opts".
				987
				988	//===---------------------------------------------------------------------===//
				989
				990	From GCC Bug 15241:
				991	unsigned int
				992	foo (unsigned int a, unsigned int b)
				993	{
				994	if (a <= 7 && b <= 7)
				995	baz ();
				996	}
				997	Should combine to "(a\|b) <= 7". Currently not optimized with "clang
				998	-emit-llvm-bc \| opt -std-compile-opts".
				999
				1000	//===---------------------------------------------------------------------===//
				1001
				1002	From GCC Bug 3756:
				1003	int
				1004	pn (int n)
				1005	{
				1006	return (n >= 0 ? 1 : -1);
				1007	}
				1008	Should combine to (n >> 31) \| 1. Currently not optimized with "clang
				1009	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				1010
				1011	//===---------------------------------------------------------------------===//
				1012
				1013	From GCC Bug 28685:
				1014	int test(int a, int b)
				1015	{
				1016	int lt = a < b;
				1017	int eq = a == b;
				1018
				1019	return (lt \|\| eq);
				1020	}
				1021	Should combine to "a <= b". Currently not optimized with "clang
				1022	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				1023
				1024	//===---------------------------------------------------------------------===//
				1025
				1026	void a(int variable)
				1027	{
				1028	if (variable == 4 \|\| variable == 6)
				1029	bar();
				1030	}
				1031	This should optimize to "if ((variable \| 2) == 6)". Currently not
				1032	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts \| llc".
				1033
				1034	//===---------------------------------------------------------------------===//
				1035
				1036	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
				1037	i;}
				1038	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				1039	These should combine to the same thing. Currently, the first function
				1040	produces better code on X86.
				1041
				1042	//===---------------------------------------------------------------------===//
				1043
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1044	From GCC Bug 15784:
				1045	#define abs(x) x>0?x:-x
				1046	int f(int x, int y)
				1047	{
				1048	return (abs(x)) >= 0;
				1049	}
				1050	This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
				1051	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1052
				1053	//===---------------------------------------------------------------------===//
				1054
				1055	From GCC Bug 14753:
				1056	void
				1057	rotate_cst (unsigned int a)
				1058	{
				1059	a = (a << 10) \| (a >> 22);
				1060	if (a == 123)
				1061	bar ();
				1062	}
				1063	void
				1064	minus_cst (unsigned int a)
				1065	{
				1066	unsigned int tem;
				1067
				1068	tem = 20 - a;
				1069	if (tem == 5)
				1070	bar ();
				1071	}
				1072	void
				1073	mask_gt (unsigned int a)
				1074	{
				1075	/* This is equivalent to a > 15. */
				1076	if ((a & ~7) > 8)
				1077	bar ();
				1078	}
				1079	void
				1080	rshift_gt (unsigned int a)
				1081	{
				1082	/* This is equivalent to a > 23. */
				1083	if ((a >> 2) > 5)
				1084	bar ();
				1085	}
				1086	All should simplify to a single comparison. All of these are
				1087	currently not optimized with "clang -emit-llvm-bc \| opt
				1088	-std-compile-opts".
				1089
				1090	//===---------------------------------------------------------------------===//
				1091
				1092	From GCC Bug 32605:
				1093	int c(int* x) {return (char)x+2 == (char)x;}
				1094	Should combine to 0. Currently not optimized with "clang
				1095	-emit-llvm-bc \| opt -std-compile-opts" (although llc can optimize it).
				1096
				1097	//===---------------------------------------------------------------------===//
				1098
				1099	int a(unsigned char* b) {return *b > 99;}
				1100	There's an unnecessary zext in the generated code with "clang
				1101	-emit-llvm-bc \| opt -std-compile-opts".
				1102
				1103	//===---------------------------------------------------------------------===//
				1104
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1105	int a(unsigned b) {return ((b << 31) \| (b << 30)) >> 31;}
				1106	Should be combined to "((b >> 1) \| b) & 1". Currently not optimized
				1107	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1108
				1109	//===---------------------------------------------------------------------===//
				1110
				1111	unsigned a(unsigned x, unsigned y) { return x \| (y & 1) \| (y & 2);}
				1112	Should combine to "x \| (y & 3)". Currently not optimized with "clang
				1113	-emit-llvm-bc \| opt -std-compile-opts".
				1114
				1115	//===---------------------------------------------------------------------===//
				1116
				1117	unsigned a(unsigned a) {return ((a \| 1) & 3) \| (a & -4);}
				1118	Should combine to "a \| 1". Currently not optimized with "clang
				1119	-emit-llvm-bc \| opt -std-compile-opts".
				1120
				1121	//===---------------------------------------------------------------------===//
				1122
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1123	int a(int a, int b, int c) {return (~a & c) \| ((c\|a) & b);}
				1124	Should fold to "(~a & c) \| (a & b)". Currently not optimized with
				1125	"clang -emit-llvm-bc \| opt -std-compile-opts".
				1126
				1127	//===---------------------------------------------------------------------===//
				1128
				1129	int a(int a,int b) {return (~(a\|b))\|a;}
				1130	Should fold to "a\|~b". Currently not optimized with "clang
				1131	-emit-llvm-bc \| opt -std-compile-opts".
				1132
				1133	//===---------------------------------------------------------------------===//
				1134
				1135	int a(int a, int b) {return (a&&b) \|\| (a&&!b);}
				1136	Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
				1137	\| opt -std-compile-opts".
				1138
				1139	//===---------------------------------------------------------------------===//
				1140
				1141	int a(int a, int b, int c) {return (a&&b) \|\| (!a&&c);}
				1142	Should fold to "a ? b : c", or at least something sane. Currently not
				1143	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1144
				1145	//===---------------------------------------------------------------------===//
				1146
				1147	int a(int a, int b, int c) {return (a&&b) \|\| (a&&c) \|\| (a&&b&&c);}
				1148	Should fold to a && (b \|\| c). Currently not optimized with "clang
				1149	-emit-llvm-bc \| opt -std-compile-opts".
				1150
				1151	//===---------------------------------------------------------------------===//
				1152
				1153	int a(int x) {return x \| ((x & 8) ^ 8);}
				1154	Should combine to x \| 8. Currently not optimized with "clang
				1155	-emit-llvm-bc \| opt -std-compile-opts".
				1156
				1157	//===---------------------------------------------------------------------===//
				1158
				1159	int a(int x) {return x ^ ((x & 8) ^ 8);}
				1160	Should also combine to x \| 8. Currently not optimized with "clang
				1161	-emit-llvm-bc \| opt -std-compile-opts".
				1162
				1163	//===---------------------------------------------------------------------===//
				1164
				1165	int a(int x) {return (x & 8) == 0 ? -1 : -9;}
				1166	Should combine to (x \| -9) ^ 8. Currently not optimized with "clang
				1167	-emit-llvm-bc \| opt -std-compile-opts".
				1168
				1169	//===---------------------------------------------------------------------===//
				1170
				1171	int a(int x) {return (x & 8) == 0 ? -9 : -1;}
				1172	Should combine to x \| -9. Currently not optimized with "clang
				1173	-emit-llvm-bc \| opt -std-compile-opts".
				1174
				1175	//===---------------------------------------------------------------------===//
				1176
				1177	int a(int x) {return ((x \| -9) ^ 8) & x;}
				1178	Should combine to x & -9. Currently not optimized with "clang
				1179	-emit-llvm-bc \| opt -std-compile-opts".
				1180
				1181	//===---------------------------------------------------------------------===//
				1182
				1183	unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
				1184	Should combine to "a * 0x88888888 >> 31". Currently not optimized
				1185	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1186
				1187	//===---------------------------------------------------------------------===//
				1188
				1189	unsigned a(char* x) {if ((*x & 32) == 0) return b();}
				1190	There's an unnecessary zext in the generated code with "clang
				1191	-emit-llvm-bc \| opt -std-compile-opts".
				1192
				1193	//===---------------------------------------------------------------------===//
				1194
				1195	unsigned a(unsigned long long x) {return 40 * (x >> 1);}
				1196	Should combine to "20 * (((unsigned)x) & -2)". Currently not
				1197	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1198
				1199	//===---------------------------------------------------------------------===//
Bill Wendling	3bdcda8	2008-12-02 05:12:47 +0000	[diff] [blame]	1200
				1201	We would like to do the following transform in the instcombiner:
				1202
				1203	-X/C -> X/-C
				1204
				1205	However, this isn't valid if (-X) overflows. We can implement this when we
				1206	have the concept of a "C signed subtraction" operator that which is undefined
				1207	on overflow.
				1208
				1209	//===---------------------------------------------------------------------===//
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1210
				1211	This was noticed in the entryblock for grokdeclarator in 403.gcc:
				1212
				1213	%tmp = icmp eq i32 %decl_context, 4
				1214	%decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
				1215	%tmp1 = icmp eq i32 %decl_context_addr.0, 1
				1216	%decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
				1217
				1218	tmp1 should be simplified to something like:
				1219	(!tmp \|\| decl_context == 1)
				1220
				1221	This allows recursive simplifications, tmp1 is used all over the place in
				1222	the function, e.g. by:
				1223
				1224	%tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
				1225	%tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
				1226	%or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
				1227
				1228	later.
				1229
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1230	//===---------------------------------------------------------------------===//
				1231
				1232	Store sinking: This code:
				1233
				1234	void f (int n, int cond, int res) {
				1235	int i;
				1236	*res = 0;
				1237	for (i = 0; i < n; i++)
				1238	if (*cond)
				1239	res ^= 234; / () /
				1240	}
				1241
				1242	On this function GVN hoists the fully redundant value of *res, but nothing
				1243	moves the store out. This gives us this code:
				1244
				1245	bb: ; preds = %bb2, %entry
				1246	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
				1247	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
				1248	%1 = load i32* %cond, align 4
				1249	%2 = icmp eq i32 %1, 0
				1250	br i1 %2, label %bb2, label %bb1
				1251
				1252	bb1: ; preds = %bb
				1253	%3 = xor i32 %.rle, 234
				1254	store i32 %3, i32* %res, align 4
				1255	br label %bb2
				1256
				1257	bb2: ; preds = %bb, %bb1
				1258	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
				1259	%indvar.next = add i32 %i.05, 1
				1260	%exitcond = icmp eq i32 %indvar.next, %n
				1261	br i1 %exitcond, label %return, label %bb
				1262
				1263	DSE should sink partially dead stores to get the store out of the loop.
				1264
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1265	Here's another partial dead case:
				1266	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
				1267
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1268	//===---------------------------------------------------------------------===//
				1269
				1270	Scalar PRE hoists the mul in the common block up to the else:
				1271
				1272	int test (int a, int b, int c, int g) {
				1273	int d, e;
				1274	if (a)
				1275	d = b * c;
				1276	else
				1277	d = b - c;
				1278	e = b * c + g;
				1279	return d + e;
				1280	}
				1281
				1282	It would be better to do the mul once to reduce codesize above the if.
				1283	This is GCC PR38204.
				1284
				1285	//===---------------------------------------------------------------------===//
				1286
				1287	GCC PR37810 is an interesting case where we should sink load/store reload
				1288	into the if block and outside the loop, so we don't reload/store it on the
				1289	non-call path.
				1290
				1291	for () {
				1292	*P += 1;
				1293	if ()
				1294	call();
				1295	else
				1296	...
				1297	->
				1298	tmp = *P
				1299	for () {
				1300	tmp += 1;
				1301	if () {
				1302	*P = tmp;
				1303	call();
				1304	tmp = *P;
				1305	} else ...
				1306	}
				1307	*P = tmp;
				1308
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1309	We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
				1310	we don't sink the store. We need partially dead store sinking.
				1311
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1312	//===---------------------------------------------------------------------===//
				1313
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1314	[PHI TRANSLATE GEPs]
				1315
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1316	GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
				1317	leading to excess stack traffic. This could be handled by GVN with some crazy
				1318	symbolic phi translation. The code we get looks like (g is on the stack):
				1319
				1320	bb2: ; preds = %bb1
				1321	..
				1322	%9 = getelementptr %struct.f* %g, i32 0, i32 0
				1323	store i32 %8, i32* %9, align bel %bb3
				1324
				1325	bb3: ; preds = %bb1, %bb2, %bb
				1326	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
				1327	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
				1328	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
				1329	%11 = load i32* %10, align 4
				1330
				1331	%11 is fully redundant, an in BB2 it should have the value %8.
				1332
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1333	GCC PR33344 is a similar case.
				1334
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1335	//===---------------------------------------------------------------------===//
				1336
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1337	There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
				1338	GCC testsuite. There are many pre testcases as ssa-pre-*.c
				1339
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1340	//===---------------------------------------------------------------------===//
				1341
				1342	There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
				1343	GCC testsuite. For example, predcom-1.c is:
				1344
				1345	for (i = 2; i < 1000; i++)
				1346	fib[i] = (fib[i-1] + fib[i - 2]) & 0xffff;
				1347
				1348	which compiles into:
				1349
				1350	bb1: ; preds = %bb1, %bb1.thread
				1351	%indvar = phi i32 [ 0, %bb1.thread ], [ %0, %bb1 ]
				1352	%i.0.reg2mem.0 = add i32 %indvar, 2
				1353	%0 = add i32 %indvar, 1 ; <i32> [#uses=3]
				1354	%1 = getelementptr [1000 x i32]* @fib, i32 0, i32 %0
				1355	%2 = load i32* %1, align 4 ; <i32> [#uses=1]
				1356	%3 = getelementptr [1000 x i32]* @fib, i32 0, i32 %indvar
				1357	%4 = load i32* %3, align 4 ; <i32> [#uses=1]
				1358	%5 = add i32 %4, %2 ; <i32> [#uses=1]
				1359	%6 = and i32 %5, 65535 ; <i32> [#uses=1]
				1360	%7 = getelementptr [1000 x i32]* @fib, i32 0, i32 %i.0.reg2mem.0
				1361	store i32 %6, i32* %7, align 4
				1362	%exitcond = icmp eq i32 %0, 998 ; <i1> [#uses=1]
				1363	br i1 %exitcond, label %return, label %bb1
				1364
				1365	This is basically:
				1366	LOAD fib[i+1]
				1367	LOAD fib[i]
				1368	STORE fib[i+2]
				1369
				1370	instead of handling this as a loop or other xform, all we'd need to do is teach
				1371	load PRE to phi translate the %0 add (i+1) into the predecessor as (i'+1+1) =
				1372	(i'+2) (where i' is the previous iteration of i). This would find the store
				1373	which feeds it.
				1374
				1375	predcom-2.c is apparently the same as predcom-1.c
				1376	predcom-3.c is very similar but needs loads feeding each other instead of
				1377	store->load.
				1378	predcom-4.c seems the same as the rest.
				1379
				1380
				1381	//===---------------------------------------------------------------------===//
				1382
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1383	Other simple load PRE cases:
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1384	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35287 [LPRE crit edge splitting]
				1385
				1386	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34677 (licm does this, LPRE crit edge)
				1387	llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \| opt -mem2reg -simplifycfg -gvn \| llvm-dis
				1388
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1389	//===---------------------------------------------------------------------===//
				1390
				1391	Type based alias analysis:
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1392	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
				1393
				1394	//===---------------------------------------------------------------------===//
				1395
				1396	When GVN/PRE finds a store of float* to a must aliases pointer when expecting
				1397	an int*, it should turn it into a bitcast. This is a nice generalization of
Chris Lattner	630c99f	2008-12-07 00:15:10 +0000	[diff] [blame]	1398	the SROA hack that would apply to other cases, e.g.:
				1399
				1400	int foo(int C, int *P, float X) {
				1401	if (C) {
				1402	bar();
				1403	*P = 42;
				1404	} else
				1405	(float)P = X;
				1406
				1407	return *P;
				1408	}
				1409
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1410
				1411	One example (that requires crazy phi translation) is:
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1412	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16799 [BITCAST PHI TRANS]
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1413
				1414	//===---------------------------------------------------------------------===//
				1415
				1416	A/B get pinned to the stack because we turn an if/then into a select instead
				1417	of PRE'ing the load/store. This may be fixable in instcombine:
				1418	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
				1419
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1420
				1421
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1422	Interesting missed case because of control flow flattening (should be 2 loads):
				1423	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1424	With: llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \|
				1425	opt -mem2reg -gvn -instcombine \| llvm-dis
				1426	we miss it because we need 1) GEP PHI TRAN, 2) CRIT EDGE 3) MULTIPLE DIFFERENT
				1427	VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1428
				1429	//===---------------------------------------------------------------------===//
				1430
				1431	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
				1432	We could eliminate the branch condition here, loading from null is undefined:
				1433
				1434	struct S { int w, x, y, z; };
				1435	struct T { int r; struct S s; };
				1436	void bar (struct S, int);
				1437	void foo (int a, struct T b)
				1438	{
				1439	struct S *c = 0;
				1440	if (a)
				1441	c = &b.s;
				1442	bar (*c, a);
				1443	}
				1444
				1445	//===---------------------------------------------------------------------===//
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1446
Chris Lattner	9cf8ef6	2008-12-23 20:52:52 +0000	[diff] [blame]	1447	simplifylibcalls should do several optimizations for strspn/strcspn:
				1448
				1449	strcspn(x, "") -> strlen(x)
				1450	strcspn("", x) -> 0
				1451	strspn("", x) -> 0
				1452	strspn(x, "") -> strlen(x)
				1453	strspn(x, "a") -> strchr(x, 'a')-x
				1454
				1455	strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
				1456
				1457	size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
				1458	int __reject3) {
				1459	register size_t __result = 0;
				1460	while (__s[__result] != '\0' && __s[__result] != __reject1 &&
				1461	__s[__result] != __reject2 && __s[__result] != __reject3)
				1462	++__result;
				1463	return __result;
				1464	}
				1465
				1466	This should turn into a switch on the character. See PR3253 for some notes on
				1467	codegen.
				1468
				1469	456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
				1470
				1471	//===---------------------------------------------------------------------===//
Chris Lattner	d23b799	2008-12-31 00:54:13 +0000	[diff] [blame]	1472
				1473	"gas" uses this idiom:
				1474	else if (strchr ("+-/%\|&^:[]()~", intel_parser.op_string))
				1475	..
				1476	else if (strchr ("<>", *intel_parser.op_string)
				1477
				1478	Those should be turned into a switch.
				1479
				1480	//===---------------------------------------------------------------------===//
Chris Lattner	ffb08f5	2009-01-08 06:52:57 +0000	[diff] [blame]	1481
				1482	252.eon contains this interesting code:
				1483
				1484	%3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
				1485	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1486	%strlen = call i32 @strlen(i8* %3072) ; uses = 1
				1487	%endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
				1488	call void @llvm.memcpy.i32(i8* %endptr,
				1489	i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
				1490	%3074 = call i32 @strlen(i8* %endptr) nounwind readonly
				1491
				1492	This is interesting for a couple reasons. First, in this:
				1493
				1494	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1495	%strlen = call i32 @strlen(i8* %3072)
				1496
				1497	The strlen could be replaced with: %strlen = sub %3072, %3073, because the
				1498	strcpy call returns a pointer to the end of the string. Based on that, the
				1499	endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
				1500
				1501	Second, the memcpy+strlen strlen can be replaced with:
				1502
				1503	%3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
				1504
				1505	Because the destination was just copied into the specified memory buffer. This,
				1506	in turn, can be constant folded to "4".
				1507
				1508	In other code, it contains:
				1509
				1510	%endptr6978 = bitcast i8* %endptr69 to i32*
				1511	store i32 7107374, i32* %endptr6978, align 1
				1512	%3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
				1513
				1514	Which could also be constant folded. Whatever is producing this should probably
				1515	be fixed to leave this as a memcpy from a string.
				1516
				1517	Further, eon also has an interesting partially redundant strlen call:
				1518
				1519	bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
				1520	%682 = getelementptr i8 %argv, i32 6 ; <i8> [#uses=2]
				1521	%683 = load i8** %682, align 4 ; <i8*> [#uses=4]
				1522	%684 = load i8* %683, align 1 ; <i8> [#uses=1]
				1523	%685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
				1524	br i1 %685, label %bb10, label %bb9
				1525
				1526	bb9: ; preds = %bb8
				1527	%686 = call i32 @strlen(i8* %683) nounwind readonly
				1528	%687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
				1529	br i1 %687, label %bb10, label %bb11
				1530
				1531	bb10: ; preds = %bb9, %bb8
				1532	%688 = call i32 @strlen(i8* %683) nounwind readonly
				1533
				1534	This could be eliminated by doing the strlen once in bb8, saving code size and
				1535	improving perf on the bb8->9->10 path.
				1536
				1537	//===---------------------------------------------------------------------===//
Chris Lattner	9fee08f	2009-01-08 07:34:55 +0000	[diff] [blame]	1538
				1539	I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
				1540	which looks like:
				1541	%movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
				1542
				1543
				1544	bb62: ; preds = %bb55, %bb53
				1545	%promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
				1546	%171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1547	%172 = add i32 %171, -1 ; <i32> [#uses=1]
				1548	%173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
				1549
				1550	... no stores ...
				1551	br i1 %or.cond, label %bb65, label %bb72
				1552
				1553	bb65: ; preds = %bb62
				1554	store i8 0, i8* %173, align 1
				1555	br label %bb72
				1556
				1557	bb72: ; preds = %bb65, %bb62
				1558	%trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
				1559	%177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1560
				1561	Note that on the bb62->bb72 path, that the %177 strlen call is partially
				1562	redundant with the %171 call. At worst, we could shove the %177 strlen call
				1563	up into the bb65 block moving it out of the bb62->bb72 path. However, note
				1564	that bb65 stores to the string, zeroing out the last byte. This means that on
				1565	that path the value of %177 is actually just %171-1. A sub is cheaper than a
				1566	strlen!
				1567
				1568	This pattern repeats several times, basically doing:
				1569
				1570	A = strlen(P);
				1571	P[A-1] = 0;
				1572	B = strlen(P);
				1573	where it is "obvious" that B = A-1.
				1574
				1575	//===---------------------------------------------------------------------===//
				1576
				1577	186.crafty contains this interesting pattern:
				1578
				1579	%77 = call i8* @strstr(i8* getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0),
				1580	i8* %30)
				1581	%phitmp648 = icmp eq i8* %77, getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0)
				1582	br i1 %phitmp648, label %bb70, label %bb76
				1583
				1584	bb70: ; preds = %OptionMatch.exit91, %bb69
				1585	%78 = call i32 @strlen(i8* %30) nounwind readonly align 1 ; <i32> [#uses=1]
				1586
				1587	This is basically:
				1588	cststr = "abcdef";
				1589	if (strstr(cststr, P) == cststr) {
				1590	x = strlen(P);
				1591	...
				1592
				1593	The strstr call would be significantly cheaper written as:
				1594
				1595	cststr = "abcdef";
				1596	if (memcmp(P, str, strlen(P)))
				1597	x = strlen(P);
				1598
				1599	This is memcmp+strlen instead of strstr. This also makes the strlen fully
				1600	redundant.
				1601
				1602	//===---------------------------------------------------------------------===//
				1603
				1604	186.crafty also contains this code:
				1605
				1606	%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1607	%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
				1608	%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
				1609	%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1610	%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
				1611
				1612	The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
				1613
				1614	//===---------------------------------------------------------------------===//
				1615
				1616	186.crafty has this interesting pattern with the "out.4543" variable:
				1617
				1618	call void @llvm.memcpy.i32(
				1619	i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
				1620	i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
				1621	%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
				1622
				1623	It is basically doing:
				1624
				1625	memcpy(globalarray, "string");
				1626	printf(..., globalarray);
				1627
				1628	Anyway, by knowing that printf just reads the memory and forward substituting
				1629	the string directly into the printf, this eliminates reads from globalarray.
				1630	Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
				1631	other similar functions) there are many stores to "out". Once all the printfs
				1632	stop using "out", all that is left is the memcpy's into it. This should allow
				1633	globalopt to remove the "stored only" global.
				1634
				1635	//===---------------------------------------------------------------------===//
				1636
Dan Gohman	8289b05	2009-01-20 01:07:33 +0000	[diff] [blame]	1637	This code:
				1638
				1639	define inreg i32 @foo(i8* inreg %p) nounwind {
				1640	%tmp0 = load i8* %p
				1641	%tmp1 = ashr i8 %tmp0, 5
				1642	%tmp2 = sext i8 %tmp1 to i32
				1643	ret i32 %tmp2
				1644	}
				1645
				1646	could be dagcombine'd to a sign-extending load with a shift.
				1647	For example, on x86 this currently gets this:
				1648
				1649	movb (%eax), %al
				1650	sarb $5, %al
				1651	movsbl %al, %eax
				1652
				1653	while it could get this:
				1654
				1655	movsbl (%eax), %eax
				1656	sarl $5, %eax
				1657
				1658	//===---------------------------------------------------------------------===//
Chris Lattner	256baa4	2009-01-22 07:16:03 +0000	[diff] [blame]	1659
				1660	GCC PR31029:
				1661
				1662	int test(int x) { return 1-x == x; } // --> return false
				1663	int test2(int x) { return 2-x == x; } // --> return x == 1 ?
				1664
				1665	Always foldable for odd constants, what is the rule for even?
				1666
				1667	//===---------------------------------------------------------------------===//
				1668
Torok Edwin	e46a686	2009-01-24 19:30:25 +0000	[diff] [blame]	1669	PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
				1670	for next field in struct (which is at same address).
				1671
				1672	For example: store of float into { {{}}, float } could be turned into a store to
				1673	the float directly.
				1674
Torok Edwin	474479f	2009-02-20 18:42:06 +0000	[diff] [blame]	1675	//===---------------------------------------------------------------------===//
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1676
Torok Edwin	474479f	2009-02-20 18:42:06 +0000	[diff] [blame]	1677	#include <math.h>
				1678	double foo(double a) { return sin(a); }
				1679
				1680	This compiles into this on x86-64 Linux:
				1681	foo:
				1682	subq $8, %rsp
				1683	call sin
				1684	addq $8, %rsp
				1685	ret
				1686	vs:
				1687
				1688	foo:
				1689	jmp sin
				1690
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1691	//===---------------------------------------------------------------------===//
				1692
				1693	Instcombine should replace the load with a constant in:
				1694
				1695	static const char x[4] = {'a', 'b', 'c', 'd'};
				1696
				1697	unsigned int y(void) {
				1698	return (unsigned int )x;
				1699	}
				1700
				1701	It currently only does this transformation when the size of the constant
				1702	is the same as the size of the integer (so, try x[5]) and the last byte
				1703	is a null (making it a C string). There's no need for these restrictions.
				1704
				1705	//===---------------------------------------------------------------------===//
				1706
Chris Lattner	352f3e5	2009-03-28 19:26:55 +0000	[diff] [blame]	1707	The arg promotion pass should make use of nocapture to make its alias analysis
				1708	stuff much more precise.
				1709
				1710	//===---------------------------------------------------------------------===//
Eli Friedman	cea03cd	2009-05-09 08:40:15 +0000	[diff] [blame]	1711
				1712	The following functions should be optimized to use a select instead of a
				1713	branch (from gcc PR40072):
				1714
				1715	char char_int(int m) {if(m>7) return 0; return m;}
				1716	int int_char(char m) {if(m>7) return 0; return m;}
				1717
				1718	//===---------------------------------------------------------------------===//
				1719
Chris Lattner	d919a8b	2009-05-11 17:36:33 +0000	[diff] [blame^]	1720	InstCombine's "turn load from constant into constant" optimization should be
				1721	more aggressive in the presence of bitcasts. For example, because of unions,
				1722	this code:
				1723
				1724	union vec2d {
				1725	double e[2];
				1726	double v __attribute__((vector_size(16)));
				1727	};
				1728	typedef union vec2d vec2d;
				1729
				1730	static vec2d a={{1,2}}, b={{3,4}};
				1731
				1732	vec2d foo () {
				1733	return (vec2d){ .v = a.v + b.v * (vec2d){{5,5}}.v };
				1734	}
				1735
				1736	Compiles into:
				1737
				1738	@a = internal constant %0 { [2 x double]
				1739	[double 1.000000e+00, double 2.000000e+00] }, align 16
				1740	@b = internal constant %0 { [2 x double]
				1741	[double 3.000000e+00, double 4.000000e+00] }, align 16
				1742	...
				1743	define void @foo(%struct.vec2d* noalias nocapture sret %agg.result) nounwind {
				1744	entry:
				1745	%0 = load <2 x double>* getelementptr (%struct.vec2d*
				1746	bitcast (%0* @a to %struct.vec2d*), i32 0, i32 0), align 16
				1747	%1 = load <2 x double>* getelementptr (%struct.vec2d*
				1748	bitcast (%0* @b to %struct.vec2d*), i32 0, i32 0), align 16
				1749
				1750
				1751	Instcombine should be able to optimize away the loads (and thus the globals).
				1752
				1753
				1754	//===---------------------------------------------------------------------===//