Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: 6c6290a70a1a1f32a8c67c3e62da883c279a6863 [file] [log] [blame]

Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
Chris Lattner	f308ea0	2006-09-28 06:01:17 +0000	[diff] [blame]	3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	1d15983	2009-11-27 17:12:30 +0000	[diff] [blame]	5	Dead argument elimination should be enhanced to handle cases when an argument is
				6	dead to an externally visible function. Though the argument can't be removed
				7	from the externally visible function, the caller doesn't need to pass it in.
				8	For example in this testcase:
				9
				10	void foo(int X) __attribute__((noinline));
				11	void foo(int X) { sideeffect(); }
				12	void bar(int A) { foo(A+1); }
				13
				14	We compile bar to:
				15
				16	define void @bar(i32 %A) nounwind ssp {
				17	%0 = add nsw i32 %A, 1 ; <i32> [#uses=1]
				18	tail call void @foo(i32 %0) nounwind noinline ssp
				19	ret void
				20	}
				21
				22	The add is dead, we could pass in 'i32 undef' instead. This occurs for C++
				23	templates etc, which usually have linkonce_odr/weak_odr linkage, not internal
				24	linkage.
				25
				26	//===---------------------------------------------------------------------===//
				27
Chris Lattner	9b62b45	2006-11-14 01:57:53 +0000	[diff] [blame]	28	With the recent changes to make the implicit def/use set explicit in
				29	machineinstrs, we should change the target descriptions for 'call' instructions
				30	so that the .td files don't list all the call-clobbered registers as implicit
				31	defs. Instead, these should be added by the code generator (e.g. on the dag).
				32
				33	This has a number of uses:
				34
				35	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				36	for their different impdef sets.
				37	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				38	sets don't need copies of call instructions.
				39	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				40	of calls.
				41
				42	//===---------------------------------------------------------------------===//
				43
Nate Begeman	81e8097	2006-03-17 01:40:33 +0000	[diff] [blame]	44	Make the PPC branch selector target independant
				45
				46	//===---------------------------------------------------------------------===//
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	47
				48	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	49	precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
				50	safe in general, even on darwin. See the libm implementation of hypot for
				51	examples (which special case when x/y are exactly zero to get signed zeros etc
				52	right).
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	53
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	54	//===---------------------------------------------------------------------===//
				55
				56	Solve this DAG isel folding deficiency:
				57
				58	int X, Y;
				59
				60	void fn1(void)
				61	{
				62	X = X \| (Y << 3);
				63	}
				64
				65	compiles to
				66
				67	fn1:
				68	movl Y, %eax
				69	shll $3, %eax
				70	orl X, %eax
				71	movl %eax, X
				72	ret
				73
				74	The problem is the store's chain operand is not the load X but rather
				75	a TokenFactor of the load X and load Y, which prevents the folding.
				76
				77	There are two ways to fix this:
				78
				79	1. The dag combiner can start using alias analysis to realize that y/x
				80	don't alias, making the store to X not dependent on the load from Y.
				81	2. The generated isel could be made smarter in the case it can't
				82	disambiguate the pointers.
				83
				84	Number 1 is the preferred solution.
				85
Evan Cheng	e617b08	2006-03-13 23:19:10 +0000	[diff] [blame]	86	This has been "fixed" by a TableGen hack. But that is a short term workaround
				87	which will be removed once the proper fix is made.
				88
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	89	//===---------------------------------------------------------------------===//
				90
Chris Lattner	b27b69f	2006-03-04 01:19:34 +0000	[diff] [blame]	91	On targets with expensive 64-bit multiply, we could LSR this:
				92
				93	for (i = ...; ++i) {
				94	x = 1ULL << i;
				95
				96	into:
				97	long long tmp = 1;
				98	for (i = ...; ++i, tmp+=tmp)
				99	x = tmp;
				100
				101	This would be a win on ppc32, but not x86 or ppc64.
				102
Chris Lattner	ad01993	2006-03-04 08:44:51 +0000	[diff] [blame]	103	//===---------------------------------------------------------------------===//
Chris Lattner	5b0fe7d	2006-03-05 20:00:08 +0000	[diff] [blame]	104
				105	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				106
				107	//===---------------------------------------------------------------------===//
Chris Lattner	549f27d2	2006-03-07 02:46:26 +0000	[diff] [blame]	108
Chris Lattner	398ffba	2010-01-01 01:29:26 +0000	[diff] [blame]	109	Reassociate should turn things like:
				110
				111	int factorial(int X) {
				112	return XXXXXXX*X;
				113	}
				114
				115	into llvm.powi calls, allowing the code generator to produce balanced
				116	multiplication trees.
				117
				118	First, the intrinsic needs to be extended to support integers, and second the
				119	code generator needs to be enhanced to lower these to multiplication trees.
Chris Lattner	c20995e	2006-03-11 20:17:08 +0000	[diff] [blame]	120
				121	//===---------------------------------------------------------------------===//
				122
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	123	Interesting? testcase for add/shift/mul reassoc:
				124
				125	int bar(int x, int y) {
				126	return xxx+y+xxxxxyyyy;
				127	}
				128	int foo(int z, int n) {
				129	return bar(z, n) + bar(2z, 2n);
				130	}
				131
Chris Lattner	398ffba	2010-01-01 01:29:26 +0000	[diff] [blame]	132	This is blocked on not handling XXX -> powi(X, 3) (see note above). The issue
				133	is that we end up getting t = 2X s = tt and don't turn this into 4XX,
				134	which is the same number of multiplies and is canonical, because the 2*X has
				135	multiple uses. Here's a simple example:
				136
				137	define i32 @test15(i32 %X1) {
				138	%B = mul i32 %X1, 47 ; X1*47
				139	%C = mul i32 %B, %B
				140	ret i32 %C
				141	}
				142
				143
				144	//===---------------------------------------------------------------------===//
				145
				146	Reassociate should handle the example in GCC PR16157:
				147
				148	extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;
				149	void f () { /* this can be optimized to four additions... */
				150	b4 = a4 + a3 + a2 + a1 + a0;
				151	b3 = a3 + a2 + a1 + a0;
				152	b2 = a2 + a1 + a0;
				153	b1 = a1 + a0;
				154	}
				155
				156	This requires reassociating to forms of expressions that are already available,
				157	something that reassoc doesn't think about yet.
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	158
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	159	//===---------------------------------------------------------------------===//
				160
Chris Lattner	82c78b2	2006-03-09 20:13:21 +0000	[diff] [blame]	161	These two functions should generate the same code on big-endian systems:
				162
				163	int g(int j,int l) { return memcmp(j,l,4); }
				164	int h(int j, int l) { return j - l; }
				165
				166	this could be done in SelectionDAGISel.cpp, along with other special cases,
				167	for 1,2,4,8 bytes.
				168
				169	//===---------------------------------------------------------------------===//
				170
Chris Lattner	c04b423	2006-03-22 07:33:46 +0000	[diff] [blame]	171	It would be nice to revert this patch:
				172	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				173
				174	And teach the dag combiner enough to simplify the code expanded before
				175	legalize. It seems plausible that this knowledge would let it simplify other
				176	stuff too.
				177
Chris Lattner	e6cd96d	2006-03-24 19:59:17 +0000	[diff] [blame]	178	//===---------------------------------------------------------------------===//
				179
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	180	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng	67d3d4c	2006-03-31 22:35:14 +0000	[diff] [blame]	181	to the type size. It works but can be overly conservative as the alignment of
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	182	specific vector types are target dependent.
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	183
				184	//===---------------------------------------------------------------------===//
				185
Dan Gohman	1f3be1a	2009-05-11 18:51:16 +0000	[diff] [blame]	186	We should produce an unaligned load from code like this:
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	187
				188	v4sf example(float *P) {
				189	return (v4sf){P[0], P[1], P[2], P[3] };
				190	}
				191
				192	//===---------------------------------------------------------------------===//
				193
Chris Lattner	16abfdf	2006-05-18 18:26:13 +0000	[diff] [blame]	194	Add support for conditional increments, and other related patterns. Instead
				195	of:
				196
				197	movl 136(%esp), %eax
				198	cmpl $0, %eax
				199	je LBB16_2 #cond_next
				200	LBB16_1: #cond_true
				201	incl _foo
				202	LBB16_2: #cond_next
				203
				204	emit:
				205	movl _foo, %eax
				206	cmpl $1, %edi
				207	sbbl $-1, %eax
				208	movl %eax, _foo
				209
				210	//===---------------------------------------------------------------------===//
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	211
				212	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				213
				214	Expand these to calls of sin/cos and stores:
				215	double sincos(double x, double sin, double cos);
				216	float sincosf(float x, float sin, float cos);
				217	long double sincosl(long double x, long double sin, long double cos);
				218
				219	Doing so could allow SROA of the destination pointers. See also:
				220	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				221
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	222	This is now easily doable with MRVs. We could even make an intrinsic for this
				223	if anyone cared enough about sincos.
				224
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	225	//===---------------------------------------------------------------------===//
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	226
Chris Lattner	e8263e6	2006-05-21 03:57:07 +0000	[diff] [blame]	227	Turn this into a single byte store with no load (the other 3 bytes are
				228	unmodified):
				229
Dan Gohman	5c8274b	2009-05-11 18:04:52 +0000	[diff] [blame]	230	define void @test(i32* %P) {
				231	%tmp = load i32* %P
				232	%tmp14 = or i32 %tmp, 3305111552
				233	%tmp15 = and i32 %tmp14, 3321888767
				234	store i32 %tmp15, i32* %P
Chris Lattner	e8263e6	2006-05-21 03:57:07 +0000	[diff] [blame]	235	ret void
				236	}
				237
Chris Lattner	9e18ef5	2006-05-30 21:29:15 +0000	[diff] [blame]	238	//===---------------------------------------------------------------------===//
				239
				240	dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
				241
				242	Compile:
				243
				244	int bar(int x)
				245	{
				246	int t = __builtin_clz(x);
				247	return -(t>>5);
				248	}
				249
				250	to:
				251
				252	_bar: addic r3,r3,-1
				253	subfe r3,r3,r3
				254	blr
				255
Chris Lattner	cbce2f6	2006-09-15 20:31:36 +0000	[diff] [blame]	256	//===---------------------------------------------------------------------===//
				257
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	258	quantum_sigma_x in 462.libquantum contains the following loop:
				259
				260	for(i=0; i<reg->size; i++)
				261	{
				262	/* Flip the target bit of each basis state */
				263	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				264	}
				265
				266	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				267	so cool to turn it into something like:
				268
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	269	long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	270	if (target < 32) {
				271	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	272	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	273	} else {
				274	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	275	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	276	}
				277
				278	... which would only do one 32-bit XOR per loop iteration instead of two.
				279
				280	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	281	this requires TBAA.
Chris Lattner	faa6adf	2009-09-21 06:04:07 +0000	[diff] [blame]	282
				283	//===---------------------------------------------------------------------===//
				284
Chris Lattner	b1ac769	2008-10-05 02:16:12 +0000	[diff] [blame]	285	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattner	f9bae43	2006-12-08 02:01:32 +0000	[diff] [blame]	286
				287	unsigned long reverse(unsigned v) {
				288	unsigned t;
				289	t = v ^ ((v << 16) \| (v >> 16));
				290	t &= ~0xff0000;
				291	v = (v << 24) \| (v >> 8);
				292	return v ^ (t >> 8);
				293	}
				294
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	295	//===---------------------------------------------------------------------===//
				296
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	297	These idioms should be recognized as popcount (see PR1488):
				298
				299	unsigned countbits_slow(unsigned v) {
				300	unsigned c;
				301	for (c = 0; v; v >>= 1)
				302	c += v & 1;
				303	return c;
				304	}
				305	unsigned countbits_fast(unsigned v){
				306	unsigned c;
				307	for (c = 0; v; c++)
				308	v &= v - 1; // clear the least significant bit set
				309	return c;
				310	}
				311
				312	BITBOARD = unsigned long long
				313	int PopCnt(register BITBOARD a) {
				314	register int c=0;
				315	while(a) {
				316	c++;
				317	a &= a - 1;
				318	}
				319	return c;
				320	}
				321	unsigned int popcount(unsigned int input) {
				322	unsigned int count = 0;
				323	for (unsigned int i = 0; i < 4 * 8; i++)
				324	count += (input >> i) & i;
				325	return count;
				326	}
				327
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	328	This is a form of idiom recognition for loops, the same thing that could be
				329	useful for recognizing memset/memcpy.
				330
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	331	//===---------------------------------------------------------------------===//
				332
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	333	These should turn into single 16-bit (unaligned?) loads on little/big endian
				334	processors.
				335
				336	unsigned short read_16_le(const unsigned char *adr) {
				337	return adr[0] \| (adr[1] << 8);
				338	}
				339	unsigned short read_16_be(const unsigned char *adr) {
				340	return (adr[0] << 8) \| adr[1];
				341	}
				342
				343	//===---------------------------------------------------------------------===//
Chris Lattner	cf10391	2006-10-24 16:12:47 +0000	[diff] [blame]	344
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	345	-instcombine should handle this transform:
Reid Spencer	e4d87aa	2006-12-23 06:05:41 +0000	[diff] [blame]	346	icmp pred (sdiv X / C1 ), C2
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	347	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				348
				349	Currently InstCombine avoids this transform but will do it when the signs of
				350	the operands and the sign of the divide match. See the FIXME in
				351	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				352	for Instruction::UDiv (around line 4447) for more details.
				353
				354	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				355	this construct.
Chris Lattner	d7c628d	2006-11-03 22:27:39 +0000	[diff] [blame]	356
				357	//===---------------------------------------------------------------------===//
				358
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	359	[LOOP RECOGNITION]
				360
Chris Lattner	578d2df	2006-11-10 00:23:26 +0000	[diff] [blame]	361	viterbi speeds up significantly if the various "history" related copy loops
				362	are turned into memcpy calls at the source level. We need a "loops to memcpy"
				363	pass.
				364
				365	//===---------------------------------------------------------------------===//
Nick Lewycky	bf63734	2006-11-13 00:23:28 +0000	[diff] [blame]	366
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	367	[LOOP OPTIMIZATION]
				368
				369	SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization
				370	opportunities in its double_array_divs_variable function: it needs loop
				371	interchange, memory promotion (which LICM already does), vectorization and
				372	variable trip count loop unrolling (since it has a constant trip count). ICC
				373	apparently produces this very nice code with -ffast-math:
				374
				375	..B1.70: # Preds ..B1.70 ..B1.69
				376	mulpd %xmm0, %xmm1 #108.2
				377	mulpd %xmm0, %xmm1 #108.2
				378	mulpd %xmm0, %xmm1 #108.2
				379	mulpd %xmm0, %xmm1 #108.2
				380	addl $8, %edx #
				381	cmpl $131072, %edx #108.2
				382	jb ..B1.70 # Prob 99% #108.2
				383
				384	It would be better to count down to zero, but this is a lot better than what we
				385	do.
				386
				387	//===---------------------------------------------------------------------===//
				388
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	389	Consider:
				390
				391	typedef unsigned U32;
				392	typedef unsigned long long U64;
				393	int test (U32 inst, U64 regs) {
				394	U64 effective_addr2;
				395	U32 temp = *inst;
				396	int r1 = (temp >> 20) & 0xf;
				397	int b2 = (temp >> 16) & 0xf;
				398	effective_addr2 = temp & 0xfff;
				399	if (b2) effective_addr2 += regs[b2];
				400	b2 = (temp >> 12) & 0xf;
				401	if (b2) effective_addr2 += regs[b2];
				402	effective_addr2 &= regs[4];
				403	if ((effective_addr2 & 3) == 0)
				404	return 1;
				405	return 0;
				406	}
				407
				408	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				409	we don't eliminate the computation of the top half of effective_addr2 because
				410	we don't have whole-function selection dags. On x86, this means we use one
				411	extra register for the function when effective_addr2 is declared as U64 than
				412	when it is declared U32.
				413
Chris Lattner	1742498	2009-11-10 23:47:45 +0000	[diff] [blame]	414	PHI Slicing could be extended to do this.
				415
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	416	//===---------------------------------------------------------------------===//
				417
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	418	LSR should know what GPR types a target has from TargetData. This code:
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	419
				420	volatile short X, Y; // globals
				421
				422	void foo(int N) {
				423	int i;
				424	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				425	}
				426
Chris Lattner	c1491f3	2009-09-20 17:37:38 +0000	[diff] [blame]	427	produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	428
Chris Lattner	c1491f3	2009-09-20 17:37:38 +0000	[diff] [blame]	429	LBB1_2:
				430	ldr r3, LCPI1_0
				431	ldr r3, [r3]
				432	strh r2, [r3]
				433	ldr r3, LCPI1_1
				434	ldr r3, [r3]
				435	strh r1, [r3]
				436	add r1, r1, #4
				437	add r2, r2, #1 <- [0,+,1]
				438	sub r0, r0, #1 <- [0,-,1]
				439	cmp r0, #0
				440	bne LBB1_2
				441
				442	LSR should reuse the "+" IV for the exit test.
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	443
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	444	//===---------------------------------------------------------------------===//
				445
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	446	Tail call elim should be more aggressive, checking to see if the call is
				447	followed by an uncond branch to an exit block.
				448
				449	; This testcase is due to tail-duplication not wanting to copy the return
				450	; instruction into the terminating blocks because there was other code
				451	; optimized out of the function after the taildup happened.
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	452	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	453
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	454	define i32 @t4(i32 %a) {
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	455	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	456	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				457	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				458	br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	459
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	460	then.0: ; preds = %entry
				461	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				462	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				463	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	464
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	465	else.0: ; preds = %entry
				466	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				467	br i1 %tmp.7, label %then.1, label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	468
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	469	then.1: ; preds = %else.0
				470	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				471	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				472	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	473
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	474	return: ; preds = %then.1, %else.0, %then.0
				475	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	476	[ %tmp.9, %then.1 ]
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	477	ret i32 %result.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	478	}
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	479
				480	//===---------------------------------------------------------------------===//
				481
Chris Lattner	c90b866	2008-08-10 00:47:21 +0000	[diff] [blame]	482	Tail recursion elimination should handle:
				483
				484	int pow2m1(int n) {
				485	if (n == 0)
				486	return 0;
				487	return 2 * pow2m1 (n - 1) + 1;
				488	}
				489
				490	Also, multiplies can be turned into SHL's, so they should be handled as if
				491	they were associative. "return foo() << 1" can be tail recursion eliminated.
				492
				493	//===---------------------------------------------------------------------===//
				494
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	495	Argument promotion should promote arguments for recursive functions, like
				496	this:
				497
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	498	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	499
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	500	define internal i32 @foo(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	501	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	502	%tmp = load i32* %x ; <i32> [#uses=0]
				503	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				504	ret i32 %tmp.foo
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	505	}
				506
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	507	define i32 @bar(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	508	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	509	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				510	ret i32 %tmp3
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	511	}
				512
Chris Lattner	81f2d71	2007-12-05 23:05:06 +0000	[diff] [blame]	513	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	514
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	515	We should investigate an instruction sinking pass. Consider this silly
				516	example in pic mode:
				517
				518	#include <assert.h>
				519	void foo(int x) {
				520	assert(x);
				521	//...
				522	}
				523
				524	we compile this to:
				525	_foo:
				526	subl $28, %esp
				527	call "L1$pb"
				528	"L1$pb":
				529	popl %eax
				530	cmpl $0, 32(%esp)
				531	je LBB1_2 # cond_true
				532	LBB1_1: # return
				533	# ...
				534	addl $28, %esp
				535	ret
				536	LBB1_2: # cond_true
				537	...
				538
				539	The PIC base computation (call+popl) is only used on one path through the
				540	code, but is currently always computed in the entry block. It would be
				541	better to sink the picbase computation down into the block for the
				542	assertion, as it is the only one that uses it. This happens for a lot of
				543	code with early outs.
				544
Chris Lattner	92c06a0	2007-12-29 01:05:01 +0000	[diff] [blame]	545	Another example is loads of arguments, which are usually emitted into the
				546	entry block on targets like x86. If not used in all paths through a
				547	function, they should be sunk into the ones that do.
				548
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	549	In this case, whole-function-isel would also handle this.
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	550
				551	//===---------------------------------------------------------------------===//
Chris Lattner	b304194	2008-01-07 21:38:14 +0000	[diff] [blame]	552
				553	Investigate lowering of sparse switch statements into perfect hash tables:
				554	http://burtleburtle.net/bob/hash/perfect.html
				555
				556	//===---------------------------------------------------------------------===//
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	557
				558	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				559	corresponding integer operations. On a yonah, this loop:
				560
				561	double a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	562	void foo() {
				563	int i, b;
				564	for (b = 0; b < 10000000; b++)
				565	for (i = 0; i < 256; i++)
				566	a[i] = -a[i];
				567	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	568
				569	is twice as slow as this loop:
				570
				571	long long a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	572	void foo() {
				573	int i, b;
				574	for (b = 0; b < 10000000; b++)
				575	for (i = 0; i < 256; i++)
				576	a[i] ^= (1ULL << 63);
				577	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	578
				579	and I suspect other processors are similar. On X86 in particular this is a
				580	big win because doing this with integers allows the use of read/modify/write
				581	instructions.
				582
				583	//===---------------------------------------------------------------------===//
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	584
				585	DAG Combiner should try to combine small loads into larger loads when
				586	profitable. For example, we compile this C++ example:
				587
				588	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				589	extern THotKey m_HotKey;
				590	THotKey GetHotKey () { return m_HotKey; }
				591
				592	into (-O3 -fno-exceptions -static -fomit-frame-pointer):
				593
				594	__Z9GetHotKeyv:
				595	pushl %esi
				596	movl 8(%esp), %eax
				597	movb _m_HotKey+3, %cl
				598	movb _m_HotKey+4, %dl
				599	movb _m_HotKey+2, %ch
				600	movw _m_HotKey, %si
				601	movw %si, (%eax)
				602	movb %ch, 2(%eax)
				603	movb %cl, 3(%eax)
				604	movb %dl, 4(%eax)
				605	popl %esi
				606	ret $4
				607
				608	GCC produces:
				609
				610	__Z9GetHotKeyv:
				611	movl _m_HotKey, %edx
				612	movl 4(%esp), %eax
				613	movl %edx, (%eax)
				614	movzwl _m_HotKey+4, %edx
				615	movw %dx, 4(%eax)
				616	ret $4
				617
				618	The LLVM IR contains the needed alignment info, so we should be able to
				619	merge the loads and stores into 4-byte loads:
				620
				621	%struct.THotKey = type { i16, i8, i8, i8 }
				622	define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
				623	...
				624	%tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
				625	%tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
				626	%tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
				627	%tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
				628
				629	Alternatively, we should use a small amount of base-offset alias analysis
				630	to make it so the scheduler doesn't need to hold all the loads in regs at
				631	once.
				632
				633	//===---------------------------------------------------------------------===//
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	634
Nate Begeman	e9fe65c	2008-02-18 18:39:23 +0000	[diff] [blame]	635	We should add an FRINT node to the DAG to model targets that have legal
				636	implementations of ceil/floor/rint.
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	637
				638	//===---------------------------------------------------------------------===//
				639
				640	Consider:
				641
				642	int test() {
				643	long long input[8] = {1,1,1,1,1,1,1,1};
				644	foo(input);
				645	}
				646
				647	We currently compile this into a memcpy from a global array since the
				648	initializer is fairly large and not memset'able. This is good, but the memcpy
				649	gets lowered to load/stores in the code generator. This is also ok, except
				650	that the codegen lowering for memcpy doesn't handle the case when the source
				651	is a constant global. This gives us atrocious code like this:
				652
				653	call "L1$pb"
				654	"L1$pb":
				655	popl %eax
				656	movl _C.0.1444-"L1$pb"+32(%eax), %ecx
				657	movl %ecx, 40(%esp)
				658	movl _C.0.1444-"L1$pb"+20(%eax), %ecx
				659	movl %ecx, 28(%esp)
				660	movl _C.0.1444-"L1$pb"+36(%eax), %ecx
				661	movl %ecx, 44(%esp)
				662	movl _C.0.1444-"L1$pb"+44(%eax), %ecx
				663	movl %ecx, 52(%esp)
				664	movl _C.0.1444-"L1$pb"+40(%eax), %ecx
				665	movl %ecx, 48(%esp)
				666	movl _C.0.1444-"L1$pb"+12(%eax), %ecx
				667	movl %ecx, 20(%esp)
				668	movl _C.0.1444-"L1$pb"+4(%eax), %ecx
				669	...
				670
				671	instead of:
				672	movl $1, 16(%esp)
				673	movl $0, 20(%esp)
				674	movl $1, 24(%esp)
				675	movl $0, 28(%esp)
				676	movl $1, 32(%esp)
				677	movl $0, 36(%esp)
				678	...
				679
				680	//===---------------------------------------------------------------------===//
Chris Lattner	a11deb0	2008-03-02 02:51:40 +0000	[diff] [blame]	681
				682	http://llvm.org/PR717:
				683
				684	The following code should compile into "ret int undef". Instead, LLVM
				685	produces "ret int 0":
				686
				687	int f() {
				688	int x = 4;
				689	int y;
				690	if (x == 3) y = 0;
				691	return y;
				692	}
				693
				694	//===---------------------------------------------------------------------===//
Chris Lattner	53b7277	2008-03-02 19:29:42 +0000	[diff] [blame]	695
				696	The loop unroller should partially unroll loops (instead of peeling them)
				697	when code growth isn't too bad and when an unroll count allows simplification
				698	of some code within the loop. One trivial example is:
				699
				700	#include <stdio.h>
				701	int main() {
				702	int nRet = 17;
				703	int nLoop;
				704	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				705	if ( nLoop & 1 )
				706	nRet += 2;
				707	else
				708	nRet -= 1;
				709	}
				710	return nRet;
				711	}
				712
				713	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				714	reduction in code size. The resultant code would then also be suitable for
				715	exit value computation.
				716
				717	//===---------------------------------------------------------------------===//
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	718
				719	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				720	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				721	matching code in dag combine doesn't look through truncates aggressively
				722	enough. Here are some testcases reduces from GCC PR17886:
				723
				724	unsigned long long f(unsigned long long x, int y) {
				725	return (x << y) \| (x >> 64-y);
				726	}
				727	unsigned f2(unsigned x, int y){
				728	return (x << y) \| (x >> 32-y);
				729	}
				730	unsigned long long f3(unsigned long long x){
				731	int y = 9;
				732	return (x << y) \| (x >> 64-y);
				733	}
				734	unsigned f4(unsigned x){
				735	int y = 10;
				736	return (x << y) \| (x >> 32-y);
				737	}
				738	unsigned long long f5(unsigned long long x, unsigned long long y) {
				739	return (x << 8) \| ((y >> 48) & 0xffull);
				740	}
				741	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				742	switch(z) {
				743	case 1:
				744	return (x << 8) \| ((y >> 48) & 0xffull);
				745	case 2:
				746	return (x << 16) \| ((y >> 40) & 0xffffull);
				747	case 3:
				748	return (x << 24) \| ((y >> 32) & 0xffffffull);
				749	case 4:
				750	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				751	default:
				752	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				753	}
				754	}
				755
Dan Gohman	cb747c5	2008-10-17 21:39:27 +0000	[diff] [blame]	756	On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	757	generate truly horrible code, instead of using shld and friends. On
				758	ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
				759	badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
				760
				761	//===---------------------------------------------------------------------===//
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	762
				763	We do a number of simplifications in simplify libcalls to strength reduce
				764	standard library functions, but we don't currently merge them together. For
				765	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				766	be done safely if "b" isn't modified between the strlen and memcpy of course.
				767
				768	//===---------------------------------------------------------------------===//
				769
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	770	We compile this program: (from GCC PR11680)
				771	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				772
				773	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				774	slower than when compile with GCC (either 4.0 or 4.2):
				775
				776	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				777	$ time ./a.out fast
				778	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				779
				780	$ g++ perf.cpp -O3 -fno-exceptions
				781	$ time ./a.out fast
				782	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				783
				784	It looks like we are making the same inlining decisions, so this may be raw
				785	codegen badness or something else (haven't investigated).
				786
				787	//===---------------------------------------------------------------------===//
				788
				789	We miss some instcombines for stuff like this:
				790	void bar (void);
				791	void foo (unsigned int a) {
				792	/* This one is equivalent to a >= (3 << 2). */
				793	if ((a >> 2) >= 3)
				794	bar ();
				795	}
				796
				797	A few other related ones are in GCC PR14753.
				798
				799	//===---------------------------------------------------------------------===//
				800
				801	Divisibility by constant can be simplified (according to GCC PR12849) from
				802	being a mulhi to being a mul lo (cheaper). Testcase:
				803
				804	void bar(unsigned n) {
				805	if (n % 3 == 0)
				806	true();
				807	}
				808
Eli Friedman	bcae205	2009-12-12 23:23:43 +0000	[diff] [blame]	809	This is equivalent to the following, where 2863311531 is the multiplicative
				810	inverse of 3, and 1431655766 is ((2^32)-1)/3+1:
				811	void bar(unsigned n) {
				812	if (n * 2863311531U < 1431655766U)
				813	true();
				814	}
				815
				816	The same transformation can work with an even modulo with the addition of a
				817	rotate: rotate the result of the multiply to the right by the number of bits
				818	which need to be zero for the condition to be true, and shrink the compare RHS
				819	by the same amount. Unless the target supports rotates, though, that
				820	transformation probably isn't worthwhile.
				821
				822	The transformation can also easily be made to work with non-zero equality
				823	comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	824
				825	//===---------------------------------------------------------------------===//
Chris Lattner	23f35bc	2008-08-19 06:22:16 +0000	[diff] [blame]	826
Chris Lattner	db03983	2008-10-15 16:06:03 +0000	[diff] [blame]	827	Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
				828	bunch of other stuff from this example (see PR1604):
				829
				830	#include <cstdio>
				831	struct test {
				832	int val;
				833	virtual ~test() {}
				834	};
				835
				836	int main() {
				837	test t;
				838	std::scanf("%d", &t.val);
				839	std::printf("%d\n", t.val);
				840	}
				841
				842	//===---------------------------------------------------------------------===//
				843
Nick Lewycky	d2f0db1	2008-11-27 22:41:45 +0000	[diff] [blame]	844	These functions perform the same computation, but produce different assembly.
Nick Lewycky	df563ca	2008-11-27 22:12:22 +0000	[diff] [blame]	845
				846	define i8 @select(i8 %x) readnone nounwind {
				847	%A = icmp ult i8 %x, 250
				848	%B = select i1 %A, i8 0, i8 1
				849	ret i8 %B
				850	}
				851
				852	define i8 @addshr(i8 %x) readnone nounwind {
				853	%A = zext i8 %x to i9
				854	%B = add i9 %A, 6 ;; 256 - 250 == 6
				855	%C = lshr i9 %B, 8
				856	%D = trunc i9 %C to i8
				857	ret i8 %D
				858	}
				859
				860	//===---------------------------------------------------------------------===//
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	861
				862	From gcc bug 24696:
				863	int
				864	f (unsigned long a, unsigned long b, unsigned long c)
				865	{
				866	return ((a & (c - 1)) != 0) \|\| ((b & (c - 1)) != 0);
				867	}
				868	int
				869	f (unsigned long a, unsigned long b, unsigned long c)
				870	{
				871	return ((a & (c - 1)) != 0) \| ((b & (c - 1)) != 0);
				872	}
				873	Both should combine to ((a\|b) & (c-1)) != 0. Currently not optimized with
				874	"clang -emit-llvm-bc \| opt -std-compile-opts".
				875
				876	//===---------------------------------------------------------------------===//
				877
				878	From GCC Bug 20192:
				879	#define PMD_MASK (~((1UL << 23) - 1))
				880	void clear_pmd_range(unsigned long start, unsigned long end)
				881	{
				882	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
				883	f();
				884	}
				885	The expression should optimize to something like
				886	"!((start\|end)&~PMD_MASK). Currently not optimized with "clang
				887	-emit-llvm-bc \| opt -std-compile-opts".
				888
				889	//===---------------------------------------------------------------------===//
				890
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	891	From GCC Bug 3756:
				892	int
				893	pn (int n)
				894	{
				895	return (n >= 0 ? 1 : -1);
				896	}
				897	Should combine to (n >> 31) \| 1. Currently not optimized with "clang
				898	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				899
				900	//===---------------------------------------------------------------------===//
				901
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	902	void a(int variable)
				903	{
				904	if (variable == 4 \|\| variable == 6)
				905	bar();
				906	}
				907	This should optimize to "if ((variable \| 2) == 6)". Currently not
				908	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts \| llc".
				909
				910	//===---------------------------------------------------------------------===//
				911
				912	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
				913	i;}
				914	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				915	These should combine to the same thing. Currently, the first function
				916	produces better code on X86.
				917
				918	//===---------------------------------------------------------------------===//
				919
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	920	From GCC Bug 15784:
				921	#define abs(x) x>0?x:-x
				922	int f(int x, int y)
				923	{
				924	return (abs(x)) >= 0;
				925	}
				926	This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
				927	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				928
				929	//===---------------------------------------------------------------------===//
				930
				931	From GCC Bug 14753:
				932	void
				933	rotate_cst (unsigned int a)
				934	{
				935	a = (a << 10) \| (a >> 22);
				936	if (a == 123)
				937	bar ();
				938	}
				939	void
				940	minus_cst (unsigned int a)
				941	{
				942	unsigned int tem;
				943
				944	tem = 20 - a;
				945	if (tem == 5)
				946	bar ();
				947	}
				948	void
				949	mask_gt (unsigned int a)
				950	{
				951	/* This is equivalent to a > 15. */
				952	if ((a & ~7) > 8)
				953	bar ();
				954	}
				955	void
				956	rshift_gt (unsigned int a)
				957	{
				958	/* This is equivalent to a > 23. */
				959	if ((a >> 2) > 5)
				960	bar ();
				961	}
				962	All should simplify to a single comparison. All of these are
				963	currently not optimized with "clang -emit-llvm-bc \| opt
				964	-std-compile-opts".
				965
				966	//===---------------------------------------------------------------------===//
				967
				968	From GCC Bug 32605:
				969	int c(int* x) {return (char)x+2 == (char)x;}
				970	Should combine to 0. Currently not optimized with "clang
				971	-emit-llvm-bc \| opt -std-compile-opts" (although llc can optimize it).
				972
				973	//===---------------------------------------------------------------------===//
				974
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	975	int a(unsigned b) {return ((b << 31) \| (b << 30)) >> 31;}
				976	Should be combined to "((b >> 1) \| b) & 1". Currently not optimized
				977	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				978
				979	//===---------------------------------------------------------------------===//
				980
				981	unsigned a(unsigned x, unsigned y) { return x \| (y & 1) \| (y & 2);}
				982	Should combine to "x \| (y & 3)". Currently not optimized with "clang
				983	-emit-llvm-bc \| opt -std-compile-opts".
				984
				985	//===---------------------------------------------------------------------===//
				986
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	987	int a(int a, int b, int c) {return (~a & c) \| ((c\|a) & b);}
				988	Should fold to "(~a & c) \| (a & b)". Currently not optimized with
				989	"clang -emit-llvm-bc \| opt -std-compile-opts".
				990
				991	//===---------------------------------------------------------------------===//
				992
				993	int a(int a,int b) {return (~(a\|b))\|a;}
				994	Should fold to "a\|~b". Currently not optimized with "clang
				995	-emit-llvm-bc \| opt -std-compile-opts".
				996
				997	//===---------------------------------------------------------------------===//
				998
				999	int a(int a, int b) {return (a&&b) \|\| (a&&!b);}
				1000	Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
				1001	\| opt -std-compile-opts".
				1002
				1003	//===---------------------------------------------------------------------===//
				1004
				1005	int a(int a, int b, int c) {return (a&&b) \|\| (!a&&c);}
				1006	Should fold to "a ? b : c", or at least something sane. Currently not
				1007	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1008
				1009	//===---------------------------------------------------------------------===//
				1010
				1011	int a(int a, int b, int c) {return (a&&b) \|\| (a&&c) \|\| (a&&b&&c);}
				1012	Should fold to a && (b \|\| c). Currently not optimized with "clang
				1013	-emit-llvm-bc \| opt -std-compile-opts".
				1014
				1015	//===---------------------------------------------------------------------===//
				1016
				1017	int a(int x) {return x \| ((x & 8) ^ 8);}
				1018	Should combine to x \| 8. Currently not optimized with "clang
				1019	-emit-llvm-bc \| opt -std-compile-opts".
				1020
				1021	//===---------------------------------------------------------------------===//
				1022
				1023	int a(int x) {return x ^ ((x & 8) ^ 8);}
				1024	Should also combine to x \| 8. Currently not optimized with "clang
				1025	-emit-llvm-bc \| opt -std-compile-opts".
				1026
				1027	//===---------------------------------------------------------------------===//
				1028
				1029	int a(int x) {return (x & 8) == 0 ? -1 : -9;}
				1030	Should combine to (x \| -9) ^ 8. Currently not optimized with "clang
				1031	-emit-llvm-bc \| opt -std-compile-opts".
				1032
				1033	//===---------------------------------------------------------------------===//
				1034
				1035	int a(int x) {return (x & 8) == 0 ? -9 : -1;}
				1036	Should combine to x \| -9. Currently not optimized with "clang
				1037	-emit-llvm-bc \| opt -std-compile-opts".
				1038
				1039	//===---------------------------------------------------------------------===//
				1040
				1041	int a(int x) {return ((x \| -9) ^ 8) & x;}
				1042	Should combine to x & -9. Currently not optimized with "clang
				1043	-emit-llvm-bc \| opt -std-compile-opts".
				1044
				1045	//===---------------------------------------------------------------------===//
				1046
				1047	unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
				1048	Should combine to "a * 0x88888888 >> 31". Currently not optimized
				1049	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1050
				1051	//===---------------------------------------------------------------------===//
				1052
				1053	unsigned a(char* x) {if ((*x & 32) == 0) return b();}
				1054	There's an unnecessary zext in the generated code with "clang
				1055	-emit-llvm-bc \| opt -std-compile-opts".
				1056
				1057	//===---------------------------------------------------------------------===//
				1058
				1059	unsigned a(unsigned long long x) {return 40 * (x >> 1);}
				1060	Should combine to "20 * (((unsigned)x) & -2)". Currently not
				1061	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1062
				1063	//===---------------------------------------------------------------------===//
Bill Wendling	3bdcda8	2008-12-02 05:12:47 +0000	[diff] [blame]	1064
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1065	This was noticed in the entryblock for grokdeclarator in 403.gcc:
				1066
				1067	%tmp = icmp eq i32 %decl_context, 4
				1068	%decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
				1069	%tmp1 = icmp eq i32 %decl_context_addr.0, 1
				1070	%decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
				1071
				1072	tmp1 should be simplified to something like:
				1073	(!tmp \|\| decl_context == 1)
				1074
				1075	This allows recursive simplifications, tmp1 is used all over the place in
				1076	the function, e.g. by:
				1077
				1078	%tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
				1079	%tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
				1080	%or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
				1081
				1082	later.
				1083
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1084	//===---------------------------------------------------------------------===//
				1085
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1086	[STORE SINKING]
				1087
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1088	Store sinking: This code:
				1089
				1090	void f (int n, int cond, int res) {
				1091	int i;
				1092	*res = 0;
				1093	for (i = 0; i < n; i++)
				1094	if (*cond)
				1095	res ^= 234; / () /
				1096	}
				1097
				1098	On this function GVN hoists the fully redundant value of *res, but nothing
				1099	moves the store out. This gives us this code:
				1100
				1101	bb: ; preds = %bb2, %entry
				1102	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
				1103	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
				1104	%1 = load i32* %cond, align 4
				1105	%2 = icmp eq i32 %1, 0
				1106	br i1 %2, label %bb2, label %bb1
				1107
				1108	bb1: ; preds = %bb
				1109	%3 = xor i32 %.rle, 234
				1110	store i32 %3, i32* %res, align 4
				1111	br label %bb2
				1112
				1113	bb2: ; preds = %bb, %bb1
				1114	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
				1115	%indvar.next = add i32 %i.05, 1
				1116	%exitcond = icmp eq i32 %indvar.next, %n
				1117	br i1 %exitcond, label %return, label %bb
				1118
				1119	DSE should sink partially dead stores to get the store out of the loop.
				1120
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1121	Here's another partial dead case:
				1122	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
				1123
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1124	//===---------------------------------------------------------------------===//
				1125
				1126	Scalar PRE hoists the mul in the common block up to the else:
				1127
				1128	int test (int a, int b, int c, int g) {
				1129	int d, e;
				1130	if (a)
				1131	d = b * c;
				1132	else
				1133	d = b - c;
				1134	e = b * c + g;
				1135	return d + e;
				1136	}
				1137
				1138	It would be better to do the mul once to reduce codesize above the if.
				1139	This is GCC PR38204.
				1140
				1141	//===---------------------------------------------------------------------===//
				1142
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1143	[STORE SINKING]
				1144
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1145	GCC PR37810 is an interesting case where we should sink load/store reload
				1146	into the if block and outside the loop, so we don't reload/store it on the
				1147	non-call path.
				1148
				1149	for () {
				1150	*P += 1;
				1151	if ()
				1152	call();
				1153	else
				1154	...
				1155	->
				1156	tmp = *P
				1157	for () {
				1158	tmp += 1;
				1159	if () {
				1160	*P = tmp;
				1161	call();
				1162	tmp = *P;
				1163	} else ...
				1164	}
				1165	*P = tmp;
				1166
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1167	We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
				1168	we don't sink the store. We need partially dead store sinking.
				1169
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1170	//===---------------------------------------------------------------------===//
				1171
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1172	[LOAD PRE CRIT EDGE SPLITTING]
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1173
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1174	GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
				1175	leading to excess stack traffic. This could be handled by GVN with some crazy
				1176	symbolic phi translation. The code we get looks like (g is on the stack):
				1177
				1178	bb2: ; preds = %bb1
				1179	..
				1180	%9 = getelementptr %struct.f* %g, i32 0, i32 0
				1181	store i32 %8, i32* %9, align bel %bb3
				1182
				1183	bb3: ; preds = %bb1, %bb2, %bb
				1184	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
				1185	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
				1186	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
				1187	%11 = load i32* %10, align 4
				1188
Chris Lattner	6d94926	2009-11-27 16:53:57 +0000	[diff] [blame]	1189	%11 is partially redundant, an in BB2 it should have the value %8.
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1190
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1191	GCC PR33344 and PR35287 are similar cases.
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1192
Chris Lattner	6c9fab7	2009-11-05 18:19:19 +0000	[diff] [blame]	1193
				1194	//===---------------------------------------------------------------------===//
				1195
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1196	[LOAD PRE]
				1197
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1198	There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1199	GCC testsuite, ones we don't get yet are (checked through loadpre25):
				1200
				1201	[CRIT EDGE BREAKING]
				1202	loadpre3.c predcom-4.c
				1203
				1204	[PRE OF READONLY CALL]
				1205	loadpre5.c
				1206
				1207	[TURN SELECT INTO BRANCH]
				1208	loadpre14.c loadpre15.c
				1209
				1210	actually a conditional increment: loadpre18.c loadpre19.c
				1211
				1212
				1213	//===---------------------------------------------------------------------===//
				1214
				1215	[SCALAR PRE]
				1216	There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the
				1217	GCC testsuite.
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1218
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1219	//===---------------------------------------------------------------------===//
				1220
				1221	There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1222	GCC testsuite. For example, we get the first example in predcom-1.c, but
				1223	miss the second one:
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1224
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1225	unsigned fib[1000];
				1226	unsigned avg[1000];
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1227
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1228	__attribute__ ((noinline))
				1229	void count_averages(int n) {
				1230	int i;
				1231	for (i = 1; i < n; i++)
				1232	avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;
				1233	}
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1234
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1235	which compiles into two loads instead of one in the loop.
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1236
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1237	predcom-2.c is the same as predcom-1.c
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1238
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1239	predcom-3.c is very similar but needs loads feeding each other instead of
				1240	store->load.
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1241
				1242
				1243	//===---------------------------------------------------------------------===//
				1244
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	1245	[ALIAS ANALYSIS]
				1246
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1247	Type based alias analysis:
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1248	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
				1249
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	1250	We should do better analysis of posix_memalign. At the least it should
				1251	no-capture its pointer argument, at best, we should know that the out-value
				1252	result doesn't point to anything (like malloc). One example of this is in
				1253	SingleSource/Benchmarks/Misc/dt.c
				1254
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1255	//===---------------------------------------------------------------------===//
				1256
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1257	A/B get pinned to the stack because we turn an if/then into a select instead
				1258	of PRE'ing the load/store. This may be fixable in instcombine:
				1259	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
				1260
Chris Lattner	93c6c77	2009-09-21 02:53:57 +0000	[diff] [blame]	1261	struct X { int i; };
				1262	int foo (int x) {
				1263	struct X a;
				1264	struct X b;
				1265	struct X *p;
				1266	a.i = 1;
				1267	b.i = 2;
				1268	if (x)
				1269	p = &a;
				1270	else
				1271	p = &b;
				1272	return p->i;
				1273	}
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1274
Chris Lattner	93c6c77	2009-09-21 02:53:57 +0000	[diff] [blame]	1275	//===---------------------------------------------------------------------===//
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1276
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1277	Interesting missed case because of control flow flattening (should be 2 loads):
				1278	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1279	With: llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \|
				1280	opt -mem2reg -gvn -instcombine \| llvm-dis
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1281	we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1282	VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1283
				1284	//===---------------------------------------------------------------------===//
				1285
				1286	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
				1287	We could eliminate the branch condition here, loading from null is undefined:
				1288
				1289	struct S { int w, x, y, z; };
				1290	struct T { int r; struct S s; };
				1291	void bar (struct S, int);
				1292	void foo (int a, struct T b)
				1293	{
				1294	struct S *c = 0;
				1295	if (a)
				1296	c = &b.s;
				1297	bar (*c, a);
				1298	}
				1299
				1300	//===---------------------------------------------------------------------===//
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1301
Chris Lattner	9cf8ef6	2008-12-23 20:52:52 +0000	[diff] [blame]	1302	simplifylibcalls should do several optimizations for strspn/strcspn:
				1303
				1304	strcspn(x, "") -> strlen(x)
				1305	strcspn("", x) -> 0
				1306	strspn("", x) -> 0
				1307	strspn(x, "") -> strlen(x)
				1308	strspn(x, "a") -> strchr(x, 'a')-x
				1309
				1310	strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
				1311
				1312	size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
				1313	int __reject3) {
				1314	register size_t __result = 0;
				1315	while (__s[__result] != '\0' && __s[__result] != __reject1 &&
				1316	__s[__result] != __reject2 && __s[__result] != __reject3)
				1317	++__result;
				1318	return __result;
				1319	}
				1320
				1321	This should turn into a switch on the character. See PR3253 for some notes on
				1322	codegen.
				1323
				1324	456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
				1325
				1326	//===---------------------------------------------------------------------===//
Chris Lattner	d23b799	2008-12-31 00:54:13 +0000	[diff] [blame]	1327
				1328	"gas" uses this idiom:
				1329	else if (strchr ("+-/%\|&^:[]()~", intel_parser.op_string))
				1330	..
				1331	else if (strchr ("<>", *intel_parser.op_string)
				1332
				1333	Those should be turned into a switch.
				1334
				1335	//===---------------------------------------------------------------------===//
Chris Lattner	ffb08f5	2009-01-08 06:52:57 +0000	[diff] [blame]	1336
				1337	252.eon contains this interesting code:
				1338
				1339	%3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
				1340	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1341	%strlen = call i32 @strlen(i8* %3072) ; uses = 1
				1342	%endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
				1343	call void @llvm.memcpy.i32(i8* %endptr,
				1344	i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
				1345	%3074 = call i32 @strlen(i8* %endptr) nounwind readonly
				1346
				1347	This is interesting for a couple reasons. First, in this:
				1348
				1349	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1350	%strlen = call i32 @strlen(i8* %3072)
				1351
				1352	The strlen could be replaced with: %strlen = sub %3072, %3073, because the
				1353	strcpy call returns a pointer to the end of the string. Based on that, the
				1354	endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
				1355
				1356	Second, the memcpy+strlen strlen can be replaced with:
				1357
				1358	%3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
				1359
				1360	Because the destination was just copied into the specified memory buffer. This,
				1361	in turn, can be constant folded to "4".
				1362
				1363	In other code, it contains:
				1364
				1365	%endptr6978 = bitcast i8* %endptr69 to i32*
				1366	store i32 7107374, i32* %endptr6978, align 1
				1367	%3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
				1368
				1369	Which could also be constant folded. Whatever is producing this should probably
				1370	be fixed to leave this as a memcpy from a string.
				1371
				1372	Further, eon also has an interesting partially redundant strlen call:
				1373
				1374	bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
				1375	%682 = getelementptr i8 %argv, i32 6 ; <i8> [#uses=2]
				1376	%683 = load i8** %682, align 4 ; <i8*> [#uses=4]
				1377	%684 = load i8* %683, align 1 ; <i8> [#uses=1]
				1378	%685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
				1379	br i1 %685, label %bb10, label %bb9
				1380
				1381	bb9: ; preds = %bb8
				1382	%686 = call i32 @strlen(i8* %683) nounwind readonly
				1383	%687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
				1384	br i1 %687, label %bb10, label %bb11
				1385
				1386	bb10: ; preds = %bb9, %bb8
				1387	%688 = call i32 @strlen(i8* %683) nounwind readonly
				1388
				1389	This could be eliminated by doing the strlen once in bb8, saving code size and
				1390	improving perf on the bb8->9->10 path.
				1391
				1392	//===---------------------------------------------------------------------===//
Chris Lattner	9fee08f	2009-01-08 07:34:55 +0000	[diff] [blame]	1393
				1394	I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
				1395	which looks like:
				1396	%movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
				1397
				1398
				1399	bb62: ; preds = %bb55, %bb53
				1400	%promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
				1401	%171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1402	%172 = add i32 %171, -1 ; <i32> [#uses=1]
				1403	%173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
				1404
				1405	... no stores ...
				1406	br i1 %or.cond, label %bb65, label %bb72
				1407
				1408	bb65: ; preds = %bb62
				1409	store i8 0, i8* %173, align 1
				1410	br label %bb72
				1411
				1412	bb72: ; preds = %bb65, %bb62
				1413	%trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
				1414	%177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1415
				1416	Note that on the bb62->bb72 path, that the %177 strlen call is partially
				1417	redundant with the %171 call. At worst, we could shove the %177 strlen call
				1418	up into the bb65 block moving it out of the bb62->bb72 path. However, note
				1419	that bb65 stores to the string, zeroing out the last byte. This means that on
				1420	that path the value of %177 is actually just %171-1. A sub is cheaper than a
				1421	strlen!
				1422
				1423	This pattern repeats several times, basically doing:
				1424
				1425	A = strlen(P);
				1426	P[A-1] = 0;
				1427	B = strlen(P);
				1428	where it is "obvious" that B = A-1.
				1429
				1430	//===---------------------------------------------------------------------===//
				1431
				1432	186.crafty contains this interesting pattern:
				1433
				1434	%77 = call i8* @strstr(i8* getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0),
				1435	i8* %30)
				1436	%phitmp648 = icmp eq i8* %77, getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0)
				1437	br i1 %phitmp648, label %bb70, label %bb76
				1438
				1439	bb70: ; preds = %OptionMatch.exit91, %bb69
				1440	%78 = call i32 @strlen(i8* %30) nounwind readonly align 1 ; <i32> [#uses=1]
				1441
				1442	This is basically:
				1443	cststr = "abcdef";
				1444	if (strstr(cststr, P) == cststr) {
				1445	x = strlen(P);
				1446	...
				1447
				1448	The strstr call would be significantly cheaper written as:
				1449
				1450	cststr = "abcdef";
				1451	if (memcmp(P, str, strlen(P)))
				1452	x = strlen(P);
				1453
				1454	This is memcmp+strlen instead of strstr. This also makes the strlen fully
				1455	redundant.
				1456
				1457	//===---------------------------------------------------------------------===//
				1458
				1459	186.crafty also contains this code:
				1460
				1461	%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1462	%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
				1463	%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
				1464	%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1465	%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
				1466
				1467	The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
				1468
				1469	//===---------------------------------------------------------------------===//
				1470
				1471	186.crafty has this interesting pattern with the "out.4543" variable:
				1472
				1473	call void @llvm.memcpy.i32(
				1474	i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
				1475	i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
				1476	%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
				1477
				1478	It is basically doing:
				1479
				1480	memcpy(globalarray, "string");
				1481	printf(..., globalarray);
				1482
				1483	Anyway, by knowing that printf just reads the memory and forward substituting
				1484	the string directly into the printf, this eliminates reads from globalarray.
				1485	Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
				1486	other similar functions) there are many stores to "out". Once all the printfs
				1487	stop using "out", all that is left is the memcpy's into it. This should allow
				1488	globalopt to remove the "stored only" global.
				1489
				1490	//===---------------------------------------------------------------------===//
				1491
Dan Gohman	8289b05	2009-01-20 01:07:33 +0000	[diff] [blame]	1492	This code:
				1493
				1494	define inreg i32 @foo(i8* inreg %p) nounwind {
				1495	%tmp0 = load i8* %p
				1496	%tmp1 = ashr i8 %tmp0, 5
				1497	%tmp2 = sext i8 %tmp1 to i32
				1498	ret i32 %tmp2
				1499	}
				1500
				1501	could be dagcombine'd to a sign-extending load with a shift.
				1502	For example, on x86 this currently gets this:
				1503
				1504	movb (%eax), %al
				1505	sarb $5, %al
				1506	movsbl %al, %eax
				1507
				1508	while it could get this:
				1509
				1510	movsbl (%eax), %eax
				1511	sarl $5, %eax
				1512
				1513	//===---------------------------------------------------------------------===//
Chris Lattner	256baa4	2009-01-22 07:16:03 +0000	[diff] [blame]	1514
				1515	GCC PR31029:
				1516
				1517	int test(int x) { return 1-x == x; } // --> return false
				1518	int test2(int x) { return 2-x == x; } // --> return x == 1 ?
				1519
				1520	Always foldable for odd constants, what is the rule for even?
				1521
				1522	//===---------------------------------------------------------------------===//
				1523
Torok Edwin	e46a686	2009-01-24 19:30:25 +0000	[diff] [blame]	1524	PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
				1525	for next field in struct (which is at same address).
				1526
				1527	For example: store of float into { {{}}, float } could be turned into a store to
				1528	the float directly.
				1529
Torok Edwin	474479f	2009-02-20 18:42:06 +0000	[diff] [blame]	1530	//===---------------------------------------------------------------------===//
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1531
Torok Edwin	474479f	2009-02-20 18:42:06 +0000	[diff] [blame]	1532	#include <math.h>
				1533	double foo(double a) { return sin(a); }
				1534
				1535	This compiles into this on x86-64 Linux:
				1536	foo:
				1537	subq $8, %rsp
				1538	call sin
				1539	addq $8, %rsp
				1540	ret
				1541	vs:
				1542
				1543	foo:
				1544	jmp sin
				1545
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1546	//===---------------------------------------------------------------------===//
				1547
Chris Lattner	32c5f17	2009-05-11 17:41:40 +0000	[diff] [blame]	1548	The arg promotion pass should make use of nocapture to make its alias analysis
				1549	stuff much more precise.
				1550
				1551	//===---------------------------------------------------------------------===//
				1552
				1553	The following functions should be optimized to use a select instead of a
				1554	branch (from gcc PR40072):
				1555
				1556	char char_int(int m) {if(m>7) return 0; return m;}
				1557	int int_char(char m) {if(m>7) return 0; return m;}
				1558
				1559	//===---------------------------------------------------------------------===//
				1560
Bill Wendling	5a56927	2009-10-27 22:48:31 +0000	[diff] [blame]	1561	int func(int a, int b) { if (a & 0x80) b \|= 0x80; else b &= ~0x80; return b; }
				1562
				1563	Generates this:
				1564
				1565	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1566	entry:
				1567	%0 = and i32 %a, 128 ; <i32> [#uses=1]
				1568	%1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
				1569	%2 = or i32 %b, 128 ; <i32> [#uses=1]
				1570	%3 = and i32 %b, -129 ; <i32> [#uses=1]
				1571	%b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
				1572	ret i32 %b_addr.0
				1573	}
				1574
				1575	However, it's functionally equivalent to:
				1576
				1577	b = (b & ~0x80) \| (a & 0x80);
				1578
				1579	Which generates this:
				1580
				1581	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1582	entry:
				1583	%0 = and i32 %b, -129 ; <i32> [#uses=1]
				1584	%1 = and i32 %a, 128 ; <i32> [#uses=1]
				1585	%2 = or i32 %0, %1 ; <i32> [#uses=1]
				1586	ret i32 %2
				1587	}
				1588
				1589	This can be generalized for other forms:
				1590
				1591	b = (b & ~0x80) \| (a & 0x40) << 1;
				1592
				1593	//===---------------------------------------------------------------------===//
Bill Wendling	c872e9c	2009-10-27 23:30:07 +0000	[diff] [blame]	1594
				1595	These two functions produce different code. They shouldn't:
				1596
				1597	#include <stdint.h>
				1598
				1599	uint8_t p1(uint8_t b, uint8_t a) {
				1600	b = (b & ~0xc0) \| (a & 0xc0);
				1601	return (b);
				1602	}
				1603
				1604	uint8_t p2(uint8_t b, uint8_t a) {
				1605	b = (b & ~0x40) \| (a & 0x40);
				1606	b = (b & ~0x80) \| (a & 0x80);
				1607	return (b);
				1608	}
				1609
				1610	define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1611	entry:
				1612	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1613	%1 = and i8 %a, -64 ; <i8> [#uses=1]
				1614	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1615	ret i8 %2
				1616	}
				1617
				1618	define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1619	entry:
				1620	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1621	%.masked = and i8 %a, 64 ; <i8> [#uses=1]
				1622	%1 = and i8 %a, -128 ; <i8> [#uses=1]
				1623	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1624	%3 = or i8 %2, %.masked ; <i8> [#uses=1]
				1625	ret i8 %3
				1626	}
				1627
				1628	//===---------------------------------------------------------------------===//
Chris Lattner	6fdfc9c	2009-11-11 17:51:27 +0000	[diff] [blame]	1629
				1630	IPSCCP does not currently propagate argument dependent constants through
				1631	functions where it does not not all of the callers. This includes functions
				1632	with normal external linkage as well as templates, C99 inline functions etc.
				1633	Specifically, it does nothing to:
				1634
				1635	define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
				1636	entry:
				1637	%0 = add nsw i32 %y, %z
				1638	%1 = mul i32 %0, %x
				1639	%2 = mul i32 %y, %z
				1640	%3 = add nsw i32 %1, %2
				1641	ret i32 %3
				1642	}
				1643
				1644	define i32 @test2() nounwind {
				1645	entry:
				1646	%0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
				1647	ret i32 %0
				1648	}
				1649
				1650	It would be interesting extend IPSCCP to be able to handle simple cases like
				1651	this, where all of the arguments to a call are constant. Because IPSCCP runs
				1652	before inlining, trivial templates and inline functions are not yet inlined.
				1653	The results for a function + set of constant arguments should be memoized in a
				1654	map.
				1655
				1656	//===---------------------------------------------------------------------===//
Chris Lattner	fc926c2	2009-11-11 17:54:02 +0000	[diff] [blame]	1657
				1658	The libcall constant folding stuff should be moved out of SimplifyLibcalls into
				1659	libanalysis' constantfolding logic. This would allow IPSCCP to be able to
				1660	handle simple things like this:
				1661
				1662	static int foo(const char *X) { return strlen(X); }
				1663	int bar() { return foo("abcd"); }
				1664
				1665	//===---------------------------------------------------------------------===//
Nick Lewycky	93f9f7a	2009-11-15 17:51:23 +0000	[diff] [blame]	1666
				1667	InstCombine should use SimplifyDemandedBits to remove the or instruction:
				1668
				1669	define i1 @test(i8 %x, i8 %y) {
				1670	%A = or i8 %x, 1
				1671	%B = icmp ugt i8 %A, 3
				1672	ret i1 %B
				1673	}
				1674
				1675	Currently instcombine calls SimplifyDemandedBits with either all bits or just
				1676	the sign bit, if the comparison is obviously a sign test. In this case, we only
				1677	need all but the bottom two bits from %A, and if we gave that mask to SDB it
				1678	would delete the or instruction for us.
				1679
				1680	//===---------------------------------------------------------------------===//
Chris Lattner	0533217	2009-12-03 07:41:54 +0000	[diff] [blame]	1681
Duncan Sands	e10920d	2010-01-06 15:37:47 +0000	[diff] [blame]	1682	functionattrs doesn't know much about memcpy/memset. This function should be
Duncan Sands	7c422ac	2010-01-06 08:45:52 +0000	[diff] [blame]	1683	marked readnone rather than readonly, since it only twiddles local memory, but
				1684	functionattrs doesn't handle memset/memcpy/memmove aggressively:
Chris Lattner	89742c2	2009-12-03 07:43:46 +0000	[diff] [blame]	1685
				1686	struct X { int p; int q; };
				1687	int foo() {
				1688	int i = 0, j = 1;
				1689	struct X x, y;
				1690	int **p;
				1691	y.p = &i;
				1692	x.q = &j;
				1693	p = __builtin_memcpy (&x, &y, sizeof (int *));
				1694	return **p;
				1695	}
				1696
Chris Lattner	0533217	2009-12-03 07:41:54 +0000	[diff] [blame]	1697	//===---------------------------------------------------------------------===//
				1698
Eli Friedman	9cfb3ad	2010-01-18 22:36:59 +0000	[diff] [blame]	1699	Missed instcombine transformation:
				1700	define i1 @a(i32 %x) nounwind readnone {
				1701	entry:
				1702	%cmp = icmp eq i32 %x, 30
				1703	%sub = add i32 %x, -30
				1704	%cmp2 = icmp ugt i32 %sub, 9
				1705	%or = or i1 %cmp, %cmp2
				1706	ret i1 %or
				1707	}
				1708	This should be optimized to a single compare. Testcase derived from gcc.
				1709
				1710	//===---------------------------------------------------------------------===//
				1711
				1712	Missed instcombine transformation:
				1713	void b();
				1714	void a(int x) { if (((1<<x)&8)==0) b(); }
				1715
				1716	The shift should be optimized out. Testcase derived from gcc.
				1717
				1718	//===---------------------------------------------------------------------===//
				1719
				1720	Missed instcombine or reassociate transformation:
				1721	int a(int a, int b) { return (a==12)&(b>47)&(b<58); }
				1722
				1723	The sgt and slt should be combined into a single comparison. Testcase derived
				1724	from gcc.
				1725
				1726	//===---------------------------------------------------------------------===//
				1727
				1728	Missed instcombine transformation:
				1729	define i32 @a(i32 %x) nounwind readnone {
				1730	entry:
				1731	%shr = lshr i32 %x, 5 ; <i32> [#uses=1]
				1732	%xor = xor i32 %shr, 67108864 ; <i32> [#uses=1]
				1733	%sub = add i32 %xor, -67108864 ; <i32> [#uses=1]
				1734	ret i32 %sub
				1735	}
				1736
				1737	This function is equivalent to "ashr i32 %x, 5". Testcase derived from gcc.
				1738
				1739	//===---------------------------------------------------------------------===//
				1740
				1741	isSafeToLoadUnconditionally should allow a GEP of a global/alloca with constant
				1742	indicies within the bounds of the allocated object. Reduced example:
				1743
				1744	const int a[] = {3,6};
				1745	int b(int y) { int* x = y ? &a[0] : &a[1]; return *x; }
				1746
				1747	All the loads should be eliminated. Testcase derived from gcc.
				1748
				1749	//===---------------------------------------------------------------------===//