Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: ce84cbb79d7b90e053b84da0688fcffe99e446d5 [file] [log] [blame]

Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
Chris Lattner	f308ea0	2006-09-28 06:01:17 +0000	[diff] [blame]	3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	1d15983	2009-11-27 17:12:30 +0000	[diff] [blame]	5	Dead argument elimination should be enhanced to handle cases when an argument is
				6	dead to an externally visible function. Though the argument can't be removed
				7	from the externally visible function, the caller doesn't need to pass it in.
				8	For example in this testcase:
				9
				10	void foo(int X) __attribute__((noinline));
				11	void foo(int X) { sideeffect(); }
				12	void bar(int A) { foo(A+1); }
				13
				14	We compile bar to:
				15
				16	define void @bar(i32 %A) nounwind ssp {
				17	%0 = add nsw i32 %A, 1 ; <i32> [#uses=1]
				18	tail call void @foo(i32 %0) nounwind noinline ssp
				19	ret void
				20	}
				21
				22	The add is dead, we could pass in 'i32 undef' instead. This occurs for C++
				23	templates etc, which usually have linkonce_odr/weak_odr linkage, not internal
				24	linkage.
				25
				26	//===---------------------------------------------------------------------===//
				27
Chris Lattner	9b62b45	2006-11-14 01:57:53 +0000	[diff] [blame]	28	With the recent changes to make the implicit def/use set explicit in
				29	machineinstrs, we should change the target descriptions for 'call' instructions
				30	so that the .td files don't list all the call-clobbered registers as implicit
				31	defs. Instead, these should be added by the code generator (e.g. on the dag).
				32
				33	This has a number of uses:
				34
				35	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				36	for their different impdef sets.
				37	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				38	sets don't need copies of call instructions.
				39	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				40	of calls.
				41
				42	//===---------------------------------------------------------------------===//
				43
Nate Begeman	81e8097	2006-03-17 01:40:33 +0000	[diff] [blame]	44	Make the PPC branch selector target independant
				45
				46	//===---------------------------------------------------------------------===//
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	47
				48	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	49	precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
				50	safe in general, even on darwin. See the libm implementation of hypot for
				51	examples (which special case when x/y are exactly zero to get signed zeros etc
				52	right).
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	53
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	54	//===---------------------------------------------------------------------===//
				55
				56	Solve this DAG isel folding deficiency:
				57
				58	int X, Y;
				59
				60	void fn1(void)
				61	{
				62	X = X \| (Y << 3);
				63	}
				64
				65	compiles to
				66
				67	fn1:
				68	movl Y, %eax
				69	shll $3, %eax
				70	orl X, %eax
				71	movl %eax, X
				72	ret
				73
				74	The problem is the store's chain operand is not the load X but rather
				75	a TokenFactor of the load X and load Y, which prevents the folding.
				76
				77	There are two ways to fix this:
				78
				79	1. The dag combiner can start using alias analysis to realize that y/x
				80	don't alias, making the store to X not dependent on the load from Y.
				81	2. The generated isel could be made smarter in the case it can't
				82	disambiguate the pointers.
				83
				84	Number 1 is the preferred solution.
				85
Evan Cheng	e617b08	2006-03-13 23:19:10 +0000	[diff] [blame]	86	This has been "fixed" by a TableGen hack. But that is a short term workaround
				87	which will be removed once the proper fix is made.
				88
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	89	//===---------------------------------------------------------------------===//
				90
Chris Lattner	b27b69f	2006-03-04 01:19:34 +0000	[diff] [blame]	91	On targets with expensive 64-bit multiply, we could LSR this:
				92
				93	for (i = ...; ++i) {
				94	x = 1ULL << i;
				95
				96	into:
				97	long long tmp = 1;
				98	for (i = ...; ++i, tmp+=tmp)
				99	x = tmp;
				100
				101	This would be a win on ppc32, but not x86 or ppc64.
				102
Chris Lattner	ad01993	2006-03-04 08:44:51 +0000	[diff] [blame]	103	//===---------------------------------------------------------------------===//
Chris Lattner	5b0fe7d	2006-03-05 20:00:08 +0000	[diff] [blame]	104
				105	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				106
				107	//===---------------------------------------------------------------------===//
Chris Lattner	549f27d2	2006-03-07 02:46:26 +0000	[diff] [blame]	108
Chris Lattner	398ffba	2010-01-01 01:29:26 +0000	[diff] [blame]	109	Reassociate should turn things like:
				110
				111	int factorial(int X) {
				112	return XXXXXXX*X;
				113	}
				114
				115	into llvm.powi calls, allowing the code generator to produce balanced
				116	multiplication trees.
				117
				118	First, the intrinsic needs to be extended to support integers, and second the
				119	code generator needs to be enhanced to lower these to multiplication trees.
Chris Lattner	c20995e	2006-03-11 20:17:08 +0000	[diff] [blame]	120
				121	//===---------------------------------------------------------------------===//
				122
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	123	Interesting? testcase for add/shift/mul reassoc:
				124
				125	int bar(int x, int y) {
				126	return xxx+y+xxxxxyyyy;
				127	}
				128	int foo(int z, int n) {
				129	return bar(z, n) + bar(2z, 2n);
				130	}
				131
Chris Lattner	398ffba	2010-01-01 01:29:26 +0000	[diff] [blame]	132	This is blocked on not handling XXX -> powi(X, 3) (see note above). The issue
				133	is that we end up getting t = 2X s = tt and don't turn this into 4XX,
				134	which is the same number of multiplies and is canonical, because the 2*X has
				135	multiple uses. Here's a simple example:
				136
				137	define i32 @test15(i32 %X1) {
				138	%B = mul i32 %X1, 47 ; X1*47
				139	%C = mul i32 %B, %B
				140	ret i32 %C
				141	}
				142
				143
				144	//===---------------------------------------------------------------------===//
				145
				146	Reassociate should handle the example in GCC PR16157:
				147
				148	extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;
				149	void f () { /* this can be optimized to four additions... */
				150	b4 = a4 + a3 + a2 + a1 + a0;
				151	b3 = a3 + a2 + a1 + a0;
				152	b2 = a2 + a1 + a0;
				153	b1 = a1 + a0;
				154	}
				155
				156	This requires reassociating to forms of expressions that are already available,
				157	something that reassoc doesn't think about yet.
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	158
Chris Lattner	10c4245	2010-01-24 20:01:41 +0000	[diff] [blame]	159
				160	//===---------------------------------------------------------------------===//
				161
				162	This function: (derived from GCC PR19988)
				163	double foo(double x, double y) {
				164	return ((x + 0.1234 * y) * (x + -0.1234 * y));
				165	}
				166
				167	compiles to:
				168	_foo:
				169	movapd %xmm1, %xmm2
				170	mulsd LCPI1_1(%rip), %xmm1
				171	mulsd LCPI1_0(%rip), %xmm2
				172	addsd %xmm0, %xmm1
				173	addsd %xmm0, %xmm2
				174	movapd %xmm1, %xmm0
				175	mulsd %xmm2, %xmm0
				176	ret
				177
Chris Lattner	43dc2e6	2010-01-24 20:17:09 +0000	[diff] [blame]	178	Reassociate should be able to turn it into:
Chris Lattner	10c4245	2010-01-24 20:01:41 +0000	[diff] [blame]	179
				180	double foo(double x, double y) {
				181	return ((x + 0.1234 * y) * (x - 0.1234 * y));
				182	}
				183
				184	Which allows the multiply by constant to be CSE'd, producing:
				185
				186	_foo:
				187	mulsd LCPI1_0(%rip), %xmm1
				188	movapd %xmm1, %xmm2
				189	addsd %xmm0, %xmm2
				190	subsd %xmm1, %xmm0
				191	mulsd %xmm2, %xmm0
				192	ret
				193
				194	This doesn't need -ffast-math support at all. This is particularly bad because
				195	the llvm-gcc frontend is canonicalizing the later into the former, but clang
				196	doesn't have this problem.
				197
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	198	//===---------------------------------------------------------------------===//
				199
Chris Lattner	82c78b2	2006-03-09 20:13:21 +0000	[diff] [blame]	200	These two functions should generate the same code on big-endian systems:
				201
				202	int g(int j,int l) { return memcmp(j,l,4); }
				203	int h(int j, int l) { return j - l; }
				204
				205	this could be done in SelectionDAGISel.cpp, along with other special cases,
				206	for 1,2,4,8 bytes.
				207
				208	//===---------------------------------------------------------------------===//
				209
Chris Lattner	c04b423	2006-03-22 07:33:46 +0000	[diff] [blame]	210	It would be nice to revert this patch:
				211	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				212
				213	And teach the dag combiner enough to simplify the code expanded before
				214	legalize. It seems plausible that this knowledge would let it simplify other
				215	stuff too.
				216
Chris Lattner	e6cd96d	2006-03-24 19:59:17 +0000	[diff] [blame]	217	//===---------------------------------------------------------------------===//
				218
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	219	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng	67d3d4c	2006-03-31 22:35:14 +0000	[diff] [blame]	220	to the type size. It works but can be overly conservative as the alignment of
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	221	specific vector types are target dependent.
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	222
				223	//===---------------------------------------------------------------------===//
				224
Dan Gohman	1f3be1a	2009-05-11 18:51:16 +0000	[diff] [blame]	225	We should produce an unaligned load from code like this:
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	226
				227	v4sf example(float *P) {
				228	return (v4sf){P[0], P[1], P[2], P[3] };
				229	}
				230
				231	//===---------------------------------------------------------------------===//
				232
Chris Lattner	16abfdf	2006-05-18 18:26:13 +0000	[diff] [blame]	233	Add support for conditional increments, and other related patterns. Instead
				234	of:
				235
				236	movl 136(%esp), %eax
				237	cmpl $0, %eax
				238	je LBB16_2 #cond_next
				239	LBB16_1: #cond_true
				240	incl _foo
				241	LBB16_2: #cond_next
				242
				243	emit:
				244	movl _foo, %eax
				245	cmpl $1, %edi
				246	sbbl $-1, %eax
				247	movl %eax, _foo
				248
				249	//===---------------------------------------------------------------------===//
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	250
				251	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				252
				253	Expand these to calls of sin/cos and stores:
				254	double sincos(double x, double sin, double cos);
				255	float sincosf(float x, float sin, float cos);
				256	long double sincosl(long double x, long double sin, long double cos);
				257
				258	Doing so could allow SROA of the destination pointers. See also:
				259	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				260
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	261	This is now easily doable with MRVs. We could even make an intrinsic for this
				262	if anyone cared enough about sincos.
				263
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	264	//===---------------------------------------------------------------------===//
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	265
Chris Lattner	e8263e6	2006-05-21 03:57:07 +0000	[diff] [blame]	266	Turn this into a single byte store with no load (the other 3 bytes are
				267	unmodified):
				268
Dan Gohman	5c8274b	2009-05-11 18:04:52 +0000	[diff] [blame]	269	define void @test(i32* %P) {
				270	%tmp = load i32* %P
				271	%tmp14 = or i32 %tmp, 3305111552
				272	%tmp15 = and i32 %tmp14, 3321888767
				273	store i32 %tmp15, i32* %P
Chris Lattner	e8263e6	2006-05-21 03:57:07 +0000	[diff] [blame]	274	ret void
				275	}
				276
Chris Lattner	9e18ef5	2006-05-30 21:29:15 +0000	[diff] [blame]	277	//===---------------------------------------------------------------------===//
				278
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	279	quantum_sigma_x in 462.libquantum contains the following loop:
				280
				281	for(i=0; i<reg->size; i++)
				282	{
				283	/* Flip the target bit of each basis state */
				284	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				285	}
				286
				287	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				288	so cool to turn it into something like:
				289
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	290	long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	291	if (target < 32) {
				292	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	293	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	294	} else {
				295	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	296	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	297	}
				298
				299	... which would only do one 32-bit XOR per loop iteration instead of two.
				300
				301	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	302	this requires TBAA.
Chris Lattner	faa6adf	2009-09-21 06:04:07 +0000	[diff] [blame]	303
				304	//===---------------------------------------------------------------------===//
				305
Chris Lattner	b1ac769	2008-10-05 02:16:12 +0000	[diff] [blame]	306	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattner	f9bae43	2006-12-08 02:01:32 +0000	[diff] [blame]	307
				308	unsigned long reverse(unsigned v) {
				309	unsigned t;
				310	t = v ^ ((v << 16) \| (v >> 16));
				311	t &= ~0xff0000;
				312	v = (v << 24) \| (v >> 8);
				313	return v ^ (t >> 8);
				314	}
				315
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	316	//===---------------------------------------------------------------------===//
				317
Chris Lattner	818ff34	2010-01-23 18:49:30 +0000	[diff] [blame]	318	[LOOP RECOGNITION]
				319
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	320	These idioms should be recognized as popcount (see PR1488):
				321
				322	unsigned countbits_slow(unsigned v) {
				323	unsigned c;
				324	for (c = 0; v; v >>= 1)
				325	c += v & 1;
				326	return c;
				327	}
				328	unsigned countbits_fast(unsigned v){
				329	unsigned c;
				330	for (c = 0; v; c++)
				331	v &= v - 1; // clear the least significant bit set
				332	return c;
				333	}
				334
				335	BITBOARD = unsigned long long
				336	int PopCnt(register BITBOARD a) {
				337	register int c=0;
				338	while(a) {
				339	c++;
				340	a &= a - 1;
				341	}
				342	return c;
				343	}
				344	unsigned int popcount(unsigned int input) {
				345	unsigned int count = 0;
				346	for (unsigned int i = 0; i < 4 * 8; i++)
				347	count += (input >> i) & i;
				348	return count;
				349	}
				350
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	351	This is a form of idiom recognition for loops, the same thing that could be
				352	useful for recognizing memset/memcpy.
				353
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	354	//===---------------------------------------------------------------------===//
				355
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	356	These should turn into single 16-bit (unaligned?) loads on little/big endian
				357	processors.
				358
				359	unsigned short read_16_le(const unsigned char *adr) {
				360	return adr[0] \| (adr[1] << 8);
				361	}
				362	unsigned short read_16_be(const unsigned char *adr) {
				363	return (adr[0] << 8) \| adr[1];
				364	}
				365
				366	//===---------------------------------------------------------------------===//
Chris Lattner	cf10391	2006-10-24 16:12:47 +0000	[diff] [blame]	367
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	368	-instcombine should handle this transform:
Reid Spencer	e4d87aa	2006-12-23 06:05:41 +0000	[diff] [blame]	369	icmp pred (sdiv X / C1 ), C2
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	370	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				371
				372	Currently InstCombine avoids this transform but will do it when the signs of
				373	the operands and the sign of the divide match. See the FIXME in
				374	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				375	for Instruction::UDiv (around line 4447) for more details.
				376
				377	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				378	this construct.
Chris Lattner	d7c628d	2006-11-03 22:27:39 +0000	[diff] [blame]	379
				380	//===---------------------------------------------------------------------===//
				381
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	382	[LOOP RECOGNITION]
				383
Chris Lattner	578d2df	2006-11-10 00:23:26 +0000	[diff] [blame]	384	viterbi speeds up significantly if the various "history" related copy loops
				385	are turned into memcpy calls at the source level. We need a "loops to memcpy"
				386	pass.
				387
				388	//===---------------------------------------------------------------------===//
Nick Lewycky	bf63734	2006-11-13 00:23:28 +0000	[diff] [blame]	389
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	390	[LOOP OPTIMIZATION]
				391
				392	SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization
				393	opportunities in its double_array_divs_variable function: it needs loop
				394	interchange, memory promotion (which LICM already does), vectorization and
				395	variable trip count loop unrolling (since it has a constant trip count). ICC
				396	apparently produces this very nice code with -ffast-math:
				397
				398	..B1.70: # Preds ..B1.70 ..B1.69
				399	mulpd %xmm0, %xmm1 #108.2
				400	mulpd %xmm0, %xmm1 #108.2
				401	mulpd %xmm0, %xmm1 #108.2
				402	mulpd %xmm0, %xmm1 #108.2
				403	addl $8, %edx #
				404	cmpl $131072, %edx #108.2
				405	jb ..B1.70 # Prob 99% #108.2
				406
				407	It would be better to count down to zero, but this is a lot better than what we
				408	do.
				409
				410	//===---------------------------------------------------------------------===//
				411
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	412	Consider:
				413
				414	typedef unsigned U32;
				415	typedef unsigned long long U64;
				416	int test (U32 inst, U64 regs) {
				417	U64 effective_addr2;
				418	U32 temp = *inst;
				419	int r1 = (temp >> 20) & 0xf;
				420	int b2 = (temp >> 16) & 0xf;
				421	effective_addr2 = temp & 0xfff;
				422	if (b2) effective_addr2 += regs[b2];
				423	b2 = (temp >> 12) & 0xf;
				424	if (b2) effective_addr2 += regs[b2];
				425	effective_addr2 &= regs[4];
				426	if ((effective_addr2 & 3) == 0)
				427	return 1;
				428	return 0;
				429	}
				430
				431	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				432	we don't eliminate the computation of the top half of effective_addr2 because
				433	we don't have whole-function selection dags. On x86, this means we use one
				434	extra register for the function when effective_addr2 is declared as U64 than
				435	when it is declared U32.
				436
Chris Lattner	1742498	2009-11-10 23:47:45 +0000	[diff] [blame]	437	PHI Slicing could be extended to do this.
				438
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	439	//===---------------------------------------------------------------------===//
				440
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	441	LSR should know what GPR types a target has from TargetData. This code:
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	442
				443	volatile short X, Y; // globals
				444
				445	void foo(int N) {
				446	int i;
				447	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				448	}
				449
Chris Lattner	c1491f3	2009-09-20 17:37:38 +0000	[diff] [blame]	450	produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	451
Chris Lattner	c1491f3	2009-09-20 17:37:38 +0000	[diff] [blame]	452	LBB1_2:
				453	ldr r3, LCPI1_0
				454	ldr r3, [r3]
				455	strh r2, [r3]
				456	ldr r3, LCPI1_1
				457	ldr r3, [r3]
				458	strh r1, [r3]
				459	add r1, r1, #4
				460	add r2, r2, #1 <- [0,+,1]
				461	sub r0, r0, #1 <- [0,-,1]
				462	cmp r0, #0
				463	bne LBB1_2
				464
				465	LSR should reuse the "+" IV for the exit test.
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	466
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	467	//===---------------------------------------------------------------------===//
				468
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	469	Tail call elim should be more aggressive, checking to see if the call is
				470	followed by an uncond branch to an exit block.
				471
				472	; This testcase is due to tail-duplication not wanting to copy the return
				473	; instruction into the terminating blocks because there was other code
				474	; optimized out of the function after the taildup happened.
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	475	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	476
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	477	define i32 @t4(i32 %a) {
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	478	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	479	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				480	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				481	br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	482
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	483	then.0: ; preds = %entry
				484	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				485	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				486	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	487
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	488	else.0: ; preds = %entry
				489	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				490	br i1 %tmp.7, label %then.1, label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	491
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	492	then.1: ; preds = %else.0
				493	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				494	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				495	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	496
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	497	return: ; preds = %then.1, %else.0, %then.0
				498	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	499	[ %tmp.9, %then.1 ]
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	500	ret i32 %result.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	501	}
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	502
				503	//===---------------------------------------------------------------------===//
				504
Chris Lattner	c90b866	2008-08-10 00:47:21 +0000	[diff] [blame]	505	Tail recursion elimination should handle:
				506
				507	int pow2m1(int n) {
				508	if (n == 0)
				509	return 0;
				510	return 2 * pow2m1 (n - 1) + 1;
				511	}
				512
				513	Also, multiplies can be turned into SHL's, so they should be handled as if
				514	they were associative. "return foo() << 1" can be tail recursion eliminated.
				515
				516	//===---------------------------------------------------------------------===//
				517
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	518	Argument promotion should promote arguments for recursive functions, like
				519	this:
				520
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	521	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	522
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	523	define internal i32 @foo(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	524	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	525	%tmp = load i32* %x ; <i32> [#uses=0]
				526	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				527	ret i32 %tmp.foo
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	528	}
				529
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	530	define i32 @bar(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	531	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	532	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				533	ret i32 %tmp3
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	534	}
				535
Chris Lattner	81f2d71	2007-12-05 23:05:06 +0000	[diff] [blame]	536	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	537
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	538	We should investigate an instruction sinking pass. Consider this silly
				539	example in pic mode:
				540
				541	#include <assert.h>
				542	void foo(int x) {
				543	assert(x);
				544	//...
				545	}
				546
				547	we compile this to:
				548	_foo:
				549	subl $28, %esp
				550	call "L1$pb"
				551	"L1$pb":
				552	popl %eax
				553	cmpl $0, 32(%esp)
				554	je LBB1_2 # cond_true
				555	LBB1_1: # return
				556	# ...
				557	addl $28, %esp
				558	ret
				559	LBB1_2: # cond_true
				560	...
				561
				562	The PIC base computation (call+popl) is only used on one path through the
				563	code, but is currently always computed in the entry block. It would be
				564	better to sink the picbase computation down into the block for the
				565	assertion, as it is the only one that uses it. This happens for a lot of
				566	code with early outs.
				567
Chris Lattner	92c06a0	2007-12-29 01:05:01 +0000	[diff] [blame]	568	Another example is loads of arguments, which are usually emitted into the
				569	entry block on targets like x86. If not used in all paths through a
				570	function, they should be sunk into the ones that do.
				571
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	572	In this case, whole-function-isel would also handle this.
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	573
				574	//===---------------------------------------------------------------------===//
Chris Lattner	b304194	2008-01-07 21:38:14 +0000	[diff] [blame]	575
				576	Investigate lowering of sparse switch statements into perfect hash tables:
				577	http://burtleburtle.net/bob/hash/perfect.html
				578
				579	//===---------------------------------------------------------------------===//
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	580
				581	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				582	corresponding integer operations. On a yonah, this loop:
				583
				584	double a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	585	void foo() {
				586	int i, b;
				587	for (b = 0; b < 10000000; b++)
				588	for (i = 0; i < 256; i++)
				589	a[i] = -a[i];
				590	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	591
				592	is twice as slow as this loop:
				593
				594	long long a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	595	void foo() {
				596	int i, b;
				597	for (b = 0; b < 10000000; b++)
				598	for (i = 0; i < 256; i++)
				599	a[i] ^= (1ULL << 63);
				600	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	601
				602	and I suspect other processors are similar. On X86 in particular this is a
				603	big win because doing this with integers allows the use of read/modify/write
				604	instructions.
				605
				606	//===---------------------------------------------------------------------===//
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	607
				608	DAG Combiner should try to combine small loads into larger loads when
				609	profitable. For example, we compile this C++ example:
				610
				611	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				612	extern THotKey m_HotKey;
				613	THotKey GetHotKey () { return m_HotKey; }
				614
				615	into (-O3 -fno-exceptions -static -fomit-frame-pointer):
				616
				617	__Z9GetHotKeyv:
				618	pushl %esi
				619	movl 8(%esp), %eax
				620	movb _m_HotKey+3, %cl
				621	movb _m_HotKey+4, %dl
				622	movb _m_HotKey+2, %ch
				623	movw _m_HotKey, %si
				624	movw %si, (%eax)
				625	movb %ch, 2(%eax)
				626	movb %cl, 3(%eax)
				627	movb %dl, 4(%eax)
				628	popl %esi
				629	ret $4
				630
				631	GCC produces:
				632
				633	__Z9GetHotKeyv:
				634	movl _m_HotKey, %edx
				635	movl 4(%esp), %eax
				636	movl %edx, (%eax)
				637	movzwl _m_HotKey+4, %edx
				638	movw %dx, 4(%eax)
				639	ret $4
				640
				641	The LLVM IR contains the needed alignment info, so we should be able to
				642	merge the loads and stores into 4-byte loads:
				643
				644	%struct.THotKey = type { i16, i8, i8, i8 }
				645	define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
				646	...
				647	%tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
				648	%tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
				649	%tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
				650	%tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
				651
				652	Alternatively, we should use a small amount of base-offset alias analysis
				653	to make it so the scheduler doesn't need to hold all the loads in regs at
				654	once.
				655
				656	//===---------------------------------------------------------------------===//
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	657
Nate Begeman	e9fe65c	2008-02-18 18:39:23 +0000	[diff] [blame]	658	We should add an FRINT node to the DAG to model targets that have legal
				659	implementations of ceil/floor/rint.
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	660
				661	//===---------------------------------------------------------------------===//
				662
				663	Consider:
				664
				665	int test() {
				666	long long input[8] = {1,1,1,1,1,1,1,1};
				667	foo(input);
				668	}
				669
				670	We currently compile this into a memcpy from a global array since the
				671	initializer is fairly large and not memset'able. This is good, but the memcpy
				672	gets lowered to load/stores in the code generator. This is also ok, except
				673	that the codegen lowering for memcpy doesn't handle the case when the source
				674	is a constant global. This gives us atrocious code like this:
				675
				676	call "L1$pb"
				677	"L1$pb":
				678	popl %eax
				679	movl _C.0.1444-"L1$pb"+32(%eax), %ecx
				680	movl %ecx, 40(%esp)
				681	movl _C.0.1444-"L1$pb"+20(%eax), %ecx
				682	movl %ecx, 28(%esp)
				683	movl _C.0.1444-"L1$pb"+36(%eax), %ecx
				684	movl %ecx, 44(%esp)
				685	movl _C.0.1444-"L1$pb"+44(%eax), %ecx
				686	movl %ecx, 52(%esp)
				687	movl _C.0.1444-"L1$pb"+40(%eax), %ecx
				688	movl %ecx, 48(%esp)
				689	movl _C.0.1444-"L1$pb"+12(%eax), %ecx
				690	movl %ecx, 20(%esp)
				691	movl _C.0.1444-"L1$pb"+4(%eax), %ecx
				692	...
				693
				694	instead of:
				695	movl $1, 16(%esp)
				696	movl $0, 20(%esp)
				697	movl $1, 24(%esp)
				698	movl $0, 28(%esp)
				699	movl $1, 32(%esp)
				700	movl $0, 36(%esp)
				701	...
				702
				703	//===---------------------------------------------------------------------===//
Chris Lattner	a11deb0	2008-03-02 02:51:40 +0000	[diff] [blame]	704
				705	http://llvm.org/PR717:
				706
				707	The following code should compile into "ret int undef". Instead, LLVM
				708	produces "ret int 0":
				709
				710	int f() {
				711	int x = 4;
				712	int y;
				713	if (x == 3) y = 0;
				714	return y;
				715	}
				716
				717	//===---------------------------------------------------------------------===//
Chris Lattner	53b7277	2008-03-02 19:29:42 +0000	[diff] [blame]	718
				719	The loop unroller should partially unroll loops (instead of peeling them)
				720	when code growth isn't too bad and when an unroll count allows simplification
				721	of some code within the loop. One trivial example is:
				722
				723	#include <stdio.h>
				724	int main() {
				725	int nRet = 17;
				726	int nLoop;
				727	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				728	if ( nLoop & 1 )
				729	nRet += 2;
				730	else
				731	nRet -= 1;
				732	}
				733	return nRet;
				734	}
				735
				736	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				737	reduction in code size. The resultant code would then also be suitable for
				738	exit value computation.
				739
				740	//===---------------------------------------------------------------------===//
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	741
				742	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				743	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				744	matching code in dag combine doesn't look through truncates aggressively
				745	enough. Here are some testcases reduces from GCC PR17886:
				746
				747	unsigned long long f(unsigned long long x, int y) {
				748	return (x << y) \| (x >> 64-y);
				749	}
				750	unsigned f2(unsigned x, int y){
				751	return (x << y) \| (x >> 32-y);
				752	}
				753	unsigned long long f3(unsigned long long x){
				754	int y = 9;
				755	return (x << y) \| (x >> 64-y);
				756	}
				757	unsigned f4(unsigned x){
				758	int y = 10;
				759	return (x << y) \| (x >> 32-y);
				760	}
				761	unsigned long long f5(unsigned long long x, unsigned long long y) {
				762	return (x << 8) \| ((y >> 48) & 0xffull);
				763	}
				764	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				765	switch(z) {
				766	case 1:
				767	return (x << 8) \| ((y >> 48) & 0xffull);
				768	case 2:
				769	return (x << 16) \| ((y >> 40) & 0xffffull);
				770	case 3:
				771	return (x << 24) \| ((y >> 32) & 0xffffffull);
				772	case 4:
				773	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				774	default:
				775	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				776	}
				777	}
				778
Dan Gohman	cb747c5	2008-10-17 21:39:27 +0000	[diff] [blame]	779	On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	780	generate truly horrible code, instead of using shld and friends. On
				781	ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
				782	badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
				783
				784	//===---------------------------------------------------------------------===//
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	785
				786	We do a number of simplifications in simplify libcalls to strength reduce
				787	standard library functions, but we don't currently merge them together. For
				788	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				789	be done safely if "b" isn't modified between the strlen and memcpy of course.
				790
				791	//===---------------------------------------------------------------------===//
				792
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	793	We compile this program: (from GCC PR11680)
				794	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				795
				796	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				797	slower than when compile with GCC (either 4.0 or 4.2):
				798
				799	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				800	$ time ./a.out fast
				801	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				802
				803	$ g++ perf.cpp -O3 -fno-exceptions
				804	$ time ./a.out fast
				805	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				806
				807	It looks like we are making the same inlining decisions, so this may be raw
				808	codegen badness or something else (haven't investigated).
				809
				810	//===---------------------------------------------------------------------===//
				811
				812	We miss some instcombines for stuff like this:
				813	void bar (void);
				814	void foo (unsigned int a) {
				815	/* This one is equivalent to a >= (3 << 2). */
				816	if ((a >> 2) >= 3)
				817	bar ();
				818	}
				819
				820	A few other related ones are in GCC PR14753.
				821
				822	//===---------------------------------------------------------------------===//
				823
				824	Divisibility by constant can be simplified (according to GCC PR12849) from
				825	being a mulhi to being a mul lo (cheaper). Testcase:
				826
				827	void bar(unsigned n) {
				828	if (n % 3 == 0)
				829	true();
				830	}
				831
Eli Friedman	bcae205	2009-12-12 23:23:43 +0000	[diff] [blame]	832	This is equivalent to the following, where 2863311531 is the multiplicative
				833	inverse of 3, and 1431655766 is ((2^32)-1)/3+1:
				834	void bar(unsigned n) {
				835	if (n * 2863311531U < 1431655766U)
				836	true();
				837	}
				838
				839	The same transformation can work with an even modulo with the addition of a
				840	rotate: rotate the result of the multiply to the right by the number of bits
				841	which need to be zero for the condition to be true, and shrink the compare RHS
				842	by the same amount. Unless the target supports rotates, though, that
				843	transformation probably isn't worthwhile.
				844
				845	The transformation can also easily be made to work with non-zero equality
				846	comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	847
				848	//===---------------------------------------------------------------------===//
Chris Lattner	23f35bc	2008-08-19 06:22:16 +0000	[diff] [blame]	849
Chris Lattner	db03983	2008-10-15 16:06:03 +0000	[diff] [blame]	850	Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
				851	bunch of other stuff from this example (see PR1604):
				852
				853	#include <cstdio>
				854	struct test {
				855	int val;
				856	virtual ~test() {}
				857	};
				858
				859	int main() {
				860	test t;
				861	std::scanf("%d", &t.val);
				862	std::printf("%d\n", t.val);
				863	}
				864
				865	//===---------------------------------------------------------------------===//
				866
Nick Lewycky	d2f0db1	2008-11-27 22:41:45 +0000	[diff] [blame]	867	These functions perform the same computation, but produce different assembly.
Nick Lewycky	df563ca	2008-11-27 22:12:22 +0000	[diff] [blame]	868
				869	define i8 @select(i8 %x) readnone nounwind {
				870	%A = icmp ult i8 %x, 250
				871	%B = select i1 %A, i8 0, i8 1
				872	ret i8 %B
				873	}
				874
				875	define i8 @addshr(i8 %x) readnone nounwind {
				876	%A = zext i8 %x to i9
				877	%B = add i9 %A, 6 ;; 256 - 250 == 6
				878	%C = lshr i9 %B, 8
				879	%D = trunc i9 %C to i8
				880	ret i8 %D
				881	}
				882
				883	//===---------------------------------------------------------------------===//
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	884
				885	From gcc bug 24696:
				886	int
				887	f (unsigned long a, unsigned long b, unsigned long c)
				888	{
				889	return ((a & (c - 1)) != 0) \|\| ((b & (c - 1)) != 0);
				890	}
				891	int
				892	f (unsigned long a, unsigned long b, unsigned long c)
				893	{
				894	return ((a & (c - 1)) != 0) \| ((b & (c - 1)) != 0);
				895	}
				896	Both should combine to ((a\|b) & (c-1)) != 0. Currently not optimized with
				897	"clang -emit-llvm-bc \| opt -std-compile-opts".
				898
				899	//===---------------------------------------------------------------------===//
				900
				901	From GCC Bug 20192:
				902	#define PMD_MASK (~((1UL << 23) - 1))
				903	void clear_pmd_range(unsigned long start, unsigned long end)
				904	{
				905	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
				906	f();
				907	}
				908	The expression should optimize to something like
				909	"!((start\|end)&~PMD_MASK). Currently not optimized with "clang
				910	-emit-llvm-bc \| opt -std-compile-opts".
				911
				912	//===---------------------------------------------------------------------===//
				913
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	914	From GCC Bug 3756:
				915	int
				916	pn (int n)
				917	{
				918	return (n >= 0 ? 1 : -1);
				919	}
				920	Should combine to (n >> 31) \| 1. Currently not optimized with "clang
				921	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				922
				923	//===---------------------------------------------------------------------===//
				924
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	925	void a(int variable)
				926	{
				927	if (variable == 4 \|\| variable == 6)
				928	bar();
				929	}
				930	This should optimize to "if ((variable \| 2) == 6)". Currently not
				931	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts \| llc".
				932
				933	//===---------------------------------------------------------------------===//
				934
				935	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
				936	i;}
				937	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				938	These should combine to the same thing. Currently, the first function
				939	produces better code on X86.
				940
				941	//===---------------------------------------------------------------------===//
				942
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	943	From GCC Bug 15784:
				944	#define abs(x) x>0?x:-x
				945	int f(int x, int y)
				946	{
				947	return (abs(x)) >= 0;
				948	}
				949	This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
				950	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				951
				952	//===---------------------------------------------------------------------===//
				953
				954	From GCC Bug 14753:
				955	void
				956	rotate_cst (unsigned int a)
				957	{
				958	a = (a << 10) \| (a >> 22);
				959	if (a == 123)
				960	bar ();
				961	}
				962	void
				963	minus_cst (unsigned int a)
				964	{
				965	unsigned int tem;
				966
				967	tem = 20 - a;
				968	if (tem == 5)
				969	bar ();
				970	}
				971	void
				972	mask_gt (unsigned int a)
				973	{
				974	/* This is equivalent to a > 15. */
				975	if ((a & ~7) > 8)
				976	bar ();
				977	}
				978	void
				979	rshift_gt (unsigned int a)
				980	{
				981	/* This is equivalent to a > 23. */
				982	if ((a >> 2) > 5)
				983	bar ();
				984	}
				985	All should simplify to a single comparison. All of these are
				986	currently not optimized with "clang -emit-llvm-bc \| opt
				987	-std-compile-opts".
				988
				989	//===---------------------------------------------------------------------===//
				990
				991	From GCC Bug 32605:
				992	int c(int* x) {return (char)x+2 == (char)x;}
				993	Should combine to 0. Currently not optimized with "clang
				994	-emit-llvm-bc \| opt -std-compile-opts" (although llc can optimize it).
				995
				996	//===---------------------------------------------------------------------===//
				997
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	998	int a(unsigned b) {return ((b << 31) \| (b << 30)) >> 31;}
				999	Should be combined to "((b >> 1) \| b) & 1". Currently not optimized
				1000	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1001
				1002	//===---------------------------------------------------------------------===//
				1003
				1004	unsigned a(unsigned x, unsigned y) { return x \| (y & 1) \| (y & 2);}
				1005	Should combine to "x \| (y & 3)". Currently not optimized with "clang
				1006	-emit-llvm-bc \| opt -std-compile-opts".
				1007
				1008	//===---------------------------------------------------------------------===//
				1009
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1010	int a(int a, int b, int c) {return (~a & c) \| ((c\|a) & b);}
				1011	Should fold to "(~a & c) \| (a & b)". Currently not optimized with
				1012	"clang -emit-llvm-bc \| opt -std-compile-opts".
				1013
				1014	//===---------------------------------------------------------------------===//
				1015
				1016	int a(int a,int b) {return (~(a\|b))\|a;}
				1017	Should fold to "a\|~b". Currently not optimized with "clang
				1018	-emit-llvm-bc \| opt -std-compile-opts".
				1019
				1020	//===---------------------------------------------------------------------===//
				1021
				1022	int a(int a, int b) {return (a&&b) \|\| (a&&!b);}
				1023	Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
				1024	\| opt -std-compile-opts".
				1025
				1026	//===---------------------------------------------------------------------===//
				1027
				1028	int a(int a, int b, int c) {return (a&&b) \|\| (!a&&c);}
				1029	Should fold to "a ? b : c", or at least something sane. Currently not
				1030	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1031
				1032	//===---------------------------------------------------------------------===//
				1033
				1034	int a(int a, int b, int c) {return (a&&b) \|\| (a&&c) \|\| (a&&b&&c);}
				1035	Should fold to a && (b \|\| c). Currently not optimized with "clang
				1036	-emit-llvm-bc \| opt -std-compile-opts".
				1037
				1038	//===---------------------------------------------------------------------===//
				1039
				1040	int a(int x) {return x \| ((x & 8) ^ 8);}
				1041	Should combine to x \| 8. Currently not optimized with "clang
				1042	-emit-llvm-bc \| opt -std-compile-opts".
				1043
				1044	//===---------------------------------------------------------------------===//
				1045
				1046	int a(int x) {return x ^ ((x & 8) ^ 8);}
				1047	Should also combine to x \| 8. Currently not optimized with "clang
				1048	-emit-llvm-bc \| opt -std-compile-opts".
				1049
				1050	//===---------------------------------------------------------------------===//
				1051
				1052	int a(int x) {return (x & 8) == 0 ? -1 : -9;}
				1053	Should combine to (x \| -9) ^ 8. Currently not optimized with "clang
				1054	-emit-llvm-bc \| opt -std-compile-opts".
				1055
				1056	//===---------------------------------------------------------------------===//
				1057
				1058	int a(int x) {return (x & 8) == 0 ? -9 : -1;}
				1059	Should combine to x \| -9. Currently not optimized with "clang
				1060	-emit-llvm-bc \| opt -std-compile-opts".
				1061
				1062	//===---------------------------------------------------------------------===//
				1063
				1064	int a(int x) {return ((x \| -9) ^ 8) & x;}
				1065	Should combine to x & -9. Currently not optimized with "clang
				1066	-emit-llvm-bc \| opt -std-compile-opts".
				1067
				1068	//===---------------------------------------------------------------------===//
				1069
				1070	unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
				1071	Should combine to "a * 0x88888888 >> 31". Currently not optimized
				1072	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1073
				1074	//===---------------------------------------------------------------------===//
				1075
				1076	unsigned a(char* x) {if ((*x & 32) == 0) return b();}
				1077	There's an unnecessary zext in the generated code with "clang
				1078	-emit-llvm-bc \| opt -std-compile-opts".
				1079
				1080	//===---------------------------------------------------------------------===//
				1081
				1082	unsigned a(unsigned long long x) {return 40 * (x >> 1);}
				1083	Should combine to "20 * (((unsigned)x) & -2)". Currently not
				1084	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1085
				1086	//===---------------------------------------------------------------------===//
Bill Wendling	3bdcda8	2008-12-02 05:12:47 +0000	[diff] [blame]	1087
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1088	This was noticed in the entryblock for grokdeclarator in 403.gcc:
				1089
				1090	%tmp = icmp eq i32 %decl_context, 4
				1091	%decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
				1092	%tmp1 = icmp eq i32 %decl_context_addr.0, 1
				1093	%decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
				1094
				1095	tmp1 should be simplified to something like:
				1096	(!tmp \|\| decl_context == 1)
				1097
				1098	This allows recursive simplifications, tmp1 is used all over the place in
				1099	the function, e.g. by:
				1100
				1101	%tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
				1102	%tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
				1103	%or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
				1104
				1105	later.
				1106
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1107	//===---------------------------------------------------------------------===//
				1108
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1109	[STORE SINKING]
				1110
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1111	Store sinking: This code:
				1112
				1113	void f (int n, int cond, int res) {
				1114	int i;
				1115	*res = 0;
				1116	for (i = 0; i < n; i++)
				1117	if (*cond)
				1118	res ^= 234; / () /
				1119	}
				1120
				1121	On this function GVN hoists the fully redundant value of *res, but nothing
				1122	moves the store out. This gives us this code:
				1123
				1124	bb: ; preds = %bb2, %entry
				1125	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
				1126	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
				1127	%1 = load i32* %cond, align 4
				1128	%2 = icmp eq i32 %1, 0
				1129	br i1 %2, label %bb2, label %bb1
				1130
				1131	bb1: ; preds = %bb
				1132	%3 = xor i32 %.rle, 234
				1133	store i32 %3, i32* %res, align 4
				1134	br label %bb2
				1135
				1136	bb2: ; preds = %bb, %bb1
				1137	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
				1138	%indvar.next = add i32 %i.05, 1
				1139	%exitcond = icmp eq i32 %indvar.next, %n
				1140	br i1 %exitcond, label %return, label %bb
				1141
				1142	DSE should sink partially dead stores to get the store out of the loop.
				1143
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1144	Here's another partial dead case:
				1145	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
				1146
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1147	//===---------------------------------------------------------------------===//
				1148
				1149	Scalar PRE hoists the mul in the common block up to the else:
				1150
				1151	int test (int a, int b, int c, int g) {
				1152	int d, e;
				1153	if (a)
				1154	d = b * c;
				1155	else
				1156	d = b - c;
				1157	e = b * c + g;
				1158	return d + e;
				1159	}
				1160
				1161	It would be better to do the mul once to reduce codesize above the if.
				1162	This is GCC PR38204.
				1163
				1164	//===---------------------------------------------------------------------===//
				1165
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1166	[STORE SINKING]
				1167
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1168	GCC PR37810 is an interesting case where we should sink load/store reload
				1169	into the if block and outside the loop, so we don't reload/store it on the
				1170	non-call path.
				1171
				1172	for () {
				1173	*P += 1;
				1174	if ()
				1175	call();
				1176	else
				1177	...
				1178	->
				1179	tmp = *P
				1180	for () {
				1181	tmp += 1;
				1182	if () {
				1183	*P = tmp;
				1184	call();
				1185	tmp = *P;
				1186	} else ...
				1187	}
				1188	*P = tmp;
				1189
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1190	We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
				1191	we don't sink the store. We need partially dead store sinking.
				1192
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1193	//===---------------------------------------------------------------------===//
				1194
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1195	[LOAD PRE CRIT EDGE SPLITTING]
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1196
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1197	GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
				1198	leading to excess stack traffic. This could be handled by GVN with some crazy
				1199	symbolic phi translation. The code we get looks like (g is on the stack):
				1200
				1201	bb2: ; preds = %bb1
				1202	..
				1203	%9 = getelementptr %struct.f* %g, i32 0, i32 0
				1204	store i32 %8, i32* %9, align bel %bb3
				1205
				1206	bb3: ; preds = %bb1, %bb2, %bb
				1207	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
				1208	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
				1209	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
				1210	%11 = load i32* %10, align 4
				1211
Chris Lattner	6d94926	2009-11-27 16:53:57 +0000	[diff] [blame]	1212	%11 is partially redundant, an in BB2 it should have the value %8.
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1213
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1214	GCC PR33344 and PR35287 are similar cases.
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1215
Chris Lattner	6c9fab7	2009-11-05 18:19:19 +0000	[diff] [blame]	1216
				1217	//===---------------------------------------------------------------------===//
				1218
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1219	[LOAD PRE]
				1220
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1221	There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1222	GCC testsuite, ones we don't get yet are (checked through loadpre25):
				1223
				1224	[CRIT EDGE BREAKING]
				1225	loadpre3.c predcom-4.c
				1226
				1227	[PRE OF READONLY CALL]
				1228	loadpre5.c
				1229
				1230	[TURN SELECT INTO BRANCH]
				1231	loadpre14.c loadpre15.c
				1232
				1233	actually a conditional increment: loadpre18.c loadpre19.c
				1234
				1235
				1236	//===---------------------------------------------------------------------===//
				1237
				1238	[SCALAR PRE]
				1239	There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the
				1240	GCC testsuite.
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1241
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1242	//===---------------------------------------------------------------------===//
				1243
				1244	There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1245	GCC testsuite. For example, we get the first example in predcom-1.c, but
				1246	miss the second one:
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1247
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1248	unsigned fib[1000];
				1249	unsigned avg[1000];
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1250
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1251	__attribute__ ((noinline))
				1252	void count_averages(int n) {
				1253	int i;
				1254	for (i = 1; i < n; i++)
				1255	avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;
				1256	}
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1257
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1258	which compiles into two loads instead of one in the loop.
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1259
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1260	predcom-2.c is the same as predcom-1.c
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1261
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1262	predcom-3.c is very similar but needs loads feeding each other instead of
				1263	store->load.
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1264
				1265
				1266	//===---------------------------------------------------------------------===//
				1267
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	1268	[ALIAS ANALYSIS]
				1269
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1270	Type based alias analysis:
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1271	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
				1272
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	1273	We should do better analysis of posix_memalign. At the least it should
				1274	no-capture its pointer argument, at best, we should know that the out-value
				1275	result doesn't point to anything (like malloc). One example of this is in
				1276	SingleSource/Benchmarks/Misc/dt.c
				1277
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1278	//===---------------------------------------------------------------------===//
				1279
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1280	A/B get pinned to the stack because we turn an if/then into a select instead
				1281	of PRE'ing the load/store. This may be fixable in instcombine:
				1282	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
				1283
Chris Lattner	93c6c77	2009-09-21 02:53:57 +0000	[diff] [blame]	1284	struct X { int i; };
				1285	int foo (int x) {
				1286	struct X a;
				1287	struct X b;
				1288	struct X *p;
				1289	a.i = 1;
				1290	b.i = 2;
				1291	if (x)
				1292	p = &a;
				1293	else
				1294	p = &b;
				1295	return p->i;
				1296	}
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1297
Chris Lattner	93c6c77	2009-09-21 02:53:57 +0000	[diff] [blame]	1298	//===---------------------------------------------------------------------===//
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1299
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1300	Interesting missed case because of control flow flattening (should be 2 loads):
				1301	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1302	With: llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \|
				1303	opt -mem2reg -gvn -instcombine \| llvm-dis
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1304	we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1305	VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1306
				1307	//===---------------------------------------------------------------------===//
				1308
				1309	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
				1310	We could eliminate the branch condition here, loading from null is undefined:
				1311
				1312	struct S { int w, x, y, z; };
				1313	struct T { int r; struct S s; };
				1314	void bar (struct S, int);
				1315	void foo (int a, struct T b)
				1316	{
				1317	struct S *c = 0;
				1318	if (a)
				1319	c = &b.s;
				1320	bar (*c, a);
				1321	}
				1322
				1323	//===---------------------------------------------------------------------===//
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1324
Chris Lattner	9cf8ef6	2008-12-23 20:52:52 +0000	[diff] [blame]	1325	simplifylibcalls should do several optimizations for strspn/strcspn:
				1326
				1327	strcspn(x, "") -> strlen(x)
				1328	strcspn("", x) -> 0
				1329	strspn("", x) -> 0
				1330	strspn(x, "") -> strlen(x)
				1331	strspn(x, "a") -> strchr(x, 'a')-x
				1332
				1333	strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
				1334
				1335	size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
				1336	int __reject3) {
				1337	register size_t __result = 0;
				1338	while (__s[__result] != '\0' && __s[__result] != __reject1 &&
				1339	__s[__result] != __reject2 && __s[__result] != __reject3)
				1340	++__result;
				1341	return __result;
				1342	}
				1343
				1344	This should turn into a switch on the character. See PR3253 for some notes on
				1345	codegen.
				1346
				1347	456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
				1348
				1349	//===---------------------------------------------------------------------===//
Chris Lattner	d23b799	2008-12-31 00:54:13 +0000	[diff] [blame]	1350
				1351	"gas" uses this idiom:
				1352	else if (strchr ("+-/%\|&^:[]()~", intel_parser.op_string))
				1353	..
				1354	else if (strchr ("<>", *intel_parser.op_string)
				1355
				1356	Those should be turned into a switch.
				1357
				1358	//===---------------------------------------------------------------------===//
Chris Lattner	ffb08f5	2009-01-08 06:52:57 +0000	[diff] [blame]	1359
				1360	252.eon contains this interesting code:
				1361
				1362	%3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
				1363	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1364	%strlen = call i32 @strlen(i8* %3072) ; uses = 1
				1365	%endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
				1366	call void @llvm.memcpy.i32(i8* %endptr,
				1367	i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
				1368	%3074 = call i32 @strlen(i8* %endptr) nounwind readonly
				1369
				1370	This is interesting for a couple reasons. First, in this:
				1371
				1372	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1373	%strlen = call i32 @strlen(i8* %3072)
				1374
				1375	The strlen could be replaced with: %strlen = sub %3072, %3073, because the
				1376	strcpy call returns a pointer to the end of the string. Based on that, the
				1377	endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
				1378
				1379	Second, the memcpy+strlen strlen can be replaced with:
				1380
				1381	%3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
				1382
				1383	Because the destination was just copied into the specified memory buffer. This,
				1384	in turn, can be constant folded to "4".
				1385
				1386	In other code, it contains:
				1387
				1388	%endptr6978 = bitcast i8* %endptr69 to i32*
				1389	store i32 7107374, i32* %endptr6978, align 1
				1390	%3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
				1391
				1392	Which could also be constant folded. Whatever is producing this should probably
				1393	be fixed to leave this as a memcpy from a string.
				1394
				1395	Further, eon also has an interesting partially redundant strlen call:
				1396
				1397	bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
				1398	%682 = getelementptr i8 %argv, i32 6 ; <i8> [#uses=2]
				1399	%683 = load i8** %682, align 4 ; <i8*> [#uses=4]
				1400	%684 = load i8* %683, align 1 ; <i8> [#uses=1]
				1401	%685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
				1402	br i1 %685, label %bb10, label %bb9
				1403
				1404	bb9: ; preds = %bb8
				1405	%686 = call i32 @strlen(i8* %683) nounwind readonly
				1406	%687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
				1407	br i1 %687, label %bb10, label %bb11
				1408
				1409	bb10: ; preds = %bb9, %bb8
				1410	%688 = call i32 @strlen(i8* %683) nounwind readonly
				1411
				1412	This could be eliminated by doing the strlen once in bb8, saving code size and
				1413	improving perf on the bb8->9->10 path.
				1414
				1415	//===---------------------------------------------------------------------===//
Chris Lattner	9fee08f	2009-01-08 07:34:55 +0000	[diff] [blame]	1416
				1417	I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
				1418	which looks like:
				1419	%movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
				1420
				1421
				1422	bb62: ; preds = %bb55, %bb53
				1423	%promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
				1424	%171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1425	%172 = add i32 %171, -1 ; <i32> [#uses=1]
				1426	%173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
				1427
				1428	... no stores ...
				1429	br i1 %or.cond, label %bb65, label %bb72
				1430
				1431	bb65: ; preds = %bb62
				1432	store i8 0, i8* %173, align 1
				1433	br label %bb72
				1434
				1435	bb72: ; preds = %bb65, %bb62
				1436	%trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
				1437	%177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1438
				1439	Note that on the bb62->bb72 path, that the %177 strlen call is partially
				1440	redundant with the %171 call. At worst, we could shove the %177 strlen call
				1441	up into the bb65 block moving it out of the bb62->bb72 path. However, note
				1442	that bb65 stores to the string, zeroing out the last byte. This means that on
				1443	that path the value of %177 is actually just %171-1. A sub is cheaper than a
				1444	strlen!
				1445
				1446	This pattern repeats several times, basically doing:
				1447
				1448	A = strlen(P);
				1449	P[A-1] = 0;
				1450	B = strlen(P);
				1451	where it is "obvious" that B = A-1.
				1452
				1453	//===---------------------------------------------------------------------===//
				1454
				1455	186.crafty contains this interesting pattern:
				1456
				1457	%77 = call i8* @strstr(i8* getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0),
				1458	i8* %30)
				1459	%phitmp648 = icmp eq i8* %77, getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0)
				1460	br i1 %phitmp648, label %bb70, label %bb76
				1461
				1462	bb70: ; preds = %OptionMatch.exit91, %bb69
				1463	%78 = call i32 @strlen(i8* %30) nounwind readonly align 1 ; <i32> [#uses=1]
				1464
				1465	This is basically:
				1466	cststr = "abcdef";
				1467	if (strstr(cststr, P) == cststr) {
				1468	x = strlen(P);
				1469	...
				1470
				1471	The strstr call would be significantly cheaper written as:
				1472
				1473	cststr = "abcdef";
				1474	if (memcmp(P, str, strlen(P)))
				1475	x = strlen(P);
				1476
				1477	This is memcmp+strlen instead of strstr. This also makes the strlen fully
				1478	redundant.
				1479
				1480	//===---------------------------------------------------------------------===//
				1481
				1482	186.crafty also contains this code:
				1483
				1484	%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1485	%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
				1486	%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
				1487	%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1488	%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
				1489
				1490	The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
				1491
				1492	//===---------------------------------------------------------------------===//
				1493
				1494	186.crafty has this interesting pattern with the "out.4543" variable:
				1495
				1496	call void @llvm.memcpy.i32(
				1497	i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
				1498	i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
				1499	%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
				1500
				1501	It is basically doing:
				1502
				1503	memcpy(globalarray, "string");
				1504	printf(..., globalarray);
				1505
				1506	Anyway, by knowing that printf just reads the memory and forward substituting
				1507	the string directly into the printf, this eliminates reads from globalarray.
				1508	Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
				1509	other similar functions) there are many stores to "out". Once all the printfs
				1510	stop using "out", all that is left is the memcpy's into it. This should allow
				1511	globalopt to remove the "stored only" global.
				1512
				1513	//===---------------------------------------------------------------------===//
				1514
Dan Gohman	8289b05	2009-01-20 01:07:33 +0000	[diff] [blame]	1515	This code:
				1516
				1517	define inreg i32 @foo(i8* inreg %p) nounwind {
				1518	%tmp0 = load i8* %p
				1519	%tmp1 = ashr i8 %tmp0, 5
				1520	%tmp2 = sext i8 %tmp1 to i32
				1521	ret i32 %tmp2
				1522	}
				1523
				1524	could be dagcombine'd to a sign-extending load with a shift.
				1525	For example, on x86 this currently gets this:
				1526
				1527	movb (%eax), %al
				1528	sarb $5, %al
				1529	movsbl %al, %eax
				1530
				1531	while it could get this:
				1532
				1533	movsbl (%eax), %eax
				1534	sarl $5, %eax
				1535
				1536	//===---------------------------------------------------------------------===//
Chris Lattner	256baa4	2009-01-22 07:16:03 +0000	[diff] [blame]	1537
				1538	GCC PR31029:
				1539
				1540	int test(int x) { return 1-x == x; } // --> return false
				1541	int test2(int x) { return 2-x == x; } // --> return x == 1 ?
				1542
				1543	Always foldable for odd constants, what is the rule for even?
				1544
				1545	//===---------------------------------------------------------------------===//
				1546
Torok Edwin	e46a686	2009-01-24 19:30:25 +0000	[diff] [blame]	1547	PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
				1548	for next field in struct (which is at same address).
				1549
				1550	For example: store of float into { {{}}, float } could be turned into a store to
				1551	the float directly.
				1552
Torok Edwin	474479f	2009-02-20 18:42:06 +0000	[diff] [blame]	1553	//===---------------------------------------------------------------------===//
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1554
Torok Edwin	474479f	2009-02-20 18:42:06 +0000	[diff] [blame]	1555	#include <math.h>
				1556	double foo(double a) { return sin(a); }
				1557
				1558	This compiles into this on x86-64 Linux:
				1559	foo:
				1560	subq $8, %rsp
				1561	call sin
				1562	addq $8, %rsp
				1563	ret
				1564	vs:
				1565
				1566	foo:
				1567	jmp sin
				1568
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1569	//===---------------------------------------------------------------------===//
				1570
Chris Lattner	32c5f17	2009-05-11 17:41:40 +0000	[diff] [blame]	1571	The arg promotion pass should make use of nocapture to make its alias analysis
				1572	stuff much more precise.
				1573
				1574	//===---------------------------------------------------------------------===//
				1575
				1576	The following functions should be optimized to use a select instead of a
				1577	branch (from gcc PR40072):
				1578
				1579	char char_int(int m) {if(m>7) return 0; return m;}
				1580	int int_char(char m) {if(m>7) return 0; return m;}
				1581
				1582	//===---------------------------------------------------------------------===//
				1583
Bill Wendling	5a56927	2009-10-27 22:48:31 +0000	[diff] [blame]	1584	int func(int a, int b) { if (a & 0x80) b \|= 0x80; else b &= ~0x80; return b; }
				1585
				1586	Generates this:
				1587
				1588	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1589	entry:
				1590	%0 = and i32 %a, 128 ; <i32> [#uses=1]
				1591	%1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
				1592	%2 = or i32 %b, 128 ; <i32> [#uses=1]
				1593	%3 = and i32 %b, -129 ; <i32> [#uses=1]
				1594	%b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
				1595	ret i32 %b_addr.0
				1596	}
				1597
				1598	However, it's functionally equivalent to:
				1599
				1600	b = (b & ~0x80) \| (a & 0x80);
				1601
				1602	Which generates this:
				1603
				1604	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1605	entry:
				1606	%0 = and i32 %b, -129 ; <i32> [#uses=1]
				1607	%1 = and i32 %a, 128 ; <i32> [#uses=1]
				1608	%2 = or i32 %0, %1 ; <i32> [#uses=1]
				1609	ret i32 %2
				1610	}
				1611
				1612	This can be generalized for other forms:
				1613
				1614	b = (b & ~0x80) \| (a & 0x40) << 1;
				1615
				1616	//===---------------------------------------------------------------------===//
Bill Wendling	c872e9c	2009-10-27 23:30:07 +0000	[diff] [blame]	1617
				1618	These two functions produce different code. They shouldn't:
				1619
				1620	#include <stdint.h>
				1621
				1622	uint8_t p1(uint8_t b, uint8_t a) {
				1623	b = (b & ~0xc0) \| (a & 0xc0);
				1624	return (b);
				1625	}
				1626
				1627	uint8_t p2(uint8_t b, uint8_t a) {
				1628	b = (b & ~0x40) \| (a & 0x40);
				1629	b = (b & ~0x80) \| (a & 0x80);
				1630	return (b);
				1631	}
				1632
				1633	define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1634	entry:
				1635	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1636	%1 = and i8 %a, -64 ; <i8> [#uses=1]
				1637	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1638	ret i8 %2
				1639	}
				1640
				1641	define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1642	entry:
				1643	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1644	%.masked = and i8 %a, 64 ; <i8> [#uses=1]
				1645	%1 = and i8 %a, -128 ; <i8> [#uses=1]
				1646	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1647	%3 = or i8 %2, %.masked ; <i8> [#uses=1]
				1648	ret i8 %3
				1649	}
				1650
				1651	//===---------------------------------------------------------------------===//
Chris Lattner	6fdfc9c	2009-11-11 17:51:27 +0000	[diff] [blame]	1652
				1653	IPSCCP does not currently propagate argument dependent constants through
				1654	functions where it does not not all of the callers. This includes functions
				1655	with normal external linkage as well as templates, C99 inline functions etc.
				1656	Specifically, it does nothing to:
				1657
				1658	define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
				1659	entry:
				1660	%0 = add nsw i32 %y, %z
				1661	%1 = mul i32 %0, %x
				1662	%2 = mul i32 %y, %z
				1663	%3 = add nsw i32 %1, %2
				1664	ret i32 %3
				1665	}
				1666
				1667	define i32 @test2() nounwind {
				1668	entry:
				1669	%0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
				1670	ret i32 %0
				1671	}
				1672
				1673	It would be interesting extend IPSCCP to be able to handle simple cases like
				1674	this, where all of the arguments to a call are constant. Because IPSCCP runs
				1675	before inlining, trivial templates and inline functions are not yet inlined.
				1676	The results for a function + set of constant arguments should be memoized in a
				1677	map.
				1678
				1679	//===---------------------------------------------------------------------===//
Chris Lattner	fc926c2	2009-11-11 17:54:02 +0000	[diff] [blame]	1680
				1681	The libcall constant folding stuff should be moved out of SimplifyLibcalls into
				1682	libanalysis' constantfolding logic. This would allow IPSCCP to be able to
				1683	handle simple things like this:
				1684
				1685	static int foo(const char *X) { return strlen(X); }
				1686	int bar() { return foo("abcd"); }
				1687
				1688	//===---------------------------------------------------------------------===//
Nick Lewycky	93f9f7a	2009-11-15 17:51:23 +0000	[diff] [blame]	1689
				1690	InstCombine should use SimplifyDemandedBits to remove the or instruction:
				1691
				1692	define i1 @test(i8 %x, i8 %y) {
				1693	%A = or i8 %x, 1
				1694	%B = icmp ugt i8 %A, 3
				1695	ret i1 %B
				1696	}
				1697
				1698	Currently instcombine calls SimplifyDemandedBits with either all bits or just
				1699	the sign bit, if the comparison is obviously a sign test. In this case, we only
				1700	need all but the bottom two bits from %A, and if we gave that mask to SDB it
				1701	would delete the or instruction for us.
				1702
				1703	//===---------------------------------------------------------------------===//
Chris Lattner	0533217	2009-12-03 07:41:54 +0000	[diff] [blame]	1704
Duncan Sands	e10920d	2010-01-06 15:37:47 +0000	[diff] [blame]	1705	functionattrs doesn't know much about memcpy/memset. This function should be
Duncan Sands	7c422ac	2010-01-06 08:45:52 +0000	[diff] [blame]	1706	marked readnone rather than readonly, since it only twiddles local memory, but
				1707	functionattrs doesn't handle memset/memcpy/memmove aggressively:
Chris Lattner	89742c2	2009-12-03 07:43:46 +0000	[diff] [blame]	1708
				1709	struct X { int p; int q; };
				1710	int foo() {
				1711	int i = 0, j = 1;
				1712	struct X x, y;
				1713	int **p;
				1714	y.p = &i;
				1715	x.q = &j;
				1716	p = __builtin_memcpy (&x, &y, sizeof (int *));
				1717	return **p;
				1718	}
				1719
Chris Lattner	0533217	2009-12-03 07:41:54 +0000	[diff] [blame]	1720	//===---------------------------------------------------------------------===//
				1721
Eli Friedman	9cfb3ad	2010-01-18 22:36:59 +0000	[diff] [blame]	1722	Missed instcombine transformation:
				1723	define i1 @a(i32 %x) nounwind readnone {
				1724	entry:
				1725	%cmp = icmp eq i32 %x, 30
				1726	%sub = add i32 %x, -30
				1727	%cmp2 = icmp ugt i32 %sub, 9
				1728	%or = or i1 %cmp, %cmp2
				1729	ret i1 %or
				1730	}
				1731	This should be optimized to a single compare. Testcase derived from gcc.
				1732
				1733	//===---------------------------------------------------------------------===//
				1734
				1735	Missed instcombine transformation:
				1736	void b();
				1737	void a(int x) { if (((1<<x)&8)==0) b(); }
				1738
				1739	The shift should be optimized out. Testcase derived from gcc.
				1740
				1741	//===---------------------------------------------------------------------===//
				1742
				1743	Missed instcombine or reassociate transformation:
				1744	int a(int a, int b) { return (a==12)&(b>47)&(b<58); }
				1745
				1746	The sgt and slt should be combined into a single comparison. Testcase derived
				1747	from gcc.
				1748
				1749	//===---------------------------------------------------------------------===//
				1750
				1751	Missed instcombine transformation:
				1752	define i32 @a(i32 %x) nounwind readnone {
				1753	entry:
				1754	%shr = lshr i32 %x, 5 ; <i32> [#uses=1]
				1755	%xor = xor i32 %shr, 67108864 ; <i32> [#uses=1]
				1756	%sub = add i32 %xor, -67108864 ; <i32> [#uses=1]
				1757	ret i32 %sub
				1758	}
				1759
				1760	This function is equivalent to "ashr i32 %x, 5". Testcase derived from gcc.
				1761
				1762	//===---------------------------------------------------------------------===//