Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: 17617ad547357351d813034aa85b934497517399 [file] [log] [blame]

Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
Chris Lattner	f308ea0	2006-09-28 06:01:17 +0000	[diff] [blame]	3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	313a94c	2010-09-19 00:37:34 +0000	[diff] [blame]	5	We should recognize idioms for add-with-carry and turn it into the appropriate
				6	intrinsics. This example:
				7
				8	unsigned add32carry(unsigned sum, unsigned x) {
				9	unsigned z = sum + x;
				10	if (sum + x < x)
				11	z++;
				12	return z;
				13	}
				14
				15	Compiles to: clang t.c -S -o - -O3 -fomit-frame-pointer -m64 -mkernel
				16
				17	_add32carry: ## @add32carry
				18	addl %esi, %edi
				19	cmpl %esi, %edi
				20	sbbl %eax, %eax
				21	andl $1, %eax
				22	addl %edi, %eax
				23	ret
				24
				25	with clang, but to:
				26
				27	_add32carry:
				28	leal (%rsi,%rdi), %eax
				29	cmpl %esi, %eax
				30	adcl $0, %eax
				31	ret
				32
				33	with gcc.
				34
				35	//===---------------------------------------------------------------------===//
				36
Chris Lattner	1d15983	2009-11-27 17:12:30 +0000	[diff] [blame]	37	Dead argument elimination should be enhanced to handle cases when an argument is
				38	dead to an externally visible function. Though the argument can't be removed
				39	from the externally visible function, the caller doesn't need to pass it in.
				40	For example in this testcase:
				41
				42	void foo(int X) __attribute__((noinline));
				43	void foo(int X) { sideeffect(); }
				44	void bar(int A) { foo(A+1); }
				45
				46	We compile bar to:
				47
				48	define void @bar(i32 %A) nounwind ssp {
				49	%0 = add nsw i32 %A, 1 ; <i32> [#uses=1]
				50	tail call void @foo(i32 %0) nounwind noinline ssp
				51	ret void
				52	}
				53
				54	The add is dead, we could pass in 'i32 undef' instead. This occurs for C++
				55	templates etc, which usually have linkonce_odr/weak_odr linkage, not internal
				56	linkage.
				57
				58	//===---------------------------------------------------------------------===//
				59
Chris Lattner	9b62b45	2006-11-14 01:57:53 +0000	[diff] [blame]	60	With the recent changes to make the implicit def/use set explicit in
				61	machineinstrs, we should change the target descriptions for 'call' instructions
				62	so that the .td files don't list all the call-clobbered registers as implicit
				63	defs. Instead, these should be added by the code generator (e.g. on the dag).
				64
				65	This has a number of uses:
				66
				67	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				68	for their different impdef sets.
				69	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				70	sets don't need copies of call instructions.
				71	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				72	of calls.
				73
				74	//===---------------------------------------------------------------------===//
				75
Nate Begeman	81e8097	2006-03-17 01:40:33 +0000	[diff] [blame]	76	Make the PPC branch selector target independant
				77
				78	//===---------------------------------------------------------------------===//
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	79
				80	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	81	precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
				82	safe in general, even on darwin. See the libm implementation of hypot for
				83	examples (which special case when x/y are exactly zero to get signed zeros etc
				84	right).
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	85
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	86	//===---------------------------------------------------------------------===//
				87
				88	Solve this DAG isel folding deficiency:
				89
				90	int X, Y;
				91
				92	void fn1(void)
				93	{
				94	X = X \| (Y << 3);
				95	}
				96
				97	compiles to
				98
				99	fn1:
				100	movl Y, %eax
				101	shll $3, %eax
				102	orl X, %eax
				103	movl %eax, X
				104	ret
				105
				106	The problem is the store's chain operand is not the load X but rather
				107	a TokenFactor of the load X and load Y, which prevents the folding.
				108
				109	There are two ways to fix this:
				110
				111	1. The dag combiner can start using alias analysis to realize that y/x
				112	don't alias, making the store to X not dependent on the load from Y.
				113	2. The generated isel could be made smarter in the case it can't
				114	disambiguate the pointers.
				115
				116	Number 1 is the preferred solution.
				117
Evan Cheng	e617b08	2006-03-13 23:19:10 +0000	[diff] [blame]	118	This has been "fixed" by a TableGen hack. But that is a short term workaround
				119	which will be removed once the proper fix is made.
				120
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	121	//===---------------------------------------------------------------------===//
				122
Chris Lattner	b27b69f	2006-03-04 01:19:34 +0000	[diff] [blame]	123	On targets with expensive 64-bit multiply, we could LSR this:
				124
				125	for (i = ...; ++i) {
				126	x = 1ULL << i;
				127
				128	into:
				129	long long tmp = 1;
				130	for (i = ...; ++i, tmp+=tmp)
				131	x = tmp;
				132
				133	This would be a win on ppc32, but not x86 or ppc64.
				134
Chris Lattner	ad01993	2006-03-04 08:44:51 +0000	[diff] [blame]	135	//===---------------------------------------------------------------------===//
Chris Lattner	5b0fe7d	2006-03-05 20:00:08 +0000	[diff] [blame]	136
				137	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				138
				139	//===---------------------------------------------------------------------===//
Chris Lattner	549f27d2	2006-03-07 02:46:26 +0000	[diff] [blame]	140
Chris Lattner	398ffba	2010-01-01 01:29:26 +0000	[diff] [blame]	141	Reassociate should turn things like:
				142
				143	int factorial(int X) {
				144	return XXXXXXX*X;
				145	}
				146
				147	into llvm.powi calls, allowing the code generator to produce balanced
				148	multiplication trees.
				149
				150	First, the intrinsic needs to be extended to support integers, and second the
				151	code generator needs to be enhanced to lower these to multiplication trees.
Chris Lattner	c20995e	2006-03-11 20:17:08 +0000	[diff] [blame]	152
				153	//===---------------------------------------------------------------------===//
				154
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	155	Interesting? testcase for add/shift/mul reassoc:
				156
				157	int bar(int x, int y) {
				158	return xxx+y+xxxxxyyyy;
				159	}
				160	int foo(int z, int n) {
				161	return bar(z, n) + bar(2z, 2n);
				162	}
				163
Chris Lattner	398ffba	2010-01-01 01:29:26 +0000	[diff] [blame]	164	This is blocked on not handling XXX -> powi(X, 3) (see note above). The issue
				165	is that we end up getting t = 2X s = tt and don't turn this into 4XX,
				166	which is the same number of multiplies and is canonical, because the 2*X has
				167	multiple uses. Here's a simple example:
				168
				169	define i32 @test15(i32 %X1) {
				170	%B = mul i32 %X1, 47 ; X1*47
				171	%C = mul i32 %B, %B
				172	ret i32 %C
				173	}
				174
				175
				176	//===---------------------------------------------------------------------===//
				177
				178	Reassociate should handle the example in GCC PR16157:
				179
				180	extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;
				181	void f () { /* this can be optimized to four additions... */
				182	b4 = a4 + a3 + a2 + a1 + a0;
				183	b3 = a3 + a2 + a1 + a0;
				184	b2 = a2 + a1 + a0;
				185	b1 = a1 + a0;
				186	}
				187
				188	This requires reassociating to forms of expressions that are already available,
				189	something that reassoc doesn't think about yet.
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	190
Chris Lattner	10c4245	2010-01-24 20:01:41 +0000	[diff] [blame]	191
				192	//===---------------------------------------------------------------------===//
				193
				194	This function: (derived from GCC PR19988)
				195	double foo(double x, double y) {
				196	return ((x + 0.1234 * y) * (x + -0.1234 * y));
				197	}
				198
				199	compiles to:
				200	_foo:
				201	movapd %xmm1, %xmm2
				202	mulsd LCPI1_1(%rip), %xmm1
				203	mulsd LCPI1_0(%rip), %xmm2
				204	addsd %xmm0, %xmm1
				205	addsd %xmm0, %xmm2
				206	movapd %xmm1, %xmm0
				207	mulsd %xmm2, %xmm0
				208	ret
				209
Chris Lattner	43dc2e6	2010-01-24 20:17:09 +0000	[diff] [blame]	210	Reassociate should be able to turn it into:
Chris Lattner	10c4245	2010-01-24 20:01:41 +0000	[diff] [blame]	211
				212	double foo(double x, double y) {
				213	return ((x + 0.1234 * y) * (x - 0.1234 * y));
				214	}
				215
				216	Which allows the multiply by constant to be CSE'd, producing:
				217
				218	_foo:
				219	mulsd LCPI1_0(%rip), %xmm1
				220	movapd %xmm1, %xmm2
				221	addsd %xmm0, %xmm2
				222	subsd %xmm1, %xmm0
				223	mulsd %xmm2, %xmm0
				224	ret
				225
				226	This doesn't need -ffast-math support at all. This is particularly bad because
				227	the llvm-gcc frontend is canonicalizing the later into the former, but clang
				228	doesn't have this problem.
				229
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	230	//===---------------------------------------------------------------------===//
				231
Chris Lattner	82c78b2	2006-03-09 20:13:21 +0000	[diff] [blame]	232	These two functions should generate the same code on big-endian systems:
				233
				234	int g(int j,int l) { return memcmp(j,l,4); }
				235	int h(int j, int l) { return j - l; }
				236
				237	this could be done in SelectionDAGISel.cpp, along with other special cases,
				238	for 1,2,4,8 bytes.
				239
				240	//===---------------------------------------------------------------------===//
				241
Chris Lattner	c04b423	2006-03-22 07:33:46 +0000	[diff] [blame]	242	It would be nice to revert this patch:
				243	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				244
				245	And teach the dag combiner enough to simplify the code expanded before
				246	legalize. It seems plausible that this knowledge would let it simplify other
				247	stuff too.
				248
Chris Lattner	e6cd96d	2006-03-24 19:59:17 +0000	[diff] [blame]	249	//===---------------------------------------------------------------------===//
				250
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	251	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng	67d3d4c	2006-03-31 22:35:14 +0000	[diff] [blame]	252	to the type size. It works but can be overly conservative as the alignment of
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	253	specific vector types are target dependent.
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	254
				255	//===---------------------------------------------------------------------===//
				256
Dan Gohman	1f3be1a	2009-05-11 18:51:16 +0000	[diff] [blame]	257	We should produce an unaligned load from code like this:
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	258
				259	v4sf example(float *P) {
				260	return (v4sf){P[0], P[1], P[2], P[3] };
				261	}
				262
				263	//===---------------------------------------------------------------------===//
				264
Chris Lattner	16abfdf	2006-05-18 18:26:13 +0000	[diff] [blame]	265	Add support for conditional increments, and other related patterns. Instead
				266	of:
				267
				268	movl 136(%esp), %eax
				269	cmpl $0, %eax
				270	je LBB16_2 #cond_next
				271	LBB16_1: #cond_true
				272	incl _foo
				273	LBB16_2: #cond_next
				274
				275	emit:
				276	movl _foo, %eax
				277	cmpl $1, %edi
				278	sbbl $-1, %eax
				279	movl %eax, _foo
				280
				281	//===---------------------------------------------------------------------===//
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	282
				283	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				284
				285	Expand these to calls of sin/cos and stores:
				286	double sincos(double x, double sin, double cos);
				287	float sincosf(float x, float sin, float cos);
				288	long double sincosl(long double x, long double sin, long double cos);
				289
				290	Doing so could allow SROA of the destination pointers. See also:
				291	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				292
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	293	This is now easily doable with MRVs. We could even make an intrinsic for this
				294	if anyone cared enough about sincos.
				295
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	296	//===---------------------------------------------------------------------===//
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	297
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	298	quantum_sigma_x in 462.libquantum contains the following loop:
				299
				300	for(i=0; i<reg->size; i++)
				301	{
				302	/* Flip the target bit of each basis state */
				303	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				304	}
				305
				306	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				307	so cool to turn it into something like:
				308
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	309	long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	310	if (target < 32) {
				311	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	312	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	313	} else {
				314	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	315	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	316	}
				317
				318	... which would only do one 32-bit XOR per loop iteration instead of two.
				319
				320	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	321	this requires TBAA.
Chris Lattner	faa6adf	2009-09-21 06:04:07 +0000	[diff] [blame]	322
				323	//===---------------------------------------------------------------------===//
				324
Chris Lattner	b1ac769	2008-10-05 02:16:12 +0000	[diff] [blame]	325	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattner	f9bae43	2006-12-08 02:01:32 +0000	[diff] [blame]	326
				327	unsigned long reverse(unsigned v) {
				328	unsigned t;
				329	t = v ^ ((v << 16) \| (v >> 16));
				330	t &= ~0xff0000;
				331	v = (v << 24) \| (v >> 8);
				332	return v ^ (t >> 8);
				333	}
				334
Eric Christopher	33634d0	2010-06-29 22:22:22 +0000	[diff] [blame]	335	Neither is this (very standard idiom):
				336
				337	int f(int n)
				338	{
				339	return (((n) << 24) \| (((n) & 0xff00) << 8)
				340	\| (((n) >> 8) & 0xff00) \| ((n) >> 24));
				341	}
				342
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	343	//===---------------------------------------------------------------------===//
				344
Chris Lattner	818ff34	2010-01-23 18:49:30 +0000	[diff] [blame]	345	[LOOP RECOGNITION]
				346
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	347	These idioms should be recognized as popcount (see PR1488):
				348
				349	unsigned countbits_slow(unsigned v) {
				350	unsigned c;
				351	for (c = 0; v; v >>= 1)
				352	c += v & 1;
				353	return c;
				354	}
				355	unsigned countbits_fast(unsigned v){
				356	unsigned c;
				357	for (c = 0; v; c++)
				358	v &= v - 1; // clear the least significant bit set
				359	return c;
				360	}
				361
				362	BITBOARD = unsigned long long
				363	int PopCnt(register BITBOARD a) {
				364	register int c=0;
				365	while(a) {
				366	c++;
				367	a &= a - 1;
				368	}
				369	return c;
				370	}
				371	unsigned int popcount(unsigned int input) {
				372	unsigned int count = 0;
				373	for (unsigned int i = 0; i < 4 * 8; i++)
				374	count += (input >> i) & i;
				375	return count;
				376	}
				377
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	378	This is a form of idiom recognition for loops, the same thing that could be
				379	useful for recognizing memset/memcpy.
				380
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	381	//===---------------------------------------------------------------------===//
				382
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	383	These should turn into single 16-bit (unaligned?) loads on little/big endian
				384	processors.
				385
				386	unsigned short read_16_le(const unsigned char *adr) {
				387	return adr[0] \| (adr[1] << 8);
				388	}
				389	unsigned short read_16_be(const unsigned char *adr) {
				390	return (adr[0] << 8) \| adr[1];
				391	}
				392
				393	//===---------------------------------------------------------------------===//
Chris Lattner	cf10391	2006-10-24 16:12:47 +0000	[diff] [blame]	394
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	395	-instcombine should handle this transform:
Reid Spencer	e4d87aa	2006-12-23 06:05:41 +0000	[diff] [blame]	396	icmp pred (sdiv X / C1 ), C2
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	397	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				398
				399	Currently InstCombine avoids this transform but will do it when the signs of
				400	the operands and the sign of the divide match. See the FIXME in
				401	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				402	for Instruction::UDiv (around line 4447) for more details.
				403
				404	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				405	this construct.
Chris Lattner	d7c628d	2006-11-03 22:27:39 +0000	[diff] [blame]	406
				407	//===---------------------------------------------------------------------===//
				408
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	409	[LOOP RECOGNITION]
				410
Chris Lattner	578d2df	2006-11-10 00:23:26 +0000	[diff] [blame]	411	viterbi speeds up significantly if the various "history" related copy loops
				412	are turned into memcpy calls at the source level. We need a "loops to memcpy"
				413	pass.
				414
				415	//===---------------------------------------------------------------------===//
Nick Lewycky	bf63734	2006-11-13 00:23:28 +0000	[diff] [blame]	416
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	417	[LOOP OPTIMIZATION]
				418
				419	SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization
				420	opportunities in its double_array_divs_variable function: it needs loop
				421	interchange, memory promotion (which LICM already does), vectorization and
				422	variable trip count loop unrolling (since it has a constant trip count). ICC
				423	apparently produces this very nice code with -ffast-math:
				424
				425	..B1.70: # Preds ..B1.70 ..B1.69
				426	mulpd %xmm0, %xmm1 #108.2
				427	mulpd %xmm0, %xmm1 #108.2
				428	mulpd %xmm0, %xmm1 #108.2
				429	mulpd %xmm0, %xmm1 #108.2
				430	addl $8, %edx #
				431	cmpl $131072, %edx #108.2
				432	jb ..B1.70 # Prob 99% #108.2
				433
				434	It would be better to count down to zero, but this is a lot better than what we
				435	do.
				436
				437	//===---------------------------------------------------------------------===//
				438
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	439	Consider:
				440
				441	typedef unsigned U32;
				442	typedef unsigned long long U64;
				443	int test (U32 inst, U64 regs) {
				444	U64 effective_addr2;
				445	U32 temp = *inst;
				446	int r1 = (temp >> 20) & 0xf;
				447	int b2 = (temp >> 16) & 0xf;
				448	effective_addr2 = temp & 0xfff;
				449	if (b2) effective_addr2 += regs[b2];
				450	b2 = (temp >> 12) & 0xf;
				451	if (b2) effective_addr2 += regs[b2];
				452	effective_addr2 &= regs[4];
				453	if ((effective_addr2 & 3) == 0)
				454	return 1;
				455	return 0;
				456	}
				457
				458	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				459	we don't eliminate the computation of the top half of effective_addr2 because
				460	we don't have whole-function selection dags. On x86, this means we use one
				461	extra register for the function when effective_addr2 is declared as U64 than
				462	when it is declared U32.
				463
Chris Lattner	1742498	2009-11-10 23:47:45 +0000	[diff] [blame]	464	PHI Slicing could be extended to do this.
				465
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	466	//===---------------------------------------------------------------------===//
				467
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	468	LSR should know what GPR types a target has from TargetData. This code:
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	469
				470	volatile short X, Y; // globals
				471
				472	void foo(int N) {
				473	int i;
				474	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				475	}
				476
Chris Lattner	c1491f3	2009-09-20 17:37:38 +0000	[diff] [blame]	477	produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	478
Chris Lattner	c1491f3	2009-09-20 17:37:38 +0000	[diff] [blame]	479	LBB1_2:
				480	ldr r3, LCPI1_0
				481	ldr r3, [r3]
				482	strh r2, [r3]
				483	ldr r3, LCPI1_1
				484	ldr r3, [r3]
				485	strh r1, [r3]
				486	add r1, r1, #4
				487	add r2, r2, #1 <- [0,+,1]
				488	sub r0, r0, #1 <- [0,-,1]
				489	cmp r0, #0
				490	bne LBB1_2
				491
				492	LSR should reuse the "+" IV for the exit test.
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	493
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	494	//===---------------------------------------------------------------------===//
				495
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	496	Tail call elim should be more aggressive, checking to see if the call is
				497	followed by an uncond branch to an exit block.
				498
				499	; This testcase is due to tail-duplication not wanting to copy the return
				500	; instruction into the terminating blocks because there was other code
				501	; optimized out of the function after the taildup happened.
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	502	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	503
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	504	define i32 @t4(i32 %a) {
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	505	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	506	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				507	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				508	br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	509
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	510	then.0: ; preds = %entry
				511	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				512	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				513	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	514
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	515	else.0: ; preds = %entry
				516	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				517	br i1 %tmp.7, label %then.1, label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	518
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	519	then.1: ; preds = %else.0
				520	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				521	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				522	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	523
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	524	return: ; preds = %then.1, %else.0, %then.0
				525	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	526	[ %tmp.9, %then.1 ]
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	527	ret i32 %result.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	528	}
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	529
				530	//===---------------------------------------------------------------------===//
				531
Chris Lattner	c90b866	2008-08-10 00:47:21 +0000	[diff] [blame]	532	Tail recursion elimination should handle:
				533
				534	int pow2m1(int n) {
				535	if (n == 0)
				536	return 0;
				537	return 2 * pow2m1 (n - 1) + 1;
				538	}
				539
				540	Also, multiplies can be turned into SHL's, so they should be handled as if
				541	they were associative. "return foo() << 1" can be tail recursion eliminated.
				542
				543	//===---------------------------------------------------------------------===//
				544
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	545	Argument promotion should promote arguments for recursive functions, like
				546	this:
				547
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	548	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	549
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	550	define internal i32 @foo(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	551	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	552	%tmp = load i32* %x ; <i32> [#uses=0]
				553	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				554	ret i32 %tmp.foo
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	555	}
				556
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	557	define i32 @bar(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	558	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	559	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				560	ret i32 %tmp3
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	561	}
				562
Chris Lattner	81f2d71	2007-12-05 23:05:06 +0000	[diff] [blame]	563	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	564
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	565	We should investigate an instruction sinking pass. Consider this silly
				566	example in pic mode:
				567
				568	#include <assert.h>
				569	void foo(int x) {
				570	assert(x);
				571	//...
				572	}
				573
				574	we compile this to:
				575	_foo:
				576	subl $28, %esp
				577	call "L1$pb"
				578	"L1$pb":
				579	popl %eax
				580	cmpl $0, 32(%esp)
				581	je LBB1_2 # cond_true
				582	LBB1_1: # return
				583	# ...
				584	addl $28, %esp
				585	ret
				586	LBB1_2: # cond_true
				587	...
				588
				589	The PIC base computation (call+popl) is only used on one path through the
				590	code, but is currently always computed in the entry block. It would be
				591	better to sink the picbase computation down into the block for the
				592	assertion, as it is the only one that uses it. This happens for a lot of
				593	code with early outs.
				594
Chris Lattner	92c06a0	2007-12-29 01:05:01 +0000	[diff] [blame]	595	Another example is loads of arguments, which are usually emitted into the
				596	entry block on targets like x86. If not used in all paths through a
				597	function, they should be sunk into the ones that do.
				598
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	599	In this case, whole-function-isel would also handle this.
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	600
				601	//===---------------------------------------------------------------------===//
Chris Lattner	b304194	2008-01-07 21:38:14 +0000	[diff] [blame]	602
				603	Investigate lowering of sparse switch statements into perfect hash tables:
				604	http://burtleburtle.net/bob/hash/perfect.html
				605
				606	//===---------------------------------------------------------------------===//
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	607
				608	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				609	corresponding integer operations. On a yonah, this loop:
				610
				611	double a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	612	void foo() {
				613	int i, b;
				614	for (b = 0; b < 10000000; b++)
				615	for (i = 0; i < 256; i++)
				616	a[i] = -a[i];
				617	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	618
				619	is twice as slow as this loop:
				620
				621	long long a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	622	void foo() {
				623	int i, b;
				624	for (b = 0; b < 10000000; b++)
				625	for (i = 0; i < 256; i++)
				626	a[i] ^= (1ULL << 63);
				627	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	628
				629	and I suspect other processors are similar. On X86 in particular this is a
				630	big win because doing this with integers allows the use of read/modify/write
				631	instructions.
				632
				633	//===---------------------------------------------------------------------===//
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	634
				635	DAG Combiner should try to combine small loads into larger loads when
				636	profitable. For example, we compile this C++ example:
				637
				638	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				639	extern THotKey m_HotKey;
				640	THotKey GetHotKey () { return m_HotKey; }
				641
				642	into (-O3 -fno-exceptions -static -fomit-frame-pointer):
				643
				644	__Z9GetHotKeyv:
				645	pushl %esi
				646	movl 8(%esp), %eax
				647	movb _m_HotKey+3, %cl
				648	movb _m_HotKey+4, %dl
				649	movb _m_HotKey+2, %ch
				650	movw _m_HotKey, %si
				651	movw %si, (%eax)
				652	movb %ch, 2(%eax)
				653	movb %cl, 3(%eax)
				654	movb %dl, 4(%eax)
				655	popl %esi
				656	ret $4
				657
				658	GCC produces:
				659
				660	__Z9GetHotKeyv:
				661	movl _m_HotKey, %edx
				662	movl 4(%esp), %eax
				663	movl %edx, (%eax)
				664	movzwl _m_HotKey+4, %edx
				665	movw %dx, 4(%eax)
				666	ret $4
				667
				668	The LLVM IR contains the needed alignment info, so we should be able to
				669	merge the loads and stores into 4-byte loads:
				670
				671	%struct.THotKey = type { i16, i8, i8, i8 }
				672	define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
				673	...
				674	%tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
				675	%tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
				676	%tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
				677	%tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
				678
				679	Alternatively, we should use a small amount of base-offset alias analysis
				680	to make it so the scheduler doesn't need to hold all the loads in regs at
				681	once.
				682
				683	//===---------------------------------------------------------------------===//
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	684
Nate Begeman	e9fe65c	2008-02-18 18:39:23 +0000	[diff] [blame]	685	We should add an FRINT node to the DAG to model targets that have legal
				686	implementations of ceil/floor/rint.
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	687
				688	//===---------------------------------------------------------------------===//
				689
				690	Consider:
				691
				692	int test() {
				693	long long input[8] = {1,1,1,1,1,1,1,1};
				694	foo(input);
				695	}
				696
				697	We currently compile this into a memcpy from a global array since the
				698	initializer is fairly large and not memset'able. This is good, but the memcpy
				699	gets lowered to load/stores in the code generator. This is also ok, except
				700	that the codegen lowering for memcpy doesn't handle the case when the source
				701	is a constant global. This gives us atrocious code like this:
				702
				703	call "L1$pb"
				704	"L1$pb":
				705	popl %eax
				706	movl _C.0.1444-"L1$pb"+32(%eax), %ecx
				707	movl %ecx, 40(%esp)
				708	movl _C.0.1444-"L1$pb"+20(%eax), %ecx
				709	movl %ecx, 28(%esp)
				710	movl _C.0.1444-"L1$pb"+36(%eax), %ecx
				711	movl %ecx, 44(%esp)
				712	movl _C.0.1444-"L1$pb"+44(%eax), %ecx
				713	movl %ecx, 52(%esp)
				714	movl _C.0.1444-"L1$pb"+40(%eax), %ecx
				715	movl %ecx, 48(%esp)
				716	movl _C.0.1444-"L1$pb"+12(%eax), %ecx
				717	movl %ecx, 20(%esp)
				718	movl _C.0.1444-"L1$pb"+4(%eax), %ecx
				719	...
				720
				721	instead of:
				722	movl $1, 16(%esp)
				723	movl $0, 20(%esp)
				724	movl $1, 24(%esp)
				725	movl $0, 28(%esp)
				726	movl $1, 32(%esp)
				727	movl $0, 36(%esp)
				728	...
				729
				730	//===---------------------------------------------------------------------===//
Chris Lattner	a11deb0	2008-03-02 02:51:40 +0000	[diff] [blame]	731
				732	http://llvm.org/PR717:
				733
				734	The following code should compile into "ret int undef". Instead, LLVM
				735	produces "ret int 0":
				736
				737	int f() {
				738	int x = 4;
				739	int y;
				740	if (x == 3) y = 0;
				741	return y;
				742	}
				743
				744	//===---------------------------------------------------------------------===//
Chris Lattner	53b7277	2008-03-02 19:29:42 +0000	[diff] [blame]	745
				746	The loop unroller should partially unroll loops (instead of peeling them)
				747	when code growth isn't too bad and when an unroll count allows simplification
				748	of some code within the loop. One trivial example is:
				749
				750	#include <stdio.h>
				751	int main() {
				752	int nRet = 17;
				753	int nLoop;
				754	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				755	if ( nLoop & 1 )
				756	nRet += 2;
				757	else
				758	nRet -= 1;
				759	}
				760	return nRet;
				761	}
				762
				763	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				764	reduction in code size. The resultant code would then also be suitable for
				765	exit value computation.
				766
				767	//===---------------------------------------------------------------------===//
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	768
				769	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				770	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				771	matching code in dag combine doesn't look through truncates aggressively
				772	enough. Here are some testcases reduces from GCC PR17886:
				773
				774	unsigned long long f(unsigned long long x, int y) {
				775	return (x << y) \| (x >> 64-y);
				776	}
				777	unsigned f2(unsigned x, int y){
				778	return (x << y) \| (x >> 32-y);
				779	}
				780	unsigned long long f3(unsigned long long x){
				781	int y = 9;
				782	return (x << y) \| (x >> 64-y);
				783	}
				784	unsigned f4(unsigned x){
				785	int y = 10;
				786	return (x << y) \| (x >> 32-y);
				787	}
				788	unsigned long long f5(unsigned long long x, unsigned long long y) {
				789	return (x << 8) \| ((y >> 48) & 0xffull);
				790	}
				791	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				792	switch(z) {
				793	case 1:
				794	return (x << 8) \| ((y >> 48) & 0xffull);
				795	case 2:
				796	return (x << 16) \| ((y >> 40) & 0xffffull);
				797	case 3:
				798	return (x << 24) \| ((y >> 32) & 0xffffffull);
				799	case 4:
				800	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				801	default:
				802	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				803	}
				804	}
				805
Dan Gohman	cb747c5	2008-10-17 21:39:27 +0000	[diff] [blame]	806	On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	807	generate truly horrible code, instead of using shld and friends. On
				808	ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
				809	badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
				810
				811	//===---------------------------------------------------------------------===//
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	812
Chris Lattner	ef17f08	2010-12-15 07:10:43 +0000	[diff] [blame^]	813	This (and similar related idioms):
				814
				815	unsigned int foo(unsigned char i) {
				816	return i \| (i<<8) \| (i<<16) \| (i<<24);
				817	}
				818
				819	compiles into:
				820
				821	define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone {
				822	entry:
				823	%conv = zext i8 %i to i32
				824	%shl = shl i32 %conv, 8
				825	%shl5 = shl i32 %conv, 16
				826	%shl9 = shl i32 %conv, 24
				827	%or = or i32 %shl9, %conv
				828	%or6 = or i32 %or, %shl5
				829	%or10 = or i32 %or6, %shl
				830	ret i32 %or10
				831	}
				832
				833	it would be better as:
				834
				835	unsigned int bar(unsigned char i) {
				836	unsigned int j=i \| (i << 8);
				837	return j \| (j<<16);
				838	}
				839
				840	aka:
				841
				842	define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone {
				843	entry:
				844	%conv = zext i8 %i to i32
				845	%shl = shl i32 %conv, 8
				846	%or = or i32 %shl, %conv
				847	%shl5 = shl i32 %or, 16
				848	%or6 = or i32 %shl5, %or
				849	ret i32 %or6
				850	}
				851
				852	or even i*0x01010101, depending on the speed of the multiplier. The best way to
				853	handle this is to canonicalize it to a multiply in IR and have codegen handle
				854	lowering multiplies to shifts on cpus where shifts are faster.
				855
				856	//===---------------------------------------------------------------------===//
				857
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	858	We do a number of simplifications in simplify libcalls to strength reduce
				859	standard library functions, but we don't currently merge them together. For
				860	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				861	be done safely if "b" isn't modified between the strlen and memcpy of course.
				862
				863	//===---------------------------------------------------------------------===//
				864
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	865	We compile this program: (from GCC PR11680)
				866	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				867
				868	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				869	slower than when compile with GCC (either 4.0 or 4.2):
				870
				871	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				872	$ time ./a.out fast
				873	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				874
				875	$ g++ perf.cpp -O3 -fno-exceptions
				876	$ time ./a.out fast
				877	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				878
				879	It looks like we are making the same inlining decisions, so this may be raw
				880	codegen badness or something else (haven't investigated).
				881
				882	//===---------------------------------------------------------------------===//
				883
				884	We miss some instcombines for stuff like this:
				885	void bar (void);
				886	void foo (unsigned int a) {
				887	/* This one is equivalent to a >= (3 << 2). */
				888	if ((a >> 2) >= 3)
				889	bar ();
				890	}
				891
				892	A few other related ones are in GCC PR14753.
				893
				894	//===---------------------------------------------------------------------===//
				895
				896	Divisibility by constant can be simplified (according to GCC PR12849) from
				897	being a mulhi to being a mul lo (cheaper). Testcase:
				898
				899	void bar(unsigned n) {
				900	if (n % 3 == 0)
				901	true();
				902	}
				903
Eli Friedman	bcae205	2009-12-12 23:23:43 +0000	[diff] [blame]	904	This is equivalent to the following, where 2863311531 is the multiplicative
				905	inverse of 3, and 1431655766 is ((2^32)-1)/3+1:
				906	void bar(unsigned n) {
				907	if (n * 2863311531U < 1431655766U)
				908	true();
				909	}
				910
				911	The same transformation can work with an even modulo with the addition of a
				912	rotate: rotate the result of the multiply to the right by the number of bits
				913	which need to be zero for the condition to be true, and shrink the compare RHS
				914	by the same amount. Unless the target supports rotates, though, that
				915	transformation probably isn't worthwhile.
				916
				917	The transformation can also easily be made to work with non-zero equality
				918	comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	919
				920	//===---------------------------------------------------------------------===//
Chris Lattner	23f35bc	2008-08-19 06:22:16 +0000	[diff] [blame]	921
Chris Lattner	db03983	2008-10-15 16:06:03 +0000	[diff] [blame]	922	Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
				923	bunch of other stuff from this example (see PR1604):
				924
				925	#include <cstdio>
				926	struct test {
				927	int val;
				928	virtual ~test() {}
				929	};
				930
				931	int main() {
				932	test t;
				933	std::scanf("%d", &t.val);
				934	std::printf("%d\n", t.val);
				935	}
				936
				937	//===---------------------------------------------------------------------===//
				938
Nick Lewycky	d2f0db1	2008-11-27 22:41:45 +0000	[diff] [blame]	939	These functions perform the same computation, but produce different assembly.
Nick Lewycky	df563ca	2008-11-27 22:12:22 +0000	[diff] [blame]	940
				941	define i8 @select(i8 %x) readnone nounwind {
				942	%A = icmp ult i8 %x, 250
				943	%B = select i1 %A, i8 0, i8 1
				944	ret i8 %B
				945	}
				946
				947	define i8 @addshr(i8 %x) readnone nounwind {
				948	%A = zext i8 %x to i9
				949	%B = add i9 %A, 6 ;; 256 - 250 == 6
				950	%C = lshr i9 %B, 8
				951	%D = trunc i9 %C to i8
				952	ret i8 %D
				953	}
				954
				955	//===---------------------------------------------------------------------===//
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	956
				957	From gcc bug 24696:
				958	int
				959	f (unsigned long a, unsigned long b, unsigned long c)
				960	{
				961	return ((a & (c - 1)) != 0) \|\| ((b & (c - 1)) != 0);
				962	}
				963	int
				964	f (unsigned long a, unsigned long b, unsigned long c)
				965	{
				966	return ((a & (c - 1)) != 0) \| ((b & (c - 1)) != 0);
				967	}
				968	Both should combine to ((a\|b) & (c-1)) != 0. Currently not optimized with
				969	"clang -emit-llvm-bc \| opt -std-compile-opts".
				970
				971	//===---------------------------------------------------------------------===//
				972
				973	From GCC Bug 20192:
				974	#define PMD_MASK (~((1UL << 23) - 1))
				975	void clear_pmd_range(unsigned long start, unsigned long end)
				976	{
				977	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
				978	f();
				979	}
				980	The expression should optimize to something like
				981	"!((start\|end)&~PMD_MASK). Currently not optimized with "clang
				982	-emit-llvm-bc \| opt -std-compile-opts".
				983
				984	//===---------------------------------------------------------------------===//
				985
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	986	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
				987	i;}
				988	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				989	These should combine to the same thing. Currently, the first function
				990	produces better code on X86.
				991
				992	//===---------------------------------------------------------------------===//
				993
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	994	From GCC Bug 15784:
				995	#define abs(x) x>0?x:-x
				996	int f(int x, int y)
				997	{
				998	return (abs(x)) >= 0;
				999	}
				1000	This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
				1001	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1002
				1003	//===---------------------------------------------------------------------===//
				1004
				1005	From GCC Bug 14753:
				1006	void
				1007	rotate_cst (unsigned int a)
				1008	{
				1009	a = (a << 10) \| (a >> 22);
				1010	if (a == 123)
				1011	bar ();
				1012	}
				1013	void
				1014	minus_cst (unsigned int a)
				1015	{
				1016	unsigned int tem;
				1017
				1018	tem = 20 - a;
				1019	if (tem == 5)
				1020	bar ();
				1021	}
				1022	void
				1023	mask_gt (unsigned int a)
				1024	{
				1025	/* This is equivalent to a > 15. */
				1026	if ((a & ~7) > 8)
				1027	bar ();
				1028	}
				1029	void
				1030	rshift_gt (unsigned int a)
				1031	{
				1032	/* This is equivalent to a > 23. */
				1033	if ((a >> 2) > 5)
				1034	bar ();
				1035	}
				1036	All should simplify to a single comparison. All of these are
				1037	currently not optimized with "clang -emit-llvm-bc \| opt
				1038	-std-compile-opts".
				1039
				1040	//===---------------------------------------------------------------------===//
				1041
				1042	From GCC Bug 32605:
				1043	int c(int* x) {return (char)x+2 == (char)x;}
				1044	Should combine to 0. Currently not optimized with "clang
				1045	-emit-llvm-bc \| opt -std-compile-opts" (although llc can optimize it).
				1046
				1047	//===---------------------------------------------------------------------===//
				1048
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1049	int a(unsigned b) {return ((b << 31) \| (b << 30)) >> 31;}
				1050	Should be combined to "((b >> 1) \| b) & 1". Currently not optimized
				1051	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1052
				1053	//===---------------------------------------------------------------------===//
				1054
				1055	unsigned a(unsigned x, unsigned y) { return x \| (y & 1) \| (y & 2);}
				1056	Should combine to "x \| (y & 3)". Currently not optimized with "clang
				1057	-emit-llvm-bc \| opt -std-compile-opts".
				1058
				1059	//===---------------------------------------------------------------------===//
				1060
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1061	int a(int a, int b, int c) {return (~a & c) \| ((c\|a) & b);}
				1062	Should fold to "(~a & c) \| (a & b)". Currently not optimized with
				1063	"clang -emit-llvm-bc \| opt -std-compile-opts".
				1064
				1065	//===---------------------------------------------------------------------===//
				1066
				1067	int a(int a,int b) {return (~(a\|b))\|a;}
				1068	Should fold to "a\|~b". Currently not optimized with "clang
				1069	-emit-llvm-bc \| opt -std-compile-opts".
				1070
				1071	//===---------------------------------------------------------------------===//
				1072
				1073	int a(int a, int b) {return (a&&b) \|\| (a&&!b);}
				1074	Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
				1075	\| opt -std-compile-opts".
				1076
				1077	//===---------------------------------------------------------------------===//
				1078
				1079	int a(int a, int b, int c) {return (a&&b) \|\| (!a&&c);}
				1080	Should fold to "a ? b : c", or at least something sane. Currently not
				1081	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1082
				1083	//===---------------------------------------------------------------------===//
				1084
				1085	int a(int a, int b, int c) {return (a&&b) \|\| (a&&c) \|\| (a&&b&&c);}
				1086	Should fold to a && (b \|\| c). Currently not optimized with "clang
				1087	-emit-llvm-bc \| opt -std-compile-opts".
				1088
				1089	//===---------------------------------------------------------------------===//
				1090
				1091	int a(int x) {return x \| ((x & 8) ^ 8);}
				1092	Should combine to x \| 8. Currently not optimized with "clang
				1093	-emit-llvm-bc \| opt -std-compile-opts".
				1094
				1095	//===---------------------------------------------------------------------===//
				1096
				1097	int a(int x) {return x ^ ((x & 8) ^ 8);}
				1098	Should also combine to x \| 8. Currently not optimized with "clang
				1099	-emit-llvm-bc \| opt -std-compile-opts".
				1100
				1101	//===---------------------------------------------------------------------===//
				1102
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1103	int a(int x) {return ((x \| -9) ^ 8) & x;}
				1104	Should combine to x & -9. Currently not optimized with "clang
				1105	-emit-llvm-bc \| opt -std-compile-opts".
				1106
				1107	//===---------------------------------------------------------------------===//
				1108
				1109	unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
				1110	Should combine to "a * 0x88888888 >> 31". Currently not optimized
				1111	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1112
				1113	//===---------------------------------------------------------------------===//
				1114
				1115	unsigned a(char* x) {if ((*x & 32) == 0) return b();}
				1116	There's an unnecessary zext in the generated code with "clang
				1117	-emit-llvm-bc \| opt -std-compile-opts".
				1118
				1119	//===---------------------------------------------------------------------===//
				1120
				1121	unsigned a(unsigned long long x) {return 40 * (x >> 1);}
				1122	Should combine to "20 * (((unsigned)x) & -2)". Currently not
				1123	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1124
				1125	//===---------------------------------------------------------------------===//
Bill Wendling	3bdcda8	2008-12-02 05:12:47 +0000	[diff] [blame]	1126
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1127	This was noticed in the entryblock for grokdeclarator in 403.gcc:
				1128
				1129	%tmp = icmp eq i32 %decl_context, 4
				1130	%decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
				1131	%tmp1 = icmp eq i32 %decl_context_addr.0, 1
				1132	%decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
				1133
				1134	tmp1 should be simplified to something like:
				1135	(!tmp \|\| decl_context == 1)
				1136
				1137	This allows recursive simplifications, tmp1 is used all over the place in
				1138	the function, e.g. by:
				1139
				1140	%tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
				1141	%tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
				1142	%or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
				1143
				1144	later.
				1145
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1146	//===---------------------------------------------------------------------===//
				1147
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1148	[STORE SINKING]
				1149
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1150	Store sinking: This code:
				1151
				1152	void f (int n, int cond, int res) {
				1153	int i;
				1154	*res = 0;
				1155	for (i = 0; i < n; i++)
				1156	if (*cond)
				1157	res ^= 234; / () /
				1158	}
				1159
				1160	On this function GVN hoists the fully redundant value of *res, but nothing
				1161	moves the store out. This gives us this code:
				1162
				1163	bb: ; preds = %bb2, %entry
				1164	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
				1165	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
				1166	%1 = load i32* %cond, align 4
				1167	%2 = icmp eq i32 %1, 0
				1168	br i1 %2, label %bb2, label %bb1
				1169
				1170	bb1: ; preds = %bb
				1171	%3 = xor i32 %.rle, 234
				1172	store i32 %3, i32* %res, align 4
				1173	br label %bb2
				1174
				1175	bb2: ; preds = %bb, %bb1
				1176	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
				1177	%indvar.next = add i32 %i.05, 1
				1178	%exitcond = icmp eq i32 %indvar.next, %n
				1179	br i1 %exitcond, label %return, label %bb
				1180
				1181	DSE should sink partially dead stores to get the store out of the loop.
				1182
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1183	Here's another partial dead case:
				1184	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
				1185
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1186	//===---------------------------------------------------------------------===//
				1187
				1188	Scalar PRE hoists the mul in the common block up to the else:
				1189
				1190	int test (int a, int b, int c, int g) {
				1191	int d, e;
				1192	if (a)
				1193	d = b * c;
				1194	else
				1195	d = b - c;
				1196	e = b * c + g;
				1197	return d + e;
				1198	}
				1199
				1200	It would be better to do the mul once to reduce codesize above the if.
				1201	This is GCC PR38204.
				1202
				1203	//===---------------------------------------------------------------------===//
				1204
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1205	[STORE SINKING]
				1206
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1207	GCC PR37810 is an interesting case where we should sink load/store reload
				1208	into the if block and outside the loop, so we don't reload/store it on the
				1209	non-call path.
				1210
				1211	for () {
				1212	*P += 1;
				1213	if ()
				1214	call();
				1215	else
				1216	...
				1217	->
				1218	tmp = *P
				1219	for () {
				1220	tmp += 1;
				1221	if () {
				1222	*P = tmp;
				1223	call();
				1224	tmp = *P;
				1225	} else ...
				1226	}
				1227	*P = tmp;
				1228
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1229	We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
				1230	we don't sink the store. We need partially dead store sinking.
				1231
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1232	//===---------------------------------------------------------------------===//
				1233
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1234	[LOAD PRE CRIT EDGE SPLITTING]
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1235
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1236	GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
				1237	leading to excess stack traffic. This could be handled by GVN with some crazy
				1238	symbolic phi translation. The code we get looks like (g is on the stack):
				1239
				1240	bb2: ; preds = %bb1
				1241	..
				1242	%9 = getelementptr %struct.f* %g, i32 0, i32 0
				1243	store i32 %8, i32* %9, align bel %bb3
				1244
				1245	bb3: ; preds = %bb1, %bb2, %bb
				1246	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
				1247	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
				1248	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
				1249	%11 = load i32* %10, align 4
				1250
Chris Lattner	6d94926	2009-11-27 16:53:57 +0000	[diff] [blame]	1251	%11 is partially redundant, an in BB2 it should have the value %8.
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1252
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1253	GCC PR33344 and PR35287 are similar cases.
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1254
Chris Lattner	6c9fab7	2009-11-05 18:19:19 +0000	[diff] [blame]	1255
				1256	//===---------------------------------------------------------------------===//
				1257
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1258	[LOAD PRE]
				1259
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1260	There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1261	GCC testsuite, ones we don't get yet are (checked through loadpre25):
				1262
				1263	[CRIT EDGE BREAKING]
				1264	loadpre3.c predcom-4.c
				1265
				1266	[PRE OF READONLY CALL]
				1267	loadpre5.c
				1268
				1269	[TURN SELECT INTO BRANCH]
				1270	loadpre14.c loadpre15.c
				1271
				1272	actually a conditional increment: loadpre18.c loadpre19.c
				1273
Chris Lattner	2fc36e1	2010-12-15 06:38:24 +0000	[diff] [blame]	1274	//===---------------------------------------------------------------------===//
				1275
				1276	[LOAD PRE / STORE SINKING / SPEC HACK]
				1277
				1278	This is a chunk of code from 456.hmmer:
				1279
				1280	int f(int M, int mc, int mpp, int tpmm, int ip, int tpim, int dpp,
				1281	int tpdm, int xmb, int bp, int *ms) {
				1282	int k, sc;
				1283	for (k = 1; k <= M; k++) {
				1284	mc[k] = mpp[k-1] + tpmm[k-1];
				1285	if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;
				1286	if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
				1287	if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
				1288	mc[k] += ms[k];
				1289	}
				1290	}
				1291
				1292	It is very profitable for this benchmark to turn the conditional stores to mc[k]
				1293	into a conditional move (select instr in IR) and allow the final store to do the
				1294	store. See GCC PR27313 for more details. Note that this is valid to xform even
				1295	with the new C++ memory model, since mc[k] is previously loaded and later
				1296	stored.
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1297
				1298	//===---------------------------------------------------------------------===//
				1299
				1300	[SCALAR PRE]
				1301	There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the
				1302	GCC testsuite.
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1303
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1304	//===---------------------------------------------------------------------===//
				1305
				1306	There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1307	GCC testsuite. For example, we get the first example in predcom-1.c, but
				1308	miss the second one:
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1309
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1310	unsigned fib[1000];
				1311	unsigned avg[1000];
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1312
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1313	__attribute__ ((noinline))
				1314	void count_averages(int n) {
				1315	int i;
				1316	for (i = 1; i < n; i++)
				1317	avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;
				1318	}
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1319
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1320	which compiles into two loads instead of one in the loop.
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1321
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1322	predcom-2.c is the same as predcom-1.c
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1323
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1324	predcom-3.c is very similar but needs loads feeding each other instead of
				1325	store->load.
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1326
				1327
				1328	//===---------------------------------------------------------------------===//
				1329
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	1330	[ALIAS ANALYSIS]
				1331
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1332	Type based alias analysis:
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1333	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
				1334
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	1335	We should do better analysis of posix_memalign. At the least it should
				1336	no-capture its pointer argument, at best, we should know that the out-value
				1337	result doesn't point to anything (like malloc). One example of this is in
				1338	SingleSource/Benchmarks/Misc/dt.c
				1339
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1340	//===---------------------------------------------------------------------===//
				1341
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1342	A/B get pinned to the stack because we turn an if/then into a select instead
				1343	of PRE'ing the load/store. This may be fixable in instcombine:
				1344	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
				1345
Chris Lattner	93c6c77	2009-09-21 02:53:57 +0000	[diff] [blame]	1346	struct X { int i; };
				1347	int foo (int x) {
				1348	struct X a;
				1349	struct X b;
				1350	struct X *p;
				1351	a.i = 1;
				1352	b.i = 2;
				1353	if (x)
				1354	p = &a;
				1355	else
				1356	p = &b;
				1357	return p->i;
				1358	}
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1359
Chris Lattner	93c6c77	2009-09-21 02:53:57 +0000	[diff] [blame]	1360	//===---------------------------------------------------------------------===//
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1361
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1362	Interesting missed case because of control flow flattening (should be 2 loads):
				1363	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1364	With: llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \|
				1365	opt -mem2reg -gvn -instcombine \| llvm-dis
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1366	we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1367	VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1368
				1369	//===---------------------------------------------------------------------===//
				1370
				1371	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
				1372	We could eliminate the branch condition here, loading from null is undefined:
				1373
				1374	struct S { int w, x, y, z; };
				1375	struct T { int r; struct S s; };
				1376	void bar (struct S, int);
				1377	void foo (int a, struct T b)
				1378	{
				1379	struct S *c = 0;
				1380	if (a)
				1381	c = &b.s;
				1382	bar (*c, a);
				1383	}
				1384
				1385	//===---------------------------------------------------------------------===//
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1386
Chris Lattner	9cf8ef6	2008-12-23 20:52:52 +0000	[diff] [blame]	1387	simplifylibcalls should do several optimizations for strspn/strcspn:
				1388
Chris Lattner	9cf8ef6	2008-12-23 20:52:52 +0000	[diff] [blame]	1389	strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
				1390
				1391	size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
				1392	int __reject3) {
				1393	register size_t __result = 0;
				1394	while (__s[__result] != '\0' && __s[__result] != __reject1 &&
				1395	__s[__result] != __reject2 && __s[__result] != __reject3)
				1396	++__result;
				1397	return __result;
				1398	}
				1399
				1400	This should turn into a switch on the character. See PR3253 for some notes on
				1401	codegen.
				1402
				1403	456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
				1404
				1405	//===---------------------------------------------------------------------===//
Chris Lattner	d23b799	2008-12-31 00:54:13 +0000	[diff] [blame]	1406
				1407	"gas" uses this idiom:
				1408	else if (strchr ("+-/%\|&^:[]()~", intel_parser.op_string))
				1409	..
				1410	else if (strchr ("<>", *intel_parser.op_string)
				1411
				1412	Those should be turned into a switch.
				1413
				1414	//===---------------------------------------------------------------------===//
Chris Lattner	ffb08f5	2009-01-08 06:52:57 +0000	[diff] [blame]	1415
				1416	252.eon contains this interesting code:
				1417
				1418	%3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
				1419	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1420	%strlen = call i32 @strlen(i8* %3072) ; uses = 1
				1421	%endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
				1422	call void @llvm.memcpy.i32(i8* %endptr,
				1423	i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
				1424	%3074 = call i32 @strlen(i8* %endptr) nounwind readonly
				1425
				1426	This is interesting for a couple reasons. First, in this:
				1427
				1428	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1429	%strlen = call i32 @strlen(i8* %3072)
				1430
				1431	The strlen could be replaced with: %strlen = sub %3072, %3073, because the
				1432	strcpy call returns a pointer to the end of the string. Based on that, the
				1433	endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
				1434
				1435	Second, the memcpy+strlen strlen can be replaced with:
				1436
				1437	%3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
				1438
				1439	Because the destination was just copied into the specified memory buffer. This,
				1440	in turn, can be constant folded to "4".
				1441
				1442	In other code, it contains:
				1443
				1444	%endptr6978 = bitcast i8* %endptr69 to i32*
				1445	store i32 7107374, i32* %endptr6978, align 1
				1446	%3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
				1447
				1448	Which could also be constant folded. Whatever is producing this should probably
				1449	be fixed to leave this as a memcpy from a string.
				1450
				1451	Further, eon also has an interesting partially redundant strlen call:
				1452
				1453	bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
				1454	%682 = getelementptr i8 %argv, i32 6 ; <i8> [#uses=2]
				1455	%683 = load i8** %682, align 4 ; <i8*> [#uses=4]
				1456	%684 = load i8* %683, align 1 ; <i8> [#uses=1]
				1457	%685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
				1458	br i1 %685, label %bb10, label %bb9
				1459
				1460	bb9: ; preds = %bb8
				1461	%686 = call i32 @strlen(i8* %683) nounwind readonly
				1462	%687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
				1463	br i1 %687, label %bb10, label %bb11
				1464
				1465	bb10: ; preds = %bb9, %bb8
				1466	%688 = call i32 @strlen(i8* %683) nounwind readonly
				1467
				1468	This could be eliminated by doing the strlen once in bb8, saving code size and
				1469	improving perf on the bb8->9->10 path.
				1470
				1471	//===---------------------------------------------------------------------===//
Chris Lattner	9fee08f	2009-01-08 07:34:55 +0000	[diff] [blame]	1472
				1473	I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
				1474	which looks like:
				1475	%movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
				1476
				1477
				1478	bb62: ; preds = %bb55, %bb53
				1479	%promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
				1480	%171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1481	%172 = add i32 %171, -1 ; <i32> [#uses=1]
				1482	%173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
				1483
				1484	... no stores ...
				1485	br i1 %or.cond, label %bb65, label %bb72
				1486
				1487	bb65: ; preds = %bb62
				1488	store i8 0, i8* %173, align 1
				1489	br label %bb72
				1490
				1491	bb72: ; preds = %bb65, %bb62
				1492	%trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
				1493	%177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1494
				1495	Note that on the bb62->bb72 path, that the %177 strlen call is partially
				1496	redundant with the %171 call. At worst, we could shove the %177 strlen call
				1497	up into the bb65 block moving it out of the bb62->bb72 path. However, note
				1498	that bb65 stores to the string, zeroing out the last byte. This means that on
				1499	that path the value of %177 is actually just %171-1. A sub is cheaper than a
				1500	strlen!
				1501
				1502	This pattern repeats several times, basically doing:
				1503
				1504	A = strlen(P);
				1505	P[A-1] = 0;
				1506	B = strlen(P);
				1507	where it is "obvious" that B = A-1.
				1508
				1509	//===---------------------------------------------------------------------===//
				1510
Chris Lattner	9fee08f	2009-01-08 07:34:55 +0000	[diff] [blame]	1511	186.crafty also contains this code:
				1512
				1513	%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1514	%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
				1515	%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
				1516	%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1517	%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
				1518
				1519	The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
				1520
				1521	//===---------------------------------------------------------------------===//
				1522
				1523	186.crafty has this interesting pattern with the "out.4543" variable:
				1524
				1525	call void @llvm.memcpy.i32(
				1526	i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
				1527	i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
				1528	%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
				1529
				1530	It is basically doing:
				1531
				1532	memcpy(globalarray, "string");
				1533	printf(..., globalarray);
				1534
				1535	Anyway, by knowing that printf just reads the memory and forward substituting
				1536	the string directly into the printf, this eliminates reads from globalarray.
				1537	Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
				1538	other similar functions) there are many stores to "out". Once all the printfs
				1539	stop using "out", all that is left is the memcpy's into it. This should allow
				1540	globalopt to remove the "stored only" global.
				1541
				1542	//===---------------------------------------------------------------------===//
				1543
Dan Gohman	8289b05	2009-01-20 01:07:33 +0000	[diff] [blame]	1544	This code:
				1545
				1546	define inreg i32 @foo(i8* inreg %p) nounwind {
				1547	%tmp0 = load i8* %p
				1548	%tmp1 = ashr i8 %tmp0, 5
				1549	%tmp2 = sext i8 %tmp1 to i32
				1550	ret i32 %tmp2
				1551	}
				1552
				1553	could be dagcombine'd to a sign-extending load with a shift.
				1554	For example, on x86 this currently gets this:
				1555
				1556	movb (%eax), %al
				1557	sarb $5, %al
				1558	movsbl %al, %eax
				1559
				1560	while it could get this:
				1561
				1562	movsbl (%eax), %eax
				1563	sarl $5, %eax
				1564
				1565	//===---------------------------------------------------------------------===//
Chris Lattner	256baa4	2009-01-22 07:16:03 +0000	[diff] [blame]	1566
				1567	GCC PR31029:
				1568
				1569	int test(int x) { return 1-x == x; } // --> return false
				1570	int test2(int x) { return 2-x == x; } // --> return x == 1 ?
				1571
				1572	Always foldable for odd constants, what is the rule for even?
				1573
				1574	//===---------------------------------------------------------------------===//
				1575
Torok Edwin	e46a686	2009-01-24 19:30:25 +0000	[diff] [blame]	1576	PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
				1577	for next field in struct (which is at same address).
				1578
				1579	For example: store of float into { {{}}, float } could be turned into a store to
				1580	the float directly.
				1581
Torok Edwin	474479f	2009-02-20 18:42:06 +0000	[diff] [blame]	1582	//===---------------------------------------------------------------------===//
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1583
Chris Lattner	32c5f17	2009-05-11 17:41:40 +0000	[diff] [blame]	1584	The arg promotion pass should make use of nocapture to make its alias analysis
				1585	stuff much more precise.
				1586
				1587	//===---------------------------------------------------------------------===//
				1588
				1589	The following functions should be optimized to use a select instead of a
				1590	branch (from gcc PR40072):
				1591
				1592	char char_int(int m) {if(m>7) return 0; return m;}
				1593	int int_char(char m) {if(m>7) return 0; return m;}
				1594
				1595	//===---------------------------------------------------------------------===//
				1596
Bill Wendling	5a56927	2009-10-27 22:48:31 +0000	[diff] [blame]	1597	int func(int a, int b) { if (a & 0x80) b \|= 0x80; else b &= ~0x80; return b; }
				1598
				1599	Generates this:
				1600
				1601	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1602	entry:
				1603	%0 = and i32 %a, 128 ; <i32> [#uses=1]
				1604	%1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
				1605	%2 = or i32 %b, 128 ; <i32> [#uses=1]
				1606	%3 = and i32 %b, -129 ; <i32> [#uses=1]
				1607	%b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
				1608	ret i32 %b_addr.0
				1609	}
				1610
				1611	However, it's functionally equivalent to:
				1612
				1613	b = (b & ~0x80) \| (a & 0x80);
				1614
				1615	Which generates this:
				1616
				1617	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1618	entry:
				1619	%0 = and i32 %b, -129 ; <i32> [#uses=1]
				1620	%1 = and i32 %a, 128 ; <i32> [#uses=1]
				1621	%2 = or i32 %0, %1 ; <i32> [#uses=1]
				1622	ret i32 %2
				1623	}
				1624
				1625	This can be generalized for other forms:
				1626
				1627	b = (b & ~0x80) \| (a & 0x40) << 1;
				1628
				1629	//===---------------------------------------------------------------------===//
Bill Wendling	c872e9c	2009-10-27 23:30:07 +0000	[diff] [blame]	1630
				1631	These two functions produce different code. They shouldn't:
				1632
				1633	#include <stdint.h>
				1634
				1635	uint8_t p1(uint8_t b, uint8_t a) {
				1636	b = (b & ~0xc0) \| (a & 0xc0);
				1637	return (b);
				1638	}
				1639
				1640	uint8_t p2(uint8_t b, uint8_t a) {
				1641	b = (b & ~0x40) \| (a & 0x40);
				1642	b = (b & ~0x80) \| (a & 0x80);
				1643	return (b);
				1644	}
				1645
				1646	define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1647	entry:
				1648	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1649	%1 = and i8 %a, -64 ; <i8> [#uses=1]
				1650	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1651	ret i8 %2
				1652	}
				1653
				1654	define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1655	entry:
				1656	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1657	%.masked = and i8 %a, 64 ; <i8> [#uses=1]
				1658	%1 = and i8 %a, -128 ; <i8> [#uses=1]
				1659	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1660	%3 = or i8 %2, %.masked ; <i8> [#uses=1]
				1661	ret i8 %3
				1662	}
				1663
				1664	//===---------------------------------------------------------------------===//
Chris Lattner	6fdfc9c	2009-11-11 17:51:27 +0000	[diff] [blame]	1665
				1666	IPSCCP does not currently propagate argument dependent constants through
				1667	functions where it does not not all of the callers. This includes functions
				1668	with normal external linkage as well as templates, C99 inline functions etc.
				1669	Specifically, it does nothing to:
				1670
				1671	define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
				1672	entry:
				1673	%0 = add nsw i32 %y, %z
				1674	%1 = mul i32 %0, %x
				1675	%2 = mul i32 %y, %z
				1676	%3 = add nsw i32 %1, %2
				1677	ret i32 %3
				1678	}
				1679
				1680	define i32 @test2() nounwind {
				1681	entry:
				1682	%0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
				1683	ret i32 %0
				1684	}
				1685
				1686	It would be interesting extend IPSCCP to be able to handle simple cases like
				1687	this, where all of the arguments to a call are constant. Because IPSCCP runs
				1688	before inlining, trivial templates and inline functions are not yet inlined.
				1689	The results for a function + set of constant arguments should be memoized in a
				1690	map.
				1691
				1692	//===---------------------------------------------------------------------===//
Chris Lattner	fc926c2	2009-11-11 17:54:02 +0000	[diff] [blame]	1693
				1694	The libcall constant folding stuff should be moved out of SimplifyLibcalls into
				1695	libanalysis' constantfolding logic. This would allow IPSCCP to be able to
				1696	handle simple things like this:
				1697
				1698	static int foo(const char *X) { return strlen(X); }
				1699	int bar() { return foo("abcd"); }
				1700
				1701	//===---------------------------------------------------------------------===//
Nick Lewycky	93f9f7a	2009-11-15 17:51:23 +0000	[diff] [blame]	1702
				1703	InstCombine should use SimplifyDemandedBits to remove the or instruction:
				1704
				1705	define i1 @test(i8 %x, i8 %y) {
				1706	%A = or i8 %x, 1
				1707	%B = icmp ugt i8 %A, 3
				1708	ret i1 %B
				1709	}
				1710
				1711	Currently instcombine calls SimplifyDemandedBits with either all bits or just
				1712	the sign bit, if the comparison is obviously a sign test. In this case, we only
				1713	need all but the bottom two bits from %A, and if we gave that mask to SDB it
				1714	would delete the or instruction for us.
				1715
				1716	//===---------------------------------------------------------------------===//
Chris Lattner	0533217	2009-12-03 07:41:54 +0000	[diff] [blame]	1717
Duncan Sands	e10920d	2010-01-06 15:37:47 +0000	[diff] [blame]	1718	functionattrs doesn't know much about memcpy/memset. This function should be
Duncan Sands	7c422ac	2010-01-06 08:45:52 +0000	[diff] [blame]	1719	marked readnone rather than readonly, since it only twiddles local memory, but
				1720	functionattrs doesn't handle memset/memcpy/memmove aggressively:
Chris Lattner	89742c2	2009-12-03 07:43:46 +0000	[diff] [blame]	1721
				1722	struct X { int p; int q; };
				1723	int foo() {
				1724	int i = 0, j = 1;
				1725	struct X x, y;
				1726	int **p;
				1727	y.p = &i;
				1728	x.q = &j;
				1729	p = __builtin_memcpy (&x, &y, sizeof (int *));
				1730	return **p;
				1731	}
				1732
Chris Lattner	0533217	2009-12-03 07:41:54 +0000	[diff] [blame]	1733	//===---------------------------------------------------------------------===//
				1734
Eli Friedman	9cfb3ad	2010-01-18 22:36:59 +0000	[diff] [blame]	1735	Missed instcombine transformation:
				1736	define i1 @a(i32 %x) nounwind readnone {
				1737	entry:
				1738	%cmp = icmp eq i32 %x, 30
				1739	%sub = add i32 %x, -30
				1740	%cmp2 = icmp ugt i32 %sub, 9
				1741	%or = or i1 %cmp, %cmp2
				1742	ret i1 %or
				1743	}
				1744	This should be optimized to a single compare. Testcase derived from gcc.
				1745
				1746	//===---------------------------------------------------------------------===//
				1747
Eli Friedman	9cfb3ad	2010-01-18 22:36:59 +0000	[diff] [blame]	1748	Missed instcombine or reassociate transformation:
				1749	int a(int a, int b) { return (a==12)&(b>47)&(b<58); }
				1750
				1751	The sgt and slt should be combined into a single comparison. Testcase derived
				1752	from gcc.
				1753
				1754	//===---------------------------------------------------------------------===//
				1755
				1756	Missed instcombine transformation:
Chris Lattner	3e41106	2010-11-21 07:05:31 +0000	[diff] [blame]	1757
				1758	%382 = srem i32 %tmp14.i, 64 ; [#uses=1]
				1759	%383 = zext i32 %382 to i64 ; [#uses=1]
				1760	%384 = shl i64 %381, %383 ; [#uses=1]
				1761	%385 = icmp slt i32 %tmp14.i, 64 ; [#uses=1]
				1762
Benjamin Kramer	c21a821	2010-11-23 20:33:57 +0000	[diff] [blame]	1763	The srem can be transformed to an and because if %tmp14.i is negative, the
				1764	shift is undefined. Testcase derived from 403.gcc.
Chris Lattner	3e41106	2010-11-21 07:05:31 +0000	[diff] [blame]	1765
				1766	//===---------------------------------------------------------------------===//
				1767
				1768	This is a range comparison on a divided result (from 403.gcc):
				1769
				1770	%1337 = sdiv i32 %1336, 8 ; [#uses=1]
				1771	%.off.i208 = add i32 %1336, 7 ; [#uses=1]
				1772	%1338 = icmp ult i32 %.off.i208, 15 ; [#uses=1]
				1773
				1774	We already catch this (removing the sdiv) if there isn't an add, we should
				1775	handle the 'add' as well. This is a common idiom with it's builtin_alloca code.
				1776	C testcase:
				1777
				1778	int a(int x) { return (unsigned)(x/16+7) < 15; }
				1779
				1780	Another similar case involves truncations on 64-bit targets:
				1781
				1782	%361 = sdiv i64 %.046, 8 ; [#uses=1]
				1783	%362 = trunc i64 %361 to i32 ; [#uses=2]
				1784	...
				1785	%367 = icmp eq i32 %362, 0 ; [#uses=1]
				1786
Eli Friedman	1144d7e	2010-01-31 04:55:32 +0000	[diff] [blame]	1787	//===---------------------------------------------------------------------===//
				1788
				1789	Missed instcombine/dagcombine transformation:
				1790	define void @lshift_lt(i8 zeroext %a) nounwind {
				1791	entry:
				1792	%conv = zext i8 %a to i32
				1793	%shl = shl i32 %conv, 3
				1794	%cmp = icmp ult i32 %shl, 33
				1795	br i1 %cmp, label %if.then, label %if.end
				1796
				1797	if.then:
				1798	tail call void @bar() nounwind
				1799	ret void
				1800
				1801	if.end:
				1802	ret void
				1803	}
				1804	declare void @bar() nounwind
				1805
				1806	The shift should be eliminated. Testcase derived from gcc.
Eli Friedman	9cfb3ad	2010-01-18 22:36:59 +0000	[diff] [blame]	1807
				1808	//===---------------------------------------------------------------------===//
Chris Lattner	cf031f6	2010-02-09 00:11:10 +0000	[diff] [blame]	1809
				1810	These compile into different code, one gets recognized as a switch and the
				1811	other doesn't due to phase ordering issues (PR6212):
				1812
				1813	int test1(int mainType, int subType) {
				1814	if (mainType == 7)
				1815	subType = 4;
				1816	else if (mainType == 9)
				1817	subType = 6;
				1818	else if (mainType == 11)
				1819	subType = 9;
				1820	return subType;
				1821	}
				1822
				1823	int test2(int mainType, int subType) {
				1824	if (mainType == 7)
				1825	subType = 4;
				1826	if (mainType == 9)
				1827	subType = 6;
				1828	if (mainType == 11)
				1829	subType = 9;
				1830	return subType;
				1831	}
				1832
				1833	//===---------------------------------------------------------------------===//
Chris Lattner	6663670	2010-03-10 21:42:42 +0000	[diff] [blame]	1834
				1835	The following test case (from PR6576):
				1836
				1837	define i32 @mul(i32 %a, i32 %b) nounwind readnone {
				1838	entry:
				1839	%cond1 = icmp eq i32 %b, 0 ; <i1> [#uses=1]
				1840	br i1 %cond1, label %exit, label %bb.nph
				1841	bb.nph: ; preds = %entry
				1842	%tmp = mul i32 %b, %a ; <i32> [#uses=1]
				1843	ret i32 %tmp
				1844	exit: ; preds = %entry
				1845	ret i32 0
				1846	}
				1847
				1848	could be reduced to:
				1849
				1850	define i32 @mul(i32 %a, i32 %b) nounwind readnone {
				1851	entry:
				1852	%tmp = mul i32 %b, %a
				1853	ret i32 %tmp
				1854	}
				1855
				1856	//===---------------------------------------------------------------------===//
				1857
Chris Lattner	9484689	2010-04-16 23:52:30 +0000	[diff] [blame]	1858	We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates.
				1859	See GCC PR34949
				1860
Chris Lattner	c2685a9	2010-05-21 23:16:21 +0000	[diff] [blame]	1861	Another interesting case is that something related could be used for variables
				1862	that go const after their ctor has finished. In these cases, globalopt (which
				1863	can statically run the constructor) could mark the global const (so it gets put
				1864	in the readonly section). A testcase would be:
				1865
				1866	#include <complex>
				1867	using namespace std;
				1868	const complex<char> should_be_in_rodata (42,-42);
				1869	complex<char> should_be_in_data (42,-42);
				1870	complex<char> should_be_in_bss;
				1871
				1872	Where we currently evaluate the ctors but the globals don't become const because
				1873	the optimizer doesn't know they "become const" after the ctor is done. See
				1874	GCC PR4131 for more examples.
				1875
Chris Lattner	9484689	2010-04-16 23:52:30 +0000	[diff] [blame]	1876	//===---------------------------------------------------------------------===//
				1877
Dan Gohman	3a2a484	2010-05-03 14:31:00 +0000	[diff] [blame]	1878	In this code:
				1879
				1880	long foo(long x) {
				1881	return x > 1 ? x : 1;
				1882	}
				1883
				1884	LLVM emits a comparison with 1 instead of 0. 0 would be equivalent
				1885	and cheaper on most targets.
				1886
				1887	LLVM prefers comparisons with zero over non-zero in general, but in this
				1888	case it choses instead to keep the max operation obvious.
				1889
				1890	//===---------------------------------------------------------------------===//
Eli Friedman	8c47d3b	2010-06-12 05:54:27 +0000	[diff] [blame]	1891
				1892	Take the following testcase on x86-64 (similar testcases exist for all targets
				1893	with addc/adde):
				1894
				1895	define void @a(i64* nocapture %s, i64* nocapture %t, i64 %a, i64 %b,
				1896	i64 %c) nounwind {
				1897	entry:
				1898	%0 = zext i64 %a to i128 ; <i128> [#uses=1]
				1899	%1 = zext i64 %b to i128 ; <i128> [#uses=1]
				1900	%2 = add i128 %1, %0 ; <i128> [#uses=2]
				1901	%3 = zext i64 %c to i128 ; <i128> [#uses=1]
				1902	%4 = shl i128 %3, 64 ; <i128> [#uses=1]
				1903	%5 = add i128 %4, %2 ; <i128> [#uses=1]
				1904	%6 = lshr i128 %5, 64 ; <i128> [#uses=1]
				1905	%7 = trunc i128 %6 to i64 ; <i64> [#uses=1]
				1906	store i64 %7, i64* %s, align 8
				1907	%8 = trunc i128 %2 to i64 ; <i64> [#uses=1]
				1908	store i64 %8, i64* %t, align 8
				1909	ret void
				1910	}
				1911
				1912	Generated code:
				1913	addq %rcx, %rdx
				1914	movl $0, %eax
				1915	adcq $0, %rax
				1916	addq %r8, %rax
				1917	movq %rax, (%rdi)
				1918	movq %rdx, (%rsi)
				1919	ret
				1920
				1921	Expected code:
				1922	addq %rcx, %rdx
				1923	adcq $0, %r8
				1924	movq %r8, (%rdi)
				1925	movq %rdx, (%rsi)
				1926	ret
				1927
				1928	The generated SelectionDAG has an ADD of an ADDE, where both operands of the
				1929	ADDE are zero. Replacing one of the operands of the ADDE with the other operand
				1930	of the ADD, and replacing the ADD with the ADDE, should give the desired result.
				1931
				1932	(That said, we are doing a lot better than gcc on this testcase. :) )
				1933
				1934	//===---------------------------------------------------------------------===//
Eli Friedman	b4a74c1	2010-07-03 07:38:12 +0000	[diff] [blame]	1935
				1936	Switch lowering generates less than ideal code for the following switch:
				1937	define void @a(i32 %x) nounwind {
				1938	entry:
				1939	switch i32 %x, label %if.end [
				1940	i32 0, label %if.then
				1941	i32 1, label %if.then
				1942	i32 2, label %if.then
				1943	i32 3, label %if.then
				1944	i32 5, label %if.then
				1945	]
				1946	if.then:
				1947	tail call void @foo() nounwind
				1948	ret void
				1949	if.end:
				1950	ret void
				1951	}
				1952	declare void @foo()
				1953
				1954	Generated code on x86-64 (other platforms give similar results):
				1955	a:
				1956	cmpl $5, %edi
				1957	ja .LBB0_2
				1958	movl %edi, %eax
				1959	movl $47, %ecx
				1960	btq %rax, %rcx
				1961	jb .LBB0_3
				1962	.LBB0_2:
				1963	ret
				1964	.LBB0_3:
Eli Friedman	b482829	2010-07-03 08:43:32 +0000	[diff] [blame]	1965	jmp foo # TAILCALL
Eli Friedman	b4a74c1	2010-07-03 07:38:12 +0000	[diff] [blame]	1966
				1967	The movl+movl+btq+jb could be simplified to a cmpl+jne.
				1968
Eli Friedman	b482829	2010-07-03 08:43:32 +0000	[diff] [blame]	1969	Or, if we wanted to be really clever, we could simplify the whole thing to
				1970	something like the following, which eliminates a branch:
				1971	xorl $1, %edi
				1972	cmpl $4, %edi
				1973	ja .LBB0_2
				1974	ret
				1975	.LBB0_2:
				1976	jmp foo # TAILCALL
Nick Lewycky	b1e4eeb	2010-08-08 07:04:25 +0000	[diff] [blame]	1977	//===---------------------------------------------------------------------===//
				1978	Given a branch where the two target blocks are identical ("ret i32 %b" in
				1979	both), simplifycfg will simplify them away. But not so for a switch statement:
Eli Friedman	b482829	2010-07-03 08:43:32 +0000	[diff] [blame]	1980
Nick Lewycky	b1e4eeb	2010-08-08 07:04:25 +0000	[diff] [blame]	1981	define i32 @f(i32 %a, i32 %b) nounwind readnone {
				1982	entry:
				1983	switch i32 %a, label %bb3 [
				1984	i32 4, label %bb
				1985	i32 6, label %bb
				1986	]
				1987
				1988	bb: ; preds = %entry, %entry
				1989	ret i32 %b
				1990
				1991	bb3: ; preds = %entry
				1992	ret i32 %b
				1993	}
Eli Friedman	b4a74c1	2010-07-03 07:38:12 +0000	[diff] [blame]	1994	//===---------------------------------------------------------------------===//
Chris Lattner	274191f	2010-11-09 19:37:28 +0000	[diff] [blame]	1995
				1996	clang -O3 fails to devirtualize this virtual inheritance case: (GCC PR45875)
Chris Lattner	1e68fdb	2010-11-11 17:17:56 +0000	[diff] [blame]	1997	Looks related to PR3100
Chris Lattner	274191f	2010-11-09 19:37:28 +0000	[diff] [blame]	1998
				1999	struct c1 {};
				2000	struct c10 : c1{
				2001	virtual void foo ();
				2002	};
				2003	struct c11 : c10, c1{
				2004	virtual void f6 ();
				2005	};
				2006	struct c28 : virtual c11{
				2007	void f6 ();
				2008	};
				2009	void check_c28 () {
				2010	c28 obj;
				2011	c11 *ptr = &obj;
				2012	ptr->f6 ();
				2013	}
				2014
				2015	//===---------------------------------------------------------------------===//
Chris Lattner	af510f1	2010-11-11 18:23:57 +0000	[diff] [blame]	2016
				2017	We compile this:
				2018
				2019	int foo(int a) { return (a & (~15)) / 16; }
				2020
				2021	Into:
				2022
				2023	define i32 @foo(i32 %a) nounwind readnone ssp {
				2024	entry:
				2025	%and = and i32 %a, -16
				2026	%div = sdiv i32 %and, 16
				2027	ret i32 %div
				2028	}
				2029
				2030	but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case
				2031	should be instcombined into just "a >> 4".
				2032
				2033	We do get this at the codegen level, so something knows about it, but
				2034	instcombine should catch it earlier:
				2035
				2036	_foo: ## @foo
				2037	## BB#0: ## %entry
				2038	movl %edi, %eax
				2039	sarl $4, %eax
				2040	ret
				2041
				2042	//===---------------------------------------------------------------------===//
				2043
Chris Lattner	a97c91f	2010-12-13 00:15:25 +0000	[diff] [blame]	2044	This code (from GCC PR28685):
				2045
				2046	int test(int a, int b) {
				2047	int lt = a < b;
				2048	int eq = a == b;
				2049	if (lt)
				2050	return 1;
				2051	return eq;
				2052	}
				2053
				2054	Is compiled to:
				2055
				2056	define i32 @test(i32 %a, i32 %b) nounwind readnone ssp {
				2057	entry:
				2058	%cmp = icmp slt i32 %a, %b
				2059	br i1 %cmp, label %return, label %if.end
				2060
				2061	if.end: ; preds = %entry
				2062	%cmp5 = icmp eq i32 %a, %b
				2063	%conv6 = zext i1 %cmp5 to i32
				2064	ret i32 %conv6
				2065
				2066	return: ; preds = %entry
				2067	ret i32 1
				2068	}
				2069
				2070	it could be:
				2071
				2072	define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp {
				2073	entry:
				2074	%0 = icmp sle i32 %a, %b
				2075	%retval = zext i1 %0 to i32
				2076	ret i32 %retval
				2077	}
				2078
				2079	//===---------------------------------------------------------------------===//