Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: c3a9330ba6ed1a7c7c12f3c82f8ac35e3d3086eb [file] [log] [blame]

Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
Chris Lattner	f308ea0	2006-09-28 06:01:17 +0000	[diff] [blame]	3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	1d15983	2009-11-27 17:12:30 +0000	[diff] [blame]	5	Dead argument elimination should be enhanced to handle cases when an argument is
				6	dead to an externally visible function. Though the argument can't be removed
				7	from the externally visible function, the caller doesn't need to pass it in.
				8	For example in this testcase:
				9
				10	void foo(int X) __attribute__((noinline));
				11	void foo(int X) { sideeffect(); }
				12	void bar(int A) { foo(A+1); }
				13
				14	We compile bar to:
				15
				16	define void @bar(i32 %A) nounwind ssp {
				17	%0 = add nsw i32 %A, 1 ; <i32> [#uses=1]
				18	tail call void @foo(i32 %0) nounwind noinline ssp
				19	ret void
				20	}
				21
				22	The add is dead, we could pass in 'i32 undef' instead. This occurs for C++
				23	templates etc, which usually have linkonce_odr/weak_odr linkage, not internal
				24	linkage.
				25
				26	//===---------------------------------------------------------------------===//
				27
Chris Lattner	9b62b45	2006-11-14 01:57:53 +0000	[diff] [blame]	28	With the recent changes to make the implicit def/use set explicit in
				29	machineinstrs, we should change the target descriptions for 'call' instructions
				30	so that the .td files don't list all the call-clobbered registers as implicit
				31	defs. Instead, these should be added by the code generator (e.g. on the dag).
				32
				33	This has a number of uses:
				34
				35	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				36	for their different impdef sets.
				37	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				38	sets don't need copies of call instructions.
				39	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				40	of calls.
				41
				42	//===---------------------------------------------------------------------===//
				43
Chris Lattner	08859ff	2010-12-15 07:25:55 +0000	[diff] [blame]	44	We should recognized various "overflow detection" idioms and translate them into
Chris Lattner	e5cbdca	2010-12-19 19:37:52 +0000	[diff] [blame]	45	llvm.uadd.with.overflow and similar intrinsics. Here is a multiply idiom:
Chris Lattner	9448184	2010-12-15 07:28:58 +0000	[diff] [blame]	46
				47	unsigned int mul(unsigned int a,unsigned int b) {
				48	if ((unsigned long long)a*b>0xffffffff)
				49	exit(0);
				50	return a*b;
				51	}
				52
Chris Lattner	527b47d	2011-01-02 18:31:38 +0000	[diff] [blame]	53	The legalization code for mul-with-overflow needs to be made more robust before
				54	this can be implemented though.
				55
Nate Begeman	81e8097	2006-03-17 01:40:33 +0000	[diff] [blame]	56	//===---------------------------------------------------------------------===//
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	57
				58	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	59	precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
				60	safe in general, even on darwin. See the libm implementation of hypot for
				61	examples (which special case when x/y are exactly zero to get signed zeros etc
				62	right).
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	63
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	64	//===---------------------------------------------------------------------===//
				65
Chris Lattner	b27b69f	2006-03-04 01:19:34 +0000	[diff] [blame]	66	On targets with expensive 64-bit multiply, we could LSR this:
				67
				68	for (i = ...; ++i) {
				69	x = 1ULL << i;
				70
				71	into:
				72	long long tmp = 1;
				73	for (i = ...; ++i, tmp+=tmp)
				74	x = tmp;
				75
				76	This would be a win on ppc32, but not x86 or ppc64.
				77
Chris Lattner	ad01993	2006-03-04 08:44:51 +0000	[diff] [blame]	78	//===---------------------------------------------------------------------===//
Chris Lattner	5b0fe7d	2006-03-05 20:00:08 +0000	[diff] [blame]	79
				80	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				81
				82	//===---------------------------------------------------------------------===//
Chris Lattner	549f27d2	2006-03-07 02:46:26 +0000	[diff] [blame]	83
Chris Lattner	398ffba	2010-01-01 01:29:26 +0000	[diff] [blame]	84	Reassociate should turn things like:
				85
				86	int factorial(int X) {
				87	return XXXXXXX*X;
				88	}
				89
				90	into llvm.powi calls, allowing the code generator to produce balanced
				91	multiplication trees.
				92
				93	First, the intrinsic needs to be extended to support integers, and second the
				94	code generator needs to be enhanced to lower these to multiplication trees.
Chris Lattner	c20995e	2006-03-11 20:17:08 +0000	[diff] [blame]	95
				96	//===---------------------------------------------------------------------===//
				97
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	98	Interesting? testcase for add/shift/mul reassoc:
				99
				100	int bar(int x, int y) {
				101	return xxx+y+xxxxxyyyy;
				102	}
				103	int foo(int z, int n) {
				104	return bar(z, n) + bar(2z, 2n);
				105	}
				106
Chris Lattner	398ffba	2010-01-01 01:29:26 +0000	[diff] [blame]	107	This is blocked on not handling XXX -> powi(X, 3) (see note above). The issue
				108	is that we end up getting t = 2X s = tt and don't turn this into 4XX,
				109	which is the same number of multiplies and is canonical, because the 2*X has
				110	multiple uses. Here's a simple example:
				111
				112	define i32 @test15(i32 %X1) {
				113	%B = mul i32 %X1, 47 ; X1*47
				114	%C = mul i32 %B, %B
				115	ret i32 %C
				116	}
				117
				118
				119	//===---------------------------------------------------------------------===//
				120
				121	Reassociate should handle the example in GCC PR16157:
				122
				123	extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;
				124	void f () { /* this can be optimized to four additions... */
				125	b4 = a4 + a3 + a2 + a1 + a0;
				126	b3 = a3 + a2 + a1 + a0;
				127	b2 = a2 + a1 + a0;
				128	b1 = a1 + a0;
				129	}
				130
				131	This requires reassociating to forms of expressions that are already available,
				132	something that reassoc doesn't think about yet.
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	133
Chris Lattner	10c4245	2010-01-24 20:01:41 +0000	[diff] [blame]	134
				135	//===---------------------------------------------------------------------===//
				136
				137	This function: (derived from GCC PR19988)
				138	double foo(double x, double y) {
				139	return ((x + 0.1234 * y) * (x + -0.1234 * y));
				140	}
				141
				142	compiles to:
				143	_foo:
				144	movapd %xmm1, %xmm2
				145	mulsd LCPI1_1(%rip), %xmm1
				146	mulsd LCPI1_0(%rip), %xmm2
				147	addsd %xmm0, %xmm1
				148	addsd %xmm0, %xmm2
				149	movapd %xmm1, %xmm0
				150	mulsd %xmm2, %xmm0
				151	ret
				152
Chris Lattner	43dc2e6	2010-01-24 20:17:09 +0000	[diff] [blame]	153	Reassociate should be able to turn it into:
Chris Lattner	10c4245	2010-01-24 20:01:41 +0000	[diff] [blame]	154
				155	double foo(double x, double y) {
				156	return ((x + 0.1234 * y) * (x - 0.1234 * y));
				157	}
				158
				159	Which allows the multiply by constant to be CSE'd, producing:
				160
				161	_foo:
				162	mulsd LCPI1_0(%rip), %xmm1
				163	movapd %xmm1, %xmm2
				164	addsd %xmm0, %xmm2
				165	subsd %xmm1, %xmm0
				166	mulsd %xmm2, %xmm0
				167	ret
				168
				169	This doesn't need -ffast-math support at all. This is particularly bad because
				170	the llvm-gcc frontend is canonicalizing the later into the former, but clang
				171	doesn't have this problem.
				172
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	173	//===---------------------------------------------------------------------===//
				174
Chris Lattner	82c78b2	2006-03-09 20:13:21 +0000	[diff] [blame]	175	These two functions should generate the same code on big-endian systems:
				176
				177	int g(int j,int l) { return memcmp(j,l,4); }
				178	int h(int j, int l) { return j - l; }
				179
				180	this could be done in SelectionDAGISel.cpp, along with other special cases,
				181	for 1,2,4,8 bytes.
				182
				183	//===---------------------------------------------------------------------===//
				184
Chris Lattner	c04b423	2006-03-22 07:33:46 +0000	[diff] [blame]	185	It would be nice to revert this patch:
				186	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				187
				188	And teach the dag combiner enough to simplify the code expanded before
				189	legalize. It seems plausible that this knowledge would let it simplify other
				190	stuff too.
				191
Chris Lattner	e6cd96d	2006-03-24 19:59:17 +0000	[diff] [blame]	192	//===---------------------------------------------------------------------===//
				193
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	194	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng	67d3d4c	2006-03-31 22:35:14 +0000	[diff] [blame]	195	to the type size. It works but can be overly conservative as the alignment of
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	196	specific vector types are target dependent.
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	197
				198	//===---------------------------------------------------------------------===//
				199
Dan Gohman	1f3be1a	2009-05-11 18:51:16 +0000	[diff] [blame]	200	We should produce an unaligned load from code like this:
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	201
				202	v4sf example(float *P) {
				203	return (v4sf){P[0], P[1], P[2], P[3] };
				204	}
				205
				206	//===---------------------------------------------------------------------===//
				207
Chris Lattner	16abfdf	2006-05-18 18:26:13 +0000	[diff] [blame]	208	Add support for conditional increments, and other related patterns. Instead
				209	of:
				210
				211	movl 136(%esp), %eax
				212	cmpl $0, %eax
				213	je LBB16_2 #cond_next
				214	LBB16_1: #cond_true
				215	incl _foo
				216	LBB16_2: #cond_next
				217
				218	emit:
				219	movl _foo, %eax
				220	cmpl $1, %edi
				221	sbbl $-1, %eax
				222	movl %eax, _foo
				223
				224	//===---------------------------------------------------------------------===//
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	225
				226	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				227
				228	Expand these to calls of sin/cos and stores:
				229	double sincos(double x, double sin, double cos);
				230	float sincosf(float x, float sin, float cos);
				231	long double sincosl(long double x, long double sin, long double cos);
				232
				233	Doing so could allow SROA of the destination pointers. See also:
				234	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				235
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	236	This is now easily doable with MRVs. We could even make an intrinsic for this
				237	if anyone cared enough about sincos.
				238
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	239	//===---------------------------------------------------------------------===//
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	240
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	241	quantum_sigma_x in 462.libquantum contains the following loop:
				242
				243	for(i=0; i<reg->size; i++)
				244	{
				245	/* Flip the target bit of each basis state */
				246	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				247	}
				248
				249	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				250	so cool to turn it into something like:
				251
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	252	long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	253	if (target < 32) {
				254	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	255	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	256	} else {
				257	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	258	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	259	}
				260
				261	... which would only do one 32-bit XOR per loop iteration instead of two.
				262
				263	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	264	this requires TBAA.
Chris Lattner	faa6adf	2009-09-21 06:04:07 +0000	[diff] [blame]	265
				266	//===---------------------------------------------------------------------===//
				267
Chris Lattner	b1ac769	2008-10-05 02:16:12 +0000	[diff] [blame]	268	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattner	f9bae43	2006-12-08 02:01:32 +0000	[diff] [blame]	269
				270	unsigned long reverse(unsigned v) {
				271	unsigned t;
				272	t = v ^ ((v << 16) \| (v >> 16));
				273	t &= ~0xff0000;
				274	v = (v << 24) \| (v >> 8);
				275	return v ^ (t >> 8);
				276	}
				277
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	278	//===---------------------------------------------------------------------===//
				279
Chris Lattner	818ff34	2010-01-23 18:49:30 +0000	[diff] [blame]	280	[LOOP RECOGNITION]
				281
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	282	These idioms should be recognized as popcount (see PR1488):
				283
				284	unsigned countbits_slow(unsigned v) {
				285	unsigned c;
				286	for (c = 0; v; v >>= 1)
				287	c += v & 1;
				288	return c;
				289	}
				290	unsigned countbits_fast(unsigned v){
				291	unsigned c;
				292	for (c = 0; v; c++)
				293	v &= v - 1; // clear the least significant bit set
				294	return c;
				295	}
				296
				297	BITBOARD = unsigned long long
				298	int PopCnt(register BITBOARD a) {
				299	register int c=0;
				300	while(a) {
				301	c++;
				302	a &= a - 1;
				303	}
				304	return c;
				305	}
				306	unsigned int popcount(unsigned int input) {
				307	unsigned int count = 0;
				308	for (unsigned int i = 0; i < 4 * 8; i++)
				309	count += (input >> i) & i;
				310	return count;
				311	}
				312
Chris Lattner	527b47d	2011-01-02 18:31:38 +0000	[diff] [blame]	313	This sort of thing should be added to the loop idiom pass.
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	314
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	315	//===---------------------------------------------------------------------===//
				316
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	317	These should turn into single 16-bit (unaligned?) loads on little/big endian
				318	processors.
				319
				320	unsigned short read_16_le(const unsigned char *adr) {
				321	return adr[0] \| (adr[1] << 8);
				322	}
				323	unsigned short read_16_be(const unsigned char *adr) {
				324	return (adr[0] << 8) \| adr[1];
				325	}
				326
				327	//===---------------------------------------------------------------------===//
Chris Lattner	cf10391	2006-10-24 16:12:47 +0000	[diff] [blame]	328
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	329	-instcombine should handle this transform:
Reid Spencer	e4d87aa	2006-12-23 06:05:41 +0000	[diff] [blame]	330	icmp pred (sdiv X / C1 ), C2
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	331	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				332
				333	Currently InstCombine avoids this transform but will do it when the signs of
				334	the operands and the sign of the divide match. See the FIXME in
				335	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				336	for Instruction::UDiv (around line 4447) for more details.
				337
				338	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				339	this construct.
Chris Lattner	d7c628d	2006-11-03 22:27:39 +0000	[diff] [blame]	340
				341	//===---------------------------------------------------------------------===//
				342
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	343	[LOOP OPTIMIZATION]
				344
				345	SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization
				346	opportunities in its double_array_divs_variable function: it needs loop
				347	interchange, memory promotion (which LICM already does), vectorization and
				348	variable trip count loop unrolling (since it has a constant trip count). ICC
				349	apparently produces this very nice code with -ffast-math:
				350
				351	..B1.70: # Preds ..B1.70 ..B1.69
				352	mulpd %xmm0, %xmm1 #108.2
				353	mulpd %xmm0, %xmm1 #108.2
				354	mulpd %xmm0, %xmm1 #108.2
				355	mulpd %xmm0, %xmm1 #108.2
				356	addl $8, %edx #
				357	cmpl $131072, %edx #108.2
				358	jb ..B1.70 # Prob 99% #108.2
				359
				360	It would be better to count down to zero, but this is a lot better than what we
				361	do.
				362
				363	//===---------------------------------------------------------------------===//
				364
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	365	Consider:
				366
				367	typedef unsigned U32;
				368	typedef unsigned long long U64;
				369	int test (U32 inst, U64 regs) {
				370	U64 effective_addr2;
				371	U32 temp = *inst;
				372	int r1 = (temp >> 20) & 0xf;
				373	int b2 = (temp >> 16) & 0xf;
				374	effective_addr2 = temp & 0xfff;
				375	if (b2) effective_addr2 += regs[b2];
				376	b2 = (temp >> 12) & 0xf;
				377	if (b2) effective_addr2 += regs[b2];
				378	effective_addr2 &= regs[4];
				379	if ((effective_addr2 & 3) == 0)
				380	return 1;
				381	return 0;
				382	}
				383
				384	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				385	we don't eliminate the computation of the top half of effective_addr2 because
				386	we don't have whole-function selection dags. On x86, this means we use one
				387	extra register for the function when effective_addr2 is declared as U64 than
				388	when it is declared U32.
				389
Chris Lattner	1742498	2009-11-10 23:47:45 +0000	[diff] [blame]	390	PHI Slicing could be extended to do this.
				391
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	392	//===---------------------------------------------------------------------===//
				393
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	394	LSR should know what GPR types a target has from TargetData. This code:
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	395
				396	volatile short X, Y; // globals
				397
				398	void foo(int N) {
				399	int i;
				400	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				401	}
				402
Chris Lattner	c1491f3	2009-09-20 17:37:38 +0000	[diff] [blame]	403	produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	404
Chris Lattner	c1491f3	2009-09-20 17:37:38 +0000	[diff] [blame]	405	LBB1_2:
				406	ldr r3, LCPI1_0
				407	ldr r3, [r3]
				408	strh r2, [r3]
				409	ldr r3, LCPI1_1
				410	ldr r3, [r3]
				411	strh r1, [r3]
				412	add r1, r1, #4
				413	add r2, r2, #1 <- [0,+,1]
				414	sub r0, r0, #1 <- [0,-,1]
				415	cmp r0, #0
				416	bne LBB1_2
				417
				418	LSR should reuse the "+" IV for the exit test.
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	419
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	420	//===---------------------------------------------------------------------===//
				421
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	422	Tail call elim should be more aggressive, checking to see if the call is
				423	followed by an uncond branch to an exit block.
				424
				425	; This testcase is due to tail-duplication not wanting to copy the return
				426	; instruction into the terminating blocks because there was other code
				427	; optimized out of the function after the taildup happened.
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	428	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	429
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	430	define i32 @t4(i32 %a) {
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	431	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	432	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				433	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				434	br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	435
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	436	then.0: ; preds = %entry
				437	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				438	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				439	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	440
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	441	else.0: ; preds = %entry
				442	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				443	br i1 %tmp.7, label %then.1, label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	444
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	445	then.1: ; preds = %else.0
				446	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				447	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				448	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	449
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	450	return: ; preds = %then.1, %else.0, %then.0
				451	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	452	[ %tmp.9, %then.1 ]
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	453	ret i32 %result.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	454	}
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	455
				456	//===---------------------------------------------------------------------===//
				457
Chris Lattner	c90b866	2008-08-10 00:47:21 +0000	[diff] [blame]	458	Tail recursion elimination should handle:
				459
				460	int pow2m1(int n) {
				461	if (n == 0)
				462	return 0;
				463	return 2 * pow2m1 (n - 1) + 1;
				464	}
				465
				466	Also, multiplies can be turned into SHL's, so they should be handled as if
				467	they were associative. "return foo() << 1" can be tail recursion eliminated.
				468
				469	//===---------------------------------------------------------------------===//
				470
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	471	Argument promotion should promote arguments for recursive functions, like
				472	this:
				473
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	474	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	475
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	476	define internal i32 @foo(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	477	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	478	%tmp = load i32* %x ; <i32> [#uses=0]
				479	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				480	ret i32 %tmp.foo
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	481	}
				482
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	483	define i32 @bar(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	484	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	485	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				486	ret i32 %tmp3
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	487	}
				488
Chris Lattner	81f2d71	2007-12-05 23:05:06 +0000	[diff] [blame]	489	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	490
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	491	We should investigate an instruction sinking pass. Consider this silly
				492	example in pic mode:
				493
				494	#include <assert.h>
				495	void foo(int x) {
				496	assert(x);
				497	//...
				498	}
				499
				500	we compile this to:
				501	_foo:
				502	subl $28, %esp
				503	call "L1$pb"
				504	"L1$pb":
				505	popl %eax
				506	cmpl $0, 32(%esp)
				507	je LBB1_2 # cond_true
				508	LBB1_1: # return
				509	# ...
				510	addl $28, %esp
				511	ret
				512	LBB1_2: # cond_true
				513	...
				514
				515	The PIC base computation (call+popl) is only used on one path through the
				516	code, but is currently always computed in the entry block. It would be
				517	better to sink the picbase computation down into the block for the
				518	assertion, as it is the only one that uses it. This happens for a lot of
				519	code with early outs.
				520
Chris Lattner	92c06a0	2007-12-29 01:05:01 +0000	[diff] [blame]	521	Another example is loads of arguments, which are usually emitted into the
				522	entry block on targets like x86. If not used in all paths through a
				523	function, they should be sunk into the ones that do.
				524
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	525	In this case, whole-function-isel would also handle this.
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	526
				527	//===---------------------------------------------------------------------===//
Chris Lattner	b304194	2008-01-07 21:38:14 +0000	[diff] [blame]	528
				529	Investigate lowering of sparse switch statements into perfect hash tables:
				530	http://burtleburtle.net/bob/hash/perfect.html
				531
				532	//===---------------------------------------------------------------------===//
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	533
				534	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				535	corresponding integer operations. On a yonah, this loop:
				536
				537	double a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	538	void foo() {
				539	int i, b;
				540	for (b = 0; b < 10000000; b++)
				541	for (i = 0; i < 256; i++)
				542	a[i] = -a[i];
				543	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	544
				545	is twice as slow as this loop:
				546
				547	long long a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	548	void foo() {
				549	int i, b;
				550	for (b = 0; b < 10000000; b++)
				551	for (i = 0; i < 256; i++)
				552	a[i] ^= (1ULL << 63);
				553	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	554
				555	and I suspect other processors are similar. On X86 in particular this is a
				556	big win because doing this with integers allows the use of read/modify/write
				557	instructions.
				558
				559	//===---------------------------------------------------------------------===//
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	560
				561	DAG Combiner should try to combine small loads into larger loads when
				562	profitable. For example, we compile this C++ example:
				563
				564	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				565	extern THotKey m_HotKey;
				566	THotKey GetHotKey () { return m_HotKey; }
				567
Chris Lattner	527b47d	2011-01-02 18:31:38 +0000	[diff] [blame]	568	into (-m64 -O3 -fno-exceptions -static -fomit-frame-pointer):
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	569
Chris Lattner	527b47d	2011-01-02 18:31:38 +0000	[diff] [blame]	570	__Z9GetHotKeyv: ## @_Z9GetHotKeyv
				571	movq _m_HotKey@GOTPCREL(%rip), %rax
				572	movzwl (%rax), %ecx
				573	movzbl 2(%rax), %edx
				574	shlq $16, %rdx
				575	orq %rcx, %rdx
				576	movzbl 3(%rax), %ecx
				577	shlq $24, %rcx
				578	orq %rdx, %rcx
				579	movzbl 4(%rax), %eax
				580	shlq $32, %rax
				581	orq %rcx, %rax
				582	ret
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	583
				584	//===---------------------------------------------------------------------===//
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	585
Nate Begeman	e9fe65c	2008-02-18 18:39:23 +0000	[diff] [blame]	586	We should add an FRINT node to the DAG to model targets that have legal
				587	implementations of ceil/floor/rint.
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	588
				589	//===---------------------------------------------------------------------===//
				590
				591	Consider:
				592
				593	int test() {
Benjamin Kramer	9d071cb	2010-12-23 15:32:07 +0000	[diff] [blame]	594	long long input[8] = {1,0,1,0,1,0,1,0};
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	595	foo(input);
				596	}
				597
Chris Lattner	9c8fb9e	2011-01-01 22:52:11 +0000	[diff] [blame]	598	Clang compiles this into:
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	599
Chris Lattner	9c8fb9e	2011-01-01 22:52:11 +0000	[diff] [blame]	600	call void @llvm.memset.p0i8.i64(i8* %tmp, i8 0, i64 64, i32 16, i1 false)
				601	%0 = getelementptr [8 x i64]* %input, i64 0, i64 0
				602	store i64 1, i64* %0, align 16
				603	%1 = getelementptr [8 x i64]* %input, i64 0, i64 2
				604	store i64 1, i64* %1, align 16
				605	%2 = getelementptr [8 x i64]* %input, i64 0, i64 4
				606	store i64 1, i64* %2, align 16
				607	%3 = getelementptr [8 x i64]* %input, i64 0, i64 6
				608	store i64 1, i64* %3, align 16
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	609
Chris Lattner	9c8fb9e	2011-01-01 22:52:11 +0000	[diff] [blame]	610	Which gets codegen'd into:
				611
				612	pxor %xmm0, %xmm0
				613	movaps %xmm0, -16(%rbp)
				614	movaps %xmm0, -32(%rbp)
				615	movaps %xmm0, -48(%rbp)
				616	movaps %xmm0, -64(%rbp)
				617	movq $1, -64(%rbp)
				618	movq $1, -48(%rbp)
				619	movq $1, -32(%rbp)
				620	movq $1, -16(%rbp)
				621
				622	It would be better to have 4 movq's of 0 instead of the movaps's.
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	623
				624	//===---------------------------------------------------------------------===//
Chris Lattner	a11deb0	2008-03-02 02:51:40 +0000	[diff] [blame]	625
				626	http://llvm.org/PR717:
				627
				628	The following code should compile into "ret int undef". Instead, LLVM
				629	produces "ret int 0":
				630
				631	int f() {
				632	int x = 4;
				633	int y;
				634	if (x == 3) y = 0;
				635	return y;
				636	}
				637
				638	//===---------------------------------------------------------------------===//
Chris Lattner	53b7277	2008-03-02 19:29:42 +0000	[diff] [blame]	639
				640	The loop unroller should partially unroll loops (instead of peeling them)
				641	when code growth isn't too bad and when an unroll count allows simplification
				642	of some code within the loop. One trivial example is:
				643
				644	#include <stdio.h>
				645	int main() {
				646	int nRet = 17;
				647	int nLoop;
				648	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				649	if ( nLoop & 1 )
				650	nRet += 2;
				651	else
				652	nRet -= 1;
				653	}
				654	return nRet;
				655	}
				656
				657	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				658	reduction in code size. The resultant code would then also be suitable for
				659	exit value computation.
				660
				661	//===---------------------------------------------------------------------===//
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	662
				663	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				664	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				665	matching code in dag combine doesn't look through truncates aggressively
				666	enough. Here are some testcases reduces from GCC PR17886:
				667
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	668	unsigned long long f5(unsigned long long x, unsigned long long y) {
				669	return (x << 8) \| ((y >> 48) & 0xffull);
				670	}
				671	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				672	switch(z) {
				673	case 1:
				674	return (x << 8) \| ((y >> 48) & 0xffull);
				675	case 2:
				676	return (x << 16) \| ((y >> 40) & 0xffffull);
				677	case 3:
				678	return (x << 24) \| ((y >> 32) & 0xffffffull);
				679	case 4:
				680	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				681	default:
				682	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				683	}
				684	}
				685
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	686	//===---------------------------------------------------------------------===//
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	687
Chris Lattner	ef17f08	2010-12-15 07:10:43 +0000	[diff] [blame]	688	This (and similar related idioms):
				689
				690	unsigned int foo(unsigned char i) {
				691	return i \| (i<<8) \| (i<<16) \| (i<<24);
				692	}
				693
				694	compiles into:
				695
				696	define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone {
				697	entry:
				698	%conv = zext i8 %i to i32
				699	%shl = shl i32 %conv, 8
				700	%shl5 = shl i32 %conv, 16
				701	%shl9 = shl i32 %conv, 24
				702	%or = or i32 %shl9, %conv
				703	%or6 = or i32 %or, %shl5
				704	%or10 = or i32 %or6, %shl
				705	ret i32 %or10
				706	}
				707
				708	it would be better as:
				709
				710	unsigned int bar(unsigned char i) {
				711	unsigned int j=i \| (i << 8);
				712	return j \| (j<<16);
				713	}
				714
				715	aka:
				716
				717	define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone {
				718	entry:
				719	%conv = zext i8 %i to i32
				720	%shl = shl i32 %conv, 8
				721	%or = or i32 %shl, %conv
				722	%shl5 = shl i32 %or, 16
				723	%or6 = or i32 %shl5, %or
				724	ret i32 %or6
				725	}
				726
				727	or even i*0x01010101, depending on the speed of the multiplier. The best way to
				728	handle this is to canonicalize it to a multiply in IR and have codegen handle
				729	lowering multiplies to shifts on cpus where shifts are faster.
				730
				731	//===---------------------------------------------------------------------===//
				732
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	733	We do a number of simplifications in simplify libcalls to strength reduce
				734	standard library functions, but we don't currently merge them together. For
				735	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				736	be done safely if "b" isn't modified between the strlen and memcpy of course.
				737
				738	//===---------------------------------------------------------------------===//
				739
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	740	We compile this program: (from GCC PR11680)
				741	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				742
				743	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				744	slower than when compile with GCC (either 4.0 or 4.2):
				745
				746	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				747	$ time ./a.out fast
				748	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				749
				750	$ g++ perf.cpp -O3 -fno-exceptions
				751	$ time ./a.out fast
				752	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				753
				754	It looks like we are making the same inlining decisions, so this may be raw
				755	codegen badness or something else (haven't investigated).
				756
				757	//===---------------------------------------------------------------------===//
				758
				759	We miss some instcombines for stuff like this:
				760	void bar (void);
				761	void foo (unsigned int a) {
				762	/* This one is equivalent to a >= (3 << 2). */
				763	if ((a >> 2) >= 3)
				764	bar ();
				765	}
				766
				767	A few other related ones are in GCC PR14753.
				768
				769	//===---------------------------------------------------------------------===//
				770
				771	Divisibility by constant can be simplified (according to GCC PR12849) from
				772	being a mulhi to being a mul lo (cheaper). Testcase:
				773
				774	void bar(unsigned n) {
				775	if (n % 3 == 0)
				776	true();
				777	}
				778
Eli Friedman	bcae205	2009-12-12 23:23:43 +0000	[diff] [blame]	779	This is equivalent to the following, where 2863311531 is the multiplicative
				780	inverse of 3, and 1431655766 is ((2^32)-1)/3+1:
				781	void bar(unsigned n) {
				782	if (n * 2863311531U < 1431655766U)
				783	true();
				784	}
				785
				786	The same transformation can work with an even modulo with the addition of a
				787	rotate: rotate the result of the multiply to the right by the number of bits
				788	which need to be zero for the condition to be true, and shrink the compare RHS
				789	by the same amount. Unless the target supports rotates, though, that
				790	transformation probably isn't worthwhile.
				791
				792	The transformation can also easily be made to work with non-zero equality
				793	comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	794
				795	//===---------------------------------------------------------------------===//
Chris Lattner	23f35bc	2008-08-19 06:22:16 +0000	[diff] [blame]	796
Chris Lattner	db03983	2008-10-15 16:06:03 +0000	[diff] [blame]	797	Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
				798	bunch of other stuff from this example (see PR1604):
				799
				800	#include <cstdio>
				801	struct test {
				802	int val;
				803	virtual ~test() {}
				804	};
				805
				806	int main() {
				807	test t;
				808	std::scanf("%d", &t.val);
				809	std::printf("%d\n", t.val);
				810	}
				811
				812	//===---------------------------------------------------------------------===//
				813
Nick Lewycky	d2f0db1	2008-11-27 22:41:45 +0000	[diff] [blame]	814	These functions perform the same computation, but produce different assembly.
Nick Lewycky	df563ca	2008-11-27 22:12:22 +0000	[diff] [blame]	815
				816	define i8 @select(i8 %x) readnone nounwind {
				817	%A = icmp ult i8 %x, 250
				818	%B = select i1 %A, i8 0, i8 1
				819	ret i8 %B
				820	}
				821
				822	define i8 @addshr(i8 %x) readnone nounwind {
				823	%A = zext i8 %x to i9
				824	%B = add i9 %A, 6 ;; 256 - 250 == 6
				825	%C = lshr i9 %B, 8
				826	%D = trunc i9 %C to i8
				827	ret i8 %D
				828	}
				829
				830	//===---------------------------------------------------------------------===//
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	831
				832	From gcc bug 24696:
				833	int
				834	f (unsigned long a, unsigned long b, unsigned long c)
				835	{
				836	return ((a & (c - 1)) != 0) \|\| ((b & (c - 1)) != 0);
				837	}
				838	int
				839	f (unsigned long a, unsigned long b, unsigned long c)
				840	{
				841	return ((a & (c - 1)) != 0) \| ((b & (c - 1)) != 0);
				842	}
				843	Both should combine to ((a\|b) & (c-1)) != 0. Currently not optimized with
				844	"clang -emit-llvm-bc \| opt -std-compile-opts".
				845
				846	//===---------------------------------------------------------------------===//
				847
				848	From GCC Bug 20192:
				849	#define PMD_MASK (~((1UL << 23) - 1))
				850	void clear_pmd_range(unsigned long start, unsigned long end)
				851	{
				852	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
				853	f();
				854	}
				855	The expression should optimize to something like
				856	"!((start\|end)&~PMD_MASK). Currently not optimized with "clang
				857	-emit-llvm-bc \| opt -std-compile-opts".
				858
				859	//===---------------------------------------------------------------------===//
				860
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	861	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
				862	i;}
				863	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				864	These should combine to the same thing. Currently, the first function
				865	produces better code on X86.
				866
				867	//===---------------------------------------------------------------------===//
				868
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	869	From GCC Bug 15784:
				870	#define abs(x) x>0?x:-x
				871	int f(int x, int y)
				872	{
				873	return (abs(x)) >= 0;
				874	}
				875	This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
				876	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				877
				878	//===---------------------------------------------------------------------===//
				879
				880	From GCC Bug 14753:
				881	void
				882	rotate_cst (unsigned int a)
				883	{
				884	a = (a << 10) \| (a >> 22);
				885	if (a == 123)
				886	bar ();
				887	}
				888	void
				889	minus_cst (unsigned int a)
				890	{
				891	unsigned int tem;
				892
				893	tem = 20 - a;
				894	if (tem == 5)
				895	bar ();
				896	}
				897	void
				898	mask_gt (unsigned int a)
				899	{
				900	/* This is equivalent to a > 15. */
				901	if ((a & ~7) > 8)
				902	bar ();
				903	}
				904	void
				905	rshift_gt (unsigned int a)
				906	{
				907	/* This is equivalent to a > 23. */
				908	if ((a >> 2) > 5)
				909	bar ();
				910	}
				911	All should simplify to a single comparison. All of these are
				912	currently not optimized with "clang -emit-llvm-bc \| opt
				913	-std-compile-opts".
				914
				915	//===---------------------------------------------------------------------===//
				916
				917	From GCC Bug 32605:
				918	int c(int* x) {return (char)x+2 == (char)x;}
				919	Should combine to 0. Currently not optimized with "clang
				920	-emit-llvm-bc \| opt -std-compile-opts" (although llc can optimize it).
				921
				922	//===---------------------------------------------------------------------===//
				923
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	924	int a(unsigned b) {return ((b << 31) \| (b << 30)) >> 31;}
				925	Should be combined to "((b >> 1) \| b) & 1". Currently not optimized
				926	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				927
				928	//===---------------------------------------------------------------------===//
				929
				930	unsigned a(unsigned x, unsigned y) { return x \| (y & 1) \| (y & 2);}
				931	Should combine to "x \| (y & 3)". Currently not optimized with "clang
				932	-emit-llvm-bc \| opt -std-compile-opts".
				933
				934	//===---------------------------------------------------------------------===//
				935
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	936	int a(int a, int b, int c) {return (~a & c) \| ((c\|a) & b);}
				937	Should fold to "(~a & c) \| (a & b)". Currently not optimized with
				938	"clang -emit-llvm-bc \| opt -std-compile-opts".
				939
				940	//===---------------------------------------------------------------------===//
				941
				942	int a(int a,int b) {return (~(a\|b))\|a;}
				943	Should fold to "a\|~b". Currently not optimized with "clang
				944	-emit-llvm-bc \| opt -std-compile-opts".
				945
				946	//===---------------------------------------------------------------------===//
				947
				948	int a(int a, int b) {return (a&&b) \|\| (a&&!b);}
				949	Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
				950	\| opt -std-compile-opts".
				951
				952	//===---------------------------------------------------------------------===//
				953
				954	int a(int a, int b, int c) {return (a&&b) \|\| (!a&&c);}
				955	Should fold to "a ? b : c", or at least something sane. Currently not
				956	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				957
				958	//===---------------------------------------------------------------------===//
				959
				960	int a(int a, int b, int c) {return (a&&b) \|\| (a&&c) \|\| (a&&b&&c);}
				961	Should fold to a && (b \|\| c). Currently not optimized with "clang
				962	-emit-llvm-bc \| opt -std-compile-opts".
				963
				964	//===---------------------------------------------------------------------===//
				965
				966	int a(int x) {return x \| ((x & 8) ^ 8);}
				967	Should combine to x \| 8. Currently not optimized with "clang
				968	-emit-llvm-bc \| opt -std-compile-opts".
				969
				970	//===---------------------------------------------------------------------===//
				971
				972	int a(int x) {return x ^ ((x & 8) ^ 8);}
				973	Should also combine to x \| 8. Currently not optimized with "clang
				974	-emit-llvm-bc \| opt -std-compile-opts".
				975
				976	//===---------------------------------------------------------------------===//
				977
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	978	int a(int x) {return ((x \| -9) ^ 8) & x;}
				979	Should combine to x & -9. Currently not optimized with "clang
				980	-emit-llvm-bc \| opt -std-compile-opts".
				981
				982	//===---------------------------------------------------------------------===//
				983
				984	unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
				985	Should combine to "a * 0x88888888 >> 31". Currently not optimized
				986	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				987
				988	//===---------------------------------------------------------------------===//
				989
				990	unsigned a(char* x) {if ((*x & 32) == 0) return b();}
				991	There's an unnecessary zext in the generated code with "clang
				992	-emit-llvm-bc \| opt -std-compile-opts".
				993
				994	//===---------------------------------------------------------------------===//
				995
				996	unsigned a(unsigned long long x) {return 40 * (x >> 1);}
				997	Should combine to "20 * (((unsigned)x) & -2)". Currently not
				998	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				999
				1000	//===---------------------------------------------------------------------===//
Bill Wendling	3bdcda8	2008-12-02 05:12:47 +0000	[diff] [blame]	1001
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1002	This was noticed in the entryblock for grokdeclarator in 403.gcc:
				1003
				1004	%tmp = icmp eq i32 %decl_context, 4
				1005	%decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
				1006	%tmp1 = icmp eq i32 %decl_context_addr.0, 1
				1007	%decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
				1008
				1009	tmp1 should be simplified to something like:
				1010	(!tmp \|\| decl_context == 1)
				1011
				1012	This allows recursive simplifications, tmp1 is used all over the place in
				1013	the function, e.g. by:
				1014
				1015	%tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
				1016	%tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
				1017	%or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
				1018
				1019	later.
				1020
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1021	//===---------------------------------------------------------------------===//
				1022
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1023	[STORE SINKING]
				1024
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1025	Store sinking: This code:
				1026
				1027	void f (int n, int cond, int res) {
				1028	int i;
				1029	*res = 0;
				1030	for (i = 0; i < n; i++)
				1031	if (*cond)
				1032	res ^= 234; / () /
				1033	}
				1034
				1035	On this function GVN hoists the fully redundant value of *res, but nothing
				1036	moves the store out. This gives us this code:
				1037
				1038	bb: ; preds = %bb2, %entry
				1039	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
				1040	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
				1041	%1 = load i32* %cond, align 4
				1042	%2 = icmp eq i32 %1, 0
				1043	br i1 %2, label %bb2, label %bb1
				1044
				1045	bb1: ; preds = %bb
				1046	%3 = xor i32 %.rle, 234
				1047	store i32 %3, i32* %res, align 4
				1048	br label %bb2
				1049
				1050	bb2: ; preds = %bb, %bb1
				1051	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
				1052	%indvar.next = add i32 %i.05, 1
				1053	%exitcond = icmp eq i32 %indvar.next, %n
				1054	br i1 %exitcond, label %return, label %bb
				1055
				1056	DSE should sink partially dead stores to get the store out of the loop.
				1057
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1058	Here's another partial dead case:
				1059	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
				1060
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1061	//===---------------------------------------------------------------------===//
				1062
				1063	Scalar PRE hoists the mul in the common block up to the else:
				1064
				1065	int test (int a, int b, int c, int g) {
				1066	int d, e;
				1067	if (a)
				1068	d = b * c;
				1069	else
				1070	d = b - c;
				1071	e = b * c + g;
				1072	return d + e;
				1073	}
				1074
				1075	It would be better to do the mul once to reduce codesize above the if.
				1076	This is GCC PR38204.
				1077
Chris Lattner	cce240d	2011-01-06 07:41:22 +0000	[diff] [blame]	1078
				1079	//===---------------------------------------------------------------------===//
				1080	This simple function from 179.art:
				1081
				1082	int winner, numf2s;
				1083	struct { double y; int reset; } *Y;
				1084
				1085	void find_match() {
				1086	int i;
				1087	winner = 0;
				1088	for (i=0;i<numf2s;i++)
				1089	if (Y[i].y > Y[winner].y)
				1090	winner =i;
				1091	}
				1092
				1093	Compiles into (with clang TBAA):
				1094
				1095	for.body: ; preds = %for.inc, %bb.nph
				1096	%indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %for.inc ]
				1097	%i.01718 = phi i32 [ 0, %bb.nph ], [ %i.01719, %for.inc ]
				1098	%tmp4 = getelementptr inbounds %struct.anon* %tmp3, i64 %indvar, i32 0
				1099	%tmp5 = load double* %tmp4, align 8, !tbaa !4
				1100	%idxprom7 = sext i32 %i.01718 to i64
				1101	%tmp10 = getelementptr inbounds %struct.anon* %tmp3, i64 %idxprom7, i32 0
				1102	%tmp11 = load double* %tmp10, align 8, !tbaa !4
				1103	%cmp12 = fcmp ogt double %tmp5, %tmp11
				1104	br i1 %cmp12, label %if.then, label %for.inc
				1105
				1106	if.then: ; preds = %for.body
				1107	%i.017 = trunc i64 %indvar to i32
				1108	br label %for.inc
				1109
				1110	for.inc: ; preds = %for.body, %if.then
				1111	%i.01719 = phi i32 [ %i.01718, %for.body ], [ %i.017, %if.then ]
				1112	%indvar.next = add i64 %indvar, 1
				1113	%exitcond = icmp eq i64 %indvar.next, %tmp22
				1114	br i1 %exitcond, label %for.cond.for.end_crit_edge, label %for.body
				1115
				1116
				1117	It is good that we hoisted the reloads of numf2's, and Y out of the loop and
				1118	sunk the store to winner out.
				1119
				1120	However, this is awful on several levels: the conditional truncate in the loop
				1121	(-indvars at fault? why can't we completely promote the IV to i64?).
				1122
				1123	Beyond that, we have a partially redundant load in the loop: if "winner" (aka
				1124	%i.01718) isn't updated, we reload Y[winner].y the next time through the loop.
				1125	Similarly, the addressing that feeds it (including the sext) is redundant. In
				1126	the end we get this generated assembly:
				1127
				1128	LBB0_2: ## %for.body
				1129	## =>This Inner Loop Header: Depth=1
				1130	movsd (%rdi), %xmm0
				1131	movslq %edx, %r8
				1132	shlq $4, %r8
				1133	ucomisd (%rcx,%r8), %xmm0
				1134	jbe LBB0_4
				1135	movl %esi, %edx
				1136	LBB0_4: ## %for.inc
				1137	addq $16, %rdi
				1138	incq %rsi
				1139	cmpq %rsi, %rax
				1140	jne LBB0_2
				1141
				1142	All things considered this isn't too bad, but we shouldn't need the movslq or
				1143	the shlq instruction, or the load folded into ucomisd every time through the
				1144	loop.
				1145
				1146	On an x86-specific topic, if the loop can't be restructure, the movl should be a
				1147	cmov.
				1148
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1149	//===---------------------------------------------------------------------===//
				1150
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1151	[STORE SINKING]
				1152
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1153	GCC PR37810 is an interesting case where we should sink load/store reload
				1154	into the if block and outside the loop, so we don't reload/store it on the
				1155	non-call path.
				1156
				1157	for () {
				1158	*P += 1;
				1159	if ()
				1160	call();
				1161	else
				1162	...
				1163	->
				1164	tmp = *P
				1165	for () {
				1166	tmp += 1;
				1167	if () {
				1168	*P = tmp;
				1169	call();
				1170	tmp = *P;
				1171	} else ...
				1172	}
				1173	*P = tmp;
				1174
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1175	We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
				1176	we don't sink the store. We need partially dead store sinking.
				1177
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1178	//===---------------------------------------------------------------------===//
				1179
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1180	[LOAD PRE CRIT EDGE SPLITTING]
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1181
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1182	GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
				1183	leading to excess stack traffic. This could be handled by GVN with some crazy
				1184	symbolic phi translation. The code we get looks like (g is on the stack):
				1185
				1186	bb2: ; preds = %bb1
				1187	..
				1188	%9 = getelementptr %struct.f* %g, i32 0, i32 0
				1189	store i32 %8, i32* %9, align bel %bb3
				1190
				1191	bb3: ; preds = %bb1, %bb2, %bb
				1192	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
				1193	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
				1194	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
				1195	%11 = load i32* %10, align 4
				1196
Chris Lattner	6d94926	2009-11-27 16:53:57 +0000	[diff] [blame]	1197	%11 is partially redundant, an in BB2 it should have the value %8.
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1198
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1199	GCC PR33344 and PR35287 are similar cases.
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1200
Chris Lattner	6c9fab7	2009-11-05 18:19:19 +0000	[diff] [blame]	1201
				1202	//===---------------------------------------------------------------------===//
				1203
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1204	[LOAD PRE]
				1205
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1206	There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1207	GCC testsuite, ones we don't get yet are (checked through loadpre25):
				1208
				1209	[CRIT EDGE BREAKING]
				1210	loadpre3.c predcom-4.c
				1211
				1212	[PRE OF READONLY CALL]
				1213	loadpre5.c
				1214
				1215	[TURN SELECT INTO BRANCH]
				1216	loadpre14.c loadpre15.c
				1217
				1218	actually a conditional increment: loadpre18.c loadpre19.c
				1219
Chris Lattner	2fc36e1	2010-12-15 06:38:24 +0000	[diff] [blame]	1220	//===---------------------------------------------------------------------===//
				1221
				1222	[LOAD PRE / STORE SINKING / SPEC HACK]
				1223
				1224	This is a chunk of code from 456.hmmer:
				1225
				1226	int f(int M, int mc, int mpp, int tpmm, int ip, int tpim, int dpp,
				1227	int tpdm, int xmb, int bp, int *ms) {
				1228	int k, sc;
				1229	for (k = 1; k <= M; k++) {
				1230	mc[k] = mpp[k-1] + tpmm[k-1];
				1231	if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;
				1232	if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
				1233	if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
				1234	mc[k] += ms[k];
				1235	}
				1236	}
				1237
				1238	It is very profitable for this benchmark to turn the conditional stores to mc[k]
				1239	into a conditional move (select instr in IR) and allow the final store to do the
				1240	store. See GCC PR27313 for more details. Note that this is valid to xform even
				1241	with the new C++ memory model, since mc[k] is previously loaded and later
				1242	stored.
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1243
				1244	//===---------------------------------------------------------------------===//
				1245
				1246	[SCALAR PRE]
				1247	There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the
				1248	GCC testsuite.
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1249
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1250	//===---------------------------------------------------------------------===//
				1251
				1252	There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1253	GCC testsuite. For example, we get the first example in predcom-1.c, but
				1254	miss the second one:
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1255
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1256	unsigned fib[1000];
				1257	unsigned avg[1000];
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1258
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1259	__attribute__ ((noinline))
				1260	void count_averages(int n) {
				1261	int i;
				1262	for (i = 1; i < n; i++)
				1263	avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;
				1264	}
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1265
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1266	which compiles into two loads instead of one in the loop.
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1267
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1268	predcom-2.c is the same as predcom-1.c
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1269
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1270	predcom-3.c is very similar but needs loads feeding each other instead of
				1271	store->load.
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1272
				1273
				1274	//===---------------------------------------------------------------------===//
				1275
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	1276	[ALIAS ANALYSIS]
				1277
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1278	Type based alias analysis:
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1279	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
				1280
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	1281	We should do better analysis of posix_memalign. At the least it should
				1282	no-capture its pointer argument, at best, we should know that the out-value
				1283	result doesn't point to anything (like malloc). One example of this is in
				1284	SingleSource/Benchmarks/Misc/dt.c
				1285
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1286	//===---------------------------------------------------------------------===//
				1287
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1288	A/B get pinned to the stack because we turn an if/then into a select instead
				1289	of PRE'ing the load/store. This may be fixable in instcombine:
				1290	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
				1291
Chris Lattner	93c6c77	2009-09-21 02:53:57 +0000	[diff] [blame]	1292	struct X { int i; };
				1293	int foo (int x) {
				1294	struct X a;
				1295	struct X b;
				1296	struct X *p;
				1297	a.i = 1;
				1298	b.i = 2;
				1299	if (x)
				1300	p = &a;
				1301	else
				1302	p = &b;
				1303	return p->i;
				1304	}
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1305
Chris Lattner	93c6c77	2009-09-21 02:53:57 +0000	[diff] [blame]	1306	//===---------------------------------------------------------------------===//
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1307
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1308	Interesting missed case because of control flow flattening (should be 2 loads):
				1309	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1310	With: llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \|
				1311	opt -mem2reg -gvn -instcombine \| llvm-dis
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1312	we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1313	VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1314
				1315	//===---------------------------------------------------------------------===//
				1316
				1317	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
				1318	We could eliminate the branch condition here, loading from null is undefined:
				1319
				1320	struct S { int w, x, y, z; };
				1321	struct T { int r; struct S s; };
				1322	void bar (struct S, int);
				1323	void foo (int a, struct T b)
				1324	{
				1325	struct S *c = 0;
				1326	if (a)
				1327	c = &b.s;
				1328	bar (*c, a);
				1329	}
				1330
				1331	//===---------------------------------------------------------------------===//
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1332
Chris Lattner	9cf8ef6	2008-12-23 20:52:52 +0000	[diff] [blame]	1333	simplifylibcalls should do several optimizations for strspn/strcspn:
				1334
Chris Lattner	9cf8ef6	2008-12-23 20:52:52 +0000	[diff] [blame]	1335	strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
				1336
				1337	size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
				1338	int __reject3) {
				1339	register size_t __result = 0;
				1340	while (__s[__result] != '\0' && __s[__result] != __reject1 &&
				1341	__s[__result] != __reject2 && __s[__result] != __reject3)
				1342	++__result;
				1343	return __result;
				1344	}
				1345
				1346	This should turn into a switch on the character. See PR3253 for some notes on
				1347	codegen.
				1348
				1349	456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
				1350
				1351	//===---------------------------------------------------------------------===//
Chris Lattner	d23b799	2008-12-31 00:54:13 +0000	[diff] [blame]	1352
				1353	"gas" uses this idiom:
				1354	else if (strchr ("+-/%\|&^:[]()~", intel_parser.op_string))
				1355	..
				1356	else if (strchr ("<>", *intel_parser.op_string)
				1357
				1358	Those should be turned into a switch.
				1359
				1360	//===---------------------------------------------------------------------===//
Chris Lattner	ffb08f5	2009-01-08 06:52:57 +0000	[diff] [blame]	1361
				1362	252.eon contains this interesting code:
				1363
				1364	%3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
				1365	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1366	%strlen = call i32 @strlen(i8* %3072) ; uses = 1
				1367	%endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
				1368	call void @llvm.memcpy.i32(i8* %endptr,
				1369	i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
				1370	%3074 = call i32 @strlen(i8* %endptr) nounwind readonly
				1371
				1372	This is interesting for a couple reasons. First, in this:
				1373
Benjamin Kramer	9d071cb	2010-12-23 15:32:07 +0000	[diff] [blame]	1374	The memcpy+strlen strlen can be replaced with:
Chris Lattner	ffb08f5	2009-01-08 06:52:57 +0000	[diff] [blame]	1375
				1376	%3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
				1377
				1378	Because the destination was just copied into the specified memory buffer. This,
				1379	in turn, can be constant folded to "4".
				1380
				1381	In other code, it contains:
				1382
				1383	%endptr6978 = bitcast i8* %endptr69 to i32*
				1384	store i32 7107374, i32* %endptr6978, align 1
				1385	%3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
				1386
				1387	Which could also be constant folded. Whatever is producing this should probably
				1388	be fixed to leave this as a memcpy from a string.
				1389
				1390	Further, eon also has an interesting partially redundant strlen call:
				1391
				1392	bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
				1393	%682 = getelementptr i8 %argv, i32 6 ; <i8> [#uses=2]
				1394	%683 = load i8** %682, align 4 ; <i8*> [#uses=4]
				1395	%684 = load i8* %683, align 1 ; <i8> [#uses=1]
				1396	%685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
				1397	br i1 %685, label %bb10, label %bb9
				1398
				1399	bb9: ; preds = %bb8
				1400	%686 = call i32 @strlen(i8* %683) nounwind readonly
				1401	%687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
				1402	br i1 %687, label %bb10, label %bb11
				1403
				1404	bb10: ; preds = %bb9, %bb8
				1405	%688 = call i32 @strlen(i8* %683) nounwind readonly
				1406
				1407	This could be eliminated by doing the strlen once in bb8, saving code size and
				1408	improving perf on the bb8->9->10 path.
				1409
				1410	//===---------------------------------------------------------------------===//
Chris Lattner	9fee08f	2009-01-08 07:34:55 +0000	[diff] [blame]	1411
				1412	I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
				1413	which looks like:
				1414	%movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
				1415
				1416
				1417	bb62: ; preds = %bb55, %bb53
				1418	%promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
				1419	%171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1420	%172 = add i32 %171, -1 ; <i32> [#uses=1]
				1421	%173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
				1422
				1423	... no stores ...
				1424	br i1 %or.cond, label %bb65, label %bb72
				1425
				1426	bb65: ; preds = %bb62
				1427	store i8 0, i8* %173, align 1
				1428	br label %bb72
				1429
				1430	bb72: ; preds = %bb65, %bb62
				1431	%trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
				1432	%177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1433
				1434	Note that on the bb62->bb72 path, that the %177 strlen call is partially
				1435	redundant with the %171 call. At worst, we could shove the %177 strlen call
				1436	up into the bb65 block moving it out of the bb62->bb72 path. However, note
				1437	that bb65 stores to the string, zeroing out the last byte. This means that on
				1438	that path the value of %177 is actually just %171-1. A sub is cheaper than a
				1439	strlen!
				1440
				1441	This pattern repeats several times, basically doing:
				1442
				1443	A = strlen(P);
				1444	P[A-1] = 0;
				1445	B = strlen(P);
				1446	where it is "obvious" that B = A-1.
				1447
				1448	//===---------------------------------------------------------------------===//
				1449
Chris Lattner	9fee08f	2009-01-08 07:34:55 +0000	[diff] [blame]	1450	186.crafty has this interesting pattern with the "out.4543" variable:
				1451
				1452	call void @llvm.memcpy.i32(
				1453	i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
				1454	i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
				1455	%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
				1456
				1457	It is basically doing:
				1458
				1459	memcpy(globalarray, "string");
				1460	printf(..., globalarray);
				1461
				1462	Anyway, by knowing that printf just reads the memory and forward substituting
				1463	the string directly into the printf, this eliminates reads from globalarray.
				1464	Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
				1465	other similar functions) there are many stores to "out". Once all the printfs
				1466	stop using "out", all that is left is the memcpy's into it. This should allow
				1467	globalopt to remove the "stored only" global.
				1468
				1469	//===---------------------------------------------------------------------===//
				1470
Dan Gohman	8289b05	2009-01-20 01:07:33 +0000	[diff] [blame]	1471	This code:
				1472
				1473	define inreg i32 @foo(i8* inreg %p) nounwind {
				1474	%tmp0 = load i8* %p
				1475	%tmp1 = ashr i8 %tmp0, 5
				1476	%tmp2 = sext i8 %tmp1 to i32
				1477	ret i32 %tmp2
				1478	}
				1479
				1480	could be dagcombine'd to a sign-extending load with a shift.
				1481	For example, on x86 this currently gets this:
				1482
				1483	movb (%eax), %al
				1484	sarb $5, %al
				1485	movsbl %al, %eax
				1486
				1487	while it could get this:
				1488
				1489	movsbl (%eax), %eax
				1490	sarl $5, %eax
				1491
				1492	//===---------------------------------------------------------------------===//
Chris Lattner	256baa4	2009-01-22 07:16:03 +0000	[diff] [blame]	1493
				1494	GCC PR31029:
				1495
				1496	int test(int x) { return 1-x == x; } // --> return false
				1497	int test2(int x) { return 2-x == x; } // --> return x == 1 ?
				1498
				1499	Always foldable for odd constants, what is the rule for even?
				1500
				1501	//===---------------------------------------------------------------------===//
				1502
Torok Edwin	e46a686	2009-01-24 19:30:25 +0000	[diff] [blame]	1503	PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
				1504	for next field in struct (which is at same address).
				1505
				1506	For example: store of float into { {{}}, float } could be turned into a store to
				1507	the float directly.
				1508
Torok Edwin	474479f	2009-02-20 18:42:06 +0000	[diff] [blame]	1509	//===---------------------------------------------------------------------===//
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1510
Chris Lattner	32c5f17	2009-05-11 17:41:40 +0000	[diff] [blame]	1511	The arg promotion pass should make use of nocapture to make its alias analysis
				1512	stuff much more precise.
				1513
				1514	//===---------------------------------------------------------------------===//
				1515
				1516	The following functions should be optimized to use a select instead of a
				1517	branch (from gcc PR40072):
				1518
				1519	char char_int(int m) {if(m>7) return 0; return m;}
				1520	int int_char(char m) {if(m>7) return 0; return m;}
				1521
				1522	//===---------------------------------------------------------------------===//
				1523
Bill Wendling	5a56927	2009-10-27 22:48:31 +0000	[diff] [blame]	1524	int func(int a, int b) { if (a & 0x80) b \|= 0x80; else b &= ~0x80; return b; }
				1525
				1526	Generates this:
				1527
				1528	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1529	entry:
				1530	%0 = and i32 %a, 128 ; <i32> [#uses=1]
				1531	%1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
				1532	%2 = or i32 %b, 128 ; <i32> [#uses=1]
				1533	%3 = and i32 %b, -129 ; <i32> [#uses=1]
				1534	%b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
				1535	ret i32 %b_addr.0
				1536	}
				1537
				1538	However, it's functionally equivalent to:
				1539
				1540	b = (b & ~0x80) \| (a & 0x80);
				1541
				1542	Which generates this:
				1543
				1544	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1545	entry:
				1546	%0 = and i32 %b, -129 ; <i32> [#uses=1]
				1547	%1 = and i32 %a, 128 ; <i32> [#uses=1]
				1548	%2 = or i32 %0, %1 ; <i32> [#uses=1]
				1549	ret i32 %2
				1550	}
				1551
				1552	This can be generalized for other forms:
				1553
				1554	b = (b & ~0x80) \| (a & 0x40) << 1;
				1555
				1556	//===---------------------------------------------------------------------===//
Bill Wendling	c872e9c	2009-10-27 23:30:07 +0000	[diff] [blame]	1557
				1558	These two functions produce different code. They shouldn't:
				1559
				1560	#include <stdint.h>
				1561
				1562	uint8_t p1(uint8_t b, uint8_t a) {
				1563	b = (b & ~0xc0) \| (a & 0xc0);
				1564	return (b);
				1565	}
				1566
				1567	uint8_t p2(uint8_t b, uint8_t a) {
				1568	b = (b & ~0x40) \| (a & 0x40);
				1569	b = (b & ~0x80) \| (a & 0x80);
				1570	return (b);
				1571	}
				1572
				1573	define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1574	entry:
				1575	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1576	%1 = and i8 %a, -64 ; <i8> [#uses=1]
				1577	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1578	ret i8 %2
				1579	}
				1580
				1581	define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1582	entry:
				1583	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1584	%.masked = and i8 %a, 64 ; <i8> [#uses=1]
				1585	%1 = and i8 %a, -128 ; <i8> [#uses=1]
				1586	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1587	%3 = or i8 %2, %.masked ; <i8> [#uses=1]
				1588	ret i8 %3
				1589	}
				1590
				1591	//===---------------------------------------------------------------------===//
Chris Lattner	6fdfc9c	2009-11-11 17:51:27 +0000	[diff] [blame]	1592
				1593	IPSCCP does not currently propagate argument dependent constants through
				1594	functions where it does not not all of the callers. This includes functions
				1595	with normal external linkage as well as templates, C99 inline functions etc.
				1596	Specifically, it does nothing to:
				1597
				1598	define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
				1599	entry:
				1600	%0 = add nsw i32 %y, %z
				1601	%1 = mul i32 %0, %x
				1602	%2 = mul i32 %y, %z
				1603	%3 = add nsw i32 %1, %2
				1604	ret i32 %3
				1605	}
				1606
				1607	define i32 @test2() nounwind {
				1608	entry:
				1609	%0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
				1610	ret i32 %0
				1611	}
				1612
				1613	It would be interesting extend IPSCCP to be able to handle simple cases like
				1614	this, where all of the arguments to a call are constant. Because IPSCCP runs
				1615	before inlining, trivial templates and inline functions are not yet inlined.
				1616	The results for a function + set of constant arguments should be memoized in a
				1617	map.
				1618
				1619	//===---------------------------------------------------------------------===//
Chris Lattner	fc926c2	2009-11-11 17:54:02 +0000	[diff] [blame]	1620
				1621	The libcall constant folding stuff should be moved out of SimplifyLibcalls into
				1622	libanalysis' constantfolding logic. This would allow IPSCCP to be able to
				1623	handle simple things like this:
				1624
				1625	static int foo(const char *X) { return strlen(X); }
				1626	int bar() { return foo("abcd"); }
				1627
				1628	//===---------------------------------------------------------------------===//
Nick Lewycky	93f9f7a	2009-11-15 17:51:23 +0000	[diff] [blame]	1629
Duncan Sands	e10920d	2010-01-06 15:37:47 +0000	[diff] [blame]	1630	functionattrs doesn't know much about memcpy/memset. This function should be
Duncan Sands	7c422ac	2010-01-06 08:45:52 +0000	[diff] [blame]	1631	marked readnone rather than readonly, since it only twiddles local memory, but
				1632	functionattrs doesn't handle memset/memcpy/memmove aggressively:
Chris Lattner	89742c2	2009-12-03 07:43:46 +0000	[diff] [blame]	1633
				1634	struct X { int p; int q; };
				1635	int foo() {
				1636	int i = 0, j = 1;
				1637	struct X x, y;
				1638	int **p;
				1639	y.p = &i;
				1640	x.q = &j;
				1641	p = __builtin_memcpy (&x, &y, sizeof (int *));
				1642	return **p;
				1643	}
				1644
Chris Lattner	9c8fb9e	2011-01-01 22:52:11 +0000	[diff] [blame]	1645	This can be seen at:
				1646	$ clang t.c -S -o - -mkernel -O0 -emit-llvm \| opt -functionattrs -S
				1647
				1648
Chris Lattner	0533217	2009-12-03 07:41:54 +0000	[diff] [blame]	1649	//===---------------------------------------------------------------------===//
				1650
Eli Friedman	9cfb3ad	2010-01-18 22:36:59 +0000	[diff] [blame]	1651	Missed instcombine transformation:
				1652	define i1 @a(i32 %x) nounwind readnone {
				1653	entry:
				1654	%cmp = icmp eq i32 %x, 30
				1655	%sub = add i32 %x, -30
				1656	%cmp2 = icmp ugt i32 %sub, 9
				1657	%or = or i1 %cmp, %cmp2
				1658	ret i1 %or
				1659	}
				1660	This should be optimized to a single compare. Testcase derived from gcc.
				1661
				1662	//===---------------------------------------------------------------------===//
				1663
Eli Friedman	9cfb3ad	2010-01-18 22:36:59 +0000	[diff] [blame]	1664	Missed instcombine or reassociate transformation:
				1665	int a(int a, int b) { return (a==12)&(b>47)&(b<58); }
				1666
				1667	The sgt and slt should be combined into a single comparison. Testcase derived
				1668	from gcc.
				1669
				1670	//===---------------------------------------------------------------------===//
				1671
				1672	Missed instcombine transformation:
Chris Lattner	3e41106	2010-11-21 07:05:31 +0000	[diff] [blame]	1673
				1674	%382 = srem i32 %tmp14.i, 64 ; [#uses=1]
				1675	%383 = zext i32 %382 to i64 ; [#uses=1]
				1676	%384 = shl i64 %381, %383 ; [#uses=1]
				1677	%385 = icmp slt i32 %tmp14.i, 64 ; [#uses=1]
				1678
Benjamin Kramer	c21a821	2010-11-23 20:33:57 +0000	[diff] [blame]	1679	The srem can be transformed to an and because if %tmp14.i is negative, the
				1680	shift is undefined. Testcase derived from 403.gcc.
Chris Lattner	3e41106	2010-11-21 07:05:31 +0000	[diff] [blame]	1681
				1682	//===---------------------------------------------------------------------===//
				1683
				1684	This is a range comparison on a divided result (from 403.gcc):
				1685
				1686	%1337 = sdiv i32 %1336, 8 ; [#uses=1]
				1687	%.off.i208 = add i32 %1336, 7 ; [#uses=1]
				1688	%1338 = icmp ult i32 %.off.i208, 15 ; [#uses=1]
				1689
				1690	We already catch this (removing the sdiv) if there isn't an add, we should
				1691	handle the 'add' as well. This is a common idiom with it's builtin_alloca code.
				1692	C testcase:
				1693
				1694	int a(int x) { return (unsigned)(x/16+7) < 15; }
				1695
				1696	Another similar case involves truncations on 64-bit targets:
				1697
				1698	%361 = sdiv i64 %.046, 8 ; [#uses=1]
				1699	%362 = trunc i64 %361 to i32 ; [#uses=2]
				1700	...
				1701	%367 = icmp eq i32 %362, 0 ; [#uses=1]
				1702
Eli Friedman	1144d7e	2010-01-31 04:55:32 +0000	[diff] [blame]	1703	//===---------------------------------------------------------------------===//
				1704
				1705	Missed instcombine/dagcombine transformation:
				1706	define void @lshift_lt(i8 zeroext %a) nounwind {
				1707	entry:
				1708	%conv = zext i8 %a to i32
				1709	%shl = shl i32 %conv, 3
				1710	%cmp = icmp ult i32 %shl, 33
				1711	br i1 %cmp, label %if.then, label %if.end
				1712
				1713	if.then:
				1714	tail call void @bar() nounwind
				1715	ret void
				1716
				1717	if.end:
				1718	ret void
				1719	}
				1720	declare void @bar() nounwind
				1721
				1722	The shift should be eliminated. Testcase derived from gcc.
Eli Friedman	9cfb3ad	2010-01-18 22:36:59 +0000	[diff] [blame]	1723
				1724	//===---------------------------------------------------------------------===//
Chris Lattner	cf031f6	2010-02-09 00:11:10 +0000	[diff] [blame]	1725
				1726	These compile into different code, one gets recognized as a switch and the
				1727	other doesn't due to phase ordering issues (PR6212):
				1728
				1729	int test1(int mainType, int subType) {
				1730	if (mainType == 7)
				1731	subType = 4;
				1732	else if (mainType == 9)
				1733	subType = 6;
				1734	else if (mainType == 11)
				1735	subType = 9;
				1736	return subType;
				1737	}
				1738
				1739	int test2(int mainType, int subType) {
				1740	if (mainType == 7)
				1741	subType = 4;
				1742	if (mainType == 9)
				1743	subType = 6;
				1744	if (mainType == 11)
				1745	subType = 9;
				1746	return subType;
				1747	}
				1748
				1749	//===---------------------------------------------------------------------===//
Chris Lattner	6663670	2010-03-10 21:42:42 +0000	[diff] [blame]	1750
				1751	The following test case (from PR6576):
				1752
				1753	define i32 @mul(i32 %a, i32 %b) nounwind readnone {
				1754	entry:
				1755	%cond1 = icmp eq i32 %b, 0 ; <i1> [#uses=1]
				1756	br i1 %cond1, label %exit, label %bb.nph
				1757	bb.nph: ; preds = %entry
				1758	%tmp = mul i32 %b, %a ; <i32> [#uses=1]
				1759	ret i32 %tmp
				1760	exit: ; preds = %entry
				1761	ret i32 0
				1762	}
				1763
				1764	could be reduced to:
				1765
				1766	define i32 @mul(i32 %a, i32 %b) nounwind readnone {
				1767	entry:
				1768	%tmp = mul i32 %b, %a
				1769	ret i32 %tmp
				1770	}
				1771
				1772	//===---------------------------------------------------------------------===//
				1773
Chris Lattner	9484689	2010-04-16 23:52:30 +0000	[diff] [blame]	1774	We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates.
				1775	See GCC PR34949
				1776
Chris Lattner	c2685a9	2010-05-21 23:16:21 +0000	[diff] [blame]	1777	Another interesting case is that something related could be used for variables
				1778	that go const after their ctor has finished. In these cases, globalopt (which
				1779	can statically run the constructor) could mark the global const (so it gets put
				1780	in the readonly section). A testcase would be:
				1781
				1782	#include <complex>
				1783	using namespace std;
				1784	const complex<char> should_be_in_rodata (42,-42);
				1785	complex<char> should_be_in_data (42,-42);
				1786	complex<char> should_be_in_bss;
				1787
				1788	Where we currently evaluate the ctors but the globals don't become const because
				1789	the optimizer doesn't know they "become const" after the ctor is done. See
				1790	GCC PR4131 for more examples.
				1791
Chris Lattner	9484689	2010-04-16 23:52:30 +0000	[diff] [blame]	1792	//===---------------------------------------------------------------------===//
				1793
Dan Gohman	3a2a484	2010-05-03 14:31:00 +0000	[diff] [blame]	1794	In this code:
				1795
				1796	long foo(long x) {
				1797	return x > 1 ? x : 1;
				1798	}
				1799
				1800	LLVM emits a comparison with 1 instead of 0. 0 would be equivalent
				1801	and cheaper on most targets.
				1802
				1803	LLVM prefers comparisons with zero over non-zero in general, but in this
				1804	case it choses instead to keep the max operation obvious.
				1805
				1806	//===---------------------------------------------------------------------===//
Eli Friedman	8c47d3b	2010-06-12 05:54:27 +0000	[diff] [blame]	1807
				1808	Take the following testcase on x86-64 (similar testcases exist for all targets
				1809	with addc/adde):
				1810
				1811	define void @a(i64* nocapture %s, i64* nocapture %t, i64 %a, i64 %b,
				1812	i64 %c) nounwind {
				1813	entry:
				1814	%0 = zext i64 %a to i128 ; <i128> [#uses=1]
				1815	%1 = zext i64 %b to i128 ; <i128> [#uses=1]
				1816	%2 = add i128 %1, %0 ; <i128> [#uses=2]
				1817	%3 = zext i64 %c to i128 ; <i128> [#uses=1]
				1818	%4 = shl i128 %3, 64 ; <i128> [#uses=1]
				1819	%5 = add i128 %4, %2 ; <i128> [#uses=1]
				1820	%6 = lshr i128 %5, 64 ; <i128> [#uses=1]
				1821	%7 = trunc i128 %6 to i64 ; <i64> [#uses=1]
				1822	store i64 %7, i64* %s, align 8
				1823	%8 = trunc i128 %2 to i64 ; <i64> [#uses=1]
				1824	store i64 %8, i64* %t, align 8
				1825	ret void
				1826	}
				1827
				1828	Generated code:
				1829	addq %rcx, %rdx
				1830	movl $0, %eax
				1831	adcq $0, %rax
				1832	addq %r8, %rax
				1833	movq %rax, (%rdi)
				1834	movq %rdx, (%rsi)
				1835	ret
				1836
				1837	Expected code:
				1838	addq %rcx, %rdx
				1839	adcq $0, %r8
				1840	movq %r8, (%rdi)
				1841	movq %rdx, (%rsi)
				1842	ret
				1843
				1844	The generated SelectionDAG has an ADD of an ADDE, where both operands of the
				1845	ADDE are zero. Replacing one of the operands of the ADDE with the other operand
				1846	of the ADD, and replacing the ADD with the ADDE, should give the desired result.
				1847
				1848	(That said, we are doing a lot better than gcc on this testcase. :) )
				1849
				1850	//===---------------------------------------------------------------------===//
Eli Friedman	b4a74c1	2010-07-03 07:38:12 +0000	[diff] [blame]	1851
				1852	Switch lowering generates less than ideal code for the following switch:
				1853	define void @a(i32 %x) nounwind {
				1854	entry:
				1855	switch i32 %x, label %if.end [
				1856	i32 0, label %if.then
				1857	i32 1, label %if.then
				1858	i32 2, label %if.then
				1859	i32 3, label %if.then
				1860	i32 5, label %if.then
				1861	]
				1862	if.then:
				1863	tail call void @foo() nounwind
				1864	ret void
				1865	if.end:
				1866	ret void
				1867	}
				1868	declare void @foo()
				1869
				1870	Generated code on x86-64 (other platforms give similar results):
				1871	a:
				1872	cmpl $5, %edi
				1873	ja .LBB0_2
				1874	movl %edi, %eax
				1875	movl $47, %ecx
				1876	btq %rax, %rcx
				1877	jb .LBB0_3
				1878	.LBB0_2:
				1879	ret
				1880	.LBB0_3:
Eli Friedman	b482829	2010-07-03 08:43:32 +0000	[diff] [blame]	1881	jmp foo # TAILCALL
Eli Friedman	b4a74c1	2010-07-03 07:38:12 +0000	[diff] [blame]	1882
				1883	The movl+movl+btq+jb could be simplified to a cmpl+jne.
				1884
Eli Friedman	b482829	2010-07-03 08:43:32 +0000	[diff] [blame]	1885	Or, if we wanted to be really clever, we could simplify the whole thing to
				1886	something like the following, which eliminates a branch:
				1887	xorl $1, %edi
				1888	cmpl $4, %edi
				1889	ja .LBB0_2
				1890	ret
				1891	.LBB0_2:
				1892	jmp foo # TAILCALL
Nick Lewycky	b1e4eeb	2010-08-08 07:04:25 +0000	[diff] [blame]	1893	//===---------------------------------------------------------------------===//
				1894	Given a branch where the two target blocks are identical ("ret i32 %b" in
				1895	both), simplifycfg will simplify them away. But not so for a switch statement:
Eli Friedman	b482829	2010-07-03 08:43:32 +0000	[diff] [blame]	1896
Nick Lewycky	b1e4eeb	2010-08-08 07:04:25 +0000	[diff] [blame]	1897	define i32 @f(i32 %a, i32 %b) nounwind readnone {
				1898	entry:
				1899	switch i32 %a, label %bb3 [
				1900	i32 4, label %bb
				1901	i32 6, label %bb
				1902	]
				1903
				1904	bb: ; preds = %entry, %entry
				1905	ret i32 %b
				1906
				1907	bb3: ; preds = %entry
				1908	ret i32 %b
				1909	}
Eli Friedman	b4a74c1	2010-07-03 07:38:12 +0000	[diff] [blame]	1910	//===---------------------------------------------------------------------===//
Chris Lattner	274191f	2010-11-09 19:37:28 +0000	[diff] [blame]	1911
				1912	clang -O3 fails to devirtualize this virtual inheritance case: (GCC PR45875)
Chris Lattner	1e68fdb	2010-11-11 17:17:56 +0000	[diff] [blame]	1913	Looks related to PR3100
Chris Lattner	274191f	2010-11-09 19:37:28 +0000	[diff] [blame]	1914
				1915	struct c1 {};
				1916	struct c10 : c1{
				1917	virtual void foo ();
				1918	};
				1919	struct c11 : c10, c1{
				1920	virtual void f6 ();
				1921	};
				1922	struct c28 : virtual c11{
				1923	void f6 ();
				1924	};
				1925	void check_c28 () {
				1926	c28 obj;
				1927	c11 *ptr = &obj;
				1928	ptr->f6 ();
				1929	}
				1930
				1931	//===---------------------------------------------------------------------===//
Chris Lattner	af510f1	2010-11-11 18:23:57 +0000	[diff] [blame]	1932
				1933	We compile this:
				1934
				1935	int foo(int a) { return (a & (~15)) / 16; }
				1936
				1937	Into:
				1938
				1939	define i32 @foo(i32 %a) nounwind readnone ssp {
				1940	entry:
				1941	%and = and i32 %a, -16
				1942	%div = sdiv i32 %and, 16
				1943	ret i32 %div
				1944	}
				1945
				1946	but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case
				1947	should be instcombined into just "a >> 4".
				1948
				1949	We do get this at the codegen level, so something knows about it, but
				1950	instcombine should catch it earlier:
				1951
				1952	_foo: ## @foo
				1953	## BB#0: ## %entry
				1954	movl %edi, %eax
				1955	sarl $4, %eax
				1956	ret
				1957
				1958	//===---------------------------------------------------------------------===//
				1959
Chris Lattner	a97c91f	2010-12-13 00:15:25 +0000	[diff] [blame]	1960	This code (from GCC PR28685):
				1961
				1962	int test(int a, int b) {
				1963	int lt = a < b;
				1964	int eq = a == b;
				1965	if (lt)
				1966	return 1;
				1967	return eq;
				1968	}
				1969
				1970	Is compiled to:
				1971
				1972	define i32 @test(i32 %a, i32 %b) nounwind readnone ssp {
				1973	entry:
				1974	%cmp = icmp slt i32 %a, %b
				1975	br i1 %cmp, label %return, label %if.end
				1976
				1977	if.end: ; preds = %entry
				1978	%cmp5 = icmp eq i32 %a, %b
				1979	%conv6 = zext i1 %cmp5 to i32
				1980	ret i32 %conv6
				1981
				1982	return: ; preds = %entry
				1983	ret i32 1
				1984	}
				1985
				1986	it could be:
				1987
				1988	define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp {
				1989	entry:
				1990	%0 = icmp sle i32 %a, %b
				1991	%retval = zext i1 %0 to i32
				1992	ret i32 %retval
				1993	}
				1994
				1995	//===---------------------------------------------------------------------===//
Duncan Sands	124708d	2011-01-01 20:08:02 +0000	[diff] [blame]	1996
Benjamin Kramer	eaff66a	2011-01-07 20:42:20 +0000	[diff] [blame]	1997	This code can be seen in viterbi:
				1998
				1999	%64 = call noalias i8* @malloc(i64 %62) nounwind
				2000	...
				2001	%67 = call i64 @llvm.objectsize.i64(i8* %64, i1 false) nounwind
				2002	%68 = call i8* @__memset_chk(i8* %64, i32 0, i64 %62, i64 %67) nounwind
				2003
				2004	llvm.objectsize.i64 should be taught about malloc/calloc, allowing it to
				2005	fold to %62. This is a security win (overflows of malloc will get caught)
				2006	and also a performance win by exposing more memsets to the optimizer.
				2007
				2008	This occurs several times in viterbi.
				2009
				2010	Note that this would change the semantics of @llvm.objectsize which by its
				2011	current definition always folds to a constant. We also should make sure that
				2012	we remove checking in code like
				2013
				2014	char *p = malloc(strlen(s)+1);
				2015	__strcpy_chk(p, s, __builtin_objectsize(p, 0));
				2016
				2017	//===---------------------------------------------------------------------===//
				2018
Chris Lattner	c1853e4	2011-01-06 07:09:23 +0000	[diff] [blame]	2019	This code (from Benchmarks/Dhrystone/dry.c):
				2020
				2021	define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
				2022	entry:
				2023	%sext = shl i32 %0, 24
				2024	%conv = ashr i32 %sext, 24
				2025	%sext6 = shl i32 %1, 24
				2026	%conv4 = ashr i32 %sext6, 24
				2027	%cmp = icmp eq i32 %conv, %conv4
				2028	%. = select i1 %cmp, i32 10000, i32 0
				2029	ret i32 %.
				2030	}
				2031
				2032	Should be simplified into something like:
				2033
				2034	define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
				2035	entry:
				2036	%sext = shl i32 %0, 24
				2037	%conv = and i32 %sext, 0xFF000000
				2038	%sext6 = shl i32 %1, 24
				2039	%conv4 = and i32 %sext6, 0xFF000000
				2040	%cmp = icmp eq i32 %conv, %conv4
				2041	%. = select i1 %cmp, i32 10000, i32 0
				2042	ret i32 %.
				2043	}
				2044
				2045	and then to:
				2046
				2047	define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
				2048	entry:
				2049	%conv = and i32 %0, 0xFF
				2050	%conv4 = and i32 %1, 0xFF
				2051	%cmp = icmp eq i32 %conv, %conv4
				2052	%. = select i1 %cmp, i32 10000, i32 0
				2053	ret i32 %.
				2054	}
				2055	//===---------------------------------------------------------------------===//
Chris Lattner	15df044	2011-01-01 22:57:31 +0000	[diff] [blame]	2056
Benjamin Kramer	fa36680	2011-01-06 17:35:50 +0000	[diff] [blame]	2057	clang -O3 currently compiles this code
				2058
				2059	int g(unsigned int a) {
				2060	unsigned int c[100];
				2061	c[10] = a;
				2062	c[11] = a;
				2063	unsigned int b = c[10] + c[11];
				2064	if(b > a*2) a = 4;
				2065	else a = 8;
				2066	return a + 7;
				2067	}
				2068
				2069	into
				2070
				2071	define i32 @g(i32 a) nounwind readnone {
				2072	%add = shl i32 %a, 1
				2073	%mul = shl i32 %a, 1
				2074	%cmp = icmp ugt i32 %add, %mul
				2075	%a.addr.0 = select i1 %cmp, i32 11, i32 15
				2076	ret i32 %a.addr.0
				2077	}
				2078
				2079	The icmp should fold to false. This CSE opportunity is only available
				2080	after GVN and InstCombine have run.
				2081
				2082	//===---------------------------------------------------------------------===//
Chris Lattner	01cdc20	2011-01-06 22:25:00 +0000	[diff] [blame]	2083
				2084	memcpyopt should turn this:
				2085
				2086	define i8* @test10(i32 %x) {
				2087	%alloc = call noalias i8* @malloc(i32 %x) nounwind
				2088	call void @llvm.memset.p0i8.i32(i8* %alloc, i8 0, i32 %x, i32 1, i1 false)
				2089	ret i8* %alloc
				2090	}
				2091
				2092	into a call to calloc. We should make sure that we analyze calloc as
				2093	aggressively as malloc though.
				2094
				2095	//===---------------------------------------------------------------------===//
Chandler Carruth	75fbd37	2011-01-09 01:32:55 +0000	[diff] [blame]	2096
Chris Lattner	4a6fb94	2011-01-10 21:01:17 +0000	[diff] [blame]	2097	clang -O3 doesn't optimize this:
Chandler Carruth	75fbd37	2011-01-09 01:32:55 +0000	[diff] [blame]	2098
				2099	void f1(int* begin, int* end) {
				2100	std::fill(begin, end, 0);
				2101	}
				2102
Chris Lattner	66d7a57	2011-01-09 23:48:41 +0000	[diff] [blame]	2103	into a memset. This is PR8942.
Chandler Carruth	75fbd37	2011-01-09 01:32:55 +0000	[diff] [blame]	2104
				2105	//===---------------------------------------------------------------------===//
Chandler Carruth	d8723a9	2011-01-09 09:58:33 +0000	[diff] [blame]	2106
				2107	clang -O3 -fno-exceptions currently compiles this code:
				2108
				2109	void f(int N) {
				2110	std::vector<int> v(N);
Chandler Carruth	27a2a13	2011-01-09 10:10:59 +0000	[diff] [blame]	2111
				2112	extern void sink(void*); sink(&v);
Chandler Carruth	d8723a9	2011-01-09 09:58:33 +0000	[diff] [blame]	2113	}
				2114
				2115	into
				2116
				2117	define void @_Z1fi(i32 %N) nounwind {
				2118	entry:
				2119	%v2 = alloca [3 x i32*], align 8
				2120	%v2.sub = getelementptr inbounds [3 x i32] %v2, i64 0, i64 0
				2121	%tmpcast = bitcast [3 x i32] %v2 to %"class.std::vector"*
				2122	%conv = sext i32 %N to i64
				2123	store i32* null, i32** %v2.sub, align 8, !tbaa !0
				2124	%tmp3.i.i.i.i.i = getelementptr inbounds [3 x i32] %v2, i64 0, i64 1
				2125	store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
				2126	%tmp4.i.i.i.i.i = getelementptr inbounds [3 x i32] %v2, i64 0, i64 2
				2127	store i32* null, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
				2128	%cmp.i.i.i.i = icmp eq i32 %N, 0
				2129	br i1 %cmp.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i, label %cond.true.i.i.i.i
				2130
				2131	_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i: ; preds = %entry
				2132	store i32* null, i32** %v2.sub, align 8, !tbaa !0
				2133	store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
				2134	%add.ptr.i5.i.i = getelementptr inbounds i32* null, i64 %conv
				2135	store i32* %add.ptr.i5.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
				2136	br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit
				2137
				2138	cond.true.i.i.i.i: ; preds = %entry
				2139	%cmp.i.i.i.i.i = icmp slt i32 %N, 0
				2140	br i1 %cmp.i.i.i.i.i, label %if.then.i.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i
				2141
				2142	if.then.i.i.i.i.i: ; preds = %cond.true.i.i.i.i
				2143	call void @_ZSt17__throw_bad_allocv() noreturn nounwind
				2144	unreachable
				2145
				2146	_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i: ; preds = %cond.true.i.i.i.i
				2147	%mul.i.i.i.i.i = shl i64 %conv, 2
				2148	%call3.i.i.i.i.i = call noalias i8* @_Znwm(i64 %mul.i.i.i.i.i) nounwind
				2149	%0 = bitcast i8* %call3.i.i.i.i.i to i32*
				2150	store i32* %0, i32** %v2.sub, align 8, !tbaa !0
				2151	store i32* %0, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
				2152	%add.ptr.i.i.i = getelementptr inbounds i32* %0, i64 %conv
				2153	store i32* %add.ptr.i.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
				2154	call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %mul.i.i.i.i.i, i32 4, i1 false)
				2155	br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit
				2156
				2157	This is just the handling the construction of the vector. Most surprising here
				2158	is the fact that all three null stores in %entry are dead, but not eliminated.
				2159	Also surprising is that %conv isn't simplified to 0 in %....exit.thread.i.i.
				2160
				2161	//===---------------------------------------------------------------------===//
Chandler Carruth	e5ca494	2011-01-09 09:58:36 +0000	[diff] [blame]	2162
				2163	clang -O3 -fno-exceptions currently compiles this code:
				2164
				2165	void f(int N) {
				2166	std::vector<int> v(N);
Chandler Carruth	27a2a13	2011-01-09 10:10:59 +0000	[diff] [blame]	2167	for (int k = 0; k < N; ++k)
				2168	v[k] = 0;
				2169
				2170	extern void sink(void*); sink(&v);
Chandler Carruth	e5ca494	2011-01-09 09:58:36 +0000	[diff] [blame]	2171	}
				2172
				2173	into almost the same as the previous note, but replace its final BB with:
				2174
				2175	for.body.lr.ph: ; preds = %cond.true.i.i.i.i
				2176	%mul.i.i.i.i.i = shl i64 %conv, 2
				2177	%call3.i.i.i.i.i = call noalias i8* @_Znwm(i64 %mul.i.i.i.i.i) nounwind
				2178	%0 = bitcast i8* %call3.i.i.i.i.i to i32*
				2179	store i32* %0, i32** %v8.sub, align 8, !tbaa !0
				2180	%add.ptr.i.i.i = getelementptr inbounds i32* %0, i64 %conv
				2181	store i32* %add.ptr.i.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
				2182	call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %mul.i.i.i.i.i, i32 4, i1 false)
				2183	store i32* %add.ptr.i.i.i, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
				2184	%tmp18 = add i32 %N, -1
				2185	%tmp19 = zext i32 %tmp18 to i64
				2186	%tmp20 = shl i64 %tmp19, 2
				2187	%tmp21 = add i64 %tmp20, 4
				2188	call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %tmp21, i32 4, i1 false)
				2189	br label %for.end
				2190
				2191	First off, why (((zext %N - 1) << 2) + 4) instead of the ((sext %N) << 2) done
				2192	previously? (or better yet, re-use that one?)
				2193
				2194	Then, the really painful one is the second memset, of the same memory, to the
				2195	same value.
				2196
				2197	//===---------------------------------------------------------------------===//
Chandler Carruth	694d753	2011-01-09 11:29:57 +0000	[diff] [blame]	2198
				2199	clang -O3 -fno-exceptions currently compiles this code:
				2200
				2201	struct S {
				2202	unsigned short m1, m2;
				2203	unsigned char m3, m4;
				2204	};
				2205
				2206	void f(int N) {
				2207	std::vector<S> v(N);
				2208	extern void sink(void*); sink(&v);
				2209	}
				2210
				2211	into poor code for zero-initializing 'v' when N is >0. The problem is that
				2212	S is only 6 bytes, but each element is 8 byte-aligned. We generate a loop and
				2213	4 stores on each iteration. If the struct were 8 bytes, this gets turned into
				2214	a memset.
				2215
				2216	//===---------------------------------------------------------------------===//
Chandler Carruth	96b1b6c	2011-01-09 21:00:19 +0000	[diff] [blame]	2217
				2218	clang -O3 currently compiles this code:
				2219
				2220	extern const int magic;
				2221	double f() { return 0.0 * magic; }
				2222
				2223	into
				2224
				2225	@magic = external constant i32
				2226
				2227	define double @_Z1fv() nounwind readnone {
				2228	entry:
				2229	%tmp = load i32* @magic, align 4, !tbaa !0
				2230	%conv = sitofp i32 %tmp to double
				2231	%mul = fmul double %conv, 0.000000e+00
				2232	ret double %mul
				2233	}
				2234
Chris Lattner	00a35d0	2011-01-10 00:33:01 +0000	[diff] [blame]	2235	We should be able to fold away this fmul to 0.0. More generally, fmul(x,0.0)
				2236	can be folded to 0.0 if we can prove that the LHS is not -0.0, not a NaN, and
				2237	not an INF. The CannotBeNegativeZero predicate in value tracking should be
				2238	extended to support general "fpclassify" operations that can return
				2239	yes/no/unknown for each of these predicates.
				2240
Chris Lattner	4a6fb94	2011-01-10 21:01:17 +0000	[diff] [blame]	2241	In this predicate, we know that uitofp is trivially never NaN or -0.0, and
Chris Lattner	00a35d0	2011-01-10 00:33:01 +0000	[diff] [blame]	2242	we know that it isn't +/-Inf if the floating point type has enough exponent bits
				2243	to represent the largest integer value as < inf.
Chandler Carruth	96b1b6c	2011-01-09 21:00:19 +0000	[diff] [blame]	2244
				2245	//===---------------------------------------------------------------------===//
Chandler Carruth	fb00e27	2011-01-09 22:36:18 +0000	[diff] [blame]	2246
Chris Lattner	4a6fb94	2011-01-10 21:01:17 +0000	[diff] [blame]	2247	When optimizing a transformation that can change the sign of 0.0 (such as the
				2248	0.0*val -> 0.0 transformation above), it might be provable that the sign of the
				2249	expression doesn't matter. For example, by the above rules, we can't transform
				2250	fmul(sitofp(x), 0.0) into 0.0, because x might be -1 and the result of the
				2251	expression is defined to be -0.0.
				2252
				2253	If we look at the uses of the fmul for example, we might be able to prove that
				2254	all uses don't care about the sign of zero. For example, if we have:
				2255
				2256	fadd(fmul(sitofp(x), 0.0), 2.0)
				2257
				2258	Since we know that x+2.0 doesn't care about the sign of any zeros in X, we can
				2259	transform the fmul to 0.0, and then the fadd to 2.0.
				2260
				2261	//===---------------------------------------------------------------------===//