Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: 45da3ddb607a1532a3b24faeed0549895271b308 [file] [log] [blame]

Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
Chris Lattner	f308ea0	2006-09-28 06:01:17 +0000	[diff] [blame]	3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	9b62b45	2006-11-14 01:57:53 +0000	[diff] [blame]	5	With the recent changes to make the implicit def/use set explicit in
				6	machineinstrs, we should change the target descriptions for 'call' instructions
				7	so that the .td files don't list all the call-clobbered registers as implicit
				8	defs. Instead, these should be added by the code generator (e.g. on the dag).
				9
				10	This has a number of uses:
				11
				12	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				13	for their different impdef sets.
				14	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				15	sets don't need copies of call instructions.
				16	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				17	of calls.
				18
				19	//===---------------------------------------------------------------------===//
				20
Chris Lattner	08859ff	2010-12-15 07:25:55 +0000	[diff] [blame]	21	We should recognized various "overflow detection" idioms and translate them into
Chris Lattner	e5cbdca	2010-12-19 19:37:52 +0000	[diff] [blame]	22	llvm.uadd.with.overflow and similar intrinsics. Here is a multiply idiom:
Chris Lattner	9448184	2010-12-15 07:28:58 +0000	[diff] [blame]	23
				24	unsigned int mul(unsigned int a,unsigned int b) {
				25	if ((unsigned long long)a*b>0xffffffff)
				26	exit(0);
				27	return a*b;
				28	}
				29
Chris Lattner	527b47d	2011-01-02 18:31:38 +0000	[diff] [blame]	30	The legalization code for mul-with-overflow needs to be made more robust before
				31	this can be implemented though.
				32
Nate Begeman	81e8097	2006-03-17 01:40:33 +0000	[diff] [blame]	33	//===---------------------------------------------------------------------===//
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	34
				35	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	36	precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
				37	safe in general, even on darwin. See the libm implementation of hypot for
				38	examples (which special case when x/y are exactly zero to get signed zeros etc
				39	right).
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	40
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	41	//===---------------------------------------------------------------------===//
				42
Chris Lattner	b27b69f	2006-03-04 01:19:34 +0000	[diff] [blame]	43	On targets with expensive 64-bit multiply, we could LSR this:
				44
				45	for (i = ...; ++i) {
				46	x = 1ULL << i;
				47
				48	into:
				49	long long tmp = 1;
				50	for (i = ...; ++i, tmp+=tmp)
				51	x = tmp;
				52
				53	This would be a win on ppc32, but not x86 or ppc64.
				54
Chris Lattner	ad01993	2006-03-04 08:44:51 +0000	[diff] [blame]	55	//===---------------------------------------------------------------------===//
Chris Lattner	5b0fe7d	2006-03-05 20:00:08 +0000	[diff] [blame]	56
				57	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				58
				59	//===---------------------------------------------------------------------===//
Chris Lattner	549f27d2	2006-03-07 02:46:26 +0000	[diff] [blame]	60
Chris Lattner	398ffba	2010-01-01 01:29:26 +0000	[diff] [blame]	61	Reassociate should turn things like:
				62
				63	int factorial(int X) {
				64	return XXXXXXX*X;
				65	}
				66
				67	into llvm.powi calls, allowing the code generator to produce balanced
				68	multiplication trees.
				69
				70	First, the intrinsic needs to be extended to support integers, and second the
				71	code generator needs to be enhanced to lower these to multiplication trees.
Chris Lattner	c20995e	2006-03-11 20:17:08 +0000	[diff] [blame]	72
				73	//===---------------------------------------------------------------------===//
				74
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	75	Interesting? testcase for add/shift/mul reassoc:
				76
				77	int bar(int x, int y) {
				78	return xxx+y+xxxxxyyyy;
				79	}
				80	int foo(int z, int n) {
				81	return bar(z, n) + bar(2z, 2n);
				82	}
				83
Chris Lattner	398ffba	2010-01-01 01:29:26 +0000	[diff] [blame]	84	This is blocked on not handling XXX -> powi(X, 3) (see note above). The issue
				85	is that we end up getting t = 2X s = tt and don't turn this into 4XX,
				86	which is the same number of multiplies and is canonical, because the 2*X has
				87	multiple uses. Here's a simple example:
				88
				89	define i32 @test15(i32 %X1) {
				90	%B = mul i32 %X1, 47 ; X1*47
				91	%C = mul i32 %B, %B
				92	ret i32 %C
				93	}
				94
				95
				96	//===---------------------------------------------------------------------===//
				97
				98	Reassociate should handle the example in GCC PR16157:
				99
				100	extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;
				101	void f () { /* this can be optimized to four additions... */
				102	b4 = a4 + a3 + a2 + a1 + a0;
				103	b3 = a3 + a2 + a1 + a0;
				104	b2 = a2 + a1 + a0;
				105	b1 = a1 + a0;
				106	}
				107
				108	This requires reassociating to forms of expressions that are already available,
				109	something that reassoc doesn't think about yet.
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	110
Chris Lattner	10c4245	2010-01-24 20:01:41 +0000	[diff] [blame]	111
				112	//===---------------------------------------------------------------------===//
				113
				114	This function: (derived from GCC PR19988)
				115	double foo(double x, double y) {
				116	return ((x + 0.1234 * y) * (x + -0.1234 * y));
				117	}
				118
				119	compiles to:
				120	_foo:
				121	movapd %xmm1, %xmm2
				122	mulsd LCPI1_1(%rip), %xmm1
				123	mulsd LCPI1_0(%rip), %xmm2
				124	addsd %xmm0, %xmm1
				125	addsd %xmm0, %xmm2
				126	movapd %xmm1, %xmm0
				127	mulsd %xmm2, %xmm0
				128	ret
				129
Chris Lattner	43dc2e6	2010-01-24 20:17:09 +0000	[diff] [blame]	130	Reassociate should be able to turn it into:
Chris Lattner	10c4245	2010-01-24 20:01:41 +0000	[diff] [blame]	131
				132	double foo(double x, double y) {
				133	return ((x + 0.1234 * y) * (x - 0.1234 * y));
				134	}
				135
				136	Which allows the multiply by constant to be CSE'd, producing:
				137
				138	_foo:
				139	mulsd LCPI1_0(%rip), %xmm1
				140	movapd %xmm1, %xmm2
				141	addsd %xmm0, %xmm2
				142	subsd %xmm1, %xmm0
				143	mulsd %xmm2, %xmm0
				144	ret
				145
				146	This doesn't need -ffast-math support at all. This is particularly bad because
				147	the llvm-gcc frontend is canonicalizing the later into the former, but clang
				148	doesn't have this problem.
				149
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	150	//===---------------------------------------------------------------------===//
				151
Chris Lattner	82c78b2	2006-03-09 20:13:21 +0000	[diff] [blame]	152	These two functions should generate the same code on big-endian systems:
				153
				154	int g(int j,int l) { return memcmp(j,l,4); }
				155	int h(int j, int l) { return j - l; }
				156
				157	this could be done in SelectionDAGISel.cpp, along with other special cases,
				158	for 1,2,4,8 bytes.
				159
				160	//===---------------------------------------------------------------------===//
				161
Chris Lattner	c04b423	2006-03-22 07:33:46 +0000	[diff] [blame]	162	It would be nice to revert this patch:
				163	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				164
				165	And teach the dag combiner enough to simplify the code expanded before
				166	legalize. It seems plausible that this knowledge would let it simplify other
				167	stuff too.
				168
Chris Lattner	e6cd96d	2006-03-24 19:59:17 +0000	[diff] [blame]	169	//===---------------------------------------------------------------------===//
				170
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	171	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng	67d3d4c	2006-03-31 22:35:14 +0000	[diff] [blame]	172	to the type size. It works but can be overly conservative as the alignment of
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	173	specific vector types are target dependent.
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	174
				175	//===---------------------------------------------------------------------===//
				176
Dan Gohman	1f3be1a	2009-05-11 18:51:16 +0000	[diff] [blame]	177	We should produce an unaligned load from code like this:
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	178
				179	v4sf example(float *P) {
				180	return (v4sf){P[0], P[1], P[2], P[3] };
				181	}
				182
				183	//===---------------------------------------------------------------------===//
				184
Chris Lattner	16abfdf	2006-05-18 18:26:13 +0000	[diff] [blame]	185	Add support for conditional increments, and other related patterns. Instead
				186	of:
				187
				188	movl 136(%esp), %eax
				189	cmpl $0, %eax
				190	je LBB16_2 #cond_next
				191	LBB16_1: #cond_true
				192	incl _foo
				193	LBB16_2: #cond_next
				194
				195	emit:
				196	movl _foo, %eax
				197	cmpl $1, %edi
				198	sbbl $-1, %eax
				199	movl %eax, _foo
				200
				201	//===---------------------------------------------------------------------===//
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	202
				203	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				204
				205	Expand these to calls of sin/cos and stores:
				206	double sincos(double x, double sin, double cos);
				207	float sincosf(float x, float sin, float cos);
				208	long double sincosl(long double x, long double sin, long double cos);
				209
				210	Doing so could allow SROA of the destination pointers. See also:
				211	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				212
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	213	This is now easily doable with MRVs. We could even make an intrinsic for this
				214	if anyone cared enough about sincos.
				215
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	216	//===---------------------------------------------------------------------===//
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	217
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	218	quantum_sigma_x in 462.libquantum contains the following loop:
				219
				220	for(i=0; i<reg->size; i++)
				221	{
				222	/* Flip the target bit of each basis state */
				223	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				224	}
				225
				226	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				227	so cool to turn it into something like:
				228
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	229	long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	230	if (target < 32) {
				231	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	232	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	233	} else {
				234	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	235	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	236	}
				237
				238	... which would only do one 32-bit XOR per loop iteration instead of two.
				239
				240	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	241	this requires TBAA.
Chris Lattner	faa6adf	2009-09-21 06:04:07 +0000	[diff] [blame]	242
				243	//===---------------------------------------------------------------------===//
				244
Chris Lattner	b1ac769	2008-10-05 02:16:12 +0000	[diff] [blame]	245	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattner	f9bae43	2006-12-08 02:01:32 +0000	[diff] [blame]	246
				247	unsigned long reverse(unsigned v) {
				248	unsigned t;
				249	t = v ^ ((v << 16) \| (v >> 16));
				250	t &= ~0xff0000;
				251	v = (v << 24) \| (v >> 8);
				252	return v ^ (t >> 8);
				253	}
				254
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	255	//===---------------------------------------------------------------------===//
				256
Chris Lattner	19310fc	2011-02-21 02:13:39 +0000	[diff] [blame]	257	[LOOP DELETION]
				258
				259	We don't delete this output free loop, because trip count analysis doesn't
				260	realize that it is finite (if it were infinite, it would be undefined). Not
				261	having this blocks Loop Idiom from matching strlen and friends.
				262
				263	void foo(char *C) {
				264	int x = 0;
				265	while (*C)
				266	++x,++C;
				267	}
				268
				269	//===---------------------------------------------------------------------===//
				270
Chris Lattner	818ff34	2010-01-23 18:49:30 +0000	[diff] [blame]	271	[LOOP RECOGNITION]
				272
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	273	These idioms should be recognized as popcount (see PR1488):
				274
				275	unsigned countbits_slow(unsigned v) {
				276	unsigned c;
				277	for (c = 0; v; v >>= 1)
				278	c += v & 1;
				279	return c;
				280	}
				281	unsigned countbits_fast(unsigned v){
				282	unsigned c;
				283	for (c = 0; v; c++)
				284	v &= v - 1; // clear the least significant bit set
				285	return c;
				286	}
				287
				288	BITBOARD = unsigned long long
				289	int PopCnt(register BITBOARD a) {
				290	register int c=0;
				291	while(a) {
				292	c++;
				293	a &= a - 1;
				294	}
				295	return c;
				296	}
				297	unsigned int popcount(unsigned int input) {
				298	unsigned int count = 0;
				299	for (unsigned int i = 0; i < 4 * 8; i++)
				300	count += (input >> i) & i;
				301	return count;
				302	}
				303
Chris Lattner	477a988	2011-02-21 01:33:38 +0000	[diff] [blame]	304	This should be recognized as CLZ: rdar://8459039
				305
				306	unsigned clz_a(unsigned a) {
				307	int i;
				308	for (i=0;i<32;i++)
				309	if (a & (1<<(31-i)))
				310	return i;
				311	return 32;
				312	}
				313
Chris Lattner	527b47d	2011-01-02 18:31:38 +0000	[diff] [blame]	314	This sort of thing should be added to the loop idiom pass.
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	315
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	316	//===---------------------------------------------------------------------===//
				317
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	318	These should turn into single 16-bit (unaligned?) loads on little/big endian
				319	processors.
				320
				321	unsigned short read_16_le(const unsigned char *adr) {
				322	return adr[0] \| (adr[1] << 8);
				323	}
				324	unsigned short read_16_be(const unsigned char *adr) {
				325	return (adr[0] << 8) \| adr[1];
				326	}
				327
				328	//===---------------------------------------------------------------------===//
Chris Lattner	cf10391	2006-10-24 16:12:47 +0000	[diff] [blame]	329
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	330	-instcombine should handle this transform:
Reid Spencer	e4d87aa	2006-12-23 06:05:41 +0000	[diff] [blame]	331	icmp pred (sdiv X / C1 ), C2
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	332	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				333
				334	Currently InstCombine avoids this transform but will do it when the signs of
				335	the operands and the sign of the divide match. See the FIXME in
				336	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				337	for Instruction::UDiv (around line 4447) for more details.
				338
				339	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				340	this construct.
Chris Lattner	d7c628d	2006-11-03 22:27:39 +0000	[diff] [blame]	341
				342	//===---------------------------------------------------------------------===//
				343
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	344	[LOOP OPTIMIZATION]
				345
				346	SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization
				347	opportunities in its double_array_divs_variable function: it needs loop
				348	interchange, memory promotion (which LICM already does), vectorization and
				349	variable trip count loop unrolling (since it has a constant trip count). ICC
				350	apparently produces this very nice code with -ffast-math:
				351
				352	..B1.70: # Preds ..B1.70 ..B1.69
				353	mulpd %xmm0, %xmm1 #108.2
				354	mulpd %xmm0, %xmm1 #108.2
				355	mulpd %xmm0, %xmm1 #108.2
				356	mulpd %xmm0, %xmm1 #108.2
				357	addl $8, %edx #
				358	cmpl $131072, %edx #108.2
				359	jb ..B1.70 # Prob 99% #108.2
				360
				361	It would be better to count down to zero, but this is a lot better than what we
				362	do.
				363
				364	//===---------------------------------------------------------------------===//
				365
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	366	Consider:
				367
				368	typedef unsigned U32;
				369	typedef unsigned long long U64;
				370	int test (U32 inst, U64 regs) {
				371	U64 effective_addr2;
				372	U32 temp = *inst;
				373	int r1 = (temp >> 20) & 0xf;
				374	int b2 = (temp >> 16) & 0xf;
				375	effective_addr2 = temp & 0xfff;
				376	if (b2) effective_addr2 += regs[b2];
				377	b2 = (temp >> 12) & 0xf;
				378	if (b2) effective_addr2 += regs[b2];
				379	effective_addr2 &= regs[4];
				380	if ((effective_addr2 & 3) == 0)
				381	return 1;
				382	return 0;
				383	}
				384
				385	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				386	we don't eliminate the computation of the top half of effective_addr2 because
				387	we don't have whole-function selection dags. On x86, this means we use one
				388	extra register for the function when effective_addr2 is declared as U64 than
				389	when it is declared U32.
				390
Chris Lattner	1742498	2009-11-10 23:47:45 +0000	[diff] [blame]	391	PHI Slicing could be extended to do this.
				392
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	393	//===---------------------------------------------------------------------===//
				394
Chris Lattner	9c6a0dc	2009-11-26 01:51:18 +0000	[diff] [blame]	395	LSR should know what GPR types a target has from TargetData. This code:
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	396
				397	volatile short X, Y; // globals
				398
				399	void foo(int N) {
				400	int i;
				401	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				402	}
				403
Chris Lattner	c1491f3	2009-09-20 17:37:38 +0000	[diff] [blame]	404	produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	405
Chris Lattner	c1491f3	2009-09-20 17:37:38 +0000	[diff] [blame]	406	LBB1_2:
				407	ldr r3, LCPI1_0
				408	ldr r3, [r3]
				409	strh r2, [r3]
				410	ldr r3, LCPI1_1
				411	ldr r3, [r3]
				412	strh r1, [r3]
				413	add r1, r1, #4
				414	add r2, r2, #1 <- [0,+,1]
				415	sub r0, r0, #1 <- [0,-,1]
				416	cmp r0, #0
				417	bne LBB1_2
				418
				419	LSR should reuse the "+" IV for the exit test.
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	420
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	421	//===---------------------------------------------------------------------===//
				422
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	423	Tail call elim should be more aggressive, checking to see if the call is
				424	followed by an uncond branch to an exit block.
				425
				426	; This testcase is due to tail-duplication not wanting to copy the return
				427	; instruction into the terminating blocks because there was other code
				428	; optimized out of the function after the taildup happened.
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	429	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	430
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	431	define i32 @t4(i32 %a) {
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	432	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	433	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				434	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				435	br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	436
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	437	then.0: ; preds = %entry
				438	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				439	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				440	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	441
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	442	else.0: ; preds = %entry
				443	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				444	br i1 %tmp.7, label %then.1, label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	445
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	446	then.1: ; preds = %else.0
				447	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				448	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				449	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	450
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	451	return: ; preds = %then.1, %else.0, %then.0
				452	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	453	[ %tmp.9, %then.1 ]
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	454	ret i32 %result.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	455	}
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	456
				457	//===---------------------------------------------------------------------===//
				458
Chris Lattner	c90b866	2008-08-10 00:47:21 +0000	[diff] [blame]	459	Tail recursion elimination should handle:
				460
				461	int pow2m1(int n) {
				462	if (n == 0)
				463	return 0;
				464	return 2 * pow2m1 (n - 1) + 1;
				465	}
				466
				467	Also, multiplies can be turned into SHL's, so they should be handled as if
				468	they were associative. "return foo() << 1" can be tail recursion eliminated.
				469
				470	//===---------------------------------------------------------------------===//
				471
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	472	Argument promotion should promote arguments for recursive functions, like
				473	this:
				474
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	475	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	476
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	477	define internal i32 @foo(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	478	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	479	%tmp = load i32* %x ; <i32> [#uses=0]
				480	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				481	ret i32 %tmp.foo
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	482	}
				483
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	484	define i32 @bar(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	485	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	486	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				487	ret i32 %tmp3
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	488	}
				489
Chris Lattner	81f2d71	2007-12-05 23:05:06 +0000	[diff] [blame]	490	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	491
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	492	We should investigate an instruction sinking pass. Consider this silly
				493	example in pic mode:
				494
				495	#include <assert.h>
				496	void foo(int x) {
				497	assert(x);
				498	//...
				499	}
				500
				501	we compile this to:
				502	_foo:
				503	subl $28, %esp
				504	call "L1$pb"
				505	"L1$pb":
				506	popl %eax
				507	cmpl $0, 32(%esp)
				508	je LBB1_2 # cond_true
				509	LBB1_1: # return
				510	# ...
				511	addl $28, %esp
				512	ret
				513	LBB1_2: # cond_true
				514	...
				515
				516	The PIC base computation (call+popl) is only used on one path through the
				517	code, but is currently always computed in the entry block. It would be
				518	better to sink the picbase computation down into the block for the
				519	assertion, as it is the only one that uses it. This happens for a lot of
				520	code with early outs.
				521
Chris Lattner	92c06a0	2007-12-29 01:05:01 +0000	[diff] [blame]	522	Another example is loads of arguments, which are usually emitted into the
				523	entry block on targets like x86. If not used in all paths through a
				524	function, they should be sunk into the ones that do.
				525
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	526	In this case, whole-function-isel would also handle this.
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	527
				528	//===---------------------------------------------------------------------===//
Chris Lattner	b304194	2008-01-07 21:38:14 +0000	[diff] [blame]	529
				530	Investigate lowering of sparse switch statements into perfect hash tables:
				531	http://burtleburtle.net/bob/hash/perfect.html
				532
				533	//===---------------------------------------------------------------------===//
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	534
				535	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				536	corresponding integer operations. On a yonah, this loop:
				537
				538	double a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	539	void foo() {
				540	int i, b;
				541	for (b = 0; b < 10000000; b++)
				542	for (i = 0; i < 256; i++)
				543	a[i] = -a[i];
				544	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	545
				546	is twice as slow as this loop:
				547
				548	long long a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	549	void foo() {
				550	int i, b;
				551	for (b = 0; b < 10000000; b++)
				552	for (i = 0; i < 256; i++)
				553	a[i] ^= (1ULL << 63);
				554	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	555
				556	and I suspect other processors are similar. On X86 in particular this is a
				557	big win because doing this with integers allows the use of read/modify/write
				558	instructions.
				559
				560	//===---------------------------------------------------------------------===//
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	561
				562	DAG Combiner should try to combine small loads into larger loads when
				563	profitable. For example, we compile this C++ example:
				564
				565	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				566	extern THotKey m_HotKey;
				567	THotKey GetHotKey () { return m_HotKey; }
				568
Chris Lattner	527b47d	2011-01-02 18:31:38 +0000	[diff] [blame]	569	into (-m64 -O3 -fno-exceptions -static -fomit-frame-pointer):
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	570
Chris Lattner	527b47d	2011-01-02 18:31:38 +0000	[diff] [blame]	571	__Z9GetHotKeyv: ## @_Z9GetHotKeyv
				572	movq _m_HotKey@GOTPCREL(%rip), %rax
				573	movzwl (%rax), %ecx
				574	movzbl 2(%rax), %edx
				575	shlq $16, %rdx
				576	orq %rcx, %rdx
				577	movzbl 3(%rax), %ecx
				578	shlq $24, %rcx
				579	orq %rdx, %rcx
				580	movzbl 4(%rax), %eax
				581	shlq $32, %rax
				582	orq %rcx, %rax
				583	ret
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	584
				585	//===---------------------------------------------------------------------===//
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	586
Nate Begeman	e9fe65c	2008-02-18 18:39:23 +0000	[diff] [blame]	587	We should add an FRINT node to the DAG to model targets that have legal
				588	implementations of ceil/floor/rint.
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	589
				590	//===---------------------------------------------------------------------===//
				591
				592	Consider:
				593
				594	int test() {
Benjamin Kramer	9d071cb	2010-12-23 15:32:07 +0000	[diff] [blame]	595	long long input[8] = {1,0,1,0,1,0,1,0};
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	596	foo(input);
				597	}
				598
Chris Lattner	9c8fb9e	2011-01-01 22:52:11 +0000	[diff] [blame]	599	Clang compiles this into:
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	600
Chris Lattner	9c8fb9e	2011-01-01 22:52:11 +0000	[diff] [blame]	601	call void @llvm.memset.p0i8.i64(i8* %tmp, i8 0, i64 64, i32 16, i1 false)
				602	%0 = getelementptr [8 x i64]* %input, i64 0, i64 0
				603	store i64 1, i64* %0, align 16
				604	%1 = getelementptr [8 x i64]* %input, i64 0, i64 2
				605	store i64 1, i64* %1, align 16
				606	%2 = getelementptr [8 x i64]* %input, i64 0, i64 4
				607	store i64 1, i64* %2, align 16
				608	%3 = getelementptr [8 x i64]* %input, i64 0, i64 6
				609	store i64 1, i64* %3, align 16
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	610
Chris Lattner	9c8fb9e	2011-01-01 22:52:11 +0000	[diff] [blame]	611	Which gets codegen'd into:
				612
				613	pxor %xmm0, %xmm0
				614	movaps %xmm0, -16(%rbp)
				615	movaps %xmm0, -32(%rbp)
				616	movaps %xmm0, -48(%rbp)
				617	movaps %xmm0, -64(%rbp)
				618	movq $1, -64(%rbp)
				619	movq $1, -48(%rbp)
				620	movq $1, -32(%rbp)
				621	movq $1, -16(%rbp)
				622
				623	It would be better to have 4 movq's of 0 instead of the movaps's.
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	624
				625	//===---------------------------------------------------------------------===//
Chris Lattner	a11deb0	2008-03-02 02:51:40 +0000	[diff] [blame]	626
				627	http://llvm.org/PR717:
				628
				629	The following code should compile into "ret int undef". Instead, LLVM
				630	produces "ret int 0":
				631
				632	int f() {
				633	int x = 4;
				634	int y;
				635	if (x == 3) y = 0;
				636	return y;
				637	}
				638
				639	//===---------------------------------------------------------------------===//
Chris Lattner	53b7277	2008-03-02 19:29:42 +0000	[diff] [blame]	640
				641	The loop unroller should partially unroll loops (instead of peeling them)
				642	when code growth isn't too bad and when an unroll count allows simplification
				643	of some code within the loop. One trivial example is:
				644
				645	#include <stdio.h>
				646	int main() {
				647	int nRet = 17;
				648	int nLoop;
				649	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				650	if ( nLoop & 1 )
				651	nRet += 2;
				652	else
				653	nRet -= 1;
				654	}
				655	return nRet;
				656	}
				657
				658	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				659	reduction in code size. The resultant code would then also be suitable for
				660	exit value computation.
				661
				662	//===---------------------------------------------------------------------===//
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	663
				664	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				665	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				666	matching code in dag combine doesn't look through truncates aggressively
				667	enough. Here are some testcases reduces from GCC PR17886:
				668
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	669	unsigned long long f5(unsigned long long x, unsigned long long y) {
				670	return (x << 8) \| ((y >> 48) & 0xffull);
				671	}
				672	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				673	switch(z) {
				674	case 1:
				675	return (x << 8) \| ((y >> 48) & 0xffull);
				676	case 2:
				677	return (x << 16) \| ((y >> 40) & 0xffffull);
				678	case 3:
				679	return (x << 24) \| ((y >> 32) & 0xffffffull);
				680	case 4:
				681	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				682	default:
				683	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				684	}
				685	}
				686
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	687	//===---------------------------------------------------------------------===//
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	688
Chris Lattner	ef17f08	2010-12-15 07:10:43 +0000	[diff] [blame]	689	This (and similar related idioms):
				690
				691	unsigned int foo(unsigned char i) {
				692	return i \| (i<<8) \| (i<<16) \| (i<<24);
				693	}
				694
				695	compiles into:
				696
				697	define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone {
				698	entry:
				699	%conv = zext i8 %i to i32
				700	%shl = shl i32 %conv, 8
				701	%shl5 = shl i32 %conv, 16
				702	%shl9 = shl i32 %conv, 24
				703	%or = or i32 %shl9, %conv
				704	%or6 = or i32 %or, %shl5
				705	%or10 = or i32 %or6, %shl
				706	ret i32 %or10
				707	}
				708
				709	it would be better as:
				710
				711	unsigned int bar(unsigned char i) {
				712	unsigned int j=i \| (i << 8);
				713	return j \| (j<<16);
				714	}
				715
				716	aka:
				717
				718	define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone {
				719	entry:
				720	%conv = zext i8 %i to i32
				721	%shl = shl i32 %conv, 8
				722	%or = or i32 %shl, %conv
				723	%shl5 = shl i32 %or, 16
				724	%or6 = or i32 %shl5, %or
				725	ret i32 %or6
				726	}
				727
				728	or even i*0x01010101, depending on the speed of the multiplier. The best way to
				729	handle this is to canonicalize it to a multiply in IR and have codegen handle
				730	lowering multiplies to shifts on cpus where shifts are faster.
				731
				732	//===---------------------------------------------------------------------===//
				733
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	734	We do a number of simplifications in simplify libcalls to strength reduce
				735	standard library functions, but we don't currently merge them together. For
				736	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				737	be done safely if "b" isn't modified between the strlen and memcpy of course.
				738
				739	//===---------------------------------------------------------------------===//
				740
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	741	We compile this program: (from GCC PR11680)
				742	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				743
				744	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				745	slower than when compile with GCC (either 4.0 or 4.2):
				746
				747	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				748	$ time ./a.out fast
				749	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				750
				751	$ g++ perf.cpp -O3 -fno-exceptions
				752	$ time ./a.out fast
				753	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				754
				755	It looks like we are making the same inlining decisions, so this may be raw
				756	codegen badness or something else (haven't investigated).
				757
				758	//===---------------------------------------------------------------------===//
				759
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	760	Divisibility by constant can be simplified (according to GCC PR12849) from
				761	being a mulhi to being a mul lo (cheaper). Testcase:
				762
				763	void bar(unsigned n) {
				764	if (n % 3 == 0)
				765	true();
				766	}
				767
Eli Friedman	bcae205	2009-12-12 23:23:43 +0000	[diff] [blame]	768	This is equivalent to the following, where 2863311531 is the multiplicative
				769	inverse of 3, and 1431655766 is ((2^32)-1)/3+1:
				770	void bar(unsigned n) {
				771	if (n * 2863311531U < 1431655766U)
				772	true();
				773	}
				774
				775	The same transformation can work with an even modulo with the addition of a
				776	rotate: rotate the result of the multiply to the right by the number of bits
				777	which need to be zero for the condition to be true, and shrink the compare RHS
				778	by the same amount. Unless the target supports rotates, though, that
				779	transformation probably isn't worthwhile.
				780
				781	The transformation can also easily be made to work with non-zero equality
				782	comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	783
				784	//===---------------------------------------------------------------------===//
Chris Lattner	23f35bc	2008-08-19 06:22:16 +0000	[diff] [blame]	785
Chris Lattner	db03983	2008-10-15 16:06:03 +0000	[diff] [blame]	786	Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
				787	bunch of other stuff from this example (see PR1604):
				788
				789	#include <cstdio>
				790	struct test {
				791	int val;
				792	virtual ~test() {}
				793	};
				794
				795	int main() {
				796	test t;
				797	std::scanf("%d", &t.val);
				798	std::printf("%d\n", t.val);
				799	}
				800
				801	//===---------------------------------------------------------------------===//
				802
Nick Lewycky	d2f0db1	2008-11-27 22:41:45 +0000	[diff] [blame]	803	These functions perform the same computation, but produce different assembly.
Nick Lewycky	df563ca	2008-11-27 22:12:22 +0000	[diff] [blame]	804
				805	define i8 @select(i8 %x) readnone nounwind {
				806	%A = icmp ult i8 %x, 250
				807	%B = select i1 %A, i8 0, i8 1
				808	ret i8 %B
				809	}
				810
				811	define i8 @addshr(i8 %x) readnone nounwind {
				812	%A = zext i8 %x to i9
				813	%B = add i9 %A, 6 ;; 256 - 250 == 6
				814	%C = lshr i9 %B, 8
				815	%D = trunc i9 %C to i8
				816	ret i8 %D
				817	}
				818
				819	//===---------------------------------------------------------------------===//
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	820
				821	From gcc bug 24696:
				822	int
				823	f (unsigned long a, unsigned long b, unsigned long c)
				824	{
				825	return ((a & (c - 1)) != 0) \|\| ((b & (c - 1)) != 0);
				826	}
				827	int
				828	f (unsigned long a, unsigned long b, unsigned long c)
				829	{
				830	return ((a & (c - 1)) != 0) \| ((b & (c - 1)) != 0);
				831	}
				832	Both should combine to ((a\|b) & (c-1)) != 0. Currently not optimized with
				833	"clang -emit-llvm-bc \| opt -std-compile-opts".
				834
				835	//===---------------------------------------------------------------------===//
				836
				837	From GCC Bug 20192:
				838	#define PMD_MASK (~((1UL << 23) - 1))
				839	void clear_pmd_range(unsigned long start, unsigned long end)
				840	{
				841	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
				842	f();
				843	}
				844	The expression should optimize to something like
				845	"!((start\|end)&~PMD_MASK). Currently not optimized with "clang
				846	-emit-llvm-bc \| opt -std-compile-opts".
				847
				848	//===---------------------------------------------------------------------===//
				849
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	850	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
				851	i;}
				852	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				853	These should combine to the same thing. Currently, the first function
				854	produces better code on X86.
				855
				856	//===---------------------------------------------------------------------===//
				857
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	858	From GCC Bug 15784:
				859	#define abs(x) x>0?x:-x
				860	int f(int x, int y)
				861	{
				862	return (abs(x)) >= 0;
				863	}
				864	This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
				865	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				866
				867	//===---------------------------------------------------------------------===//
				868
				869	From GCC Bug 14753:
				870	void
				871	rotate_cst (unsigned int a)
				872	{
				873	a = (a << 10) \| (a >> 22);
				874	if (a == 123)
				875	bar ();
				876	}
				877	void
				878	minus_cst (unsigned int a)
				879	{
				880	unsigned int tem;
				881
				882	tem = 20 - a;
				883	if (tem == 5)
				884	bar ();
				885	}
				886	void
				887	mask_gt (unsigned int a)
				888	{
				889	/* This is equivalent to a > 15. */
				890	if ((a & ~7) > 8)
				891	bar ();
				892	}
				893	void
				894	rshift_gt (unsigned int a)
				895	{
				896	/* This is equivalent to a > 23. */
				897	if ((a >> 2) > 5)
				898	bar ();
				899	}
Chris Lattner	cb40195	2011-02-17 01:43:46 +0000	[diff] [blame]	900
				901	void neg_eq_cst(unsigned int a) {
				902	if (-a == 123)
				903	bar();
				904	}
				905
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	906	All should simplify to a single comparison. All of these are
				907	currently not optimized with "clang -emit-llvm-bc \| opt
				908	-std-compile-opts".
				909
				910	//===---------------------------------------------------------------------===//
				911
				912	From GCC Bug 32605:
				913	int c(int* x) {return (char)x+2 == (char)x;}
				914	Should combine to 0. Currently not optimized with "clang
				915	-emit-llvm-bc \| opt -std-compile-opts" (although llc can optimize it).
				916
				917	//===---------------------------------------------------------------------===//
				918
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	919	int a(unsigned b) {return ((b << 31) \| (b << 30)) >> 31;}
				920	Should be combined to "((b >> 1) \| b) & 1". Currently not optimized
				921	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				922
				923	//===---------------------------------------------------------------------===//
				924
				925	unsigned a(unsigned x, unsigned y) { return x \| (y & 1) \| (y & 2);}
				926	Should combine to "x \| (y & 3)". Currently not optimized with "clang
				927	-emit-llvm-bc \| opt -std-compile-opts".
				928
				929	//===---------------------------------------------------------------------===//
				930
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	931	int a(int a, int b, int c) {return (~a & c) \| ((c\|a) & b);}
				932	Should fold to "(~a & c) \| (a & b)". Currently not optimized with
				933	"clang -emit-llvm-bc \| opt -std-compile-opts".
				934
				935	//===---------------------------------------------------------------------===//
				936
				937	int a(int a,int b) {return (~(a\|b))\|a;}
				938	Should fold to "a\|~b". Currently not optimized with "clang
				939	-emit-llvm-bc \| opt -std-compile-opts".
				940
				941	//===---------------------------------------------------------------------===//
				942
				943	int a(int a, int b) {return (a&&b) \|\| (a&&!b);}
				944	Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
				945	\| opt -std-compile-opts".
				946
				947	//===---------------------------------------------------------------------===//
				948
				949	int a(int a, int b, int c) {return (a&&b) \|\| (!a&&c);}
				950	Should fold to "a ? b : c", or at least something sane. Currently not
				951	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				952
				953	//===---------------------------------------------------------------------===//
				954
				955	int a(int a, int b, int c) {return (a&&b) \|\| (a&&c) \|\| (a&&b&&c);}
				956	Should fold to a && (b \|\| c). Currently not optimized with "clang
				957	-emit-llvm-bc \| opt -std-compile-opts".
				958
				959	//===---------------------------------------------------------------------===//
				960
				961	int a(int x) {return x \| ((x & 8) ^ 8);}
				962	Should combine to x \| 8. Currently not optimized with "clang
				963	-emit-llvm-bc \| opt -std-compile-opts".
				964
				965	//===---------------------------------------------------------------------===//
				966
				967	int a(int x) {return x ^ ((x & 8) ^ 8);}
				968	Should also combine to x \| 8. Currently not optimized with "clang
				969	-emit-llvm-bc \| opt -std-compile-opts".
				970
				971	//===---------------------------------------------------------------------===//
				972
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	973	int a(int x) {return ((x \| -9) ^ 8) & x;}
				974	Should combine to x & -9. Currently not optimized with "clang
				975	-emit-llvm-bc \| opt -std-compile-opts".
				976
				977	//===---------------------------------------------------------------------===//
				978
				979	unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
				980	Should combine to "a * 0x88888888 >> 31". Currently not optimized
				981	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				982
				983	//===---------------------------------------------------------------------===//
				984
				985	unsigned a(char* x) {if ((*x & 32) == 0) return b();}
				986	There's an unnecessary zext in the generated code with "clang
				987	-emit-llvm-bc \| opt -std-compile-opts".
				988
				989	//===---------------------------------------------------------------------===//
				990
				991	unsigned a(unsigned long long x) {return 40 * (x >> 1);}
				992	Should combine to "20 * (((unsigned)x) & -2)". Currently not
				993	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				994
				995	//===---------------------------------------------------------------------===//
Bill Wendling	3bdcda8	2008-12-02 05:12:47 +0000	[diff] [blame]	996
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	997	This was noticed in the entryblock for grokdeclarator in 403.gcc:
				998
				999	%tmp = icmp eq i32 %decl_context, 4
				1000	%decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
				1001	%tmp1 = icmp eq i32 %decl_context_addr.0, 1
				1002	%decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
				1003
				1004	tmp1 should be simplified to something like:
				1005	(!tmp \|\| decl_context == 1)
				1006
				1007	This allows recursive simplifications, tmp1 is used all over the place in
				1008	the function, e.g. by:
				1009
				1010	%tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
				1011	%tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
				1012	%or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
				1013
				1014	later.
				1015
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1016	//===---------------------------------------------------------------------===//
				1017
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1018	[STORE SINKING]
				1019
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1020	Store sinking: This code:
				1021
				1022	void f (int n, int cond, int res) {
				1023	int i;
				1024	*res = 0;
				1025	for (i = 0; i < n; i++)
				1026	if (*cond)
				1027	res ^= 234; / () /
				1028	}
				1029
				1030	On this function GVN hoists the fully redundant value of *res, but nothing
				1031	moves the store out. This gives us this code:
				1032
				1033	bb: ; preds = %bb2, %entry
				1034	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
				1035	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
				1036	%1 = load i32* %cond, align 4
				1037	%2 = icmp eq i32 %1, 0
				1038	br i1 %2, label %bb2, label %bb1
				1039
				1040	bb1: ; preds = %bb
				1041	%3 = xor i32 %.rle, 234
				1042	store i32 %3, i32* %res, align 4
				1043	br label %bb2
				1044
				1045	bb2: ; preds = %bb, %bb1
				1046	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
				1047	%indvar.next = add i32 %i.05, 1
				1048	%exitcond = icmp eq i32 %indvar.next, %n
				1049	br i1 %exitcond, label %return, label %bb
				1050
				1051	DSE should sink partially dead stores to get the store out of the loop.
				1052
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1053	Here's another partial dead case:
				1054	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
				1055
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1056	//===---------------------------------------------------------------------===//
				1057
				1058	Scalar PRE hoists the mul in the common block up to the else:
				1059
				1060	int test (int a, int b, int c, int g) {
				1061	int d, e;
				1062	if (a)
				1063	d = b * c;
				1064	else
				1065	d = b - c;
				1066	e = b * c + g;
				1067	return d + e;
				1068	}
				1069
				1070	It would be better to do the mul once to reduce codesize above the if.
				1071	This is GCC PR38204.
				1072
Chris Lattner	cce240d	2011-01-06 07:41:22 +0000	[diff] [blame]	1073
				1074	//===---------------------------------------------------------------------===//
				1075	This simple function from 179.art:
				1076
				1077	int winner, numf2s;
				1078	struct { double y; int reset; } *Y;
				1079
				1080	void find_match() {
				1081	int i;
				1082	winner = 0;
				1083	for (i=0;i<numf2s;i++)
				1084	if (Y[i].y > Y[winner].y)
				1085	winner =i;
				1086	}
				1087
				1088	Compiles into (with clang TBAA):
				1089
				1090	for.body: ; preds = %for.inc, %bb.nph
				1091	%indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %for.inc ]
				1092	%i.01718 = phi i32 [ 0, %bb.nph ], [ %i.01719, %for.inc ]
				1093	%tmp4 = getelementptr inbounds %struct.anon* %tmp3, i64 %indvar, i32 0
				1094	%tmp5 = load double* %tmp4, align 8, !tbaa !4
				1095	%idxprom7 = sext i32 %i.01718 to i64
				1096	%tmp10 = getelementptr inbounds %struct.anon* %tmp3, i64 %idxprom7, i32 0
				1097	%tmp11 = load double* %tmp10, align 8, !tbaa !4
				1098	%cmp12 = fcmp ogt double %tmp5, %tmp11
				1099	br i1 %cmp12, label %if.then, label %for.inc
				1100
				1101	if.then: ; preds = %for.body
				1102	%i.017 = trunc i64 %indvar to i32
				1103	br label %for.inc
				1104
				1105	for.inc: ; preds = %for.body, %if.then
				1106	%i.01719 = phi i32 [ %i.01718, %for.body ], [ %i.017, %if.then ]
				1107	%indvar.next = add i64 %indvar, 1
				1108	%exitcond = icmp eq i64 %indvar.next, %tmp22
				1109	br i1 %exitcond, label %for.cond.for.end_crit_edge, label %for.body
				1110
				1111
				1112	It is good that we hoisted the reloads of numf2's, and Y out of the loop and
				1113	sunk the store to winner out.
				1114
				1115	However, this is awful on several levels: the conditional truncate in the loop
				1116	(-indvars at fault? why can't we completely promote the IV to i64?).
				1117
				1118	Beyond that, we have a partially redundant load in the loop: if "winner" (aka
				1119	%i.01718) isn't updated, we reload Y[winner].y the next time through the loop.
				1120	Similarly, the addressing that feeds it (including the sext) is redundant. In
				1121	the end we get this generated assembly:
				1122
				1123	LBB0_2: ## %for.body
				1124	## =>This Inner Loop Header: Depth=1
				1125	movsd (%rdi), %xmm0
				1126	movslq %edx, %r8
				1127	shlq $4, %r8
				1128	ucomisd (%rcx,%r8), %xmm0
				1129	jbe LBB0_4
				1130	movl %esi, %edx
				1131	LBB0_4: ## %for.inc
				1132	addq $16, %rdi
				1133	incq %rsi
				1134	cmpq %rsi, %rax
				1135	jne LBB0_2
				1136
				1137	All things considered this isn't too bad, but we shouldn't need the movslq or
				1138	the shlq instruction, or the load folded into ucomisd every time through the
				1139	loop.
				1140
				1141	On an x86-specific topic, if the loop can't be restructure, the movl should be a
				1142	cmov.
				1143
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1144	//===---------------------------------------------------------------------===//
				1145
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1146	[STORE SINKING]
				1147
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1148	GCC PR37810 is an interesting case where we should sink load/store reload
				1149	into the if block and outside the loop, so we don't reload/store it on the
				1150	non-call path.
				1151
				1152	for () {
				1153	*P += 1;
				1154	if ()
				1155	call();
				1156	else
				1157	...
				1158	->
				1159	tmp = *P
				1160	for () {
				1161	tmp += 1;
				1162	if () {
				1163	*P = tmp;
				1164	call();
				1165	tmp = *P;
				1166	} else ...
				1167	}
				1168	*P = tmp;
				1169
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1170	We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
				1171	we don't sink the store. We need partially dead store sinking.
				1172
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1173	//===---------------------------------------------------------------------===//
				1174
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1175	[LOAD PRE CRIT EDGE SPLITTING]
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1176
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1177	GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
				1178	leading to excess stack traffic. This could be handled by GVN with some crazy
				1179	symbolic phi translation. The code we get looks like (g is on the stack):
				1180
				1181	bb2: ; preds = %bb1
				1182	..
				1183	%9 = getelementptr %struct.f* %g, i32 0, i32 0
				1184	store i32 %8, i32* %9, align bel %bb3
				1185
				1186	bb3: ; preds = %bb1, %bb2, %bb
				1187	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
				1188	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
				1189	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
				1190	%11 = load i32* %10, align 4
				1191
Chris Lattner	6d94926	2009-11-27 16:53:57 +0000	[diff] [blame]	1192	%11 is partially redundant, an in BB2 it should have the value %8.
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1193
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1194	GCC PR33344 and PR35287 are similar cases.
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1195
Chris Lattner	6c9fab7	2009-11-05 18:19:19 +0000	[diff] [blame]	1196
				1197	//===---------------------------------------------------------------------===//
				1198
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1199	[LOAD PRE]
				1200
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1201	There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1202	GCC testsuite, ones we don't get yet are (checked through loadpre25):
				1203
				1204	[CRIT EDGE BREAKING]
				1205	loadpre3.c predcom-4.c
				1206
				1207	[PRE OF READONLY CALL]
				1208	loadpre5.c
				1209
				1210	[TURN SELECT INTO BRANCH]
				1211	loadpre14.c loadpre15.c
				1212
				1213	actually a conditional increment: loadpre18.c loadpre19.c
				1214
Chris Lattner	2fc36e1	2010-12-15 06:38:24 +0000	[diff] [blame]	1215	//===---------------------------------------------------------------------===//
				1216
				1217	[LOAD PRE / STORE SINKING / SPEC HACK]
				1218
				1219	This is a chunk of code from 456.hmmer:
				1220
				1221	int f(int M, int mc, int mpp, int tpmm, int ip, int tpim, int dpp,
				1222	int tpdm, int xmb, int bp, int *ms) {
				1223	int k, sc;
				1224	for (k = 1; k <= M; k++) {
				1225	mc[k] = mpp[k-1] + tpmm[k-1];
				1226	if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;
				1227	if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
				1228	if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
				1229	mc[k] += ms[k];
				1230	}
				1231	}
				1232
				1233	It is very profitable for this benchmark to turn the conditional stores to mc[k]
				1234	into a conditional move (select instr in IR) and allow the final store to do the
				1235	store. See GCC PR27313 for more details. Note that this is valid to xform even
				1236	with the new C++ memory model, since mc[k] is previously loaded and later
				1237	stored.
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1238
				1239	//===---------------------------------------------------------------------===//
				1240
				1241	[SCALAR PRE]
				1242	There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the
				1243	GCC testsuite.
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1244
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1245	//===---------------------------------------------------------------------===//
				1246
				1247	There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1248	GCC testsuite. For example, we get the first example in predcom-1.c, but
				1249	miss the second one:
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1250
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1251	unsigned fib[1000];
				1252	unsigned avg[1000];
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1253
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1254	__attribute__ ((noinline))
				1255	void count_averages(int n) {
				1256	int i;
				1257	for (i = 1; i < n; i++)
				1258	avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;
				1259	}
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1260
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1261	which compiles into two loads instead of one in the loop.
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1262
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1263	predcom-2.c is the same as predcom-1.c
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1264
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1265	predcom-3.c is very similar but needs loads feeding each other instead of
				1266	store->load.
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1267
				1268
				1269	//===---------------------------------------------------------------------===//
				1270
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	1271	[ALIAS ANALYSIS]
				1272
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1273	Type based alias analysis:
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1274	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
				1275
Chris Lattner	aa306c2	2010-01-23 17:59:23 +0000	[diff] [blame]	1276	We should do better analysis of posix_memalign. At the least it should
				1277	no-capture its pointer argument, at best, we should know that the out-value
				1278	result doesn't point to anything (like malloc). One example of this is in
				1279	SingleSource/Benchmarks/Misc/dt.c
				1280
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1281	//===---------------------------------------------------------------------===//
				1282
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1283	Interesting missed case because of control flow flattening (should be 2 loads):
				1284	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1285	With: llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \|
				1286	opt -mem2reg -gvn -instcombine \| llvm-dis
Chris Lattner	d4137f4	2009-11-29 02:19:52 +0000	[diff] [blame]	1287	we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1288	VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1289
				1290	//===---------------------------------------------------------------------===//
				1291
				1292	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
				1293	We could eliminate the branch condition here, loading from null is undefined:
				1294
				1295	struct S { int w, x, y, z; };
				1296	struct T { int r; struct S s; };
				1297	void bar (struct S, int);
				1298	void foo (int a, struct T b)
				1299	{
				1300	struct S *c = 0;
				1301	if (a)
				1302	c = &b.s;
				1303	bar (*c, a);
				1304	}
				1305
				1306	//===---------------------------------------------------------------------===//
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1307
Chris Lattner	9cf8ef6	2008-12-23 20:52:52 +0000	[diff] [blame]	1308	simplifylibcalls should do several optimizations for strspn/strcspn:
				1309
Chris Lattner	9cf8ef6	2008-12-23 20:52:52 +0000	[diff] [blame]	1310	strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
				1311
				1312	size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
				1313	int __reject3) {
				1314	register size_t __result = 0;
				1315	while (__s[__result] != '\0' && __s[__result] != __reject1 &&
				1316	__s[__result] != __reject2 && __s[__result] != __reject3)
				1317	++__result;
				1318	return __result;
				1319	}
				1320
				1321	This should turn into a switch on the character. See PR3253 for some notes on
				1322	codegen.
				1323
				1324	456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
				1325
				1326	//===---------------------------------------------------------------------===//
Chris Lattner	d23b799	2008-12-31 00:54:13 +0000	[diff] [blame]	1327
Chris Lattner	d3e768e	2011-03-01 00:24:51 +0000	[diff] [blame^]	1328	simplifylibcalls should turn these snprintf idioms into memcpy (GCC PR47917)
				1329
				1330	char buf1[6], buf2[6], buf3[4], buf4[4];
				1331	int i;
				1332
				1333	int foo (void) {
				1334	int ret = snprintf (buf1, sizeof buf1, "abcde");
				1335	ret += snprintf (buf2, sizeof buf2, "abcdef") * 16;
				1336	ret += snprintf (buf3, sizeof buf3, "%s", i++ < 6 ? "abc" : "def") * 256;
				1337	ret += snprintf (buf4, sizeof buf4, "%s", i++ > 10 ? "abcde" : "defgh")*4096;
				1338	return ret;
				1339	}
				1340
				1341	//===---------------------------------------------------------------------===//
				1342
Chris Lattner	d23b799	2008-12-31 00:54:13 +0000	[diff] [blame]	1343	"gas" uses this idiom:
				1344	else if (strchr ("+-/%\|&^:[]()~", intel_parser.op_string))
				1345	..
				1346	else if (strchr ("<>", *intel_parser.op_string)
				1347
				1348	Those should be turned into a switch.
				1349
				1350	//===---------------------------------------------------------------------===//
Chris Lattner	ffb08f5	2009-01-08 06:52:57 +0000	[diff] [blame]	1351
				1352	252.eon contains this interesting code:
				1353
				1354	%3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
				1355	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1356	%strlen = call i32 @strlen(i8* %3072) ; uses = 1
				1357	%endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
				1358	call void @llvm.memcpy.i32(i8* %endptr,
				1359	i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
				1360	%3074 = call i32 @strlen(i8* %endptr) nounwind readonly
				1361
				1362	This is interesting for a couple reasons. First, in this:
				1363
Benjamin Kramer	9d071cb	2010-12-23 15:32:07 +0000	[diff] [blame]	1364	The memcpy+strlen strlen can be replaced with:
Chris Lattner	ffb08f5	2009-01-08 06:52:57 +0000	[diff] [blame]	1365
				1366	%3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
				1367
				1368	Because the destination was just copied into the specified memory buffer. This,
				1369	in turn, can be constant folded to "4".
				1370
				1371	In other code, it contains:
				1372
				1373	%endptr6978 = bitcast i8* %endptr69 to i32*
				1374	store i32 7107374, i32* %endptr6978, align 1
				1375	%3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
				1376
				1377	Which could also be constant folded. Whatever is producing this should probably
				1378	be fixed to leave this as a memcpy from a string.
				1379
				1380	Further, eon also has an interesting partially redundant strlen call:
				1381
				1382	bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
				1383	%682 = getelementptr i8 %argv, i32 6 ; <i8> [#uses=2]
				1384	%683 = load i8** %682, align 4 ; <i8*> [#uses=4]
				1385	%684 = load i8* %683, align 1 ; <i8> [#uses=1]
				1386	%685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
				1387	br i1 %685, label %bb10, label %bb9
				1388
				1389	bb9: ; preds = %bb8
				1390	%686 = call i32 @strlen(i8* %683) nounwind readonly
				1391	%687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
				1392	br i1 %687, label %bb10, label %bb11
				1393
				1394	bb10: ; preds = %bb9, %bb8
				1395	%688 = call i32 @strlen(i8* %683) nounwind readonly
				1396
				1397	This could be eliminated by doing the strlen once in bb8, saving code size and
				1398	improving perf on the bb8->9->10 path.
				1399
				1400	//===---------------------------------------------------------------------===//
Chris Lattner	9fee08f	2009-01-08 07:34:55 +0000	[diff] [blame]	1401
				1402	I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
				1403	which looks like:
				1404	%movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
				1405
				1406
				1407	bb62: ; preds = %bb55, %bb53
				1408	%promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
				1409	%171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1410	%172 = add i32 %171, -1 ; <i32> [#uses=1]
				1411	%173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
				1412
				1413	... no stores ...
				1414	br i1 %or.cond, label %bb65, label %bb72
				1415
				1416	bb65: ; preds = %bb62
				1417	store i8 0, i8* %173, align 1
				1418	br label %bb72
				1419
				1420	bb72: ; preds = %bb65, %bb62
				1421	%trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
				1422	%177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1423
				1424	Note that on the bb62->bb72 path, that the %177 strlen call is partially
				1425	redundant with the %171 call. At worst, we could shove the %177 strlen call
				1426	up into the bb65 block moving it out of the bb62->bb72 path. However, note
				1427	that bb65 stores to the string, zeroing out the last byte. This means that on
				1428	that path the value of %177 is actually just %171-1. A sub is cheaper than a
				1429	strlen!
				1430
				1431	This pattern repeats several times, basically doing:
				1432
				1433	A = strlen(P);
				1434	P[A-1] = 0;
				1435	B = strlen(P);
				1436	where it is "obvious" that B = A-1.
				1437
				1438	//===---------------------------------------------------------------------===//
				1439
Chris Lattner	9fee08f	2009-01-08 07:34:55 +0000	[diff] [blame]	1440	186.crafty has this interesting pattern with the "out.4543" variable:
				1441
				1442	call void @llvm.memcpy.i32(
				1443	i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
				1444	i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
				1445	%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
				1446
				1447	It is basically doing:
				1448
				1449	memcpy(globalarray, "string");
				1450	printf(..., globalarray);
				1451
				1452	Anyway, by knowing that printf just reads the memory and forward substituting
				1453	the string directly into the printf, this eliminates reads from globalarray.
				1454	Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
				1455	other similar functions) there are many stores to "out". Once all the printfs
				1456	stop using "out", all that is left is the memcpy's into it. This should allow
				1457	globalopt to remove the "stored only" global.
				1458
				1459	//===---------------------------------------------------------------------===//
				1460
Dan Gohman	8289b05	2009-01-20 01:07:33 +0000	[diff] [blame]	1461	This code:
				1462
				1463	define inreg i32 @foo(i8* inreg %p) nounwind {
				1464	%tmp0 = load i8* %p
				1465	%tmp1 = ashr i8 %tmp0, 5
				1466	%tmp2 = sext i8 %tmp1 to i32
				1467	ret i32 %tmp2
				1468	}
				1469
				1470	could be dagcombine'd to a sign-extending load with a shift.
				1471	For example, on x86 this currently gets this:
				1472
				1473	movb (%eax), %al
				1474	sarb $5, %al
				1475	movsbl %al, %eax
				1476
				1477	while it could get this:
				1478
				1479	movsbl (%eax), %eax
				1480	sarl $5, %eax
				1481
				1482	//===---------------------------------------------------------------------===//
Chris Lattner	256baa4	2009-01-22 07:16:03 +0000	[diff] [blame]	1483
				1484	GCC PR31029:
				1485
				1486	int test(int x) { return 1-x == x; } // --> return false
				1487	int test2(int x) { return 2-x == x; } // --> return x == 1 ?
				1488
				1489	Always foldable for odd constants, what is the rule for even?
				1490
				1491	//===---------------------------------------------------------------------===//
				1492
Torok Edwin	e46a686	2009-01-24 19:30:25 +0000	[diff] [blame]	1493	PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
				1494	for next field in struct (which is at same address).
				1495
				1496	For example: store of float into { {{}}, float } could be turned into a store to
				1497	the float directly.
				1498
Torok Edwin	474479f	2009-02-20 18:42:06 +0000	[diff] [blame]	1499	//===---------------------------------------------------------------------===//
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1500
Chris Lattner	32c5f17	2009-05-11 17:41:40 +0000	[diff] [blame]	1501	The arg promotion pass should make use of nocapture to make its alias analysis
				1502	stuff much more precise.
				1503
				1504	//===---------------------------------------------------------------------===//
				1505
				1506	The following functions should be optimized to use a select instead of a
				1507	branch (from gcc PR40072):
				1508
				1509	char char_int(int m) {if(m>7) return 0; return m;}
				1510	int int_char(char m) {if(m>7) return 0; return m;}
				1511
				1512	//===---------------------------------------------------------------------===//
				1513
Bill Wendling	5a56927	2009-10-27 22:48:31 +0000	[diff] [blame]	1514	int func(int a, int b) { if (a & 0x80) b \|= 0x80; else b &= ~0x80; return b; }
				1515
				1516	Generates this:
				1517
				1518	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1519	entry:
				1520	%0 = and i32 %a, 128 ; <i32> [#uses=1]
				1521	%1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
				1522	%2 = or i32 %b, 128 ; <i32> [#uses=1]
				1523	%3 = and i32 %b, -129 ; <i32> [#uses=1]
				1524	%b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
				1525	ret i32 %b_addr.0
				1526	}
				1527
				1528	However, it's functionally equivalent to:
				1529
				1530	b = (b & ~0x80) \| (a & 0x80);
				1531
				1532	Which generates this:
				1533
				1534	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1535	entry:
				1536	%0 = and i32 %b, -129 ; <i32> [#uses=1]
				1537	%1 = and i32 %a, 128 ; <i32> [#uses=1]
				1538	%2 = or i32 %0, %1 ; <i32> [#uses=1]
				1539	ret i32 %2
				1540	}
				1541
				1542	This can be generalized for other forms:
				1543
				1544	b = (b & ~0x80) \| (a & 0x40) << 1;
				1545
				1546	//===---------------------------------------------------------------------===//
Bill Wendling	c872e9c	2009-10-27 23:30:07 +0000	[diff] [blame]	1547
				1548	These two functions produce different code. They shouldn't:
				1549
				1550	#include <stdint.h>
				1551
				1552	uint8_t p1(uint8_t b, uint8_t a) {
				1553	b = (b & ~0xc0) \| (a & 0xc0);
				1554	return (b);
				1555	}
				1556
				1557	uint8_t p2(uint8_t b, uint8_t a) {
				1558	b = (b & ~0x40) \| (a & 0x40);
				1559	b = (b & ~0x80) \| (a & 0x80);
				1560	return (b);
				1561	}
				1562
				1563	define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1564	entry:
				1565	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1566	%1 = and i8 %a, -64 ; <i8> [#uses=1]
				1567	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1568	ret i8 %2
				1569	}
				1570
				1571	define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1572	entry:
				1573	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1574	%.masked = and i8 %a, 64 ; <i8> [#uses=1]
				1575	%1 = and i8 %a, -128 ; <i8> [#uses=1]
				1576	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1577	%3 = or i8 %2, %.masked ; <i8> [#uses=1]
				1578	ret i8 %3
				1579	}
				1580
				1581	//===---------------------------------------------------------------------===//
Chris Lattner	6fdfc9c	2009-11-11 17:51:27 +0000	[diff] [blame]	1582
				1583	IPSCCP does not currently propagate argument dependent constants through
				1584	functions where it does not not all of the callers. This includes functions
				1585	with normal external linkage as well as templates, C99 inline functions etc.
				1586	Specifically, it does nothing to:
				1587
				1588	define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
				1589	entry:
				1590	%0 = add nsw i32 %y, %z
				1591	%1 = mul i32 %0, %x
				1592	%2 = mul i32 %y, %z
				1593	%3 = add nsw i32 %1, %2
				1594	ret i32 %3
				1595	}
				1596
				1597	define i32 @test2() nounwind {
				1598	entry:
				1599	%0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
				1600	ret i32 %0
				1601	}
				1602
				1603	It would be interesting extend IPSCCP to be able to handle simple cases like
				1604	this, where all of the arguments to a call are constant. Because IPSCCP runs
				1605	before inlining, trivial templates and inline functions are not yet inlined.
				1606	The results for a function + set of constant arguments should be memoized in a
				1607	map.
				1608
				1609	//===---------------------------------------------------------------------===//
Chris Lattner	fc926c2	2009-11-11 17:54:02 +0000	[diff] [blame]	1610
				1611	The libcall constant folding stuff should be moved out of SimplifyLibcalls into
				1612	libanalysis' constantfolding logic. This would allow IPSCCP to be able to
				1613	handle simple things like this:
				1614
				1615	static int foo(const char *X) { return strlen(X); }
				1616	int bar() { return foo("abcd"); }
				1617
				1618	//===---------------------------------------------------------------------===//
Nick Lewycky	93f9f7a	2009-11-15 17:51:23 +0000	[diff] [blame]	1619
Duncan Sands	e10920d	2010-01-06 15:37:47 +0000	[diff] [blame]	1620	functionattrs doesn't know much about memcpy/memset. This function should be
Duncan Sands	7c422ac	2010-01-06 08:45:52 +0000	[diff] [blame]	1621	marked readnone rather than readonly, since it only twiddles local memory, but
				1622	functionattrs doesn't handle memset/memcpy/memmove aggressively:
Chris Lattner	89742c2	2009-12-03 07:43:46 +0000	[diff] [blame]	1623
				1624	struct X { int p; int q; };
				1625	int foo() {
				1626	int i = 0, j = 1;
				1627	struct X x, y;
				1628	int **p;
				1629	y.p = &i;
				1630	x.q = &j;
				1631	p = __builtin_memcpy (&x, &y, sizeof (int *));
				1632	return **p;
				1633	}
				1634
Chris Lattner	9c8fb9e	2011-01-01 22:52:11 +0000	[diff] [blame]	1635	This can be seen at:
				1636	$ clang t.c -S -o - -mkernel -O0 -emit-llvm \| opt -functionattrs -S
				1637
				1638
Chris Lattner	0533217	2009-12-03 07:41:54 +0000	[diff] [blame]	1639	//===---------------------------------------------------------------------===//
				1640
Eli Friedman	9cfb3ad	2010-01-18 22:36:59 +0000	[diff] [blame]	1641	Missed instcombine transformation:
				1642	define i1 @a(i32 %x) nounwind readnone {
				1643	entry:
				1644	%cmp = icmp eq i32 %x, 30
				1645	%sub = add i32 %x, -30
				1646	%cmp2 = icmp ugt i32 %sub, 9
				1647	%or = or i1 %cmp, %cmp2
				1648	ret i1 %or
				1649	}
				1650	This should be optimized to a single compare. Testcase derived from gcc.
				1651
				1652	//===---------------------------------------------------------------------===//
				1653
Eli Friedman	9cfb3ad	2010-01-18 22:36:59 +0000	[diff] [blame]	1654	Missed instcombine or reassociate transformation:
				1655	int a(int a, int b) { return (a==12)&(b>47)&(b<58); }
				1656
				1657	The sgt and slt should be combined into a single comparison. Testcase derived
				1658	from gcc.
				1659
				1660	//===---------------------------------------------------------------------===//
				1661
				1662	Missed instcombine transformation:
Chris Lattner	3e41106	2010-11-21 07:05:31 +0000	[diff] [blame]	1663
				1664	%382 = srem i32 %tmp14.i, 64 ; [#uses=1]
				1665	%383 = zext i32 %382 to i64 ; [#uses=1]
				1666	%384 = shl i64 %381, %383 ; [#uses=1]
				1667	%385 = icmp slt i32 %tmp14.i, 64 ; [#uses=1]
				1668
Benjamin Kramer	c21a821	2010-11-23 20:33:57 +0000	[diff] [blame]	1669	The srem can be transformed to an and because if %tmp14.i is negative, the
				1670	shift is undefined. Testcase derived from 403.gcc.
Chris Lattner	3e41106	2010-11-21 07:05:31 +0000	[diff] [blame]	1671
				1672	//===---------------------------------------------------------------------===//
				1673
				1674	This is a range comparison on a divided result (from 403.gcc):
				1675
				1676	%1337 = sdiv i32 %1336, 8 ; [#uses=1]
				1677	%.off.i208 = add i32 %1336, 7 ; [#uses=1]
				1678	%1338 = icmp ult i32 %.off.i208, 15 ; [#uses=1]
				1679
				1680	We already catch this (removing the sdiv) if there isn't an add, we should
				1681	handle the 'add' as well. This is a common idiom with it's builtin_alloca code.
				1682	C testcase:
				1683
				1684	int a(int x) { return (unsigned)(x/16+7) < 15; }
				1685
				1686	Another similar case involves truncations on 64-bit targets:
				1687
				1688	%361 = sdiv i64 %.046, 8 ; [#uses=1]
				1689	%362 = trunc i64 %361 to i32 ; [#uses=2]
				1690	...
				1691	%367 = icmp eq i32 %362, 0 ; [#uses=1]
				1692
Eli Friedman	1144d7e	2010-01-31 04:55:32 +0000	[diff] [blame]	1693	//===---------------------------------------------------------------------===//
				1694
				1695	Missed instcombine/dagcombine transformation:
				1696	define void @lshift_lt(i8 zeroext %a) nounwind {
				1697	entry:
				1698	%conv = zext i8 %a to i32
				1699	%shl = shl i32 %conv, 3
				1700	%cmp = icmp ult i32 %shl, 33
				1701	br i1 %cmp, label %if.then, label %if.end
				1702
				1703	if.then:
				1704	tail call void @bar() nounwind
				1705	ret void
				1706
				1707	if.end:
				1708	ret void
				1709	}
				1710	declare void @bar() nounwind
				1711
				1712	The shift should be eliminated. Testcase derived from gcc.
Eli Friedman	9cfb3ad	2010-01-18 22:36:59 +0000	[diff] [blame]	1713
				1714	//===---------------------------------------------------------------------===//
Chris Lattner	cf031f6	2010-02-09 00:11:10 +0000	[diff] [blame]	1715
				1716	These compile into different code, one gets recognized as a switch and the
				1717	other doesn't due to phase ordering issues (PR6212):
				1718
				1719	int test1(int mainType, int subType) {
				1720	if (mainType == 7)
				1721	subType = 4;
				1722	else if (mainType == 9)
				1723	subType = 6;
				1724	else if (mainType == 11)
				1725	subType = 9;
				1726	return subType;
				1727	}
				1728
				1729	int test2(int mainType, int subType) {
				1730	if (mainType == 7)
				1731	subType = 4;
				1732	if (mainType == 9)
				1733	subType = 6;
				1734	if (mainType == 11)
				1735	subType = 9;
				1736	return subType;
				1737	}
				1738
				1739	//===---------------------------------------------------------------------===//
Chris Lattner	6663670	2010-03-10 21:42:42 +0000	[diff] [blame]	1740
				1741	The following test case (from PR6576):
				1742
				1743	define i32 @mul(i32 %a, i32 %b) nounwind readnone {
				1744	entry:
				1745	%cond1 = icmp eq i32 %b, 0 ; <i1> [#uses=1]
				1746	br i1 %cond1, label %exit, label %bb.nph
				1747	bb.nph: ; preds = %entry
				1748	%tmp = mul i32 %b, %a ; <i32> [#uses=1]
				1749	ret i32 %tmp
				1750	exit: ; preds = %entry
				1751	ret i32 0
				1752	}
				1753
				1754	could be reduced to:
				1755
				1756	define i32 @mul(i32 %a, i32 %b) nounwind readnone {
				1757	entry:
				1758	%tmp = mul i32 %b, %a
				1759	ret i32 %tmp
				1760	}
				1761
				1762	//===---------------------------------------------------------------------===//
				1763
Chris Lattner	9484689	2010-04-16 23:52:30 +0000	[diff] [blame]	1764	We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates.
				1765	See GCC PR34949
				1766
Chris Lattner	c2685a9	2010-05-21 23:16:21 +0000	[diff] [blame]	1767	Another interesting case is that something related could be used for variables
				1768	that go const after their ctor has finished. In these cases, globalopt (which
				1769	can statically run the constructor) could mark the global const (so it gets put
				1770	in the readonly section). A testcase would be:
				1771
				1772	#include <complex>
				1773	using namespace std;
				1774	const complex<char> should_be_in_rodata (42,-42);
				1775	complex<char> should_be_in_data (42,-42);
				1776	complex<char> should_be_in_bss;
				1777
				1778	Where we currently evaluate the ctors but the globals don't become const because
				1779	the optimizer doesn't know they "become const" after the ctor is done. See
				1780	GCC PR4131 for more examples.
				1781
Chris Lattner	9484689	2010-04-16 23:52:30 +0000	[diff] [blame]	1782	//===---------------------------------------------------------------------===//
				1783
Dan Gohman	3a2a484	2010-05-03 14:31:00 +0000	[diff] [blame]	1784	In this code:
				1785
				1786	long foo(long x) {
				1787	return x > 1 ? x : 1;
				1788	}
				1789
				1790	LLVM emits a comparison with 1 instead of 0. 0 would be equivalent
				1791	and cheaper on most targets.
				1792
				1793	LLVM prefers comparisons with zero over non-zero in general, but in this
				1794	case it choses instead to keep the max operation obvious.
				1795
				1796	//===---------------------------------------------------------------------===//
Eli Friedman	8c47d3b	2010-06-12 05:54:27 +0000	[diff] [blame]	1797
Eli Friedman	b4a74c1	2010-07-03 07:38:12 +0000	[diff] [blame]	1798	Switch lowering generates less than ideal code for the following switch:
				1799	define void @a(i32 %x) nounwind {
				1800	entry:
				1801	switch i32 %x, label %if.end [
				1802	i32 0, label %if.then
				1803	i32 1, label %if.then
				1804	i32 2, label %if.then
				1805	i32 3, label %if.then
				1806	i32 5, label %if.then
				1807	]
				1808	if.then:
				1809	tail call void @foo() nounwind
				1810	ret void
				1811	if.end:
				1812	ret void
				1813	}
				1814	declare void @foo()
				1815
				1816	Generated code on x86-64 (other platforms give similar results):
				1817	a:
				1818	cmpl $5, %edi
				1819	ja .LBB0_2
				1820	movl %edi, %eax
				1821	movl $47, %ecx
				1822	btq %rax, %rcx
				1823	jb .LBB0_3
				1824	.LBB0_2:
				1825	ret
				1826	.LBB0_3:
Eli Friedman	b482829	2010-07-03 08:43:32 +0000	[diff] [blame]	1827	jmp foo # TAILCALL
Eli Friedman	b4a74c1	2010-07-03 07:38:12 +0000	[diff] [blame]	1828
				1829	The movl+movl+btq+jb could be simplified to a cmpl+jne.
				1830
Eli Friedman	b482829	2010-07-03 08:43:32 +0000	[diff] [blame]	1831	Or, if we wanted to be really clever, we could simplify the whole thing to
				1832	something like the following, which eliminates a branch:
				1833	xorl $1, %edi
				1834	cmpl $4, %edi
				1835	ja .LBB0_2
				1836	ret
				1837	.LBB0_2:
				1838	jmp foo # TAILCALL
				1839
Eli Friedman	b4a74c1	2010-07-03 07:38:12 +0000	[diff] [blame]	1840	//===---------------------------------------------------------------------===//
Chris Lattner	274191f	2010-11-09 19:37:28 +0000	[diff] [blame]	1841
Chris Lattner	af510f1	2010-11-11 18:23:57 +0000	[diff] [blame]	1842	We compile this:
				1843
				1844	int foo(int a) { return (a & (~15)) / 16; }
				1845
				1846	Into:
				1847
				1848	define i32 @foo(i32 %a) nounwind readnone ssp {
				1849	entry:
				1850	%and = and i32 %a, -16
				1851	%div = sdiv i32 %and, 16
				1852	ret i32 %div
				1853	}
				1854
				1855	but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case
				1856	should be instcombined into just "a >> 4".
				1857
				1858	We do get this at the codegen level, so something knows about it, but
				1859	instcombine should catch it earlier:
				1860
				1861	_foo: ## @foo
				1862	## BB#0: ## %entry
				1863	movl %edi, %eax
				1864	sarl $4, %eax
				1865	ret
				1866
				1867	//===---------------------------------------------------------------------===//
				1868
Chris Lattner	a97c91f	2010-12-13 00:15:25 +0000	[diff] [blame]	1869	This code (from GCC PR28685):
				1870
				1871	int test(int a, int b) {
				1872	int lt = a < b;
				1873	int eq = a == b;
				1874	if (lt)
				1875	return 1;
				1876	return eq;
				1877	}
				1878
				1879	Is compiled to:
				1880
				1881	define i32 @test(i32 %a, i32 %b) nounwind readnone ssp {
				1882	entry:
				1883	%cmp = icmp slt i32 %a, %b
				1884	br i1 %cmp, label %return, label %if.end
				1885
				1886	if.end: ; preds = %entry
				1887	%cmp5 = icmp eq i32 %a, %b
				1888	%conv6 = zext i1 %cmp5 to i32
				1889	ret i32 %conv6
				1890
				1891	return: ; preds = %entry
				1892	ret i32 1
				1893	}
				1894
				1895	it could be:
				1896
				1897	define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp {
				1898	entry:
				1899	%0 = icmp sle i32 %a, %b
				1900	%retval = zext i1 %0 to i32
				1901	ret i32 %retval
				1902	}
				1903
				1904	//===---------------------------------------------------------------------===//
Duncan Sands	124708d	2011-01-01 20:08:02 +0000	[diff] [blame]	1905
Benjamin Kramer	eaff66a	2011-01-07 20:42:20 +0000	[diff] [blame]	1906	This code can be seen in viterbi:
				1907
				1908	%64 = call noalias i8* @malloc(i64 %62) nounwind
				1909	...
				1910	%67 = call i64 @llvm.objectsize.i64(i8* %64, i1 false) nounwind
				1911	%68 = call i8* @__memset_chk(i8* %64, i32 0, i64 %62, i64 %67) nounwind
				1912
				1913	llvm.objectsize.i64 should be taught about malloc/calloc, allowing it to
				1914	fold to %62. This is a security win (overflows of malloc will get caught)
				1915	and also a performance win by exposing more memsets to the optimizer.
				1916
				1917	This occurs several times in viterbi.
				1918
				1919	Note that this would change the semantics of @llvm.objectsize which by its
				1920	current definition always folds to a constant. We also should make sure that
				1921	we remove checking in code like
				1922
				1923	char *p = malloc(strlen(s)+1);
				1924	__strcpy_chk(p, s, __builtin_objectsize(p, 0));
				1925
				1926	//===---------------------------------------------------------------------===//
				1927
Chris Lattner	c1853e4	2011-01-06 07:09:23 +0000	[diff] [blame]	1928	This code (from Benchmarks/Dhrystone/dry.c):
				1929
				1930	define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
				1931	entry:
				1932	%sext = shl i32 %0, 24
				1933	%conv = ashr i32 %sext, 24
				1934	%sext6 = shl i32 %1, 24
				1935	%conv4 = ashr i32 %sext6, 24
				1936	%cmp = icmp eq i32 %conv, %conv4
				1937	%. = select i1 %cmp, i32 10000, i32 0
				1938	ret i32 %.
				1939	}
				1940
				1941	Should be simplified into something like:
				1942
				1943	define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
				1944	entry:
				1945	%sext = shl i32 %0, 24
				1946	%conv = and i32 %sext, 0xFF000000
				1947	%sext6 = shl i32 %1, 24
				1948	%conv4 = and i32 %sext6, 0xFF000000
				1949	%cmp = icmp eq i32 %conv, %conv4
				1950	%. = select i1 %cmp, i32 10000, i32 0
				1951	ret i32 %.
				1952	}
				1953
				1954	and then to:
				1955
				1956	define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
				1957	entry:
				1958	%conv = and i32 %0, 0xFF
				1959	%conv4 = and i32 %1, 0xFF
				1960	%cmp = icmp eq i32 %conv, %conv4
				1961	%. = select i1 %cmp, i32 10000, i32 0
				1962	ret i32 %.
				1963	}
				1964	//===---------------------------------------------------------------------===//
Chris Lattner	15df044	2011-01-01 22:57:31 +0000	[diff] [blame]	1965
Benjamin Kramer	fa36680	2011-01-06 17:35:50 +0000	[diff] [blame]	1966	clang -O3 currently compiles this code
				1967
				1968	int g(unsigned int a) {
				1969	unsigned int c[100];
				1970	c[10] = a;
				1971	c[11] = a;
				1972	unsigned int b = c[10] + c[11];
				1973	if(b > a*2) a = 4;
				1974	else a = 8;
				1975	return a + 7;
				1976	}
				1977
				1978	into
				1979
				1980	define i32 @g(i32 a) nounwind readnone {
				1981	%add = shl i32 %a, 1
				1982	%mul = shl i32 %a, 1
				1983	%cmp = icmp ugt i32 %add, %mul
				1984	%a.addr.0 = select i1 %cmp, i32 11, i32 15
				1985	ret i32 %a.addr.0
				1986	}
				1987
				1988	The icmp should fold to false. This CSE opportunity is only available
				1989	after GVN and InstCombine have run.
				1990
				1991	//===---------------------------------------------------------------------===//
Chris Lattner	01cdc20	2011-01-06 22:25:00 +0000	[diff] [blame]	1992
				1993	memcpyopt should turn this:
				1994
				1995	define i8* @test10(i32 %x) {
				1996	%alloc = call noalias i8* @malloc(i32 %x) nounwind
				1997	call void @llvm.memset.p0i8.i32(i8* %alloc, i8 0, i32 %x, i32 1, i1 false)
				1998	ret i8* %alloc
				1999	}
				2000
				2001	into a call to calloc. We should make sure that we analyze calloc as
				2002	aggressively as malloc though.
				2003
				2004	//===---------------------------------------------------------------------===//
Chandler Carruth	75fbd37	2011-01-09 01:32:55 +0000	[diff] [blame]	2005
Chris Lattner	4a6fb94	2011-01-10 21:01:17 +0000	[diff] [blame]	2006	clang -O3 doesn't optimize this:
Chandler Carruth	75fbd37	2011-01-09 01:32:55 +0000	[diff] [blame]	2007
				2008	void f1(int* begin, int* end) {
				2009	std::fill(begin, end, 0);
				2010	}
				2011
Chris Lattner	66d7a57	2011-01-09 23:48:41 +0000	[diff] [blame]	2012	into a memset. This is PR8942.
Chandler Carruth	75fbd37	2011-01-09 01:32:55 +0000	[diff] [blame]	2013
				2014	//===---------------------------------------------------------------------===//
Chandler Carruth	d8723a9	2011-01-09 09:58:33 +0000	[diff] [blame]	2015
				2016	clang -O3 -fno-exceptions currently compiles this code:
				2017
				2018	void f(int N) {
				2019	std::vector<int> v(N);
Chandler Carruth	27a2a13	2011-01-09 10:10:59 +0000	[diff] [blame]	2020
				2021	extern void sink(void*); sink(&v);
Chandler Carruth	d8723a9	2011-01-09 09:58:33 +0000	[diff] [blame]	2022	}
				2023
				2024	into
				2025
				2026	define void @_Z1fi(i32 %N) nounwind {
				2027	entry:
				2028	%v2 = alloca [3 x i32*], align 8
				2029	%v2.sub = getelementptr inbounds [3 x i32] %v2, i64 0, i64 0
				2030	%tmpcast = bitcast [3 x i32] %v2 to %"class.std::vector"*
				2031	%conv = sext i32 %N to i64
				2032	store i32* null, i32** %v2.sub, align 8, !tbaa !0
				2033	%tmp3.i.i.i.i.i = getelementptr inbounds [3 x i32] %v2, i64 0, i64 1
				2034	store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
				2035	%tmp4.i.i.i.i.i = getelementptr inbounds [3 x i32] %v2, i64 0, i64 2
				2036	store i32* null, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
				2037	%cmp.i.i.i.i = icmp eq i32 %N, 0
				2038	br i1 %cmp.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i, label %cond.true.i.i.i.i
				2039
				2040	_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i: ; preds = %entry
				2041	store i32* null, i32** %v2.sub, align 8, !tbaa !0
				2042	store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
				2043	%add.ptr.i5.i.i = getelementptr inbounds i32* null, i64 %conv
				2044	store i32* %add.ptr.i5.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
				2045	br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit
				2046
				2047	cond.true.i.i.i.i: ; preds = %entry
				2048	%cmp.i.i.i.i.i = icmp slt i32 %N, 0
				2049	br i1 %cmp.i.i.i.i.i, label %if.then.i.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i
				2050
				2051	if.then.i.i.i.i.i: ; preds = %cond.true.i.i.i.i
				2052	call void @_ZSt17__throw_bad_allocv() noreturn nounwind
				2053	unreachable
				2054
				2055	_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i: ; preds = %cond.true.i.i.i.i
				2056	%mul.i.i.i.i.i = shl i64 %conv, 2
				2057	%call3.i.i.i.i.i = call noalias i8* @_Znwm(i64 %mul.i.i.i.i.i) nounwind
				2058	%0 = bitcast i8* %call3.i.i.i.i.i to i32*
				2059	store i32* %0, i32** %v2.sub, align 8, !tbaa !0
				2060	store i32* %0, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
				2061	%add.ptr.i.i.i = getelementptr inbounds i32* %0, i64 %conv
				2062	store i32* %add.ptr.i.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
				2063	call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %mul.i.i.i.i.i, i32 4, i1 false)
				2064	br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit
				2065
				2066	This is just the handling the construction of the vector. Most surprising here
Chris Lattner	b0daffc	2011-01-16 06:39:44 +0000	[diff] [blame]	2067	is the fact that all three null stores in %entry are dead (because we do no
				2068	cross-block DSE).
				2069
Chandler Carruth	d8723a9	2011-01-09 09:58:33 +0000	[diff] [blame]	2070	Also surprising is that %conv isn't simplified to 0 in %....exit.thread.i.i.
Chris Lattner	b0daffc	2011-01-16 06:39:44 +0000	[diff] [blame]	2071	This is a because the client of LazyValueInfo doesn't simplify all instruction
				2072	operands, just selected ones.
Chandler Carruth	d8723a9	2011-01-09 09:58:33 +0000	[diff] [blame]	2073
				2074	//===---------------------------------------------------------------------===//
Chandler Carruth	e5ca494	2011-01-09 09:58:36 +0000	[diff] [blame]	2075
				2076	clang -O3 -fno-exceptions currently compiles this code:
				2077
Chandler Carruth	cad33c6	2011-01-16 01:40:23 +0000	[diff] [blame]	2078	void f(char* a, int n) {
				2079	__builtin_memset(a, 0, n);
				2080	for (int i = 0; i < n; ++i)
				2081	a[i] = 0;
Chandler Carruth	e5ca494	2011-01-09 09:58:36 +0000	[diff] [blame]	2082	}
				2083
Chandler Carruth	cad33c6	2011-01-16 01:40:23 +0000	[diff] [blame]	2084	into:
Chandler Carruth	e5ca494	2011-01-09 09:58:36 +0000	[diff] [blame]	2085
Chandler Carruth	cad33c6	2011-01-16 01:40:23 +0000	[diff] [blame]	2086	define void @_Z1fPci(i8* nocapture %a, i32 %n) nounwind {
				2087	entry:
				2088	%conv = sext i32 %n to i64
				2089	tail call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %conv, i32 1, i1 false)
				2090	%cmp8 = icmp sgt i32 %n, 0
				2091	br i1 %cmp8, label %for.body.lr.ph, label %for.end
Chandler Carruth	e5ca494	2011-01-09 09:58:36 +0000	[diff] [blame]	2092
Chandler Carruth	cad33c6	2011-01-16 01:40:23 +0000	[diff] [blame]	2093	for.body.lr.ph: ; preds = %entry
				2094	%tmp10 = add i32 %n, -1
				2095	%tmp11 = zext i32 %tmp10 to i64
				2096	%tmp12 = add i64 %tmp11, 1
				2097	call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %tmp12, i32 1, i1 false)
				2098	ret void
Chandler Carruth	e5ca494	2011-01-09 09:58:36 +0000	[diff] [blame]	2099
Chandler Carruth	cad33c6	2011-01-16 01:40:23 +0000	[diff] [blame]	2100	for.end: ; preds = %entry
				2101	ret void
				2102	}
				2103
				2104	This shouldn't need the ((zext (%n - 1)) + 1) game, and it should ideally fold
				2105	the two memset's together. The issue with %n seems to stem from poor handling
				2106	of the original loop.
Chandler Carruth	e5ca494	2011-01-09 09:58:36 +0000	[diff] [blame]	2107
Chris Lattner	b0daffc	2011-01-16 06:39:44 +0000	[diff] [blame]	2108	To simplify this, we need SCEV to know that "n != 0" because of the dominating
				2109	conditional. That would turn the second memset into a simple memset of 'n'.
				2110
Chandler Carruth	e5ca494	2011-01-09 09:58:36 +0000	[diff] [blame]	2111	//===---------------------------------------------------------------------===//
Chandler Carruth	694d753	2011-01-09 11:29:57 +0000	[diff] [blame]	2112
				2113	clang -O3 -fno-exceptions currently compiles this code:
				2114
				2115	struct S {
				2116	unsigned short m1, m2;
				2117	unsigned char m3, m4;
				2118	};
				2119
				2120	void f(int N) {
				2121	std::vector<S> v(N);
				2122	extern void sink(void*); sink(&v);
				2123	}
				2124
				2125	into poor code for zero-initializing 'v' when N is >0. The problem is that
				2126	S is only 6 bytes, but each element is 8 byte-aligned. We generate a loop and
				2127	4 stores on each iteration. If the struct were 8 bytes, this gets turned into
				2128	a memset.
				2129
Chris Lattner	b0daffc	2011-01-16 06:39:44 +0000	[diff] [blame]	2130	In order to handle this we have to:
				2131	A) Teach clang to generate metadata for memsets of structs that have holes in
				2132	them.
				2133	B) Teach clang to use such a memset for zero init of this struct (since it has
				2134	a hole), instead of doing elementwise zeroing.
				2135
Chandler Carruth	694d753	2011-01-09 11:29:57 +0000	[diff] [blame]	2136	//===---------------------------------------------------------------------===//
Chandler Carruth	96b1b6c	2011-01-09 21:00:19 +0000	[diff] [blame]	2137
				2138	clang -O3 currently compiles this code:
				2139
				2140	extern const int magic;
				2141	double f() { return 0.0 * magic; }
				2142
				2143	into
				2144
				2145	@magic = external constant i32
				2146
				2147	define double @_Z1fv() nounwind readnone {
				2148	entry:
				2149	%tmp = load i32* @magic, align 4, !tbaa !0
				2150	%conv = sitofp i32 %tmp to double
				2151	%mul = fmul double %conv, 0.000000e+00
				2152	ret double %mul
				2153	}
				2154
Chris Lattner	00a35d0	2011-01-10 00:33:01 +0000	[diff] [blame]	2155	We should be able to fold away this fmul to 0.0. More generally, fmul(x,0.0)
				2156	can be folded to 0.0 if we can prove that the LHS is not -0.0, not a NaN, and
				2157	not an INF. The CannotBeNegativeZero predicate in value tracking should be
				2158	extended to support general "fpclassify" operations that can return
				2159	yes/no/unknown for each of these predicates.
				2160
Chris Lattner	4a6fb94	2011-01-10 21:01:17 +0000	[diff] [blame]	2161	In this predicate, we know that uitofp is trivially never NaN or -0.0, and
Chris Lattner	00a35d0	2011-01-10 00:33:01 +0000	[diff] [blame]	2162	we know that it isn't +/-Inf if the floating point type has enough exponent bits
				2163	to represent the largest integer value as < inf.
Chandler Carruth	96b1b6c	2011-01-09 21:00:19 +0000	[diff] [blame]	2164
				2165	//===---------------------------------------------------------------------===//
Chandler Carruth	fb00e27	2011-01-09 22:36:18 +0000	[diff] [blame]	2166
Chris Lattner	4a6fb94	2011-01-10 21:01:17 +0000	[diff] [blame]	2167	When optimizing a transformation that can change the sign of 0.0 (such as the
				2168	0.0*val -> 0.0 transformation above), it might be provable that the sign of the
				2169	expression doesn't matter. For example, by the above rules, we can't transform
				2170	fmul(sitofp(x), 0.0) into 0.0, because x might be -1 and the result of the
				2171	expression is defined to be -0.0.
				2172
				2173	If we look at the uses of the fmul for example, we might be able to prove that
				2174	all uses don't care about the sign of zero. For example, if we have:
				2175
				2176	fadd(fmul(sitofp(x), 0.0), 2.0)
				2177
				2178	Since we know that x+2.0 doesn't care about the sign of any zeros in X, we can
				2179	transform the fmul to 0.0, and then the fadd to 2.0.
				2180
				2181	//===---------------------------------------------------------------------===//
Chris Lattner	4cd18f9	2011-01-13 22:08:15 +0000	[diff] [blame]	2182
				2183	We should enhance memcpy/memcpy/memset to allow a metadata node on them
				2184	indicating that some bytes of the transfer are undefined. This is useful for
Chris Lattner	4c5456a	2011-01-13 22:11:56 +0000	[diff] [blame]	2185	frontends like clang when lowering struct copies, when some elements of the
Chris Lattner	4cd18f9	2011-01-13 22:08:15 +0000	[diff] [blame]	2186	struct are undefined. Consider something like this:
				2187
				2188	struct x {
				2189	char a;
				2190	int b[4];
				2191	};
				2192	void foo(struct x*P);
				2193	struct x testfunc() {
				2194	struct x V1, V2;
				2195	foo(&V1);
				2196	V2 = V1;
				2197
				2198	return V2;
				2199	}
				2200
				2201	We currently compile this to:
				2202	$ clang t.c -S -o - -O0 -emit-llvm \| opt -scalarrepl -S
				2203
				2204
				2205	%struct.x = type { i8, [4 x i32] }
				2206
				2207	define void @testfunc(%struct.x* sret %agg.result) nounwind ssp {
				2208	entry:
				2209	%V1 = alloca %struct.x, align 4
				2210	call void @foo(%struct.x* %V1)
				2211	%tmp1 = bitcast %struct.x* %V1 to i8*
				2212	%0 = bitcast %struct.x* %V1 to i160*
				2213	%srcval1 = load i160* %0, align 4
				2214	%tmp2 = bitcast %struct.x* %agg.result to i8*
				2215	%1 = bitcast %struct.x* %agg.result to i160*
				2216	store i160 %srcval1, i160* %1, align 4
				2217	ret void
				2218	}
				2219
				2220	This happens because SRoA sees that the temp alloca has is being memcpy'd into
				2221	and out of and it has holes and it has to be conservative. If we knew about the
				2222	holes, then this could be much much better.
				2223
				2224	Having information about these holes would also improve memcpy (etc) lowering at
				2225	llc time when it gets inlined, because we can use smaller transfers. This also
				2226	avoids partial register stalls in some important cases.
				2227
				2228	//===---------------------------------------------------------------------===//
Chris Lattner	5653f1f	2011-02-16 19:16:34 +0000	[diff] [blame]	2229
Chris Lattner	cb40195	2011-02-17 01:43:46 +0000	[diff] [blame]	2230	We don't fold (icmp (add) (add)) unless the two adds only have a single use.
				2231	There are a lot of cases that we're refusing to fold in (e.g.) 256.bzip2, for
				2232	example:
				2233
				2234	%indvar.next90 = add i64 %indvar89, 1 ;; Has 2 uses
				2235	%tmp96 = add i64 %tmp95, 1 ;; Has 1 use
				2236	%exitcond97 = icmp eq i64 %indvar.next90, %tmp96
				2237
				2238	We don't fold this because we don't want to introduce an overlapped live range
				2239	of the ivar. However if we can make this more aggressive without causing
				2240	performance issues in two ways:
				2241
				2242	1. If either the LHS or RHS has a single use, we can definitely do the
				2243	transformation. In the overlapping liverange case we're trading one register
				2244	use for one fewer operation, which is a reasonable trade. Before doing this
				2245	we should verify that the llc output actually shrinks for some benchmarks.
				2246	2. If both ops have multiple uses, we can still fold it if the operations are
				2247	both sinkable to after the icmp (e.g. in a subsequent block) which doesn't
				2248	increase register pressure.
				2249
				2250	There are a ton of icmp's we aren't simplifying because of the reg pressure
				2251	concern. Care is warranted here though because many of these are induction
				2252	variables and other cases that matter a lot to performance, like the above.
				2253	Here's a blob of code that you can drop into the bottom of visitICmp to see some
				2254	missed cases:
				2255
				2256	{ Value A, B, C, D;
				2257	if (match(Op0, m_Add(m_Value(A), m_Value(B))) &&
				2258	match(Op1, m_Add(m_Value(C), m_Value(D))) &&
				2259	(A == C \|\| A == D \|\| B == C \|\| B == D)) {
				2260	errs() << "OP0 = " << *Op0 << " U=" << Op0->getNumUses() << "\n";
				2261	errs() << "OP1 = " << *Op1 << " U=" << Op1->getNumUses() << "\n";
				2262	errs() << "CMP = " << I << "\n\n";
				2263	}
				2264	}
				2265
				2266	//===---------------------------------------------------------------------===//
				2267
				2268