Blame - llvm/lib/Target/README.txt - toolchain/llvm-project

blob: c6f384cd6eac0692e5e7a4e9f06dce5c807e9f50 [file] [log] [blame]

Chris Lattner	d1aaee0	2006-02-03 06:21:43 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
Chris Lattner	3cbd160	2006-09-28 06:01:17 +0000	[diff] [blame]	3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	6dc2233	2006-11-14 01:57:53 +0000	[diff] [blame]	5	With the recent changes to make the implicit def/use set explicit in
				6	machineinstrs, we should change the target descriptions for 'call' instructions
				7	so that the .td files don't list all the call-clobbered registers as implicit
				8	defs. Instead, these should be added by the code generator (e.g. on the dag).
				9
				10	This has a number of uses:
				11
				12	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				13	for their different impdef sets.
				14	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				15	sets don't need copies of call instructions.
				16	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				17	of calls.
				18
				19	//===---------------------------------------------------------------------===//
				20
Chris Lattner	2e33985	2010-12-15 07:25:55 +0000	[diff] [blame]	21	We should recognized various "overflow detection" idioms and translate them into
Chris Lattner	5e0c0c7	2010-12-19 19:37:52 +0000	[diff] [blame]	22	llvm.uadd.with.overflow and similar intrinsics. Here is a multiply idiom:
Chris Lattner	5174921	2010-12-15 07:28:58 +0000	[diff] [blame]	23
				24	unsigned int mul(unsigned int a,unsigned int b) {
				25	if ((unsigned long long)a*b>0xffffffff)
				26	exit(0);
				27	return a*b;
				28	}
				29
Chris Lattner	51415d2	2011-01-02 18:31:38 +0000	[diff] [blame]	30	The legalization code for mul-with-overflow needs to be made more robust before
				31	this can be implemented though.
				32
Nate Begeman	bb01d4f	2006-03-17 01:40:33 +0000	[diff] [blame]	33	//===---------------------------------------------------------------------===//
Chris Lattner	d1aaee0	2006-02-03 06:21:43 +0000	[diff] [blame]	34
				35	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
Chris Lattner	56fe52e	2008-12-10 01:30:48 +0000	[diff] [blame]	36	precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
				37	safe in general, even on darwin. See the libm implementation of hypot for
				38	examples (which special case when x/y are exactly zero to get signed zeros etc
				39	right).
Chris Lattner	d1aaee0	2006-02-03 06:21:43 +0000	[diff] [blame]	40
Chris Lattner	d1aaee0	2006-02-03 06:21:43 +0000	[diff] [blame]	41	//===---------------------------------------------------------------------===//
				42
Chris Lattner	e43e5c0	2006-03-04 01:19:34 +0000	[diff] [blame]	43	On targets with expensive 64-bit multiply, we could LSR this:
				44
				45	for (i = ...; ++i) {
				46	x = 1ULL << i;
				47
				48	into:
				49	long long tmp = 1;
				50	for (i = ...; ++i, tmp+=tmp)
				51	x = tmp;
				52
				53	This would be a win on ppc32, but not x86 or ppc64.
				54
Chris Lattner	c9a318d	2006-03-04 08:44:51 +0000	[diff] [blame]	55	//===---------------------------------------------------------------------===//
Chris Lattner	5032c32	2006-03-05 20:00:08 +0000	[diff] [blame]	56
				57	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				58
				59	//===---------------------------------------------------------------------===//
Chris Lattner	bccb0e0	2006-03-07 02:46:26 +0000	[diff] [blame]	60
Chris Lattner	71cf7c2	2010-01-01 01:29:26 +0000	[diff] [blame]	61	Reassociate should turn things like:
				62
				63	int factorial(int X) {
				64	return XXXXXXX*X;
				65	}
				66
				67	into llvm.powi calls, allowing the code generator to produce balanced
				68	multiplication trees.
				69
				70	First, the intrinsic needs to be extended to support integers, and second the
				71	code generator needs to be enhanced to lower these to multiplication trees.
Chris Lattner	003f633	2006-03-11 20:17:08 +0000	[diff] [blame]	72
				73	//===---------------------------------------------------------------------===//
				74
Chris Lattner	4e56b68	2006-03-11 20:20:40 +0000	[diff] [blame]	75	Interesting? testcase for add/shift/mul reassoc:
				76
				77	int bar(int x, int y) {
				78	return xxx+y+xxxxxyyyy;
				79	}
				80	int foo(int z, int n) {
				81	return bar(z, n) + bar(2z, 2n);
				82	}
				83
Chris Lattner	71cf7c2	2010-01-01 01:29:26 +0000	[diff] [blame]	84	This is blocked on not handling XXX -> powi(X, 3) (see note above). The issue
				85	is that we end up getting t = 2X s = tt and don't turn this into 4XX,
				86	which is the same number of multiplies and is canonical, because the 2*X has
				87	multiple uses. Here's a simple example:
				88
				89	define i32 @test15(i32 %X1) {
				90	%B = mul i32 %X1, 47 ; X1*47
				91	%C = mul i32 %B, %B
				92	ret i32 %C
				93	}
				94
				95
				96	//===---------------------------------------------------------------------===//
				97
				98	Reassociate should handle the example in GCC PR16157:
				99
				100	extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;
				101	void f () { /* this can be optimized to four additions... */
				102	b4 = a4 + a3 + a2 + a1 + a0;
				103	b3 = a3 + a2 + a1 + a0;
				104	b2 = a2 + a1 + a0;
				105	b1 = a1 + a0;
				106	}
				107
				108	This requires reassociating to forms of expressions that are already available,
				109	something that reassoc doesn't think about yet.
Chris Lattner	2cca31f	2007-05-05 22:29:06 +0000	[diff] [blame]	110
Chris Lattner	7e3f8b6	2010-01-24 20:01:41 +0000	[diff] [blame]	111
				112	//===---------------------------------------------------------------------===//
				113
				114	This function: (derived from GCC PR19988)
				115	double foo(double x, double y) {
				116	return ((x + 0.1234 * y) * (x + -0.1234 * y));
				117	}
				118
				119	compiles to:
				120	_foo:
				121	movapd %xmm1, %xmm2
				122	mulsd LCPI1_1(%rip), %xmm1
				123	mulsd LCPI1_0(%rip), %xmm2
				124	addsd %xmm0, %xmm1
				125	addsd %xmm0, %xmm2
				126	movapd %xmm1, %xmm0
				127	mulsd %xmm2, %xmm0
				128	ret
				129
Chris Lattner	e3a68d1	2010-01-24 20:17:09 +0000	[diff] [blame]	130	Reassociate should be able to turn it into:
Chris Lattner	7e3f8b6	2010-01-24 20:01:41 +0000	[diff] [blame]	131
				132	double foo(double x, double y) {
				133	return ((x + 0.1234 * y) * (x - 0.1234 * y));
				134	}
				135
				136	Which allows the multiply by constant to be CSE'd, producing:
				137
				138	_foo:
				139	mulsd LCPI1_0(%rip), %xmm1
				140	movapd %xmm1, %xmm2
				141	addsd %xmm0, %xmm2
				142	subsd %xmm1, %xmm0
				143	mulsd %xmm2, %xmm0
				144	ret
				145
				146	This doesn't need -ffast-math support at all. This is particularly bad because
				147	the llvm-gcc frontend is canonicalizing the later into the former, but clang
				148	doesn't have this problem.
				149
Chris Lattner	4e56b68	2006-03-11 20:20:40 +0000	[diff] [blame]	150	//===---------------------------------------------------------------------===//
				151
Chris Lattner	f136299	2006-03-09 20:13:21 +0000	[diff] [blame]	152	These two functions should generate the same code on big-endian systems:
				153
				154	int g(int j,int l) { return memcmp(j,l,4); }
				155	int h(int j, int l) { return j - l; }
				156
				157	this could be done in SelectionDAGISel.cpp, along with other special cases,
				158	for 1,2,4,8 bytes.
				159
				160	//===---------------------------------------------------------------------===//
				161
Chris Lattner	e24cf9d	2006-03-22 07:33:46 +0000	[diff] [blame]	162	It would be nice to revert this patch:
				163	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				164
				165	And teach the dag combiner enough to simplify the code expanded before
				166	legalize. It seems plausible that this knowledge would let it simplify other
				167	stuff too.
				168
Chris Lattner	0affd76	2006-03-24 19:59:17 +0000	[diff] [blame]	169	//===---------------------------------------------------------------------===//
				170
Reid Spencer	09575ba	2007-02-15 03:39:18 +0000	[diff] [blame]	171	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng	dc1161c	2006-03-31 22:35:14 +0000	[diff] [blame]	172	to the type size. It works but can be overly conservative as the alignment of
Reid Spencer	09575ba	2007-02-15 03:39:18 +0000	[diff] [blame]	173	specific vector types are target dependent.
Chris Lattner	0baebb1	2006-04-01 04:08:29 +0000	[diff] [blame]	174
				175	//===---------------------------------------------------------------------===//
				176
Dan Gohman	1dbb40f	2009-05-11 18:51:16 +0000	[diff] [blame]	177	We should produce an unaligned load from code like this:
Chris Lattner	0baebb1	2006-04-01 04:08:29 +0000	[diff] [blame]	178
				179	v4sf example(float *P) {
				180	return (v4sf){P[0], P[1], P[2], P[3] };
				181	}
				182
				183	//===---------------------------------------------------------------------===//
				184
Chris Lattner	4cda95b	2006-05-18 18:26:13 +0000	[diff] [blame]	185	Add support for conditional increments, and other related patterns. Instead
				186	of:
				187
				188	movl 136(%esp), %eax
				189	cmpl $0, %eax
				190	je LBB16_2 #cond_next
				191	LBB16_1: #cond_true
				192	incl _foo
				193	LBB16_2: #cond_next
				194
				195	emit:
				196	movl _foo, %eax
				197	cmpl $1, %edi
				198	sbbl $-1, %eax
				199	movl %eax, _foo
				200
				201	//===---------------------------------------------------------------------===//
Chris Lattner	240f846	2006-05-19 20:45:08 +0000	[diff] [blame]	202
				203	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				204
				205	Expand these to calls of sin/cos and stores:
				206	double sincos(double x, double sin, double cos);
				207	float sincosf(float x, float sin, float cos);
				208	long double sincosl(long double x, long double sin, long double cos);
				209
				210	Doing so could allow SROA of the destination pointers. See also:
				211	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				212
Chris Lattner	56fe52e	2008-12-10 01:30:48 +0000	[diff] [blame]	213	This is now easily doable with MRVs. We could even make an intrinsic for this
				214	if anyone cared enough about sincos.
				215
Chris Lattner	240f846	2006-05-19 20:45:08 +0000	[diff] [blame]	216	//===---------------------------------------------------------------------===//
Chris Lattner	29d7bde	2006-05-19 21:01:38 +0000	[diff] [blame]	217
Chris Lattner	f7e3478	2006-09-16 23:57:51 +0000	[diff] [blame]	218	quantum_sigma_x in 462.libquantum contains the following loop:
				219
				220	for(i=0; i<reg->size; i++)
				221	{
				222	/* Flip the target bit of each basis state */
				223	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				224	}
				225
				226	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				227	so cool to turn it into something like:
				228
Chris Lattner	4a13d3b	2006-09-18 04:54:35 +0000	[diff] [blame]	229	long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner	f7e3478	2006-09-16 23:57:51 +0000	[diff] [blame]	230	if (target < 32) {
				231	for(i=0; i<reg->size; i++)
Chris Lattner	4a13d3b	2006-09-18 04:54:35 +0000	[diff] [blame]	232	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner	f7e3478	2006-09-16 23:57:51 +0000	[diff] [blame]	233	} else {
				234	for(i=0; i<reg->size; i++)
Chris Lattner	4a13d3b	2006-09-18 04:54:35 +0000	[diff] [blame]	235	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner	f7e3478	2006-09-16 23:57:51 +0000	[diff] [blame]	236	}
				237
				238	... which would only do one 32-bit XOR per loop iteration instead of two.
				239
				240	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
Chris Lattner	8e09ad6	2009-11-26 01:51:18 +0000	[diff] [blame]	241	this requires TBAA.
Chris Lattner	6185878	2009-09-21 06:04:07 +0000	[diff] [blame]	242
				243	//===---------------------------------------------------------------------===//
				244
Chris Lattner	f9325e5	2008-10-05 02:16:12 +0000	[diff] [blame]	245	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattner	4d475f6	2006-12-08 02:01:32 +0000	[diff] [blame]	246
				247	unsigned long reverse(unsigned v) {
				248	unsigned t;
				249	t = v ^ ((v << 16) \| (v >> 16));
				250	t &= ~0xff0000;
				251	v = (v << 24) \| (v >> 8);
				252	return v ^ (t >> 8);
				253	}
				254
Chris Lattner	f11327d	2006-09-25 17:12:14 +0000	[diff] [blame]	255	//===---------------------------------------------------------------------===//
				256
Chris Lattner	249da5c	2010-01-23 18:49:30 +0000	[diff] [blame]	257	[LOOP RECOGNITION]
				258
Chris Lattner	843dacc	2008-10-15 16:02:15 +0000	[diff] [blame]	259	These idioms should be recognized as popcount (see PR1488):
				260
				261	unsigned countbits_slow(unsigned v) {
				262	unsigned c;
				263	for (c = 0; v; v >>= 1)
				264	c += v & 1;
				265	return c;
				266	}
				267	unsigned countbits_fast(unsigned v){
				268	unsigned c;
				269	for (c = 0; v; c++)
				270	v &= v - 1; // clear the least significant bit set
				271	return c;
				272	}
				273
				274	BITBOARD = unsigned long long
				275	int PopCnt(register BITBOARD a) {
				276	register int c=0;
				277	while(a) {
				278	c++;
				279	a &= a - 1;
				280	}
				281	return c;
				282	}
				283	unsigned int popcount(unsigned int input) {
				284	unsigned int count = 0;
				285	for (unsigned int i = 0; i < 4 * 8; i++)
				286	count += (input >> i) & i;
				287	return count;
				288	}
				289
Chris Lattner	51415d2	2011-01-02 18:31:38 +0000	[diff] [blame]	290	This sort of thing should be added to the loop idiom pass.
Chris Lattner	8e09ad6	2009-11-26 01:51:18 +0000	[diff] [blame]	291
Chris Lattner	843dacc	2008-10-15 16:02:15 +0000	[diff] [blame]	292	//===---------------------------------------------------------------------===//
				293
Chris Lattner	f11327d	2006-09-25 17:12:14 +0000	[diff] [blame]	294	These should turn into single 16-bit (unaligned?) loads on little/big endian
				295	processors.
				296
				297	unsigned short read_16_le(const unsigned char *adr) {
				298	return adr[0] \| (adr[1] << 8);
				299	}
				300	unsigned short read_16_be(const unsigned char *adr) {
				301	return (adr[0] << 8) \| adr[1];
				302	}
				303
				304	//===---------------------------------------------------------------------===//
Chris Lattner	f054003	2006-10-24 16:12:47 +0000	[diff] [blame]	305
Reid Spencer	7e80b0b	2006-10-26 06:15:43 +0000	[diff] [blame]	306	-instcombine should handle this transform:
Reid Spencer	266e42b	2006-12-23 06:05:41 +0000	[diff] [blame]	307	icmp pred (sdiv X / C1 ), C2
Reid Spencer	7e80b0b	2006-10-26 06:15:43 +0000	[diff] [blame]	308	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				309
				310	Currently InstCombine avoids this transform but will do it when the signs of
				311	the operands and the sign of the divide match. See the FIXME in
				312	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				313	for Instruction::UDiv (around line 4447) for more details.
				314
				315	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				316	this construct.
Chris Lattner	2048373	2006-11-03 22:27:39 +0000	[diff] [blame]	317
				318	//===---------------------------------------------------------------------===//
				319
Chris Lattner	082da53	2010-01-23 17:59:23 +0000	[diff] [blame]	320	[LOOP OPTIMIZATION]
				321
				322	SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization
				323	opportunities in its double_array_divs_variable function: it needs loop
				324	interchange, memory promotion (which LICM already does), vectorization and
				325	variable trip count loop unrolling (since it has a constant trip count). ICC
				326	apparently produces this very nice code with -ffast-math:
				327
				328	..B1.70: # Preds ..B1.70 ..B1.69
				329	mulpd %xmm0, %xmm1 #108.2
				330	mulpd %xmm0, %xmm1 #108.2
				331	mulpd %xmm0, %xmm1 #108.2
				332	mulpd %xmm0, %xmm1 #108.2
				333	addl $8, %edx #
				334	cmpl $131072, %edx #108.2
				335	jb ..B1.70 # Prob 99% #108.2
				336
				337	It would be better to count down to zero, but this is a lot better than what we
				338	do.
				339
				340	//===---------------------------------------------------------------------===//
				341
Chris Lattner	89e5813	2007-01-16 06:39:48 +0000	[diff] [blame]	342	Consider:
				343
				344	typedef unsigned U32;
				345	typedef unsigned long long U64;
				346	int test (U32 inst, U64 regs) {
				347	U64 effective_addr2;
				348	U32 temp = *inst;
				349	int r1 = (temp >> 20) & 0xf;
				350	int b2 = (temp >> 16) & 0xf;
				351	effective_addr2 = temp & 0xfff;
				352	if (b2) effective_addr2 += regs[b2];
				353	b2 = (temp >> 12) & 0xf;
				354	if (b2) effective_addr2 += regs[b2];
				355	effective_addr2 &= regs[4];
				356	if ((effective_addr2 & 3) == 0)
				357	return 1;
				358	return 0;
				359	}
				360
				361	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				362	we don't eliminate the computation of the top half of effective_addr2 because
				363	we don't have whole-function selection dags. On x86, this means we use one
				364	extra register for the function when effective_addr2 is declared as U64 than
				365	when it is declared U32.
				366
Chris Lattner	0169fd7	2009-11-10 23:47:45 +0000	[diff] [blame]	367	PHI Slicing could be extended to do this.
				368
Chris Lattner	89e5813	2007-01-16 06:39:48 +0000	[diff] [blame]	369	//===---------------------------------------------------------------------===//
				370
Chris Lattner	8e09ad6	2009-11-26 01:51:18 +0000	[diff] [blame]	371	LSR should know what GPR types a target has from TargetData. This code:
Chris Lattner	43cab75	2007-03-24 06:01:32 +0000	[diff] [blame]	372
				373	volatile short X, Y; // globals
				374
				375	void foo(int N) {
				376	int i;
				377	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				378	}
				379
Chris Lattner	fc2d846	2009-09-20 17:37:38 +0000	[diff] [blame]	380	produces two near identical IV's (after promotion) on PPC/ARM:
Chris Lattner	43cab75	2007-03-24 06:01:32 +0000	[diff] [blame]	381
Chris Lattner	fc2d846	2009-09-20 17:37:38 +0000	[diff] [blame]	382	LBB1_2:
				383	ldr r3, LCPI1_0
				384	ldr r3, [r3]
				385	strh r2, [r3]
				386	ldr r3, LCPI1_1
				387	ldr r3, [r3]
				388	strh r1, [r3]
				389	add r1, r1, #4
				390	add r2, r2, #1 <- [0,+,1]
				391	sub r0, r0, #1 <- [0,-,1]
				392	cmp r0, #0
				393	bne LBB1_2
				394
				395	LSR should reuse the "+" IV for the exit test.
Chris Lattner	43cab75	2007-03-24 06:01:32 +0000	[diff] [blame]	396
Chris Lattner	43cab75	2007-03-24 06:01:32 +0000	[diff] [blame]	397	//===---------------------------------------------------------------------===//
				398
Chris Lattner	2cca31f	2007-05-05 22:29:06 +0000	[diff] [blame]	399	Tail call elim should be more aggressive, checking to see if the call is
				400	followed by an uncond branch to an exit block.
				401
				402	; This testcase is due to tail-duplication not wanting to copy the return
				403	; instruction into the terminating blocks because there was other code
				404	; optimized out of the function after the taildup happened.
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	405	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Chris Lattner	2cca31f	2007-05-05 22:29:06 +0000	[diff] [blame]	406
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	407	define i32 @t4(i32 %a) {
Chris Lattner	2cca31f	2007-05-05 22:29:06 +0000	[diff] [blame]	408	entry:
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	409	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				410	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				411	br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner	2cca31f	2007-05-05 22:29:06 +0000	[diff] [blame]	412
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	413	then.0: ; preds = %entry
				414	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				415	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				416	br label %return
Chris Lattner	2cca31f	2007-05-05 22:29:06 +0000	[diff] [blame]	417
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	418	else.0: ; preds = %entry
				419	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				420	br i1 %tmp.7, label %then.1, label %return
Chris Lattner	2cca31f	2007-05-05 22:29:06 +0000	[diff] [blame]	421
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	422	then.1: ; preds = %else.0
				423	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				424	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				425	br label %return
Chris Lattner	2cca31f	2007-05-05 22:29:06 +0000	[diff] [blame]	426
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	427	return: ; preds = %then.1, %else.0, %then.0
				428	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner	2cca31f	2007-05-05 22:29:06 +0000	[diff] [blame]	429	[ %tmp.9, %then.1 ]
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	430	ret i32 %result.0
Chris Lattner	2cca31f	2007-05-05 22:29:06 +0000	[diff] [blame]	431	}
Chris Lattner	9958b82	2007-05-05 22:44:08 +0000	[diff] [blame]	432
				433	//===---------------------------------------------------------------------===//
				434
Chris Lattner	4afb010	2008-08-10 00:47:21 +0000	[diff] [blame]	435	Tail recursion elimination should handle:
				436
				437	int pow2m1(int n) {
				438	if (n == 0)
				439	return 0;
				440	return 2 * pow2m1 (n - 1) + 1;
				441	}
				442
				443	Also, multiplies can be turned into SHL's, so they should be handled as if
				444	they were associative. "return foo() << 1" can be tail recursion eliminated.
				445
				446	//===---------------------------------------------------------------------===//
				447
Chris Lattner	9958b82	2007-05-05 22:44:08 +0000	[diff] [blame]	448	Argument promotion should promote arguments for recursive functions, like
				449	this:
				450
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	451	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Chris Lattner	9958b82	2007-05-05 22:44:08 +0000	[diff] [blame]	452
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	453	define internal i32 @foo(i32* %x) {
Chris Lattner	9958b82	2007-05-05 22:44:08 +0000	[diff] [blame]	454	entry:
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	455	%tmp = load i32* %x ; <i32> [#uses=0]
				456	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				457	ret i32 %tmp.foo
Chris Lattner	9958b82	2007-05-05 22:44:08 +0000	[diff] [blame]	458	}
				459
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	460	define i32 @bar(i32* %x) {
Chris Lattner	9958b82	2007-05-05 22:44:08 +0000	[diff] [blame]	461	entry:
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	462	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				463	ret i32 %tmp3
Chris Lattner	9958b82	2007-05-05 22:44:08 +0000	[diff] [blame]	464	}
				465
Chris Lattner	5e224c3	2007-12-05 23:05:06 +0000	[diff] [blame]	466	//===---------------------------------------------------------------------===//
Chris Lattner	180f0e9	2007-12-28 04:42:05 +0000	[diff] [blame]	467
Chris Lattner	9d53b61	2007-12-28 22:30:05 +0000	[diff] [blame]	468	We should investigate an instruction sinking pass. Consider this silly
				469	example in pic mode:
				470
				471	#include <assert.h>
				472	void foo(int x) {
				473	assert(x);
				474	//...
				475	}
				476
				477	we compile this to:
				478	_foo:
				479	subl $28, %esp
				480	call "L1$pb"
				481	"L1$pb":
				482	popl %eax
				483	cmpl $0, 32(%esp)
				484	je LBB1_2 # cond_true
				485	LBB1_1: # return
				486	# ...
				487	addl $28, %esp
				488	ret
				489	LBB1_2: # cond_true
				490	...
				491
				492	The PIC base computation (call+popl) is only used on one path through the
				493	code, but is currently always computed in the entry block. It would be
				494	better to sink the picbase computation down into the block for the
				495	assertion, as it is the only one that uses it. This happens for a lot of
				496	code with early outs.
				497
Chris Lattner	7cafd92	2007-12-29 01:05:01 +0000	[diff] [blame]	498	Another example is loads of arguments, which are usually emitted into the
				499	entry block on targets like x86. If not used in all paths through a
				500	function, they should be sunk into the ones that do.
				501
Chris Lattner	9d53b61	2007-12-28 22:30:05 +0000	[diff] [blame]	502	In this case, whole-function-isel would also handle this.
Chris Lattner	180f0e9	2007-12-28 04:42:05 +0000	[diff] [blame]	503
				504	//===---------------------------------------------------------------------===//
Chris Lattner	730d088	2008-01-07 21:38:14 +0000	[diff] [blame]	505
				506	Investigate lowering of sparse switch statements into perfect hash tables:
				507	http://burtleburtle.net/bob/hash/perfect.html
				508
				509	//===---------------------------------------------------------------------===//
Chris Lattner	45e5032	2008-01-09 00:17:57 +0000	[diff] [blame]	510
				511	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				512	corresponding integer operations. On a yonah, this loop:
				513
				514	double a[256];
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	515	void foo() {
				516	int i, b;
				517	for (b = 0; b < 10000000; b++)
				518	for (i = 0; i < 256; i++)
				519	a[i] = -a[i];
				520	}
Chris Lattner	45e5032	2008-01-09 00:17:57 +0000	[diff] [blame]	521
				522	is twice as slow as this loop:
				523
				524	long long a[256];
Chris Lattner	6bb6a55	2008-02-18 18:46:39 +0000	[diff] [blame]	525	void foo() {
				526	int i, b;
				527	for (b = 0; b < 10000000; b++)
				528	for (i = 0; i < 256; i++)
				529	a[i] ^= (1ULL << 63);
				530	}
Chris Lattner	45e5032	2008-01-09 00:17:57 +0000	[diff] [blame]	531
				532	and I suspect other processors are similar. On X86 in particular this is a
				533	big win because doing this with integers allows the use of read/modify/write
				534	instructions.
				535
				536	//===---------------------------------------------------------------------===//
Chris Lattner	1d07b65	2008-01-10 18:25:41 +0000	[diff] [blame]	537
				538	DAG Combiner should try to combine small loads into larger loads when
				539	profitable. For example, we compile this C++ example:
				540
				541	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				542	extern THotKey m_HotKey;
				543	THotKey GetHotKey () { return m_HotKey; }
				544
Chris Lattner	51415d2	2011-01-02 18:31:38 +0000	[diff] [blame]	545	into (-m64 -O3 -fno-exceptions -static -fomit-frame-pointer):
Chris Lattner	1d07b65	2008-01-10 18:25:41 +0000	[diff] [blame]	546
Chris Lattner	51415d2	2011-01-02 18:31:38 +0000	[diff] [blame]	547	__Z9GetHotKeyv: ## @_Z9GetHotKeyv
				548	movq _m_HotKey@GOTPCREL(%rip), %rax
				549	movzwl (%rax), %ecx
				550	movzbl 2(%rax), %edx
				551	shlq $16, %rdx
				552	orq %rcx, %rdx
				553	movzbl 3(%rax), %ecx
				554	shlq $24, %rcx
				555	orq %rdx, %rcx
				556	movzbl 4(%rax), %eax
				557	shlq $32, %rax
				558	orq %rcx, %rax
				559	ret
Chris Lattner	1d07b65	2008-01-10 18:25:41 +0000	[diff] [blame]	560
				561	//===---------------------------------------------------------------------===//
Chris Lattner	87b0c13	2008-01-11 06:17:47 +0000	[diff] [blame]	562
Nate Begeman	0fddc34	2008-02-18 18:39:23 +0000	[diff] [blame]	563	We should add an FRINT node to the DAG to model targets that have legal
				564	implementations of ceil/floor/rint.
Chris Lattner	765be88	2008-02-28 05:34:27 +0000	[diff] [blame]	565
				566	//===---------------------------------------------------------------------===//
				567
				568	Consider:
				569
				570	int test() {
Benjamin Kramer	dfa40f8	2010-12-23 15:32:07 +0000	[diff] [blame]	571	long long input[8] = {1,0,1,0,1,0,1,0};
Chris Lattner	765be88	2008-02-28 05:34:27 +0000	[diff] [blame]	572	foo(input);
				573	}
				574
Chris Lattner	e5d5a41	2011-01-01 22:52:11 +0000	[diff] [blame]	575	Clang compiles this into:
Chris Lattner	765be88	2008-02-28 05:34:27 +0000	[diff] [blame]	576
Chris Lattner	e5d5a41	2011-01-01 22:52:11 +0000	[diff] [blame]	577	call void @llvm.memset.p0i8.i64(i8* %tmp, i8 0, i64 64, i32 16, i1 false)
				578	%0 = getelementptr [8 x i64]* %input, i64 0, i64 0
				579	store i64 1, i64* %0, align 16
				580	%1 = getelementptr [8 x i64]* %input, i64 0, i64 2
				581	store i64 1, i64* %1, align 16
				582	%2 = getelementptr [8 x i64]* %input, i64 0, i64 4
				583	store i64 1, i64* %2, align 16
				584	%3 = getelementptr [8 x i64]* %input, i64 0, i64 6
				585	store i64 1, i64* %3, align 16
Chris Lattner	765be88	2008-02-28 05:34:27 +0000	[diff] [blame]	586
Chris Lattner	e5d5a41	2011-01-01 22:52:11 +0000	[diff] [blame]	587	Which gets codegen'd into:
				588
				589	pxor %xmm0, %xmm0
				590	movaps %xmm0, -16(%rbp)
				591	movaps %xmm0, -32(%rbp)
				592	movaps %xmm0, -48(%rbp)
				593	movaps %xmm0, -64(%rbp)
				594	movq $1, -64(%rbp)
				595	movq $1, -48(%rbp)
				596	movq $1, -32(%rbp)
				597	movq $1, -16(%rbp)
				598
				599	It would be better to have 4 movq's of 0 instead of the movaps's.
Chris Lattner	765be88	2008-02-28 05:34:27 +0000	[diff] [blame]	600
				601	//===---------------------------------------------------------------------===//
Chris Lattner	647c664	2008-03-02 02:51:40 +0000	[diff] [blame]	602
				603	http://llvm.org/PR717:
				604
				605	The following code should compile into "ret int undef". Instead, LLVM
				606	produces "ret int 0":
				607
				608	int f() {
				609	int x = 4;
				610	int y;
				611	if (x == 3) y = 0;
				612	return y;
				613	}
				614
				615	//===---------------------------------------------------------------------===//
Chris Lattner	d51372a	2008-03-02 19:29:42 +0000	[diff] [blame]	616
				617	The loop unroller should partially unroll loops (instead of peeling them)
				618	when code growth isn't too bad and when an unroll count allows simplification
				619	of some code within the loop. One trivial example is:
				620
				621	#include <stdio.h>
				622	int main() {
				623	int nRet = 17;
				624	int nLoop;
				625	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				626	if ( nLoop & 1 )
				627	nRet += 2;
				628	else
				629	nRet -= 1;
				630	}
				631	return nRet;
				632	}
				633
				634	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				635	reduction in code size. The resultant code would then also be suitable for
				636	exit value computation.
				637
				638	//===---------------------------------------------------------------------===//
Chris Lattner	af8d3c6	2008-03-17 01:47:51 +0000	[diff] [blame]	639
				640	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				641	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				642	matching code in dag combine doesn't look through truncates aggressively
				643	enough. Here are some testcases reduces from GCC PR17886:
				644
Chris Lattner	af8d3c6	2008-03-17 01:47:51 +0000	[diff] [blame]	645	unsigned long long f5(unsigned long long x, unsigned long long y) {
				646	return (x << 8) \| ((y >> 48) & 0xffull);
				647	}
				648	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				649	switch(z) {
				650	case 1:
				651	return (x << 8) \| ((y >> 48) & 0xffull);
				652	case 2:
				653	return (x << 16) \| ((y >> 40) & 0xffffull);
				654	case 3:
				655	return (x << 24) \| ((y >> 32) & 0xffffffull);
				656	case 4:
				657	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				658	default:
				659	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				660	}
				661	}
				662
Chris Lattner	af8d3c6	2008-03-17 01:47:51 +0000	[diff] [blame]	663	//===---------------------------------------------------------------------===//
Chris Lattner	fd5fe2a	2008-03-20 04:46:13 +0000	[diff] [blame]	664
Chris Lattner	27ecda1	2010-12-15 07:10:43 +0000	[diff] [blame]	665	This (and similar related idioms):
				666
				667	unsigned int foo(unsigned char i) {
				668	return i \| (i<<8) \| (i<<16) \| (i<<24);
				669	}
				670
				671	compiles into:
				672
				673	define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone {
				674	entry:
				675	%conv = zext i8 %i to i32
				676	%shl = shl i32 %conv, 8
				677	%shl5 = shl i32 %conv, 16
				678	%shl9 = shl i32 %conv, 24
				679	%or = or i32 %shl9, %conv
				680	%or6 = or i32 %or, %shl5
				681	%or10 = or i32 %or6, %shl
				682	ret i32 %or10
				683	}
				684
				685	it would be better as:
				686
				687	unsigned int bar(unsigned char i) {
				688	unsigned int j=i \| (i << 8);
				689	return j \| (j<<16);
				690	}
				691
				692	aka:
				693
				694	define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone {
				695	entry:
				696	%conv = zext i8 %i to i32
				697	%shl = shl i32 %conv, 8
				698	%or = or i32 %shl, %conv
				699	%shl5 = shl i32 %or, 16
				700	%or6 = or i32 %shl5, %or
				701	ret i32 %or6
				702	}
				703
				704	or even i*0x01010101, depending on the speed of the multiplier. The best way to
				705	handle this is to canonicalize it to a multiply in IR and have codegen handle
				706	lowering multiplies to shifts on cpus where shifts are faster.
				707
				708	//===---------------------------------------------------------------------===//
				709
Chris Lattner	fd5fe2a	2008-03-20 04:46:13 +0000	[diff] [blame]	710	We do a number of simplifications in simplify libcalls to strength reduce
				711	standard library functions, but we don't currently merge them together. For
				712	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				713	be done safely if "b" isn't modified between the strlen and memcpy of course.
				714
				715	//===---------------------------------------------------------------------===//
				716
Chris Lattner	113b336	2008-08-10 01:14:08 +0000	[diff] [blame]	717	We compile this program: (from GCC PR11680)
				718	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				719
				720	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				721	slower than when compile with GCC (either 4.0 or 4.2):
				722
				723	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				724	$ time ./a.out fast
				725	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				726
				727	$ g++ perf.cpp -O3 -fno-exceptions
				728	$ time ./a.out fast
				729	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				730
				731	It looks like we are making the same inlining decisions, so this may be raw
				732	codegen badness or something else (haven't investigated).
				733
				734	//===---------------------------------------------------------------------===//
				735
Chris Lattner	113b336	2008-08-10 01:14:08 +0000	[diff] [blame]	736	Divisibility by constant can be simplified (according to GCC PR12849) from
				737	being a mulhi to being a mul lo (cheaper). Testcase:
				738
				739	void bar(unsigned n) {
				740	if (n % 3 == 0)
				741	true();
				742	}
				743
Eli Friedman	96cf7f4	2009-12-12 23:23:43 +0000	[diff] [blame]	744	This is equivalent to the following, where 2863311531 is the multiplicative
				745	inverse of 3, and 1431655766 is ((2^32)-1)/3+1:
				746	void bar(unsigned n) {
				747	if (n * 2863311531U < 1431655766U)
				748	true();
				749	}
				750
				751	The same transformation can work with an even modulo with the addition of a
				752	rotate: rotate the result of the multiply to the right by the number of bits
				753	which need to be zero for the condition to be true, and shrink the compare RHS
				754	by the same amount. Unless the target supports rotates, though, that
				755	transformation probably isn't worthwhile.
				756
				757	The transformation can also easily be made to work with non-zero equality
				758	comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".
Chris Lattner	113b336	2008-08-10 01:14:08 +0000	[diff] [blame]	759
				760	//===---------------------------------------------------------------------===//
Chris Lattner	d7dd8b8	2008-08-19 06:22:16 +0000	[diff] [blame]	761
Chris Lattner	6d275fd	2008-10-15 16:06:03 +0000	[diff] [blame]	762	Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
				763	bunch of other stuff from this example (see PR1604):
				764
				765	#include <cstdio>
				766	struct test {
				767	int val;
				768	virtual ~test() {}
				769	};
				770
				771	int main() {
				772	test t;
				773	std::scanf("%d", &t.val);
				774	std::printf("%d\n", t.val);
				775	}
				776
				777	//===---------------------------------------------------------------------===//
				778
Nick Lewycky	edd5d3e	2008-11-27 22:41:45 +0000	[diff] [blame]	779	These functions perform the same computation, but produce different assembly.
Nick Lewycky	b3dc4ad	2008-11-27 22:12:22 +0000	[diff] [blame]	780
				781	define i8 @select(i8 %x) readnone nounwind {
				782	%A = icmp ult i8 %x, 250
				783	%B = select i1 %A, i8 0, i8 1
				784	ret i8 %B
				785	}
				786
				787	define i8 @addshr(i8 %x) readnone nounwind {
				788	%A = zext i8 %x to i9
				789	%B = add i9 %A, 6 ;; 256 - 250 == 6
				790	%C = lshr i9 %B, 8
				791	%D = trunc i9 %C to i8
				792	ret i8 %D
				793	}
				794
				795	//===---------------------------------------------------------------------===//
Eli Friedman	e16c0ff	2008-11-30 07:36:04 +0000	[diff] [blame]	796
				797	From gcc bug 24696:
				798	int
				799	f (unsigned long a, unsigned long b, unsigned long c)
				800	{
				801	return ((a & (c - 1)) != 0) \|\| ((b & (c - 1)) != 0);
				802	}
				803	int
				804	f (unsigned long a, unsigned long b, unsigned long c)
				805	{
				806	return ((a & (c - 1)) != 0) \| ((b & (c - 1)) != 0);
				807	}
				808	Both should combine to ((a\|b) & (c-1)) != 0. Currently not optimized with
				809	"clang -emit-llvm-bc \| opt -std-compile-opts".
				810
				811	//===---------------------------------------------------------------------===//
				812
				813	From GCC Bug 20192:
				814	#define PMD_MASK (~((1UL << 23) - 1))
				815	void clear_pmd_range(unsigned long start, unsigned long end)
				816	{
				817	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
				818	f();
				819	}
				820	The expression should optimize to something like
				821	"!((start\|end)&~PMD_MASK). Currently not optimized with "clang
				822	-emit-llvm-bc \| opt -std-compile-opts".
				823
				824	//===---------------------------------------------------------------------===//
				825
Eli Friedman	e16c0ff	2008-11-30 07:36:04 +0000	[diff] [blame]	826	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
				827	i;}
				828	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				829	These should combine to the same thing. Currently, the first function
				830	produces better code on X86.
				831
				832	//===---------------------------------------------------------------------===//
				833
Eli Friedman	e16c0ff	2008-11-30 07:36:04 +0000	[diff] [blame]	834	From GCC Bug 15784:
				835	#define abs(x) x>0?x:-x
				836	int f(int x, int y)
				837	{
				838	return (abs(x)) >= 0;
				839	}
				840	This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
				841	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				842
				843	//===---------------------------------------------------------------------===//
				844
				845	From GCC Bug 14753:
				846	void
				847	rotate_cst (unsigned int a)
				848	{
				849	a = (a << 10) \| (a >> 22);
				850	if (a == 123)
				851	bar ();
				852	}
				853	void
				854	minus_cst (unsigned int a)
				855	{
				856	unsigned int tem;
				857
				858	tem = 20 - a;
				859	if (tem == 5)
				860	bar ();
				861	}
				862	void
				863	mask_gt (unsigned int a)
				864	{
				865	/* This is equivalent to a > 15. */
				866	if ((a & ~7) > 8)
				867	bar ();
				868	}
				869	void
				870	rshift_gt (unsigned int a)
				871	{
				872	/* This is equivalent to a > 23. */
				873	if ((a >> 2) > 5)
				874	bar ();
				875	}
				876	All should simplify to a single comparison. All of these are
				877	currently not optimized with "clang -emit-llvm-bc \| opt
				878	-std-compile-opts".
				879
				880	//===---------------------------------------------------------------------===//
				881
				882	From GCC Bug 32605:
				883	int c(int* x) {return (char)x+2 == (char)x;}
				884	Should combine to 0. Currently not optimized with "clang
				885	-emit-llvm-bc \| opt -std-compile-opts" (although llc can optimize it).
				886
				887	//===---------------------------------------------------------------------===//
				888
Eli Friedman	e16c0ff	2008-11-30 07:36:04 +0000	[diff] [blame]	889	int a(unsigned b) {return ((b << 31) \| (b << 30)) >> 31;}
				890	Should be combined to "((b >> 1) \| b) & 1". Currently not optimized
				891	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				892
				893	//===---------------------------------------------------------------------===//
				894
				895	unsigned a(unsigned x, unsigned y) { return x \| (y & 1) \| (y & 2);}
				896	Should combine to "x \| (y & 3)". Currently not optimized with "clang
				897	-emit-llvm-bc \| opt -std-compile-opts".
				898
				899	//===---------------------------------------------------------------------===//
				900
Eli Friedman	e16c0ff	2008-11-30 07:36:04 +0000	[diff] [blame]	901	int a(int a, int b, int c) {return (~a & c) \| ((c\|a) & b);}
				902	Should fold to "(~a & c) \| (a & b)". Currently not optimized with
				903	"clang -emit-llvm-bc \| opt -std-compile-opts".
				904
				905	//===---------------------------------------------------------------------===//
				906
				907	int a(int a,int b) {return (~(a\|b))\|a;}
				908	Should fold to "a\|~b". Currently not optimized with "clang
				909	-emit-llvm-bc \| opt -std-compile-opts".
				910
				911	//===---------------------------------------------------------------------===//
				912
				913	int a(int a, int b) {return (a&&b) \|\| (a&&!b);}
				914	Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
				915	\| opt -std-compile-opts".
				916
				917	//===---------------------------------------------------------------------===//
				918
				919	int a(int a, int b, int c) {return (a&&b) \|\| (!a&&c);}
				920	Should fold to "a ? b : c", or at least something sane. Currently not
				921	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				922
				923	//===---------------------------------------------------------------------===//
				924
				925	int a(int a, int b, int c) {return (a&&b) \|\| (a&&c) \|\| (a&&b&&c);}
				926	Should fold to a && (b \|\| c). Currently not optimized with "clang
				927	-emit-llvm-bc \| opt -std-compile-opts".
				928
				929	//===---------------------------------------------------------------------===//
				930
				931	int a(int x) {return x \| ((x & 8) ^ 8);}
				932	Should combine to x \| 8. Currently not optimized with "clang
				933	-emit-llvm-bc \| opt -std-compile-opts".
				934
				935	//===---------------------------------------------------------------------===//
				936
				937	int a(int x) {return x ^ ((x & 8) ^ 8);}
				938	Should also combine to x \| 8. Currently not optimized with "clang
				939	-emit-llvm-bc \| opt -std-compile-opts".
				940
				941	//===---------------------------------------------------------------------===//
				942
Eli Friedman	e16c0ff	2008-11-30 07:36:04 +0000	[diff] [blame]	943	int a(int x) {return ((x \| -9) ^ 8) & x;}
				944	Should combine to x & -9. Currently not optimized with "clang
				945	-emit-llvm-bc \| opt -std-compile-opts".
				946
				947	//===---------------------------------------------------------------------===//
				948
				949	unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
				950	Should combine to "a * 0x88888888 >> 31". Currently not optimized
				951	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				952
				953	//===---------------------------------------------------------------------===//
				954
				955	unsigned a(char* x) {if ((*x & 32) == 0) return b();}
				956	There's an unnecessary zext in the generated code with "clang
				957	-emit-llvm-bc \| opt -std-compile-opts".
				958
				959	//===---------------------------------------------------------------------===//
				960
				961	unsigned a(unsigned long long x) {return 40 * (x >> 1);}
				962	Should combine to "20 * (((unsigned)x) & -2)". Currently not
				963	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				964
				965	//===---------------------------------------------------------------------===//
Bill Wendling	85de4b3	2008-12-02 05:12:47 +0000	[diff] [blame]	966
Chris Lattner	0cdc0bb	2008-12-02 06:32:34 +0000	[diff] [blame]	967	This was noticed in the entryblock for grokdeclarator in 403.gcc:
				968
				969	%tmp = icmp eq i32 %decl_context, 4
				970	%decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
				971	%tmp1 = icmp eq i32 %decl_context_addr.0, 1
				972	%decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
				973
				974	tmp1 should be simplified to something like:
				975	(!tmp \|\| decl_context == 1)
				976
				977	This allows recursive simplifications, tmp1 is used all over the place in
				978	the function, e.g. by:
				979
				980	%tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
				981	%tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
				982	%or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
				983
				984	later.
				985
Chris Lattner	543d6c6	2008-12-06 19:28:22 +0000	[diff] [blame]	986	//===---------------------------------------------------------------------===//
				987
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	988	[STORE SINKING]
				989
Chris Lattner	543d6c6	2008-12-06 19:28:22 +0000	[diff] [blame]	990	Store sinking: This code:
				991
				992	void f (int n, int cond, int res) {
				993	int i;
				994	*res = 0;
				995	for (i = 0; i < n; i++)
				996	if (*cond)
				997	res ^= 234; / () /
				998	}
				999
				1000	On this function GVN hoists the fully redundant value of *res, but nothing
				1001	moves the store out. This gives us this code:
				1002
				1003	bb: ; preds = %bb2, %entry
				1004	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
				1005	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
				1006	%1 = load i32* %cond, align 4
				1007	%2 = icmp eq i32 %1, 0
				1008	br i1 %2, label %bb2, label %bb1
				1009
				1010	bb1: ; preds = %bb
				1011	%3 = xor i32 %.rle, 234
				1012	store i32 %3, i32* %res, align 4
				1013	br label %bb2
				1014
				1015	bb2: ; preds = %bb, %bb1
				1016	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
				1017	%indvar.next = add i32 %i.05, 1
				1018	%exitcond = icmp eq i32 %indvar.next, %n
				1019	br i1 %exitcond, label %return, label %bb
				1020
				1021	DSE should sink partially dead stores to get the store out of the loop.
				1022
Chris Lattner	da93063	2008-12-06 22:52:12 +0000	[diff] [blame]	1023	Here's another partial dead case:
				1024	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
				1025
Chris Lattner	543d6c6	2008-12-06 19:28:22 +0000	[diff] [blame]	1026	//===---------------------------------------------------------------------===//
				1027
				1028	Scalar PRE hoists the mul in the common block up to the else:
				1029
				1030	int test (int a, int b, int c, int g) {
				1031	int d, e;
				1032	if (a)
				1033	d = b * c;
				1034	else
				1035	d = b - c;
				1036	e = b * c + g;
				1037	return d + e;
				1038	}
				1039
				1040	It would be better to do the mul once to reduce codesize above the if.
				1041	This is GCC PR38204.
				1042
Chris Lattner	245de78	2011-01-06 07:41:22 +0000	[diff] [blame]	1043
				1044	//===---------------------------------------------------------------------===//
				1045	This simple function from 179.art:
				1046
				1047	int winner, numf2s;
				1048	struct { double y; int reset; } *Y;
				1049
				1050	void find_match() {
				1051	int i;
				1052	winner = 0;
				1053	for (i=0;i<numf2s;i++)
				1054	if (Y[i].y > Y[winner].y)
				1055	winner =i;
				1056	}
				1057
				1058	Compiles into (with clang TBAA):
				1059
				1060	for.body: ; preds = %for.inc, %bb.nph
				1061	%indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %for.inc ]
				1062	%i.01718 = phi i32 [ 0, %bb.nph ], [ %i.01719, %for.inc ]
				1063	%tmp4 = getelementptr inbounds %struct.anon* %tmp3, i64 %indvar, i32 0
				1064	%tmp5 = load double* %tmp4, align 8, !tbaa !4
				1065	%idxprom7 = sext i32 %i.01718 to i64
				1066	%tmp10 = getelementptr inbounds %struct.anon* %tmp3, i64 %idxprom7, i32 0
				1067	%tmp11 = load double* %tmp10, align 8, !tbaa !4
				1068	%cmp12 = fcmp ogt double %tmp5, %tmp11
				1069	br i1 %cmp12, label %if.then, label %for.inc
				1070
				1071	if.then: ; preds = %for.body
				1072	%i.017 = trunc i64 %indvar to i32
				1073	br label %for.inc
				1074
				1075	for.inc: ; preds = %for.body, %if.then
				1076	%i.01719 = phi i32 [ %i.01718, %for.body ], [ %i.017, %if.then ]
				1077	%indvar.next = add i64 %indvar, 1
				1078	%exitcond = icmp eq i64 %indvar.next, %tmp22
				1079	br i1 %exitcond, label %for.cond.for.end_crit_edge, label %for.body
				1080
				1081
				1082	It is good that we hoisted the reloads of numf2's, and Y out of the loop and
				1083	sunk the store to winner out.
				1084
				1085	However, this is awful on several levels: the conditional truncate in the loop
				1086	(-indvars at fault? why can't we completely promote the IV to i64?).
				1087
				1088	Beyond that, we have a partially redundant load in the loop: if "winner" (aka
				1089	%i.01718) isn't updated, we reload Y[winner].y the next time through the loop.
				1090	Similarly, the addressing that feeds it (including the sext) is redundant. In
				1091	the end we get this generated assembly:
				1092
				1093	LBB0_2: ## %for.body
				1094	## =>This Inner Loop Header: Depth=1
				1095	movsd (%rdi), %xmm0
				1096	movslq %edx, %r8
				1097	shlq $4, %r8
				1098	ucomisd (%rcx,%r8), %xmm0
				1099	jbe LBB0_4
				1100	movl %esi, %edx
				1101	LBB0_4: ## %for.inc
				1102	addq $16, %rdi
				1103	incq %rsi
				1104	cmpq %rsi, %rax
				1105	jne LBB0_2
				1106
				1107	All things considered this isn't too bad, but we shouldn't need the movslq or
				1108	the shlq instruction, or the load folded into ucomisd every time through the
				1109	loop.
				1110
				1111	On an x86-specific topic, if the loop can't be restructure, the movl should be a
				1112	cmov.
				1113
Chris Lattner	543d6c6	2008-12-06 19:28:22 +0000	[diff] [blame]	1114	//===---------------------------------------------------------------------===//
				1115
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1116	[STORE SINKING]
				1117
Chris Lattner	543d6c6	2008-12-06 19:28:22 +0000	[diff] [blame]	1118	GCC PR37810 is an interesting case where we should sink load/store reload
				1119	into the if block and outside the loop, so we don't reload/store it on the
				1120	non-call path.
				1121
				1122	for () {
				1123	*P += 1;
				1124	if ()
				1125	call();
				1126	else
				1127	...
				1128	->
				1129	tmp = *P
				1130	for () {
				1131	tmp += 1;
				1132	if () {
				1133	*P = tmp;
				1134	call();
				1135	tmp = *P;
				1136	} else ...
				1137	}
				1138	*P = tmp;
				1139
Chris Lattner	81ee731	2008-12-15 07:49:24 +0000	[diff] [blame]	1140	We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
				1141	we don't sink the store. We need partially dead store sinking.
				1142
Chris Lattner	543d6c6	2008-12-06 19:28:22 +0000	[diff] [blame]	1143	//===---------------------------------------------------------------------===//
				1144
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1145	[LOAD PRE CRIT EDGE SPLITTING]
Chris Lattner	81ee731	2008-12-15 07:49:24 +0000	[diff] [blame]	1146
Chris Lattner	543d6c6	2008-12-06 19:28:22 +0000	[diff] [blame]	1147	GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
				1148	leading to excess stack traffic. This could be handled by GVN with some crazy
				1149	symbolic phi translation. The code we get looks like (g is on the stack):
				1150
				1151	bb2: ; preds = %bb1
				1152	..
				1153	%9 = getelementptr %struct.f* %g, i32 0, i32 0
				1154	store i32 %8, i32* %9, align bel %bb3
				1155
				1156	bb3: ; preds = %bb1, %bb2, %bb
				1157	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
				1158	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
				1159	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
				1160	%11 = load i32* %10, align 4
				1161
Chris Lattner	ca9e0e8	2009-11-27 16:53:57 +0000	[diff] [blame]	1162	%11 is partially redundant, an in BB2 it should have the value %8.
Chris Lattner	543d6c6	2008-12-06 19:28:22 +0000	[diff] [blame]	1163
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1164	GCC PR33344 and PR35287 are similar cases.
Chris Lattner	da93063	2008-12-06 22:52:12 +0000	[diff] [blame]	1165
Chris Lattner	06c26d9	2009-11-05 18:19:19 +0000	[diff] [blame]	1166
				1167	//===---------------------------------------------------------------------===//
				1168
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1169	[LOAD PRE]
				1170
Chris Lattner	da93063	2008-12-06 22:52:12 +0000	[diff] [blame]	1171	There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1172	GCC testsuite, ones we don't get yet are (checked through loadpre25):
				1173
				1174	[CRIT EDGE BREAKING]
				1175	loadpre3.c predcom-4.c
				1176
				1177	[PRE OF READONLY CALL]
				1178	loadpre5.c
				1179
				1180	[TURN SELECT INTO BRANCH]
				1181	loadpre14.c loadpre15.c
				1182
				1183	actually a conditional increment: loadpre18.c loadpre19.c
				1184
Chris Lattner	aded09f	2010-12-15 06:38:24 +0000	[diff] [blame]	1185	//===---------------------------------------------------------------------===//
				1186
				1187	[LOAD PRE / STORE SINKING / SPEC HACK]
				1188
				1189	This is a chunk of code from 456.hmmer:
				1190
				1191	int f(int M, int mc, int mpp, int tpmm, int ip, int tpim, int dpp,
				1192	int tpdm, int xmb, int bp, int *ms) {
				1193	int k, sc;
				1194	for (k = 1; k <= M; k++) {
				1195	mc[k] = mpp[k-1] + tpmm[k-1];
				1196	if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;
				1197	if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
				1198	if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
				1199	mc[k] += ms[k];
				1200	}
				1201	}
				1202
				1203	It is very profitable for this benchmark to turn the conditional stores to mc[k]
				1204	into a conditional move (select instr in IR) and allow the final store to do the
				1205	store. See GCC PR27313 for more details. Note that this is valid to xform even
				1206	with the new C++ memory model, since mc[k] is previously loaded and later
				1207	stored.
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1208
				1209	//===---------------------------------------------------------------------===//
				1210
				1211	[SCALAR PRE]
				1212	There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the
				1213	GCC testsuite.
Chris Lattner	da93063	2008-12-06 22:52:12 +0000	[diff] [blame]	1214
Chris Lattner	5d196e6	2008-12-15 08:32:28 +0000	[diff] [blame]	1215	//===---------------------------------------------------------------------===//
				1216
				1217	There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1218	GCC testsuite. For example, we get the first example in predcom-1.c, but
				1219	miss the second one:
Chris Lattner	5d196e6	2008-12-15 08:32:28 +0000	[diff] [blame]	1220
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1221	unsigned fib[1000];
				1222	unsigned avg[1000];
Chris Lattner	5d196e6	2008-12-15 08:32:28 +0000	[diff] [blame]	1223
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1224	__attribute__ ((noinline))
				1225	void count_averages(int n) {
				1226	int i;
				1227	for (i = 1; i < n; i++)
				1228	avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;
				1229	}
Chris Lattner	5d196e6	2008-12-15 08:32:28 +0000	[diff] [blame]	1230
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1231	which compiles into two loads instead of one in the loop.
Chris Lattner	5d196e6	2008-12-15 08:32:28 +0000	[diff] [blame]	1232
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1233	predcom-2.c is the same as predcom-1.c
Chris Lattner	5d196e6	2008-12-15 08:32:28 +0000	[diff] [blame]	1234
Chris Lattner	5d196e6	2008-12-15 08:32:28 +0000	[diff] [blame]	1235	predcom-3.c is very similar but needs loads feeding each other instead of
				1236	store->load.
Chris Lattner	5d196e6	2008-12-15 08:32:28 +0000	[diff] [blame]	1237
				1238
				1239	//===---------------------------------------------------------------------===//
				1240
Chris Lattner	082da53	2010-01-23 17:59:23 +0000	[diff] [blame]	1241	[ALIAS ANALYSIS]
				1242
Chris Lattner	5d196e6	2008-12-15 08:32:28 +0000	[diff] [blame]	1243	Type based alias analysis:
Chris Lattner	da93063	2008-12-06 22:52:12 +0000	[diff] [blame]	1244	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
				1245
Chris Lattner	082da53	2010-01-23 17:59:23 +0000	[diff] [blame]	1246	We should do better analysis of posix_memalign. At the least it should
				1247	no-capture its pointer argument, at best, we should know that the out-value
				1248	result doesn't point to anything (like malloc). One example of this is in
				1249	SingleSource/Benchmarks/Misc/dt.c
				1250
Chris Lattner	da93063	2008-12-06 22:52:12 +0000	[diff] [blame]	1251	//===---------------------------------------------------------------------===//
				1252
Chris Lattner	da93063	2008-12-06 22:52:12 +0000	[diff] [blame]	1253	Interesting missed case because of control flow flattening (should be 2 loads):
				1254	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner	5d196e6	2008-12-15 08:32:28 +0000	[diff] [blame]	1255	With: llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \|
				1256	opt -mem2reg -gvn -instcombine \| llvm-dis
Chris Lattner	58ccf88	2009-11-29 02:19:52 +0000	[diff] [blame]	1257	we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT
Chris Lattner	5d196e6	2008-12-15 08:32:28 +0000	[diff] [blame]	1258	VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner	da93063	2008-12-06 22:52:12 +0000	[diff] [blame]	1259
				1260	//===---------------------------------------------------------------------===//
				1261
				1262	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
				1263	We could eliminate the branch condition here, loading from null is undefined:
				1264
				1265	struct S { int w, x, y, z; };
				1266	struct T { int r; struct S s; };
				1267	void bar (struct S, int);
				1268	void foo (int a, struct T b)
				1269	{
				1270	struct S *c = 0;
				1271	if (a)
				1272	c = &b.s;
				1273	bar (*c, a);
				1274	}
				1275
				1276	//===---------------------------------------------------------------------===//
Chris Lattner	0cdc0bb	2008-12-02 06:32:34 +0000	[diff] [blame]	1277
Chris Lattner	8a35adf	2008-12-23 20:52:52 +0000	[diff] [blame]	1278	simplifylibcalls should do several optimizations for strspn/strcspn:
				1279
Chris Lattner	8a35adf	2008-12-23 20:52:52 +0000	[diff] [blame]	1280	strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
				1281
				1282	size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
				1283	int __reject3) {
				1284	register size_t __result = 0;
				1285	while (__s[__result] != '\0' && __s[__result] != __reject1 &&
				1286	__s[__result] != __reject2 && __s[__result] != __reject3)
				1287	++__result;
				1288	return __result;
				1289	}
				1290
				1291	This should turn into a switch on the character. See PR3253 for some notes on
				1292	codegen.
				1293
				1294	456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
				1295
				1296	//===---------------------------------------------------------------------===//
Chris Lattner	a414225	2008-12-31 00:54:13 +0000	[diff] [blame]	1297
				1298	"gas" uses this idiom:
				1299	else if (strchr ("+-/%\|&^:[]()~", intel_parser.op_string))
				1300	..
				1301	else if (strchr ("<>", *intel_parser.op_string)
				1302
				1303	Those should be turned into a switch.
				1304
				1305	//===---------------------------------------------------------------------===//
Chris Lattner	7cb3ae0	2009-01-08 06:52:57 +0000	[diff] [blame]	1306
				1307	252.eon contains this interesting code:
				1308
				1309	%3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
				1310	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1311	%strlen = call i32 @strlen(i8* %3072) ; uses = 1
				1312	%endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
				1313	call void @llvm.memcpy.i32(i8* %endptr,
				1314	i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
				1315	%3074 = call i32 @strlen(i8* %endptr) nounwind readonly
				1316
				1317	This is interesting for a couple reasons. First, in this:
				1318
Benjamin Kramer	dfa40f8	2010-12-23 15:32:07 +0000	[diff] [blame]	1319	The memcpy+strlen strlen can be replaced with:
Chris Lattner	7cb3ae0	2009-01-08 06:52:57 +0000	[diff] [blame]	1320
				1321	%3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
				1322
				1323	Because the destination was just copied into the specified memory buffer. This,
				1324	in turn, can be constant folded to "4".
				1325
				1326	In other code, it contains:
				1327
				1328	%endptr6978 = bitcast i8* %endptr69 to i32*
				1329	store i32 7107374, i32* %endptr6978, align 1
				1330	%3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
				1331
				1332	Which could also be constant folded. Whatever is producing this should probably
				1333	be fixed to leave this as a memcpy from a string.
				1334
				1335	Further, eon also has an interesting partially redundant strlen call:
				1336
				1337	bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
				1338	%682 = getelementptr i8 %argv, i32 6 ; <i8> [#uses=2]
				1339	%683 = load i8** %682, align 4 ; <i8*> [#uses=4]
				1340	%684 = load i8* %683, align 1 ; <i8> [#uses=1]
				1341	%685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
				1342	br i1 %685, label %bb10, label %bb9
				1343
				1344	bb9: ; preds = %bb8
				1345	%686 = call i32 @strlen(i8* %683) nounwind readonly
				1346	%687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
				1347	br i1 %687, label %bb10, label %bb11
				1348
				1349	bb10: ; preds = %bb9, %bb8
				1350	%688 = call i32 @strlen(i8* %683) nounwind readonly
				1351
				1352	This could be eliminated by doing the strlen once in bb8, saving code size and
				1353	improving perf on the bb8->9->10 path.
				1354
				1355	//===---------------------------------------------------------------------===//
Chris Lattner	6c2ee50	2009-01-08 07:34:55 +0000	[diff] [blame]	1356
				1357	I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
				1358	which looks like:
				1359	%movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
				1360
				1361
				1362	bb62: ; preds = %bb55, %bb53
				1363	%promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
				1364	%171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1365	%172 = add i32 %171, -1 ; <i32> [#uses=1]
				1366	%173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
				1367
				1368	... no stores ...
				1369	br i1 %or.cond, label %bb65, label %bb72
				1370
				1371	bb65: ; preds = %bb62
				1372	store i8 0, i8* %173, align 1
				1373	br label %bb72
				1374
				1375	bb72: ; preds = %bb65, %bb62
				1376	%trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
				1377	%177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1378
				1379	Note that on the bb62->bb72 path, that the %177 strlen call is partially
				1380	redundant with the %171 call. At worst, we could shove the %177 strlen call
				1381	up into the bb65 block moving it out of the bb62->bb72 path. However, note
				1382	that bb65 stores to the string, zeroing out the last byte. This means that on
				1383	that path the value of %177 is actually just %171-1. A sub is cheaper than a
				1384	strlen!
				1385
				1386	This pattern repeats several times, basically doing:
				1387
				1388	A = strlen(P);
				1389	P[A-1] = 0;
				1390	B = strlen(P);
				1391	where it is "obvious" that B = A-1.
				1392
				1393	//===---------------------------------------------------------------------===//
				1394
Chris Lattner	6c2ee50	2009-01-08 07:34:55 +0000	[diff] [blame]	1395	186.crafty has this interesting pattern with the "out.4543" variable:
				1396
				1397	call void @llvm.memcpy.i32(
				1398	i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
				1399	i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
				1400	%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
				1401
				1402	It is basically doing:
				1403
				1404	memcpy(globalarray, "string");
				1405	printf(..., globalarray);
				1406
				1407	Anyway, by knowing that printf just reads the memory and forward substituting
				1408	the string directly into the printf, this eliminates reads from globalarray.
				1409	Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
				1410	other similar functions) there are many stores to "out". Once all the printfs
				1411	stop using "out", all that is left is the memcpy's into it. This should allow
				1412	globalopt to remove the "stored only" global.
				1413
				1414	//===---------------------------------------------------------------------===//
				1415
Dan Gohman	83d2e06	2009-01-20 01:07:33 +0000	[diff] [blame]	1416	This code:
				1417
				1418	define inreg i32 @foo(i8* inreg %p) nounwind {
				1419	%tmp0 = load i8* %p
				1420	%tmp1 = ashr i8 %tmp0, 5
				1421	%tmp2 = sext i8 %tmp1 to i32
				1422	ret i32 %tmp2
				1423	}
				1424
				1425	could be dagcombine'd to a sign-extending load with a shift.
				1426	For example, on x86 this currently gets this:
				1427
				1428	movb (%eax), %al
				1429	sarb $5, %al
				1430	movsbl %al, %eax
				1431
				1432	while it could get this:
				1433
				1434	movsbl (%eax), %eax
				1435	sarl $5, %eax
				1436
				1437	//===---------------------------------------------------------------------===//
Chris Lattner	705ac70	2009-01-22 07:16:03 +0000	[diff] [blame]	1438
				1439	GCC PR31029:
				1440
				1441	int test(int x) { return 1-x == x; } // --> return false
				1442	int test2(int x) { return 2-x == x; } // --> return x == 1 ?
				1443
				1444	Always foldable for odd constants, what is the rule for even?
				1445
				1446	//===---------------------------------------------------------------------===//
				1447
Torok Edwin	3cedd4d	2009-01-24 19:30:25 +0000	[diff] [blame]	1448	PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
				1449	for next field in struct (which is at same address).
				1450
				1451	For example: store of float into { {{}}, float } could be turned into a store to
				1452	the float directly.
				1453
Torok Edwin	87d5ca0	2009-02-20 18:42:06 +0000	[diff] [blame]	1454	//===---------------------------------------------------------------------===//
Nick Lewycky	5c10a3a	2009-02-25 06:52:48 +0000	[diff] [blame]	1455
Chris Lattner	17a999e	2009-05-11 17:41:40 +0000	[diff] [blame]	1456	The arg promotion pass should make use of nocapture to make its alias analysis
				1457	stuff much more precise.
				1458
				1459	//===---------------------------------------------------------------------===//
				1460
				1461	The following functions should be optimized to use a select instead of a
				1462	branch (from gcc PR40072):
				1463
				1464	char char_int(int m) {if(m>7) return 0; return m;}
				1465	int int_char(char m) {if(m>7) return 0; return m;}
				1466
				1467	//===---------------------------------------------------------------------===//
				1468
Bill Wendling	fd2730e	2009-10-27 22:48:31 +0000	[diff] [blame]	1469	int func(int a, int b) { if (a & 0x80) b \|= 0x80; else b &= ~0x80; return b; }
				1470
				1471	Generates this:
				1472
				1473	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1474	entry:
				1475	%0 = and i32 %a, 128 ; <i32> [#uses=1]
				1476	%1 = icmp eq i32 %0, 0 ; <i1> [#uses=1]
				1477	%2 = or i32 %b, 128 ; <i32> [#uses=1]
				1478	%3 = and i32 %b, -129 ; <i32> [#uses=1]
				1479	%b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1]
				1480	ret i32 %b_addr.0
				1481	}
				1482
				1483	However, it's functionally equivalent to:
				1484
				1485	b = (b & ~0x80) \| (a & 0x80);
				1486
				1487	Which generates this:
				1488
				1489	define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {
				1490	entry:
				1491	%0 = and i32 %b, -129 ; <i32> [#uses=1]
				1492	%1 = and i32 %a, 128 ; <i32> [#uses=1]
				1493	%2 = or i32 %0, %1 ; <i32> [#uses=1]
				1494	ret i32 %2
				1495	}
				1496
				1497	This can be generalized for other forms:
				1498
				1499	b = (b & ~0x80) \| (a & 0x40) << 1;
				1500
				1501	//===---------------------------------------------------------------------===//
Bill Wendling	2e5198f	2009-10-27 23:30:07 +0000	[diff] [blame]	1502
				1503	These two functions produce different code. They shouldn't:
				1504
				1505	#include <stdint.h>
				1506
				1507	uint8_t p1(uint8_t b, uint8_t a) {
				1508	b = (b & ~0xc0) \| (a & 0xc0);
				1509	return (b);
				1510	}
				1511
				1512	uint8_t p2(uint8_t b, uint8_t a) {
				1513	b = (b & ~0x40) \| (a & 0x40);
				1514	b = (b & ~0x80) \| (a & 0x80);
				1515	return (b);
				1516	}
				1517
				1518	define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1519	entry:
				1520	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1521	%1 = and i8 %a, -64 ; <i8> [#uses=1]
				1522	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1523	ret i8 %2
				1524	}
				1525
				1526	define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {
				1527	entry:
				1528	%0 = and i8 %b, 63 ; <i8> [#uses=1]
				1529	%.masked = and i8 %a, 64 ; <i8> [#uses=1]
				1530	%1 = and i8 %a, -128 ; <i8> [#uses=1]
				1531	%2 = or i8 %1, %0 ; <i8> [#uses=1]
				1532	%3 = or i8 %2, %.masked ; <i8> [#uses=1]
				1533	ret i8 %3
				1534	}
				1535
				1536	//===---------------------------------------------------------------------===//
Chris Lattner	539bdf0	2009-11-11 17:51:27 +0000	[diff] [blame]	1537
				1538	IPSCCP does not currently propagate argument dependent constants through
				1539	functions where it does not not all of the callers. This includes functions
				1540	with normal external linkage as well as templates, C99 inline functions etc.
				1541	Specifically, it does nothing to:
				1542
				1543	define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {
				1544	entry:
				1545	%0 = add nsw i32 %y, %z
				1546	%1 = mul i32 %0, %x
				1547	%2 = mul i32 %y, %z
				1548	%3 = add nsw i32 %1, %2
				1549	ret i32 %3
				1550	}
				1551
				1552	define i32 @test2() nounwind {
				1553	entry:
				1554	%0 = call i32 @test(i32 1, i32 2, i32 4) nounwind
				1555	ret i32 %0
				1556	}
				1557
				1558	It would be interesting extend IPSCCP to be able to handle simple cases like
				1559	this, where all of the arguments to a call are constant. Because IPSCCP runs
				1560	before inlining, trivial templates and inline functions are not yet inlined.
				1561	The results for a function + set of constant arguments should be memoized in a
				1562	map.
				1563
				1564	//===---------------------------------------------------------------------===//
Chris Lattner	7a09964	2009-11-11 17:54:02 +0000	[diff] [blame]	1565
				1566	The libcall constant folding stuff should be moved out of SimplifyLibcalls into
				1567	libanalysis' constantfolding logic. This would allow IPSCCP to be able to
				1568	handle simple things like this:
				1569
				1570	static int foo(const char *X) { return strlen(X); }
				1571	int bar() { return foo("abcd"); }
				1572
				1573	//===---------------------------------------------------------------------===//
Nick Lewycky	ef4ea9a	2009-11-15 17:51:23 +0000	[diff] [blame]	1574
Duncan Sands	c8493da	2010-01-06 15:37:47 +0000	[diff] [blame]	1575	functionattrs doesn't know much about memcpy/memset. This function should be
Duncan Sands	78376ad	2010-01-06 08:45:52 +0000	[diff] [blame]	1576	marked readnone rather than readonly, since it only twiddles local memory, but
				1577	functionattrs doesn't handle memset/memcpy/memmove aggressively:
Chris Lattner	f05330a	2009-12-03 07:43:46 +0000	[diff] [blame]	1578
				1579	struct X { int p; int q; };
				1580	int foo() {
				1581	int i = 0, j = 1;
				1582	struct X x, y;
				1583	int **p;
				1584	y.p = &i;
				1585	x.q = &j;
				1586	p = __builtin_memcpy (&x, &y, sizeof (int *));
				1587	return **p;
				1588	}
				1589
Chris Lattner	e5d5a41	2011-01-01 22:52:11 +0000	[diff] [blame]	1590	This can be seen at:
				1591	$ clang t.c -S -o - -mkernel -O0 -emit-llvm \| opt -functionattrs -S
				1592
				1593
Chris Lattner	d1e4ee3	2009-12-03 07:41:54 +0000	[diff] [blame]	1594	//===---------------------------------------------------------------------===//
				1595
Eli Friedman	9ed49c5	2010-01-18 22:36:59 +0000	[diff] [blame]	1596	Missed instcombine transformation:
				1597	define i1 @a(i32 %x) nounwind readnone {
				1598	entry:
				1599	%cmp = icmp eq i32 %x, 30
				1600	%sub = add i32 %x, -30
				1601	%cmp2 = icmp ugt i32 %sub, 9
				1602	%or = or i1 %cmp, %cmp2
				1603	ret i1 %or
				1604	}
				1605	This should be optimized to a single compare. Testcase derived from gcc.
				1606
				1607	//===---------------------------------------------------------------------===//
				1608
Eli Friedman	9ed49c5	2010-01-18 22:36:59 +0000	[diff] [blame]	1609	Missed instcombine or reassociate transformation:
				1610	int a(int a, int b) { return (a==12)&(b>47)&(b<58); }
				1611
				1612	The sgt and slt should be combined into a single comparison. Testcase derived
				1613	from gcc.
				1614
				1615	//===---------------------------------------------------------------------===//
				1616
				1617	Missed instcombine transformation:
Chris Lattner	9165d9d	2010-11-21 07:05:31 +0000	[diff] [blame]	1618
				1619	%382 = srem i32 %tmp14.i, 64 ; [#uses=1]
				1620	%383 = zext i32 %382 to i64 ; [#uses=1]
				1621	%384 = shl i64 %381, %383 ; [#uses=1]
				1622	%385 = icmp slt i32 %tmp14.i, 64 ; [#uses=1]
				1623
Benjamin Kramer	94a622a	2010-11-23 20:33:57 +0000	[diff] [blame]	1624	The srem can be transformed to an and because if %tmp14.i is negative, the
				1625	shift is undefined. Testcase derived from 403.gcc.
Chris Lattner	9165d9d	2010-11-21 07:05:31 +0000	[diff] [blame]	1626
				1627	//===---------------------------------------------------------------------===//
				1628
				1629	This is a range comparison on a divided result (from 403.gcc):
				1630
				1631	%1337 = sdiv i32 %1336, 8 ; [#uses=1]
				1632	%.off.i208 = add i32 %1336, 7 ; [#uses=1]
				1633	%1338 = icmp ult i32 %.off.i208, 15 ; [#uses=1]
				1634
				1635	We already catch this (removing the sdiv) if there isn't an add, we should
				1636	handle the 'add' as well. This is a common idiom with it's builtin_alloca code.
				1637	C testcase:
				1638
				1639	int a(int x) { return (unsigned)(x/16+7) < 15; }
				1640
				1641	Another similar case involves truncations on 64-bit targets:
				1642
				1643	%361 = sdiv i64 %.046, 8 ; [#uses=1]
				1644	%362 = trunc i64 %361 to i32 ; [#uses=2]
				1645	...
				1646	%367 = icmp eq i32 %362, 0 ; [#uses=1]
				1647
Eli Friedman	0de0b36	2010-01-31 04:55:32 +0000	[diff] [blame]	1648	//===---------------------------------------------------------------------===//
				1649
				1650	Missed instcombine/dagcombine transformation:
				1651	define void @lshift_lt(i8 zeroext %a) nounwind {
				1652	entry:
				1653	%conv = zext i8 %a to i32
				1654	%shl = shl i32 %conv, 3
				1655	%cmp = icmp ult i32 %shl, 33
				1656	br i1 %cmp, label %if.then, label %if.end
				1657
				1658	if.then:
				1659	tail call void @bar() nounwind
				1660	ret void
				1661
				1662	if.end:
				1663	ret void
				1664	}
				1665	declare void @bar() nounwind
				1666
				1667	The shift should be eliminated. Testcase derived from gcc.
Eli Friedman	9ed49c5	2010-01-18 22:36:59 +0000	[diff] [blame]	1668
				1669	//===---------------------------------------------------------------------===//
Chris Lattner	187242b	2010-02-09 00:11:10 +0000	[diff] [blame]	1670
				1671	These compile into different code, one gets recognized as a switch and the
				1672	other doesn't due to phase ordering issues (PR6212):
				1673
				1674	int test1(int mainType, int subType) {
				1675	if (mainType == 7)
				1676	subType = 4;
				1677	else if (mainType == 9)
				1678	subType = 6;
				1679	else if (mainType == 11)
				1680	subType = 9;
				1681	return subType;
				1682	}
				1683
				1684	int test2(int mainType, int subType) {
				1685	if (mainType == 7)
				1686	subType = 4;
				1687	if (mainType == 9)
				1688	subType = 6;
				1689	if (mainType == 11)
				1690	subType = 9;
				1691	return subType;
				1692	}
				1693
				1694	//===---------------------------------------------------------------------===//
Chris Lattner	1f6689a	2010-03-10 21:42:42 +0000	[diff] [blame]	1695
				1696	The following test case (from PR6576):
				1697
				1698	define i32 @mul(i32 %a, i32 %b) nounwind readnone {
				1699	entry:
				1700	%cond1 = icmp eq i32 %b, 0 ; <i1> [#uses=1]
				1701	br i1 %cond1, label %exit, label %bb.nph
				1702	bb.nph: ; preds = %entry
				1703	%tmp = mul i32 %b, %a ; <i32> [#uses=1]
				1704	ret i32 %tmp
				1705	exit: ; preds = %entry
				1706	ret i32 0
				1707	}
				1708
				1709	could be reduced to:
				1710
				1711	define i32 @mul(i32 %a, i32 %b) nounwind readnone {
				1712	entry:
				1713	%tmp = mul i32 %b, %a
				1714	ret i32 %tmp
				1715	}
				1716
				1717	//===---------------------------------------------------------------------===//
				1718
Chris Lattner	cfc921c	2010-04-16 23:52:30 +0000	[diff] [blame]	1719	We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates.
				1720	See GCC PR34949
				1721
Chris Lattner	4dc833c	2010-05-21 23:16:21 +0000	[diff] [blame]	1722	Another interesting case is that something related could be used for variables
				1723	that go const after their ctor has finished. In these cases, globalopt (which
				1724	can statically run the constructor) could mark the global const (so it gets put
				1725	in the readonly section). A testcase would be:
				1726
				1727	#include <complex>
				1728	using namespace std;
				1729	const complex<char> should_be_in_rodata (42,-42);
				1730	complex<char> should_be_in_data (42,-42);
				1731	complex<char> should_be_in_bss;
				1732
				1733	Where we currently evaluate the ctors but the globals don't become const because
				1734	the optimizer doesn't know they "become const" after the ctor is done. See
				1735	GCC PR4131 for more examples.
				1736
Chris Lattner	cfc921c	2010-04-16 23:52:30 +0000	[diff] [blame]	1737	//===---------------------------------------------------------------------===//
				1738
Dan Gohman	73c8145	2010-05-03 14:31:00 +0000	[diff] [blame]	1739	In this code:
				1740
				1741	long foo(long x) {
				1742	return x > 1 ? x : 1;
				1743	}
				1744
				1745	LLVM emits a comparison with 1 instead of 0. 0 would be equivalent
				1746	and cheaper on most targets.
				1747
				1748	LLVM prefers comparisons with zero over non-zero in general, but in this
				1749	case it choses instead to keep the max operation obvious.
				1750
				1751	//===---------------------------------------------------------------------===//
Eli Friedman	e17e4ae	2010-06-12 05:54:27 +0000	[diff] [blame]	1752
				1753	Take the following testcase on x86-64 (similar testcases exist for all targets
				1754	with addc/adde):
				1755
				1756	define void @a(i64* nocapture %s, i64* nocapture %t, i64 %a, i64 %b,
				1757	i64 %c) nounwind {
				1758	entry:
				1759	%0 = zext i64 %a to i128 ; <i128> [#uses=1]
				1760	%1 = zext i64 %b to i128 ; <i128> [#uses=1]
				1761	%2 = add i128 %1, %0 ; <i128> [#uses=2]
				1762	%3 = zext i64 %c to i128 ; <i128> [#uses=1]
				1763	%4 = shl i128 %3, 64 ; <i128> [#uses=1]
				1764	%5 = add i128 %4, %2 ; <i128> [#uses=1]
				1765	%6 = lshr i128 %5, 64 ; <i128> [#uses=1]
				1766	%7 = trunc i128 %6 to i64 ; <i64> [#uses=1]
				1767	store i64 %7, i64* %s, align 8
				1768	%8 = trunc i128 %2 to i64 ; <i64> [#uses=1]
				1769	store i64 %8, i64* %t, align 8
				1770	ret void
				1771	}
				1772
				1773	Generated code:
Eli Friedman	5f75515	2011-02-16 07:17:44 +0000	[diff] [blame]	1774	addq %rcx, %rdx
				1775	sbbq %rax, %rax
				1776	subq %rax, %r8
				1777	movq %r8, (%rdi)
				1778	movq %rdx, (%rsi)
				1779	ret
Eli Friedman	e17e4ae	2010-06-12 05:54:27 +0000	[diff] [blame]	1780
				1781	Expected code:
				1782	addq %rcx, %rdx
				1783	adcq $0, %r8
				1784	movq %r8, (%rdi)
				1785	movq %rdx, (%rsi)
				1786	ret
				1787
Eli Friedman	e17e4ae	2010-06-12 05:54:27 +0000	[diff] [blame]	1788	//===---------------------------------------------------------------------===//
Eli Friedman	836fdbc	2010-07-03 07:38:12 +0000	[diff] [blame]	1789
				1790	Switch lowering generates less than ideal code for the following switch:
				1791	define void @a(i32 %x) nounwind {
				1792	entry:
				1793	switch i32 %x, label %if.end [
				1794	i32 0, label %if.then
				1795	i32 1, label %if.then
				1796	i32 2, label %if.then
				1797	i32 3, label %if.then
				1798	i32 5, label %if.then
				1799	]
				1800	if.then:
				1801	tail call void @foo() nounwind
				1802	ret void
				1803	if.end:
				1804	ret void
				1805	}
				1806	declare void @foo()
				1807
				1808	Generated code on x86-64 (other platforms give similar results):
				1809	a:
				1810	cmpl $5, %edi
				1811	ja .LBB0_2
				1812	movl %edi, %eax
				1813	movl $47, %ecx
				1814	btq %rax, %rcx
				1815	jb .LBB0_3
				1816	.LBB0_2:
				1817	ret
				1818	.LBB0_3:
Eli Friedman	c8f5952	2010-07-03 08:43:32 +0000	[diff] [blame]	1819	jmp foo # TAILCALL
Eli Friedman	836fdbc	2010-07-03 07:38:12 +0000	[diff] [blame]	1820
				1821	The movl+movl+btq+jb could be simplified to a cmpl+jne.
				1822
Eli Friedman	c8f5952	2010-07-03 08:43:32 +0000	[diff] [blame]	1823	Or, if we wanted to be really clever, we could simplify the whole thing to
				1824	something like the following, which eliminates a branch:
				1825	xorl $1, %edi
				1826	cmpl $4, %edi
				1827	ja .LBB0_2
				1828	ret
				1829	.LBB0_2:
				1830	jmp foo # TAILCALL
				1831
Eli Friedman	836fdbc	2010-07-03 07:38:12 +0000	[diff] [blame]	1832	//===---------------------------------------------------------------------===//
Chris Lattner	4d94e47	2010-11-09 19:37:28 +0000	[diff] [blame]	1833
Chris Lattner	932aab3	2010-11-11 18:23:57 +0000	[diff] [blame]	1834	We compile this:
				1835
				1836	int foo(int a) { return (a & (~15)) / 16; }
				1837
				1838	Into:
				1839
				1840	define i32 @foo(i32 %a) nounwind readnone ssp {
				1841	entry:
				1842	%and = and i32 %a, -16
				1843	%div = sdiv i32 %and, 16
				1844	ret i32 %div
				1845	}
				1846
				1847	but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case
				1848	should be instcombined into just "a >> 4".
				1849
				1850	We do get this at the codegen level, so something knows about it, but
				1851	instcombine should catch it earlier:
				1852
				1853	_foo: ## @foo
				1854	## BB#0: ## %entry
				1855	movl %edi, %eax
				1856	sarl $4, %eax
				1857	ret
				1858
				1859	//===---------------------------------------------------------------------===//
				1860
Chris Lattner	14cb11d	2010-12-13 00:15:25 +0000	[diff] [blame]	1861	This code (from GCC PR28685):
				1862
				1863	int test(int a, int b) {
				1864	int lt = a < b;
				1865	int eq = a == b;
				1866	if (lt)
				1867	return 1;
				1868	return eq;
				1869	}
				1870
				1871	Is compiled to:
				1872
				1873	define i32 @test(i32 %a, i32 %b) nounwind readnone ssp {
				1874	entry:
				1875	%cmp = icmp slt i32 %a, %b
				1876	br i1 %cmp, label %return, label %if.end
				1877
				1878	if.end: ; preds = %entry
				1879	%cmp5 = icmp eq i32 %a, %b
				1880	%conv6 = zext i1 %cmp5 to i32
				1881	ret i32 %conv6
				1882
				1883	return: ; preds = %entry
				1884	ret i32 1
				1885	}
				1886
				1887	it could be:
				1888
				1889	define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp {
				1890	entry:
				1891	%0 = icmp sle i32 %a, %b
				1892	%retval = zext i1 %0 to i32
				1893	ret i32 %retval
				1894	}
				1895
				1896	//===---------------------------------------------------------------------===//
Duncan Sands	772749a	2011-01-01 20:08:02 +0000	[diff] [blame]	1897
Benjamin Kramer	134cde9	2011-01-07 20:42:20 +0000	[diff] [blame]	1898	This code can be seen in viterbi:
				1899
				1900	%64 = call noalias i8* @malloc(i64 %62) nounwind
				1901	...
				1902	%67 = call i64 @llvm.objectsize.i64(i8* %64, i1 false) nounwind
				1903	%68 = call i8* @__memset_chk(i8* %64, i32 0, i64 %62, i64 %67) nounwind
				1904
				1905	llvm.objectsize.i64 should be taught about malloc/calloc, allowing it to
				1906	fold to %62. This is a security win (overflows of malloc will get caught)
				1907	and also a performance win by exposing more memsets to the optimizer.
				1908
				1909	This occurs several times in viterbi.
				1910
				1911	Note that this would change the semantics of @llvm.objectsize which by its
				1912	current definition always folds to a constant. We also should make sure that
				1913	we remove checking in code like
				1914
				1915	char *p = malloc(strlen(s)+1);
				1916	__strcpy_chk(p, s, __builtin_objectsize(p, 0));
				1917
				1918	//===---------------------------------------------------------------------===//
				1919
Chris Lattner	73552c2	2011-01-06 07:09:23 +0000	[diff] [blame]	1920	This code (from Benchmarks/Dhrystone/dry.c):
				1921
				1922	define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
				1923	entry:
				1924	%sext = shl i32 %0, 24
				1925	%conv = ashr i32 %sext, 24
				1926	%sext6 = shl i32 %1, 24
				1927	%conv4 = ashr i32 %sext6, 24
				1928	%cmp = icmp eq i32 %conv, %conv4
				1929	%. = select i1 %cmp, i32 10000, i32 0
				1930	ret i32 %.
				1931	}
				1932
				1933	Should be simplified into something like:
				1934
				1935	define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
				1936	entry:
				1937	%sext = shl i32 %0, 24
				1938	%conv = and i32 %sext, 0xFF000000
				1939	%sext6 = shl i32 %1, 24
				1940	%conv4 = and i32 %sext6, 0xFF000000
				1941	%cmp = icmp eq i32 %conv, %conv4
				1942	%. = select i1 %cmp, i32 10000, i32 0
				1943	ret i32 %.
				1944	}
				1945
				1946	and then to:
				1947
				1948	define i32 @Func1(i32, i32) nounwind readnone optsize ssp {
				1949	entry:
				1950	%conv = and i32 %0, 0xFF
				1951	%conv4 = and i32 %1, 0xFF
				1952	%cmp = icmp eq i32 %conv, %conv4
				1953	%. = select i1 %cmp, i32 10000, i32 0
				1954	ret i32 %.
				1955	}
				1956	//===---------------------------------------------------------------------===//
Chris Lattner	6c3fc0a	2011-01-01 22:57:31 +0000	[diff] [blame]	1957
Benjamin Kramer	1e01ade	2011-01-06 17:35:50 +0000	[diff] [blame]	1958	clang -O3 currently compiles this code
				1959
				1960	int g(unsigned int a) {
				1961	unsigned int c[100];
				1962	c[10] = a;
				1963	c[11] = a;
				1964	unsigned int b = c[10] + c[11];
				1965	if(b > a*2) a = 4;
				1966	else a = 8;
				1967	return a + 7;
				1968	}
				1969
				1970	into
				1971
				1972	define i32 @g(i32 a) nounwind readnone {
				1973	%add = shl i32 %a, 1
				1974	%mul = shl i32 %a, 1
				1975	%cmp = icmp ugt i32 %add, %mul
				1976	%a.addr.0 = select i1 %cmp, i32 11, i32 15
				1977	ret i32 %a.addr.0
				1978	}
				1979
				1980	The icmp should fold to false. This CSE opportunity is only available
				1981	after GVN and InstCombine have run.
				1982
				1983	//===---------------------------------------------------------------------===//
Chris Lattner	84184b7	2011-01-06 22:25:00 +0000	[diff] [blame]	1984
				1985	memcpyopt should turn this:
				1986
				1987	define i8* @test10(i32 %x) {
				1988	%alloc = call noalias i8* @malloc(i32 %x) nounwind
				1989	call void @llvm.memset.p0i8.i32(i8* %alloc, i8 0, i32 %x, i32 1, i1 false)
				1990	ret i8* %alloc
				1991	}
				1992
				1993	into a call to calloc. We should make sure that we analyze calloc as
				1994	aggressively as malloc though.
				1995
				1996	//===---------------------------------------------------------------------===//
Chandler Carruth	5d684c1	2011-01-09 01:32:55 +0000	[diff] [blame]	1997
Chris Lattner	78cdd2a	2011-01-10 21:01:17 +0000	[diff] [blame]	1998	clang -O3 doesn't optimize this:
Chandler Carruth	5d684c1	2011-01-09 01:32:55 +0000	[diff] [blame]	1999
				2000	void f1(int* begin, int* end) {
				2001	std::fill(begin, end, 0);
				2002	}
				2003
Chris Lattner	5b358c6	2011-01-09 23:48:41 +0000	[diff] [blame]	2004	into a memset. This is PR8942.
Chandler Carruth	5d684c1	2011-01-09 01:32:55 +0000	[diff] [blame]	2005
				2006	//===---------------------------------------------------------------------===//
Chandler Carruth	f326193	2011-01-09 09:58:33 +0000	[diff] [blame]	2007
				2008	clang -O3 -fno-exceptions currently compiles this code:
				2009
				2010	void f(int N) {
				2011	std::vector<int> v(N);
Chandler Carruth	43f6d1b	2011-01-09 10:10:59 +0000	[diff] [blame]	2012
				2013	extern void sink(void*); sink(&v);
Chandler Carruth	f326193	2011-01-09 09:58:33 +0000	[diff] [blame]	2014	}
				2015
				2016	into
				2017
				2018	define void @_Z1fi(i32 %N) nounwind {
				2019	entry:
				2020	%v2 = alloca [3 x i32*], align 8
				2021	%v2.sub = getelementptr inbounds [3 x i32] %v2, i64 0, i64 0
				2022	%tmpcast = bitcast [3 x i32] %v2 to %"class.std::vector"*
				2023	%conv = sext i32 %N to i64
				2024	store i32* null, i32** %v2.sub, align 8, !tbaa !0
				2025	%tmp3.i.i.i.i.i = getelementptr inbounds [3 x i32] %v2, i64 0, i64 1
				2026	store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
				2027	%tmp4.i.i.i.i.i = getelementptr inbounds [3 x i32] %v2, i64 0, i64 2
				2028	store i32* null, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
				2029	%cmp.i.i.i.i = icmp eq i32 %N, 0
				2030	br i1 %cmp.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i, label %cond.true.i.i.i.i
				2031
				2032	_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i: ; preds = %entry
				2033	store i32* null, i32** %v2.sub, align 8, !tbaa !0
				2034	store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
				2035	%add.ptr.i5.i.i = getelementptr inbounds i32* null, i64 %conv
				2036	store i32* %add.ptr.i5.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
				2037	br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit
				2038
				2039	cond.true.i.i.i.i: ; preds = %entry
				2040	%cmp.i.i.i.i.i = icmp slt i32 %N, 0
				2041	br i1 %cmp.i.i.i.i.i, label %if.then.i.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i
				2042
				2043	if.then.i.i.i.i.i: ; preds = %cond.true.i.i.i.i
				2044	call void @_ZSt17__throw_bad_allocv() noreturn nounwind
				2045	unreachable
				2046
				2047	_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i: ; preds = %cond.true.i.i.i.i
				2048	%mul.i.i.i.i.i = shl i64 %conv, 2
				2049	%call3.i.i.i.i.i = call noalias i8* @_Znwm(i64 %mul.i.i.i.i.i) nounwind
				2050	%0 = bitcast i8* %call3.i.i.i.i.i to i32*
				2051	store i32* %0, i32** %v2.sub, align 8, !tbaa !0
				2052	store i32* %0, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0
				2053	%add.ptr.i.i.i = getelementptr inbounds i32* %0, i64 %conv
				2054	store i32* %add.ptr.i.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0
				2055	call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %mul.i.i.i.i.i, i32 4, i1 false)
				2056	br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit
				2057
				2058	This is just the handling the construction of the vector. Most surprising here
Chris Lattner	c326ebd	2011-01-16 06:39:44 +0000	[diff] [blame]	2059	is the fact that all three null stores in %entry are dead (because we do no
				2060	cross-block DSE).
				2061
Chandler Carruth	f326193	2011-01-09 09:58:33 +0000	[diff] [blame]	2062	Also surprising is that %conv isn't simplified to 0 in %....exit.thread.i.i.
Chris Lattner	c326ebd	2011-01-16 06:39:44 +0000	[diff] [blame]	2063	This is a because the client of LazyValueInfo doesn't simplify all instruction
				2064	operands, just selected ones.
Chandler Carruth	f326193	2011-01-09 09:58:33 +0000	[diff] [blame]	2065
				2066	//===---------------------------------------------------------------------===//
Chandler Carruth	ad6e1f0	2011-01-09 09:58:36 +0000	[diff] [blame]	2067
				2068	clang -O3 -fno-exceptions currently compiles this code:
				2069
Chandler Carruth	ef28abe	2011-01-16 01:40:23 +0000	[diff] [blame]	2070	void f(char* a, int n) {
				2071	__builtin_memset(a, 0, n);
				2072	for (int i = 0; i < n; ++i)
				2073	a[i] = 0;
Chandler Carruth	ad6e1f0	2011-01-09 09:58:36 +0000	[diff] [blame]	2074	}
				2075
Chandler Carruth	ef28abe	2011-01-16 01:40:23 +0000	[diff] [blame]	2076	into:
Chandler Carruth	ad6e1f0	2011-01-09 09:58:36 +0000	[diff] [blame]	2077
Chandler Carruth	ef28abe	2011-01-16 01:40:23 +0000	[diff] [blame]	2078	define void @_Z1fPci(i8* nocapture %a, i32 %n) nounwind {
				2079	entry:
				2080	%conv = sext i32 %n to i64
				2081	tail call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %conv, i32 1, i1 false)
				2082	%cmp8 = icmp sgt i32 %n, 0
				2083	br i1 %cmp8, label %for.body.lr.ph, label %for.end
Chandler Carruth	ad6e1f0	2011-01-09 09:58:36 +0000	[diff] [blame]	2084
Chandler Carruth	ef28abe	2011-01-16 01:40:23 +0000	[diff] [blame]	2085	for.body.lr.ph: ; preds = %entry
				2086	%tmp10 = add i32 %n, -1
				2087	%tmp11 = zext i32 %tmp10 to i64
				2088	%tmp12 = add i64 %tmp11, 1
				2089	call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %tmp12, i32 1, i1 false)
				2090	ret void
Chandler Carruth	ad6e1f0	2011-01-09 09:58:36 +0000	[diff] [blame]	2091
Chandler Carruth	ef28abe	2011-01-16 01:40:23 +0000	[diff] [blame]	2092	for.end: ; preds = %entry
				2093	ret void
				2094	}
				2095
				2096	This shouldn't need the ((zext (%n - 1)) + 1) game, and it should ideally fold
				2097	the two memset's together. The issue with %n seems to stem from poor handling
				2098	of the original loop.
Chandler Carruth	ad6e1f0	2011-01-09 09:58:36 +0000	[diff] [blame]	2099
Chris Lattner	c326ebd	2011-01-16 06:39:44 +0000	[diff] [blame]	2100	To simplify this, we need SCEV to know that "n != 0" because of the dominating
				2101	conditional. That would turn the second memset into a simple memset of 'n'.
				2102
Chandler Carruth	ad6e1f0	2011-01-09 09:58:36 +0000	[diff] [blame]	2103	//===---------------------------------------------------------------------===//
Chandler Carruth	82e6f6a	2011-01-09 11:29:57 +0000	[diff] [blame]	2104
				2105	clang -O3 -fno-exceptions currently compiles this code:
				2106
				2107	struct S {
				2108	unsigned short m1, m2;
				2109	unsigned char m3, m4;
				2110	};
				2111
				2112	void f(int N) {
				2113	std::vector<S> v(N);
				2114	extern void sink(void*); sink(&v);
				2115	}
				2116
				2117	into poor code for zero-initializing 'v' when N is >0. The problem is that
				2118	S is only 6 bytes, but each element is 8 byte-aligned. We generate a loop and
				2119	4 stores on each iteration. If the struct were 8 bytes, this gets turned into
				2120	a memset.
				2121
Chris Lattner	c326ebd	2011-01-16 06:39:44 +0000	[diff] [blame]	2122	In order to handle this we have to:
				2123	A) Teach clang to generate metadata for memsets of structs that have holes in
				2124	them.
				2125	B) Teach clang to use such a memset for zero init of this struct (since it has
				2126	a hole), instead of doing elementwise zeroing.
				2127
Chandler Carruth	82e6f6a	2011-01-09 11:29:57 +0000	[diff] [blame]	2128	//===---------------------------------------------------------------------===//
Chandler Carruth	0c68a668	2011-01-09 21:00:19 +0000	[diff] [blame]	2129
				2130	clang -O3 currently compiles this code:
				2131
				2132	extern const int magic;
				2133	double f() { return 0.0 * magic; }
				2134
				2135	into
				2136
				2137	@magic = external constant i32
				2138
				2139	define double @_Z1fv() nounwind readnone {
				2140	entry:
				2141	%tmp = load i32* @magic, align 4, !tbaa !0
				2142	%conv = sitofp i32 %tmp to double
				2143	%mul = fmul double %conv, 0.000000e+00
				2144	ret double %mul
				2145	}
				2146
Chris Lattner	eef1455	2011-01-10 00:33:01 +0000	[diff] [blame]	2147	We should be able to fold away this fmul to 0.0. More generally, fmul(x,0.0)
				2148	can be folded to 0.0 if we can prove that the LHS is not -0.0, not a NaN, and
				2149	not an INF. The CannotBeNegativeZero predicate in value tracking should be
				2150	extended to support general "fpclassify" operations that can return
				2151	yes/no/unknown for each of these predicates.
				2152
Chris Lattner	78cdd2a	2011-01-10 21:01:17 +0000	[diff] [blame]	2153	In this predicate, we know that uitofp is trivially never NaN or -0.0, and
Chris Lattner	eef1455	2011-01-10 00:33:01 +0000	[diff] [blame]	2154	we know that it isn't +/-Inf if the floating point type has enough exponent bits
				2155	to represent the largest integer value as < inf.
Chandler Carruth	0c68a668	2011-01-09 21:00:19 +0000	[diff] [blame]	2156
				2157	//===---------------------------------------------------------------------===//
Chandler Carruth	d011d53	2011-01-09 22:36:18 +0000	[diff] [blame]	2158
Chris Lattner	78cdd2a	2011-01-10 21:01:17 +0000	[diff] [blame]	2159	When optimizing a transformation that can change the sign of 0.0 (such as the
				2160	0.0*val -> 0.0 transformation above), it might be provable that the sign of the
				2161	expression doesn't matter. For example, by the above rules, we can't transform
				2162	fmul(sitofp(x), 0.0) into 0.0, because x might be -1 and the result of the
				2163	expression is defined to be -0.0.
				2164
				2165	If we look at the uses of the fmul for example, we might be able to prove that
				2166	all uses don't care about the sign of zero. For example, if we have:
				2167
				2168	fadd(fmul(sitofp(x), 0.0), 2.0)
				2169
				2170	Since we know that x+2.0 doesn't care about the sign of any zeros in X, we can
				2171	transform the fmul to 0.0, and then the fadd to 2.0.
				2172
				2173	//===---------------------------------------------------------------------===//
Chris Lattner	b9cdf39	2011-01-13 22:08:15 +0000	[diff] [blame]	2174
				2175	We should enhance memcpy/memcpy/memset to allow a metadata node on them
				2176	indicating that some bytes of the transfer are undefined. This is useful for
Chris Lattner	b6c3aff	2011-01-13 22:11:56 +0000	[diff] [blame]	2177	frontends like clang when lowering struct copies, when some elements of the
Chris Lattner	b9cdf39	2011-01-13 22:08:15 +0000	[diff] [blame]	2178	struct are undefined. Consider something like this:
				2179
				2180	struct x {
				2181	char a;
				2182	int b[4];
				2183	};
				2184	void foo(struct x*P);
				2185	struct x testfunc() {
				2186	struct x V1, V2;
				2187	foo(&V1);
				2188	V2 = V1;
				2189
				2190	return V2;
				2191	}
				2192
				2193	We currently compile this to:
				2194	$ clang t.c -S -o - -O0 -emit-llvm \| opt -scalarrepl -S
				2195
				2196
				2197	%struct.x = type { i8, [4 x i32] }
				2198
				2199	define void @testfunc(%struct.x* sret %agg.result) nounwind ssp {
				2200	entry:
				2201	%V1 = alloca %struct.x, align 4
				2202	call void @foo(%struct.x* %V1)
				2203	%tmp1 = bitcast %struct.x* %V1 to i8*
				2204	%0 = bitcast %struct.x* %V1 to i160*
				2205	%srcval1 = load i160* %0, align 4
				2206	%tmp2 = bitcast %struct.x* %agg.result to i8*
				2207	%1 = bitcast %struct.x* %agg.result to i160*
				2208	store i160 %srcval1, i160* %1, align 4
				2209	ret void
				2210	}
				2211
				2212	This happens because SRoA sees that the temp alloca has is being memcpy'd into
				2213	and out of and it has holes and it has to be conservative. If we knew about the
				2214	holes, then this could be much much better.
				2215
				2216	Having information about these holes would also improve memcpy (etc) lowering at
				2217	llc time when it gets inlined, because we can use smaller transfers. This also
				2218	avoids partial register stalls in some important cases.
				2219
				2220	//===---------------------------------------------------------------------===//
Chris Lattner	28bf91f	2011-02-16 19:16:34 +0000	[diff] [blame^]	2221
				2222	Some missed instcombine xforms (from GCC PR14753):
				2223
				2224	void bar (void);
				2225
				2226	void mask_gt (unsigned int a) {
				2227	/* This is equivalent to a > 15. */
				2228	if ((a & ~7) > 8)
				2229	bar();
				2230	}
				2231
				2232	void neg_eq_cst(unsigned int a) {
				2233	if (-a == 123)
				2234	bar();
				2235	}
				2236
				2237	void minus_cst(unsigned int a) {
				2238	if (20 - a == 5)
				2239	bar();
				2240	}
				2241
				2242	void rotate_cst (unsigned a) {
				2243	a = (a << 10) \| (a >> 22);
				2244	if (a == 123)
				2245	bar ();
				2246	}
				2247
				2248	//===---------------------------------------------------------------------===//
				2249