Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: 2b7e9ed4ffd20c46d90452c02f100fb15e3af620 [file] [log] [blame]

Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
Chris Lattner	f308ea0	2006-09-28 06:01:17 +0000	[diff] [blame]	3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	9b62b45	2006-11-14 01:57:53 +0000	[diff] [blame]	5	With the recent changes to make the implicit def/use set explicit in
				6	machineinstrs, we should change the target descriptions for 'call' instructions
				7	so that the .td files don't list all the call-clobbered registers as implicit
				8	defs. Instead, these should be added by the code generator (e.g. on the dag).
				9
				10	This has a number of uses:
				11
				12	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				13	for their different impdef sets.
				14	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				15	sets don't need copies of call instructions.
				16	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				17	of calls.
				18
				19	//===---------------------------------------------------------------------===//
				20
Nate Begeman	81e8097	2006-03-17 01:40:33 +0000	[diff] [blame]	21	Make the PPC branch selector target independant
				22
				23	//===---------------------------------------------------------------------===//
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	24
				25	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	26	precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
				27	safe in general, even on darwin. See the libm implementation of hypot for
				28	examples (which special case when x/y are exactly zero to get signed zeros etc
				29	right).
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	30
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	31	//===---------------------------------------------------------------------===//
				32
				33	Solve this DAG isel folding deficiency:
				34
				35	int X, Y;
				36
				37	void fn1(void)
				38	{
				39	X = X \| (Y << 3);
				40	}
				41
				42	compiles to
				43
				44	fn1:
				45	movl Y, %eax
				46	shll $3, %eax
				47	orl X, %eax
				48	movl %eax, X
				49	ret
				50
				51	The problem is the store's chain operand is not the load X but rather
				52	a TokenFactor of the load X and load Y, which prevents the folding.
				53
				54	There are two ways to fix this:
				55
				56	1. The dag combiner can start using alias analysis to realize that y/x
				57	don't alias, making the store to X not dependent on the load from Y.
				58	2. The generated isel could be made smarter in the case it can't
				59	disambiguate the pointers.
				60
				61	Number 1 is the preferred solution.
				62
Evan Cheng	e617b08	2006-03-13 23:19:10 +0000	[diff] [blame]	63	This has been "fixed" by a TableGen hack. But that is a short term workaround
				64	which will be removed once the proper fix is made.
				65
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	66	//===---------------------------------------------------------------------===//
				67
Chris Lattner	b27b69f	2006-03-04 01:19:34 +0000	[diff] [blame]	68	On targets with expensive 64-bit multiply, we could LSR this:
				69
				70	for (i = ...; ++i) {
				71	x = 1ULL << i;
				72
				73	into:
				74	long long tmp = 1;
				75	for (i = ...; ++i, tmp+=tmp)
				76	x = tmp;
				77
				78	This would be a win on ppc32, but not x86 or ppc64.
				79
Chris Lattner	ad01993	2006-03-04 08:44:51 +0000	[diff] [blame]	80	//===---------------------------------------------------------------------===//
Chris Lattner	5b0fe7d	2006-03-05 20:00:08 +0000	[diff] [blame]	81
				82	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				83
				84	//===---------------------------------------------------------------------===//
Chris Lattner	549f27d2	2006-03-07 02:46:26 +0000	[diff] [blame]	85
Chris Lattner	c20995e	2006-03-11 20:17:08 +0000	[diff] [blame]	86	Reassociate should turn: XXXX -> t=(XX) (t*t) to eliminate a multiply.
				87
				88	//===---------------------------------------------------------------------===//
				89
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	90	Interesting? testcase for add/shift/mul reassoc:
				91
				92	int bar(int x, int y) {
				93	return xxx+y+xxxxxyyyy;
				94	}
				95	int foo(int z, int n) {
				96	return bar(z, n) + bar(2z, 2n);
				97	}
				98
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	99	Reassociate should handle the example in GCC PR16157.
				100
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	101	//===---------------------------------------------------------------------===//
				102
Chris Lattner	82c78b2	2006-03-09 20:13:21 +0000	[diff] [blame]	103	These two functions should generate the same code on big-endian systems:
				104
				105	int g(int j,int l) { return memcmp(j,l,4); }
				106	int h(int j, int l) { return j - l; }
				107
				108	this could be done in SelectionDAGISel.cpp, along with other special cases,
				109	for 1,2,4,8 bytes.
				110
				111	//===---------------------------------------------------------------------===//
				112
Chris Lattner	c04b423	2006-03-22 07:33:46 +0000	[diff] [blame]	113	It would be nice to revert this patch:
				114	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				115
				116	And teach the dag combiner enough to simplify the code expanded before
				117	legalize. It seems plausible that this knowledge would let it simplify other
				118	stuff too.
				119
Chris Lattner	e6cd96d	2006-03-24 19:59:17 +0000	[diff] [blame]	120	//===---------------------------------------------------------------------===//
				121
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	122	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng	67d3d4c	2006-03-31 22:35:14 +0000	[diff] [blame]	123	to the type size. It works but can be overly conservative as the alignment of
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	124	specific vector types are target dependent.
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	125
				126	//===---------------------------------------------------------------------===//
				127
				128	We should add 'unaligned load/store' nodes, and produce them from code like
				129	this:
				130
				131	v4sf example(float *P) {
				132	return (v4sf){P[0], P[1], P[2], P[3] };
				133	}
				134
				135	//===---------------------------------------------------------------------===//
				136
Chris Lattner	16abfdf	2006-05-18 18:26:13 +0000	[diff] [blame]	137	Add support for conditional increments, and other related patterns. Instead
				138	of:
				139
				140	movl 136(%esp), %eax
				141	cmpl $0, %eax
				142	je LBB16_2 #cond_next
				143	LBB16_1: #cond_true
				144	incl _foo
				145	LBB16_2: #cond_next
				146
				147	emit:
				148	movl _foo, %eax
				149	cmpl $1, %edi
				150	sbbl $-1, %eax
				151	movl %eax, _foo
				152
				153	//===---------------------------------------------------------------------===//
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	154
				155	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				156
				157	Expand these to calls of sin/cos and stores:
				158	double sincos(double x, double sin, double cos);
				159	float sincosf(float x, float sin, float cos);
				160	long double sincosl(long double x, long double sin, long double cos);
				161
				162	Doing so could allow SROA of the destination pointers. See also:
				163	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				164
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	165	This is now easily doable with MRVs. We could even make an intrinsic for this
				166	if anyone cared enough about sincos.
				167
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	168	//===---------------------------------------------------------------------===//
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	169
				170	Scalar Repl cannot currently promote this testcase to 'ret long cst':
				171
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	172	%struct.X = type { i32, i32 }
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	173	%struct.Y = type { %struct.X }
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	174
				175	define i64 @bar() {
				176	%retval = alloca %struct.Y, align 8
				177	%tmp12 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 0
				178	store i32 0, i32* %tmp12
				179	%tmp15 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 1
				180	store i32 1, i32* %tmp15
				181	%retval.upgrd.1 = bitcast %struct.Y* %retval to i64*
				182	%retval.upgrd.2 = load i64* %retval.upgrd.1
				183	ret i64 %retval.upgrd.2
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	184	}
				185
				186	it should be extended to do so.
				187
				188	//===---------------------------------------------------------------------===//
Chris Lattner	e8263e6	2006-05-21 03:57:07 +0000	[diff] [blame]	189
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	190	-scalarrepl should promote this to be a vector scalar.
				191
				192	%struct..0anon = type { <4 x float> }
				193
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	194	define void @test1(<4 x float> %V, float* %P) {
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	195	%u = alloca %struct..0anon, align 16
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	196	%tmp = getelementptr %struct..0anon* %u, i32 0, i32 0
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	197	store <4 x float> %V, <4 x float>* %tmp
				198	%tmp1 = bitcast %struct..0anon* %u to [4 x float]*
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	199	%tmp.upgrd.1 = getelementptr [4 x float]* %tmp1, i32 0, i32 1
				200	%tmp.upgrd.2 = load float* %tmp.upgrd.1
				201	%tmp3 = mul float %tmp.upgrd.2, 2.000000e+00
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	202	store float %tmp3, float* %P
				203	ret void
				204	}
				205
				206	//===---------------------------------------------------------------------===//
				207
Chris Lattner	e8263e6	2006-05-21 03:57:07 +0000	[diff] [blame]	208	Turn this into a single byte store with no load (the other 3 bytes are
				209	unmodified):
				210
				211	void %test(uint* %P) {
				212	%tmp = load uint* %P
				213	%tmp14 = or uint %tmp, 3305111552
				214	%tmp15 = and uint %tmp14, 3321888767
				215	store uint %tmp15, uint* %P
				216	ret void
				217	}
				218
Chris Lattner	9e18ef5	2006-05-30 21:29:15 +0000	[diff] [blame]	219	//===---------------------------------------------------------------------===//
				220
				221	dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
				222
				223	Compile:
				224
				225	int bar(int x)
				226	{
				227	int t = __builtin_clz(x);
				228	return -(t>>5);
				229	}
				230
				231	to:
				232
				233	_bar: addic r3,r3,-1
				234	subfe r3,r3,r3
				235	blr
				236
Chris Lattner	cbce2f6	2006-09-15 20:31:36 +0000	[diff] [blame]	237	//===---------------------------------------------------------------------===//
				238
				239	Legalize should lower ctlz like this:
				240	ctlz(x) = popcnt((x-1) & ~x)
				241
				242	on targets that have popcnt but not ctlz. itanium, what else?
Chris Lattner	9e18ef5	2006-05-30 21:29:15 +0000	[diff] [blame]	243
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	244	//===---------------------------------------------------------------------===//
				245
				246	quantum_sigma_x in 462.libquantum contains the following loop:
				247
				248	for(i=0; i<reg->size; i++)
				249	{
				250	/* Flip the target bit of each basis state */
				251	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				252	}
				253
				254	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				255	so cool to turn it into something like:
				256
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	257	long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	258	if (target < 32) {
				259	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	260	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	261	} else {
				262	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	263	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	264	}
				265
				266	... which would only do one 32-bit XOR per loop iteration instead of two.
				267
				268	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
				269	alas...
				270
				271	//===---------------------------------------------------------------------===//
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	272
Chris Lattner	b1ac769	2008-10-05 02:16:12 +0000	[diff] [blame]	273	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattner	f9bae43	2006-12-08 02:01:32 +0000	[diff] [blame]	274
				275	unsigned long reverse(unsigned v) {
				276	unsigned t;
				277	t = v ^ ((v << 16) \| (v >> 16));
				278	t &= ~0xff0000;
				279	v = (v << 24) \| (v >> 8);
				280	return v ^ (t >> 8);
				281	}
				282
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	283	//===---------------------------------------------------------------------===//
				284
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	285	These idioms should be recognized as popcount (see PR1488):
				286
				287	unsigned countbits_slow(unsigned v) {
				288	unsigned c;
				289	for (c = 0; v; v >>= 1)
				290	c += v & 1;
				291	return c;
				292	}
				293	unsigned countbits_fast(unsigned v){
				294	unsigned c;
				295	for (c = 0; v; c++)
				296	v &= v - 1; // clear the least significant bit set
				297	return c;
				298	}
				299
				300	BITBOARD = unsigned long long
				301	int PopCnt(register BITBOARD a) {
				302	register int c=0;
				303	while(a) {
				304	c++;
				305	a &= a - 1;
				306	}
				307	return c;
				308	}
				309	unsigned int popcount(unsigned int input) {
				310	unsigned int count = 0;
				311	for (unsigned int i = 0; i < 4 * 8; i++)
				312	count += (input >> i) & i;
				313	return count;
				314	}
				315
				316	//===---------------------------------------------------------------------===//
				317
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	318	These should turn into single 16-bit (unaligned?) loads on little/big endian
				319	processors.
				320
				321	unsigned short read_16_le(const unsigned char *adr) {
				322	return adr[0] \| (adr[1] << 8);
				323	}
				324	unsigned short read_16_be(const unsigned char *adr) {
				325	return (adr[0] << 8) \| adr[1];
				326	}
				327
				328	//===---------------------------------------------------------------------===//
Chris Lattner	cf10391	2006-10-24 16:12:47 +0000	[diff] [blame]	329
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	330	-instcombine should handle this transform:
Reid Spencer	e4d87aa	2006-12-23 06:05:41 +0000	[diff] [blame]	331	icmp pred (sdiv X / C1 ), C2
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	332	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				333
				334	Currently InstCombine avoids this transform but will do it when the signs of
				335	the operands and the sign of the divide match. See the FIXME in
				336	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				337	for Instruction::UDiv (around line 4447) for more details.
				338
				339	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				340	this construct.
Chris Lattner	d7c628d	2006-11-03 22:27:39 +0000	[diff] [blame]	341
				342	//===---------------------------------------------------------------------===//
				343
Chris Lattner	578d2df	2006-11-10 00:23:26 +0000	[diff] [blame]	344	viterbi speeds up significantly if the various "history" related copy loops
				345	are turned into memcpy calls at the source level. We need a "loops to memcpy"
				346	pass.
				347
				348	//===---------------------------------------------------------------------===//
Nick Lewycky	bf63734	2006-11-13 00:23:28 +0000	[diff] [blame]	349
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	350	Consider:
				351
				352	typedef unsigned U32;
				353	typedef unsigned long long U64;
				354	int test (U32 inst, U64 regs) {
				355	U64 effective_addr2;
				356	U32 temp = *inst;
				357	int r1 = (temp >> 20) & 0xf;
				358	int b2 = (temp >> 16) & 0xf;
				359	effective_addr2 = temp & 0xfff;
				360	if (b2) effective_addr2 += regs[b2];
				361	b2 = (temp >> 12) & 0xf;
				362	if (b2) effective_addr2 += regs[b2];
				363	effective_addr2 &= regs[4];
				364	if ((effective_addr2 & 3) == 0)
				365	return 1;
				366	return 0;
				367	}
				368
				369	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				370	we don't eliminate the computation of the top half of effective_addr2 because
				371	we don't have whole-function selection dags. On x86, this means we use one
				372	extra register for the function when effective_addr2 is declared as U64 than
				373	when it is declared U32.
				374
				375	//===---------------------------------------------------------------------===//
				376
Chris Lattner	36e37d2	2007-02-13 21:44:43 +0000	[diff] [blame]	377	Promote for i32 bswap can use i64 bswap + shr. Useful on targets with 64-bit
				378	regs and bswap, like itanium.
				379
				380	//===---------------------------------------------------------------------===//
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	381
				382	LSR should know what GPR types a target has. This code:
				383
				384	volatile short X, Y; // globals
				385
				386	void foo(int N) {
				387	int i;
				388	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				389	}
				390
				391	produces two identical IV's (after promotion) on PPC/ARM:
				392
				393	LBB1_1: @bb.preheader
				394	mov r3, #0
				395	mov r2, r3
				396	mov r1, r3
				397	LBB1_2: @bb
				398	ldr r12, LCPI1_0
				399	ldr r12, [r12]
				400	strh r2, [r12]
				401	ldr r12, LCPI1_1
				402	ldr r12, [r12]
				403	strh r3, [r12]
				404	add r1, r1, #1 <- [0,+,1]
				405	add r3, r3, #4
				406	add r2, r2, #1 <- [0,+,1]
				407	cmp r1, r0
				408	bne LBB1_2 @bb
				409
				410
				411	//===---------------------------------------------------------------------===//
				412
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	413	Tail call elim should be more aggressive, checking to see if the call is
				414	followed by an uncond branch to an exit block.
				415
				416	; This testcase is due to tail-duplication not wanting to copy the return
				417	; instruction into the terminating blocks because there was other code
				418	; optimized out of the function after the taildup happened.
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	419	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	420
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	421	define i32 @t4(i32 %a) {
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	422	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	423	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				424	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				425	br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	426
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	427	then.0: ; preds = %entry
				428	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				429	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				430	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	431
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	432	else.0: ; preds = %entry
				433	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				434	br i1 %tmp.7, label %then.1, label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	435
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	436	then.1: ; preds = %else.0
				437	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				438	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				439	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	440
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	441	return: ; preds = %then.1, %else.0, %then.0
				442	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	443	[ %tmp.9, %then.1 ]
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	444	ret i32 %result.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	445	}
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	446
				447	//===---------------------------------------------------------------------===//
				448
Chris Lattner	e1bb6ab	2007-10-03 06:10:59 +0000	[diff] [blame]	449	Tail recursion elimination is not transforming this function, because it is
				450	returning n, which fails the isDynamicConstant check in the accumulator
				451	recursion checks.
				452
				453	long long fib(const long long n) {
				454	switch(n) {
				455	case 0:
				456	case 1:
				457	return n;
				458	default:
				459	return fib(n-1) + fib(n-2);
				460	}
				461	}
				462
				463	//===---------------------------------------------------------------------===//
				464
Chris Lattner	c90b866	2008-08-10 00:47:21 +0000	[diff] [blame]	465	Tail recursion elimination should handle:
				466
				467	int pow2m1(int n) {
				468	if (n == 0)
				469	return 0;
				470	return 2 * pow2m1 (n - 1) + 1;
				471	}
				472
				473	Also, multiplies can be turned into SHL's, so they should be handled as if
				474	they were associative. "return foo() << 1" can be tail recursion eliminated.
				475
				476	//===---------------------------------------------------------------------===//
				477
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	478	Argument promotion should promote arguments for recursive functions, like
				479	this:
				480
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	481	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	482
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	483	define internal i32 @foo(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	484	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	485	%tmp = load i32* %x ; <i32> [#uses=0]
				486	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				487	ret i32 %tmp.foo
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	488	}
				489
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	490	define i32 @bar(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	491	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	492	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				493	ret i32 %tmp3
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	494	}
				495
Chris Lattner	81f2d71	2007-12-05 23:05:06 +0000	[diff] [blame]	496	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	497
				498	"basicaa" should know how to look through "or" instructions that act like add
				499	instructions. For example in this code, the x4+1 is turned into x4 \| 1, and
				500	basicaa can't analyze the array subscript, leading to duplicated loads in the
				501	generated code:
				502
				503	void test(int X, int Y, int a[]) {
				504	int i;
				505	for (i=2; i<1000; i+=4) {
				506	a[i+0] = a[i-1+0]*a[i-2+0];
				507	a[i+1] = a[i-1+1]*a[i-2+1];
				508	a[i+2] = a[i-1+2]*a[i-2+2];
				509	a[i+3] = a[i-1+3]*a[i-2+3];
				510	}
				511	}
				512
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	513	BasicAA also doesn't do this for add. It needs to know that &A[i+1] != &A[i].
				514
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	515	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	516
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	517	We should investigate an instruction sinking pass. Consider this silly
				518	example in pic mode:
				519
				520	#include <assert.h>
				521	void foo(int x) {
				522	assert(x);
				523	//...
				524	}
				525
				526	we compile this to:
				527	_foo:
				528	subl $28, %esp
				529	call "L1$pb"
				530	"L1$pb":
				531	popl %eax
				532	cmpl $0, 32(%esp)
				533	je LBB1_2 # cond_true
				534	LBB1_1: # return
				535	# ...
				536	addl $28, %esp
				537	ret
				538	LBB1_2: # cond_true
				539	...
				540
				541	The PIC base computation (call+popl) is only used on one path through the
				542	code, but is currently always computed in the entry block. It would be
				543	better to sink the picbase computation down into the block for the
				544	assertion, as it is the only one that uses it. This happens for a lot of
				545	code with early outs.
				546
Chris Lattner	92c06a0	2007-12-29 01:05:01 +0000	[diff] [blame]	547	Another example is loads of arguments, which are usually emitted into the
				548	entry block on targets like x86. If not used in all paths through a
				549	function, they should be sunk into the ones that do.
				550
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	551	In this case, whole-function-isel would also handle this.
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	552
				553	//===---------------------------------------------------------------------===//
Chris Lattner	b304194	2008-01-07 21:38:14 +0000	[diff] [blame]	554
				555	Investigate lowering of sparse switch statements into perfect hash tables:
				556	http://burtleburtle.net/bob/hash/perfect.html
				557
				558	//===---------------------------------------------------------------------===//
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	559
				560	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				561	corresponding integer operations. On a yonah, this loop:
				562
				563	double a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	564	void foo() {
				565	int i, b;
				566	for (b = 0; b < 10000000; b++)
				567	for (i = 0; i < 256; i++)
				568	a[i] = -a[i];
				569	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	570
				571	is twice as slow as this loop:
				572
				573	long long a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	574	void foo() {
				575	int i, b;
				576	for (b = 0; b < 10000000; b++)
				577	for (i = 0; i < 256; i++)
				578	a[i] ^= (1ULL << 63);
				579	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	580
				581	and I suspect other processors are similar. On X86 in particular this is a
				582	big win because doing this with integers allows the use of read/modify/write
				583	instructions.
				584
				585	//===---------------------------------------------------------------------===//
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	586
				587	DAG Combiner should try to combine small loads into larger loads when
				588	profitable. For example, we compile this C++ example:
				589
				590	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				591	extern THotKey m_HotKey;
				592	THotKey GetHotKey () { return m_HotKey; }
				593
				594	into (-O3 -fno-exceptions -static -fomit-frame-pointer):
				595
				596	__Z9GetHotKeyv:
				597	pushl %esi
				598	movl 8(%esp), %eax
				599	movb _m_HotKey+3, %cl
				600	movb _m_HotKey+4, %dl
				601	movb _m_HotKey+2, %ch
				602	movw _m_HotKey, %si
				603	movw %si, (%eax)
				604	movb %ch, 2(%eax)
				605	movb %cl, 3(%eax)
				606	movb %dl, 4(%eax)
				607	popl %esi
				608	ret $4
				609
				610	GCC produces:
				611
				612	__Z9GetHotKeyv:
				613	movl _m_HotKey, %edx
				614	movl 4(%esp), %eax
				615	movl %edx, (%eax)
				616	movzwl _m_HotKey+4, %edx
				617	movw %dx, 4(%eax)
				618	ret $4
				619
				620	The LLVM IR contains the needed alignment info, so we should be able to
				621	merge the loads and stores into 4-byte loads:
				622
				623	%struct.THotKey = type { i16, i8, i8, i8 }
				624	define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
				625	...
				626	%tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
				627	%tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
				628	%tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
				629	%tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
				630
				631	Alternatively, we should use a small amount of base-offset alias analysis
				632	to make it so the scheduler doesn't need to hold all the loads in regs at
				633	once.
				634
				635	//===---------------------------------------------------------------------===//
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	636
				637	We should extend parameter attributes to capture more information about
				638	pointer parameters for alias analysis. Some ideas:
				639
				640	1. Add a "nocapture" attribute, which indicates that the callee does not store
				641	the address of the parameter into a global or any other memory location
				642	visible to the callee. This can be used to make basicaa and other analyses
				643	more powerful. It is true for things like memcpy, strcat, and many other
				644	things, including structs passed by value, most C++ references, etc.
				645	2. Generalize readonly to be set on parameters. This is important mod/ref
				646	info for the function, which is important for basicaa and others. It can
				647	also be used by the inliner to avoid inserting a memcpy for byval
				648	arguments when the function is inlined.
				649
				650	These functions can be inferred by various analysis passes such as the
Chris Lattner	65844fb	2008-01-12 18:58:46 +0000	[diff] [blame]	651	globalsmodrefaa pass. Note that getting #2 right is actually really tricky.
				652	Consider this code:
				653
				654	struct S; S G;
				655	void caller(S byvalarg) { G.field = 1; ... }
				656	void callee() { caller(G); }
				657
				658	The fact that the caller does not modify byval arg is not enough, we need
				659	to know that it doesn't modify G either. This is very tricky.
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	660
				661	//===---------------------------------------------------------------------===//
Nate Begeman	e9fe65c	2008-02-18 18:39:23 +0000	[diff] [blame]	662
				663	We should add an FRINT node to the DAG to model targets that have legal
				664	implementations of ceil/floor/rint.
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	665
				666	//===---------------------------------------------------------------------===//
				667
Chris Lattner	e29536c	2008-02-28 17:21:27 +0000	[diff] [blame]	668	This GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34043
				669	contains a testcase that compiles down to:
				670
				671	%struct.XMM128 = type { <4 x float> }
				672	..
				673	%src = alloca %struct.XMM128
				674	..
				675	%tmp6263 = bitcast %struct.XMM128* %src to <2 x i64>*
				676	%tmp65 = getelementptr %struct.XMM128* %src, i32 0, i32 0
				677	store <2 x i64> %tmp5899, <2 x i64>* %tmp6263, align 16
				678	%tmp66 = load <4 x float>* %tmp65, align 16
				679	%tmp71 = add <4 x float> %tmp66, %tmp66
				680
				681	If the mid-level optimizer turned the bitcast of pointer + store of tmp5899
				682	into a bitcast of the vector value and a store to the pointer, then the
				683	store->load could be easily removed.
				684
				685	//===---------------------------------------------------------------------===//
				686
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	687	Consider:
				688
				689	int test() {
				690	long long input[8] = {1,1,1,1,1,1,1,1};
				691	foo(input);
				692	}
				693
				694	We currently compile this into a memcpy from a global array since the
				695	initializer is fairly large and not memset'able. This is good, but the memcpy
				696	gets lowered to load/stores in the code generator. This is also ok, except
				697	that the codegen lowering for memcpy doesn't handle the case when the source
				698	is a constant global. This gives us atrocious code like this:
				699
				700	call "L1$pb"
				701	"L1$pb":
				702	popl %eax
				703	movl _C.0.1444-"L1$pb"+32(%eax), %ecx
				704	movl %ecx, 40(%esp)
				705	movl _C.0.1444-"L1$pb"+20(%eax), %ecx
				706	movl %ecx, 28(%esp)
				707	movl _C.0.1444-"L1$pb"+36(%eax), %ecx
				708	movl %ecx, 44(%esp)
				709	movl _C.0.1444-"L1$pb"+44(%eax), %ecx
				710	movl %ecx, 52(%esp)
				711	movl _C.0.1444-"L1$pb"+40(%eax), %ecx
				712	movl %ecx, 48(%esp)
				713	movl _C.0.1444-"L1$pb"+12(%eax), %ecx
				714	movl %ecx, 20(%esp)
				715	movl _C.0.1444-"L1$pb"+4(%eax), %ecx
				716	...
				717
				718	instead of:
				719	movl $1, 16(%esp)
				720	movl $0, 20(%esp)
				721	movl $1, 24(%esp)
				722	movl $0, 28(%esp)
				723	movl $1, 32(%esp)
				724	movl $0, 36(%esp)
				725	...
				726
				727	//===---------------------------------------------------------------------===//
Chris Lattner	a11deb0	2008-03-02 02:51:40 +0000	[diff] [blame]	728
				729	http://llvm.org/PR717:
				730
				731	The following code should compile into "ret int undef". Instead, LLVM
				732	produces "ret int 0":
				733
				734	int f() {
				735	int x = 4;
				736	int y;
				737	if (x == 3) y = 0;
				738	return y;
				739	}
				740
				741	//===---------------------------------------------------------------------===//
Chris Lattner	53b7277	2008-03-02 19:29:42 +0000	[diff] [blame]	742
				743	The loop unroller should partially unroll loops (instead of peeling them)
				744	when code growth isn't too bad and when an unroll count allows simplification
				745	of some code within the loop. One trivial example is:
				746
				747	#include <stdio.h>
				748	int main() {
				749	int nRet = 17;
				750	int nLoop;
				751	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				752	if ( nLoop & 1 )
				753	nRet += 2;
				754	else
				755	nRet -= 1;
				756	}
				757	return nRet;
				758	}
				759
				760	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				761	reduction in code size. The resultant code would then also be suitable for
				762	exit value computation.
				763
				764	//===---------------------------------------------------------------------===//
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	765
				766	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				767	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				768	matching code in dag combine doesn't look through truncates aggressively
				769	enough. Here are some testcases reduces from GCC PR17886:
				770
				771	unsigned long long f(unsigned long long x, int y) {
				772	return (x << y) \| (x >> 64-y);
				773	}
				774	unsigned f2(unsigned x, int y){
				775	return (x << y) \| (x >> 32-y);
				776	}
				777	unsigned long long f3(unsigned long long x){
				778	int y = 9;
				779	return (x << y) \| (x >> 64-y);
				780	}
				781	unsigned f4(unsigned x){
				782	int y = 10;
				783	return (x << y) \| (x >> 32-y);
				784	}
				785	unsigned long long f5(unsigned long long x, unsigned long long y) {
				786	return (x << 8) \| ((y >> 48) & 0xffull);
				787	}
				788	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				789	switch(z) {
				790	case 1:
				791	return (x << 8) \| ((y >> 48) & 0xffull);
				792	case 2:
				793	return (x << 16) \| ((y >> 40) & 0xffffull);
				794	case 3:
				795	return (x << 24) \| ((y >> 32) & 0xffffffull);
				796	case 4:
				797	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				798	default:
				799	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				800	}
				801	}
				802
Dan Gohman	cb747c5	2008-10-17 21:39:27 +0000	[diff] [blame]	803	On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	804	generate truly horrible code, instead of using shld and friends. On
				805	ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
				806	badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
				807
				808	//===---------------------------------------------------------------------===//
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	809
				810	We do a number of simplifications in simplify libcalls to strength reduce
				811	standard library functions, but we don't currently merge them together. For
				812	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				813	be done safely if "b" isn't modified between the strlen and memcpy of course.
				814
				815	//===---------------------------------------------------------------------===//
				816
Chris Lattner	b578310	2008-05-17 15:37:38 +0000	[diff] [blame]	817	We should be able to evaluate this loop:
				818
				819	int test(int x_offs) {
				820	while (x_offs > 4)
				821	x_offs -= 4;
				822	return x_offs;
				823	}
				824
				825	//===---------------------------------------------------------------------===//
Chris Lattner	10c5d36	2008-07-14 00:19:59 +0000	[diff] [blame]	826
				827	Reassociate should turn things like:
				828
				829	int factorial(int X) {
				830	return XXXXXXX*X;
				831	}
				832
				833	into llvm.powi calls, allowing the code generator to produce balanced
				834	multiplication trees.
				835
				836	//===---------------------------------------------------------------------===//
				837
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	838	We generate a horrible libcall for llvm.powi. For example, we compile:
				839
				840	#include <cmath>
				841	double f(double a) { return std::pow(a, 4); }
				842
				843	into:
				844
				845	__Z1fd:
				846	subl $12, %esp
				847	movsd 16(%esp), %xmm0
				848	movsd %xmm0, (%esp)
				849	movl $4, 8(%esp)
				850	call L___powidf2$stub
				851	addl $12, %esp
				852	ret
				853
				854	GCC produces:
				855
				856	__Z1fd:
				857	subl $12, %esp
				858	movsd 16(%esp), %xmm0
				859	mulsd %xmm0, %xmm0
				860	mulsd %xmm0, %xmm0
				861	movsd %xmm0, (%esp)
				862	fldl (%esp)
				863	addl $12, %esp
				864	ret
				865
				866	//===---------------------------------------------------------------------===//
				867
				868	We compile this program: (from GCC PR11680)
				869	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				870
				871	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				872	slower than when compile with GCC (either 4.0 or 4.2):
				873
				874	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				875	$ time ./a.out fast
				876	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				877
				878	$ g++ perf.cpp -O3 -fno-exceptions
				879	$ time ./a.out fast
				880	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				881
				882	It looks like we are making the same inlining decisions, so this may be raw
				883	codegen badness or something else (haven't investigated).
				884
				885	//===---------------------------------------------------------------------===//
				886
				887	We miss some instcombines for stuff like this:
				888	void bar (void);
				889	void foo (unsigned int a) {
				890	/* This one is equivalent to a >= (3 << 2). */
				891	if ((a >> 2) >= 3)
				892	bar ();
				893	}
				894
				895	A few other related ones are in GCC PR14753.
				896
				897	//===---------------------------------------------------------------------===//
				898
				899	Divisibility by constant can be simplified (according to GCC PR12849) from
				900	being a mulhi to being a mul lo (cheaper). Testcase:
				901
				902	void bar(unsigned n) {
				903	if (n % 3 == 0)
				904	true();
				905	}
				906
				907	I think this basically amounts to a dag combine to simplify comparisons against
				908	multiply hi's into a comparison against the mullo.
				909
				910	//===---------------------------------------------------------------------===//
Chris Lattner	23f35bc	2008-08-19 06:22:16 +0000	[diff] [blame]	911
				912	SROA is not promoting the union on the stack in this example, we should end
				913	up with no allocas.
				914
				915	union vec2d {
				916	double e[2];
				917	double v __attribute__((vector_size(16)));
				918	};
				919	typedef union vec2d vec2d;
				920
				921	static vec2d a={{1,2}}, b={{3,4}};
				922
				923	vec2d foo () {
				924	return (vec2d){ .v = a.v + b.v * (vec2d){{5,5}}.v };
				925	}
				926
				927	//===---------------------------------------------------------------------===//
Chris Lattner	b7fe708	2008-10-15 05:53:25 +0000	[diff] [blame]	928
Chris Lattner	db03983	2008-10-15 16:06:03 +0000	[diff] [blame]	929	Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
				930	bunch of other stuff from this example (see PR1604):
				931
				932	#include <cstdio>
				933	struct test {
				934	int val;
				935	virtual ~test() {}
				936	};
				937
				938	int main() {
				939	test t;
				940	std::scanf("%d", &t.val);
				941	std::printf("%d\n", t.val);
				942	}
				943
				944	//===---------------------------------------------------------------------===//
				945
Chris Lattner	3b364cb	2008-10-15 16:33:52 +0000	[diff] [blame]	946	Instcombine will merge comparisons like (x >= 10) && (x < 20) by producing (x -
				947	10) u< 10, but only when the comparisons have matching sign.
				948
				949	This could be converted with a similiar technique. (PR1941)
				950
				951	define i1 @test(i8 %x) {
				952	%A = icmp uge i8 %x, 5
				953	%B = icmp slt i8 %x, 20
				954	%C = and i1 %A, %B
				955	ret i1 %C
				956	}
				957
				958	//===---------------------------------------------------------------------===//
Nick Lewycky	df563ca	2008-11-27 22:12:22 +0000	[diff] [blame]	959
Nick Lewycky	d2f0db1	2008-11-27 22:41:45 +0000	[diff] [blame]	960	These functions perform the same computation, but produce different assembly.
Nick Lewycky	df563ca	2008-11-27 22:12:22 +0000	[diff] [blame]	961
				962	define i8 @select(i8 %x) readnone nounwind {
				963	%A = icmp ult i8 %x, 250
				964	%B = select i1 %A, i8 0, i8 1
				965	ret i8 %B
				966	}
				967
				968	define i8 @addshr(i8 %x) readnone nounwind {
				969	%A = zext i8 %x to i9
				970	%B = add i9 %A, 6 ;; 256 - 250 == 6
				971	%C = lshr i9 %B, 8
				972	%D = trunc i9 %C to i8
				973	ret i8 %D
				974	}
				975
				976	//===---------------------------------------------------------------------===//
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	977
				978	From gcc bug 24696:
				979	int
				980	f (unsigned long a, unsigned long b, unsigned long c)
				981	{
				982	return ((a & (c - 1)) != 0) \|\| ((b & (c - 1)) != 0);
				983	}
				984	int
				985	f (unsigned long a, unsigned long b, unsigned long c)
				986	{
				987	return ((a & (c - 1)) != 0) \| ((b & (c - 1)) != 0);
				988	}
				989	Both should combine to ((a\|b) & (c-1)) != 0. Currently not optimized with
				990	"clang -emit-llvm-bc \| opt -std-compile-opts".
				991
				992	//===---------------------------------------------------------------------===//
				993
				994	From GCC Bug 20192:
				995	#define PMD_MASK (~((1UL << 23) - 1))
				996	void clear_pmd_range(unsigned long start, unsigned long end)
				997	{
				998	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
				999	f();
				1000	}
				1001	The expression should optimize to something like
				1002	"!((start\|end)&~PMD_MASK). Currently not optimized with "clang
				1003	-emit-llvm-bc \| opt -std-compile-opts".
				1004
				1005	//===---------------------------------------------------------------------===//
				1006
				1007	From GCC Bug 15241:
				1008	unsigned int
				1009	foo (unsigned int a, unsigned int b)
				1010	{
				1011	if (a <= 7 && b <= 7)
				1012	baz ();
				1013	}
				1014	Should combine to "(a\|b) <= 7". Currently not optimized with "clang
				1015	-emit-llvm-bc \| opt -std-compile-opts".
				1016
				1017	//===---------------------------------------------------------------------===//
				1018
				1019	From GCC Bug 3756:
				1020	int
				1021	pn (int n)
				1022	{
				1023	return (n >= 0 ? 1 : -1);
				1024	}
				1025	Should combine to (n >> 31) \| 1. Currently not optimized with "clang
				1026	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				1027
				1028	//===---------------------------------------------------------------------===//
				1029
				1030	From GCC Bug 28685:
				1031	int test(int a, int b)
				1032	{
				1033	int lt = a < b;
				1034	int eq = a == b;
				1035
				1036	return (lt \|\| eq);
				1037	}
				1038	Should combine to "a <= b". Currently not optimized with "clang
				1039	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				1040
				1041	//===---------------------------------------------------------------------===//
				1042
				1043	void a(int variable)
				1044	{
				1045	if (variable == 4 \|\| variable == 6)
				1046	bar();
				1047	}
				1048	This should optimize to "if ((variable \| 2) == 6)". Currently not
				1049	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts \| llc".
				1050
				1051	//===---------------------------------------------------------------------===//
				1052
				1053	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
				1054	i;}
				1055	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				1056	These should combine to the same thing. Currently, the first function
				1057	produces better code on X86.
				1058
				1059	//===---------------------------------------------------------------------===//
				1060
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1061	From GCC Bug 15784:
				1062	#define abs(x) x>0?x:-x
				1063	int f(int x, int y)
				1064	{
				1065	return (abs(x)) >= 0;
				1066	}
				1067	This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
				1068	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1069
				1070	//===---------------------------------------------------------------------===//
				1071
				1072	From GCC Bug 14753:
				1073	void
				1074	rotate_cst (unsigned int a)
				1075	{
				1076	a = (a << 10) \| (a >> 22);
				1077	if (a == 123)
				1078	bar ();
				1079	}
				1080	void
				1081	minus_cst (unsigned int a)
				1082	{
				1083	unsigned int tem;
				1084
				1085	tem = 20 - a;
				1086	if (tem == 5)
				1087	bar ();
				1088	}
				1089	void
				1090	mask_gt (unsigned int a)
				1091	{
				1092	/* This is equivalent to a > 15. */
				1093	if ((a & ~7) > 8)
				1094	bar ();
				1095	}
				1096	void
				1097	rshift_gt (unsigned int a)
				1098	{
				1099	/* This is equivalent to a > 23. */
				1100	if ((a >> 2) > 5)
				1101	bar ();
				1102	}
				1103	All should simplify to a single comparison. All of these are
				1104	currently not optimized with "clang -emit-llvm-bc \| opt
				1105	-std-compile-opts".
				1106
				1107	//===---------------------------------------------------------------------===//
				1108
				1109	From GCC Bug 32605:
				1110	int c(int* x) {return (char)x+2 == (char)x;}
				1111	Should combine to 0. Currently not optimized with "clang
				1112	-emit-llvm-bc \| opt -std-compile-opts" (although llc can optimize it).
				1113
				1114	//===---------------------------------------------------------------------===//
				1115
				1116	int a(unsigned char* b) {return *b > 99;}
				1117	There's an unnecessary zext in the generated code with "clang
				1118	-emit-llvm-bc \| opt -std-compile-opts".
				1119
				1120	//===---------------------------------------------------------------------===//
				1121
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1122	int a(unsigned b) {return ((b << 31) \| (b << 30)) >> 31;}
				1123	Should be combined to "((b >> 1) \| b) & 1". Currently not optimized
				1124	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1125
				1126	//===---------------------------------------------------------------------===//
				1127
				1128	unsigned a(unsigned x, unsigned y) { return x \| (y & 1) \| (y & 2);}
				1129	Should combine to "x \| (y & 3)". Currently not optimized with "clang
				1130	-emit-llvm-bc \| opt -std-compile-opts".
				1131
				1132	//===---------------------------------------------------------------------===//
				1133
				1134	unsigned a(unsigned a) {return ((a \| 1) & 3) \| (a & -4);}
				1135	Should combine to "a \| 1". Currently not optimized with "clang
				1136	-emit-llvm-bc \| opt -std-compile-opts".
				1137
				1138	//===---------------------------------------------------------------------===//
				1139
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1140	int a(int a, int b, int c) {return (~a & c) \| ((c\|a) & b);}
				1141	Should fold to "(~a & c) \| (a & b)". Currently not optimized with
				1142	"clang -emit-llvm-bc \| opt -std-compile-opts".
				1143
				1144	//===---------------------------------------------------------------------===//
				1145
				1146	int a(int a,int b) {return (~(a\|b))\|a;}
				1147	Should fold to "a\|~b". Currently not optimized with "clang
				1148	-emit-llvm-bc \| opt -std-compile-opts".
				1149
				1150	//===---------------------------------------------------------------------===//
				1151
				1152	int a(int a, int b) {return (a&&b) \|\| (a&&!b);}
				1153	Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
				1154	\| opt -std-compile-opts".
				1155
				1156	//===---------------------------------------------------------------------===//
				1157
				1158	int a(int a, int b, int c) {return (a&&b) \|\| (!a&&c);}
				1159	Should fold to "a ? b : c", or at least something sane. Currently not
				1160	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1161
				1162	//===---------------------------------------------------------------------===//
				1163
				1164	int a(int a, int b, int c) {return (a&&b) \|\| (a&&c) \|\| (a&&b&&c);}
				1165	Should fold to a && (b \|\| c). Currently not optimized with "clang
				1166	-emit-llvm-bc \| opt -std-compile-opts".
				1167
				1168	//===---------------------------------------------------------------------===//
				1169
				1170	int a(int x) {return x \| ((x & 8) ^ 8);}
				1171	Should combine to x \| 8. Currently not optimized with "clang
				1172	-emit-llvm-bc \| opt -std-compile-opts".
				1173
				1174	//===---------------------------------------------------------------------===//
				1175
				1176	int a(int x) {return x ^ ((x & 8) ^ 8);}
				1177	Should also combine to x \| 8. Currently not optimized with "clang
				1178	-emit-llvm-bc \| opt -std-compile-opts".
				1179
				1180	//===---------------------------------------------------------------------===//
				1181
				1182	int a(int x) {return (x & 8) == 0 ? -1 : -9;}
				1183	Should combine to (x \| -9) ^ 8. Currently not optimized with "clang
				1184	-emit-llvm-bc \| opt -std-compile-opts".
				1185
				1186	//===---------------------------------------------------------------------===//
				1187
				1188	int a(int x) {return (x & 8) == 0 ? -9 : -1;}
				1189	Should combine to x \| -9. Currently not optimized with "clang
				1190	-emit-llvm-bc \| opt -std-compile-opts".
				1191
				1192	//===---------------------------------------------------------------------===//
				1193
				1194	int a(int x) {return ((x \| -9) ^ 8) & x;}
				1195	Should combine to x & -9. Currently not optimized with "clang
				1196	-emit-llvm-bc \| opt -std-compile-opts".
				1197
				1198	//===---------------------------------------------------------------------===//
				1199
				1200	unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
				1201	Should combine to "a * 0x88888888 >> 31". Currently not optimized
				1202	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1203
				1204	//===---------------------------------------------------------------------===//
				1205
				1206	unsigned a(char* x) {if ((*x & 32) == 0) return b();}
				1207	There's an unnecessary zext in the generated code with "clang
				1208	-emit-llvm-bc \| opt -std-compile-opts".
				1209
				1210	//===---------------------------------------------------------------------===//
				1211
				1212	unsigned a(unsigned long long x) {return 40 * (x >> 1);}
				1213	Should combine to "20 * (((unsigned)x) & -2)". Currently not
				1214	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1215
				1216	//===---------------------------------------------------------------------===//
Bill Wendling	3bdcda8	2008-12-02 05:12:47 +0000	[diff] [blame]	1217
				1218	We would like to do the following transform in the instcombiner:
				1219
				1220	-X/C -> X/-C
				1221
				1222	However, this isn't valid if (-X) overflows. We can implement this when we
				1223	have the concept of a "C signed subtraction" operator that which is undefined
				1224	on overflow.
				1225
				1226	//===---------------------------------------------------------------------===//
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1227
				1228	This was noticed in the entryblock for grokdeclarator in 403.gcc:
				1229
				1230	%tmp = icmp eq i32 %decl_context, 4
				1231	%decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
				1232	%tmp1 = icmp eq i32 %decl_context_addr.0, 1
				1233	%decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
				1234
				1235	tmp1 should be simplified to something like:
				1236	(!tmp \|\| decl_context == 1)
				1237
				1238	This allows recursive simplifications, tmp1 is used all over the place in
				1239	the function, e.g. by:
				1240
				1241	%tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
				1242	%tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
				1243	%or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
				1244
				1245	later.
				1246
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1247	//===---------------------------------------------------------------------===//
				1248
				1249	Store sinking: This code:
				1250
				1251	void f (int n, int cond, int res) {
				1252	int i;
				1253	*res = 0;
				1254	for (i = 0; i < n; i++)
				1255	if (*cond)
				1256	res ^= 234; / () /
				1257	}
				1258
				1259	On this function GVN hoists the fully redundant value of *res, but nothing
				1260	moves the store out. This gives us this code:
				1261
				1262	bb: ; preds = %bb2, %entry
				1263	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
				1264	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
				1265	%1 = load i32* %cond, align 4
				1266	%2 = icmp eq i32 %1, 0
				1267	br i1 %2, label %bb2, label %bb1
				1268
				1269	bb1: ; preds = %bb
				1270	%3 = xor i32 %.rle, 234
				1271	store i32 %3, i32* %res, align 4
				1272	br label %bb2
				1273
				1274	bb2: ; preds = %bb, %bb1
				1275	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
				1276	%indvar.next = add i32 %i.05, 1
				1277	%exitcond = icmp eq i32 %indvar.next, %n
				1278	br i1 %exitcond, label %return, label %bb
				1279
				1280	DSE should sink partially dead stores to get the store out of the loop.
				1281
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1282	Here's another partial dead case:
				1283	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
				1284
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1285	//===---------------------------------------------------------------------===//
				1286
				1287	Scalar PRE hoists the mul in the common block up to the else:
				1288
				1289	int test (int a, int b, int c, int g) {
				1290	int d, e;
				1291	if (a)
				1292	d = b * c;
				1293	else
				1294	d = b - c;
				1295	e = b * c + g;
				1296	return d + e;
				1297	}
				1298
				1299	It would be better to do the mul once to reduce codesize above the if.
				1300	This is GCC PR38204.
				1301
				1302	//===---------------------------------------------------------------------===//
				1303
				1304	GCC PR37810 is an interesting case where we should sink load/store reload
				1305	into the if block and outside the loop, so we don't reload/store it on the
				1306	non-call path.
				1307
				1308	for () {
				1309	*P += 1;
				1310	if ()
				1311	call();
				1312	else
				1313	...
				1314	->
				1315	tmp = *P
				1316	for () {
				1317	tmp += 1;
				1318	if () {
				1319	*P = tmp;
				1320	call();
				1321	tmp = *P;
				1322	} else ...
				1323	}
				1324	*P = tmp;
				1325
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1326	We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
				1327	we don't sink the store. We need partially dead store sinking.
				1328
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1329	//===---------------------------------------------------------------------===//
				1330
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1331	[PHI TRANSLATE GEPs]
				1332
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1333	GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
				1334	leading to excess stack traffic. This could be handled by GVN with some crazy
				1335	symbolic phi translation. The code we get looks like (g is on the stack):
				1336
				1337	bb2: ; preds = %bb1
				1338	..
				1339	%9 = getelementptr %struct.f* %g, i32 0, i32 0
				1340	store i32 %8, i32* %9, align bel %bb3
				1341
				1342	bb3: ; preds = %bb1, %bb2, %bb
				1343	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
				1344	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
				1345	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
				1346	%11 = load i32* %10, align 4
				1347
				1348	%11 is fully redundant, an in BB2 it should have the value %8.
				1349
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1350	GCC PR33344 is a similar case.
				1351
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1352	//===---------------------------------------------------------------------===//
				1353
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1354	There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
				1355	GCC testsuite. There are many pre testcases as ssa-pre-*.c
				1356
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1357	//===---------------------------------------------------------------------===//
				1358
				1359	There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
				1360	GCC testsuite. For example, predcom-1.c is:
				1361
				1362	for (i = 2; i < 1000; i++)
				1363	fib[i] = (fib[i-1] + fib[i - 2]) & 0xffff;
				1364
				1365	which compiles into:
				1366
				1367	bb1: ; preds = %bb1, %bb1.thread
				1368	%indvar = phi i32 [ 0, %bb1.thread ], [ %0, %bb1 ]
				1369	%i.0.reg2mem.0 = add i32 %indvar, 2
				1370	%0 = add i32 %indvar, 1 ; <i32> [#uses=3]
				1371	%1 = getelementptr [1000 x i32]* @fib, i32 0, i32 %0
				1372	%2 = load i32* %1, align 4 ; <i32> [#uses=1]
				1373	%3 = getelementptr [1000 x i32]* @fib, i32 0, i32 %indvar
				1374	%4 = load i32* %3, align 4 ; <i32> [#uses=1]
				1375	%5 = add i32 %4, %2 ; <i32> [#uses=1]
				1376	%6 = and i32 %5, 65535 ; <i32> [#uses=1]
				1377	%7 = getelementptr [1000 x i32]* @fib, i32 0, i32 %i.0.reg2mem.0
				1378	store i32 %6, i32* %7, align 4
				1379	%exitcond = icmp eq i32 %0, 998 ; <i1> [#uses=1]
				1380	br i1 %exitcond, label %return, label %bb1
				1381
				1382	This is basically:
				1383	LOAD fib[i+1]
				1384	LOAD fib[i]
				1385	STORE fib[i+2]
				1386
				1387	instead of handling this as a loop or other xform, all we'd need to do is teach
				1388	load PRE to phi translate the %0 add (i+1) into the predecessor as (i'+1+1) =
				1389	(i'+2) (where i' is the previous iteration of i). This would find the store
				1390	which feeds it.
				1391
				1392	predcom-2.c is apparently the same as predcom-1.c
				1393	predcom-3.c is very similar but needs loads feeding each other instead of
				1394	store->load.
				1395	predcom-4.c seems the same as the rest.
				1396
				1397
				1398	//===---------------------------------------------------------------------===//
				1399
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1400	Other simple load PRE cases:
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1401	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35287 [LPRE crit edge splitting]
				1402
				1403	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34677 (licm does this, LPRE crit edge)
				1404	llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \| opt -mem2reg -simplifycfg -gvn \| llvm-dis
				1405
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1406	//===---------------------------------------------------------------------===//
				1407
				1408	Type based alias analysis:
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1409	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
				1410
				1411	//===---------------------------------------------------------------------===//
				1412
				1413	When GVN/PRE finds a store of float* to a must aliases pointer when expecting
				1414	an int*, it should turn it into a bitcast. This is a nice generalization of
Chris Lattner	630c99f	2008-12-07 00:15:10 +0000	[diff] [blame]	1415	the SROA hack that would apply to other cases, e.g.:
				1416
				1417	int foo(int C, int *P, float X) {
				1418	if (C) {
				1419	bar();
				1420	*P = 42;
				1421	} else
				1422	(float)P = X;
				1423
				1424	return *P;
				1425	}
				1426
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1427
				1428	One example (that requires crazy phi translation) is:
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1429	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16799 [BITCAST PHI TRANS]
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1430
				1431	//===---------------------------------------------------------------------===//
				1432
				1433	A/B get pinned to the stack because we turn an if/then into a select instead
				1434	of PRE'ing the load/store. This may be fixable in instcombine:
				1435	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
				1436
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1437
				1438
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1439	Interesting missed case because of control flow flattening (should be 2 loads):
				1440	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1441	With: llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \|
				1442	opt -mem2reg -gvn -instcombine \| llvm-dis
				1443	we miss it because we need 1) GEP PHI TRAN, 2) CRIT EDGE 3) MULTIPLE DIFFERENT
				1444	VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1445
				1446	//===---------------------------------------------------------------------===//
				1447
				1448	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
				1449	We could eliminate the branch condition here, loading from null is undefined:
				1450
				1451	struct S { int w, x, y, z; };
				1452	struct T { int r; struct S s; };
				1453	void bar (struct S, int);
				1454	void foo (int a, struct T b)
				1455	{
				1456	struct S *c = 0;
				1457	if (a)
				1458	c = &b.s;
				1459	bar (*c, a);
				1460	}
				1461
				1462	//===---------------------------------------------------------------------===//
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1463