Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: 8399115e598aebe2011952f596bce12f830c9f1f [file] [log] [blame]

Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
Chris Lattner	f308ea0	2006-09-28 06:01:17 +0000	[diff] [blame]	3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	9b62b45	2006-11-14 01:57:53 +0000	[diff] [blame]	5	With the recent changes to make the implicit def/use set explicit in
				6	machineinstrs, we should change the target descriptions for 'call' instructions
				7	so that the .td files don't list all the call-clobbered registers as implicit
				8	defs. Instead, these should be added by the code generator (e.g. on the dag).
				9
				10	This has a number of uses:
				11
				12	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				13	for their different impdef sets.
				14	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				15	sets don't need copies of call instructions.
				16	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				17	of calls.
				18
				19	//===---------------------------------------------------------------------===//
				20
Nate Begeman	81e8097	2006-03-17 01:40:33 +0000	[diff] [blame]	21	Make the PPC branch selector target independant
				22
				23	//===---------------------------------------------------------------------===//
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	24
				25	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	26	precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't
				27	safe in general, even on darwin. See the libm implementation of hypot for
				28	examples (which special case when x/y are exactly zero to get signed zeros etc
				29	right).
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	30
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	31	//===---------------------------------------------------------------------===//
				32
				33	Solve this DAG isel folding deficiency:
				34
				35	int X, Y;
				36
				37	void fn1(void)
				38	{
				39	X = X \| (Y << 3);
				40	}
				41
				42	compiles to
				43
				44	fn1:
				45	movl Y, %eax
				46	shll $3, %eax
				47	orl X, %eax
				48	movl %eax, X
				49	ret
				50
				51	The problem is the store's chain operand is not the load X but rather
				52	a TokenFactor of the load X and load Y, which prevents the folding.
				53
				54	There are two ways to fix this:
				55
				56	1. The dag combiner can start using alias analysis to realize that y/x
				57	don't alias, making the store to X not dependent on the load from Y.
				58	2. The generated isel could be made smarter in the case it can't
				59	disambiguate the pointers.
				60
				61	Number 1 is the preferred solution.
				62
Evan Cheng	e617b08	2006-03-13 23:19:10 +0000	[diff] [blame]	63	This has been "fixed" by a TableGen hack. But that is a short term workaround
				64	which will be removed once the proper fix is made.
				65
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	66	//===---------------------------------------------------------------------===//
				67
Chris Lattner	b27b69f	2006-03-04 01:19:34 +0000	[diff] [blame]	68	On targets with expensive 64-bit multiply, we could LSR this:
				69
				70	for (i = ...; ++i) {
				71	x = 1ULL << i;
				72
				73	into:
				74	long long tmp = 1;
				75	for (i = ...; ++i, tmp+=tmp)
				76	x = tmp;
				77
				78	This would be a win on ppc32, but not x86 or ppc64.
				79
Chris Lattner	ad01993	2006-03-04 08:44:51 +0000	[diff] [blame]	80	//===---------------------------------------------------------------------===//
Chris Lattner	5b0fe7d	2006-03-05 20:00:08 +0000	[diff] [blame]	81
				82	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				83
				84	//===---------------------------------------------------------------------===//
Chris Lattner	549f27d2	2006-03-07 02:46:26 +0000	[diff] [blame]	85
Chris Lattner	c20995e	2006-03-11 20:17:08 +0000	[diff] [blame]	86	Reassociate should turn: XXXX -> t=(XX) (t*t) to eliminate a multiply.
				87
				88	//===---------------------------------------------------------------------===//
				89
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	90	Interesting? testcase for add/shift/mul reassoc:
				91
				92	int bar(int x, int y) {
				93	return xxx+y+xxxxxyyyy;
				94	}
				95	int foo(int z, int n) {
				96	return bar(z, n) + bar(2z, 2n);
				97	}
				98
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	99	Reassociate should handle the example in GCC PR16157.
				100
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	101	//===---------------------------------------------------------------------===//
				102
Chris Lattner	82c78b2	2006-03-09 20:13:21 +0000	[diff] [blame]	103	These two functions should generate the same code on big-endian systems:
				104
				105	int g(int j,int l) { return memcmp(j,l,4); }
				106	int h(int j, int l) { return j - l; }
				107
				108	this could be done in SelectionDAGISel.cpp, along with other special cases,
				109	for 1,2,4,8 bytes.
				110
				111	//===---------------------------------------------------------------------===//
				112
Chris Lattner	c04b423	2006-03-22 07:33:46 +0000	[diff] [blame]	113	It would be nice to revert this patch:
				114	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				115
				116	And teach the dag combiner enough to simplify the code expanded before
				117	legalize. It seems plausible that this knowledge would let it simplify other
				118	stuff too.
				119
Chris Lattner	e6cd96d	2006-03-24 19:59:17 +0000	[diff] [blame]	120	//===---------------------------------------------------------------------===//
				121
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	122	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng	67d3d4c	2006-03-31 22:35:14 +0000	[diff] [blame]	123	to the type size. It works but can be overly conservative as the alignment of
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	124	specific vector types are target dependent.
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	125
				126	//===---------------------------------------------------------------------===//
				127
				128	We should add 'unaligned load/store' nodes, and produce them from code like
				129	this:
				130
				131	v4sf example(float *P) {
				132	return (v4sf){P[0], P[1], P[2], P[3] };
				133	}
				134
				135	//===---------------------------------------------------------------------===//
				136
Chris Lattner	16abfdf	2006-05-18 18:26:13 +0000	[diff] [blame]	137	Add support for conditional increments, and other related patterns. Instead
				138	of:
				139
				140	movl 136(%esp), %eax
				141	cmpl $0, %eax
				142	je LBB16_2 #cond_next
				143	LBB16_1: #cond_true
				144	incl _foo
				145	LBB16_2: #cond_next
				146
				147	emit:
				148	movl _foo, %eax
				149	cmpl $1, %edi
				150	sbbl $-1, %eax
				151	movl %eax, _foo
				152
				153	//===---------------------------------------------------------------------===//
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	154
				155	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				156
				157	Expand these to calls of sin/cos and stores:
				158	double sincos(double x, double sin, double cos);
				159	float sincosf(float x, float sin, float cos);
				160	long double sincosl(long double x, long double sin, long double cos);
				161
				162	Doing so could allow SROA of the destination pointers. See also:
				163	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				164
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	165	This is now easily doable with MRVs. We could even make an intrinsic for this
				166	if anyone cared enough about sincos.
				167
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	168	//===---------------------------------------------------------------------===//
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	169
Chris Lattner	e8263e6	2006-05-21 03:57:07 +0000	[diff] [blame]	170	Turn this into a single byte store with no load (the other 3 bytes are
				171	unmodified):
				172
				173	void %test(uint* %P) {
				174	%tmp = load uint* %P
				175	%tmp14 = or uint %tmp, 3305111552
				176	%tmp15 = and uint %tmp14, 3321888767
				177	store uint %tmp15, uint* %P
				178	ret void
				179	}
				180
Chris Lattner	9e18ef5	2006-05-30 21:29:15 +0000	[diff] [blame]	181	//===---------------------------------------------------------------------===//
				182
				183	dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
				184
				185	Compile:
				186
				187	int bar(int x)
				188	{
				189	int t = __builtin_clz(x);
				190	return -(t>>5);
				191	}
				192
				193	to:
				194
				195	_bar: addic r3,r3,-1
				196	subfe r3,r3,r3
				197	blr
				198
Chris Lattner	cbce2f6	2006-09-15 20:31:36 +0000	[diff] [blame]	199	//===---------------------------------------------------------------------===//
				200
				201	Legalize should lower ctlz like this:
				202	ctlz(x) = popcnt((x-1) & ~x)
				203
				204	on targets that have popcnt but not ctlz. itanium, what else?
Chris Lattner	9e18ef5	2006-05-30 21:29:15 +0000	[diff] [blame]	205
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	206	//===---------------------------------------------------------------------===//
				207
				208	quantum_sigma_x in 462.libquantum contains the following loop:
				209
				210	for(i=0; i<reg->size; i++)
				211	{
				212	/* Flip the target bit of each basis state */
				213	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				214	}
				215
				216	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				217	so cool to turn it into something like:
				218
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	219	long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	220	if (target < 32) {
				221	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	222	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	223	} else {
				224	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	225	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	226	}
				227
				228	... which would only do one 32-bit XOR per loop iteration instead of two.
				229
				230	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
				231	alas...
				232
				233	//===---------------------------------------------------------------------===//
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	234
Chris Lattner	b1ac769	2008-10-05 02:16:12 +0000	[diff] [blame]	235	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Chris Lattner	f9bae43	2006-12-08 02:01:32 +0000	[diff] [blame]	236
				237	unsigned long reverse(unsigned v) {
				238	unsigned t;
				239	t = v ^ ((v << 16) \| (v >> 16));
				240	t &= ~0xff0000;
				241	v = (v << 24) \| (v >> 8);
				242	return v ^ (t >> 8);
				243	}
				244
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	245	//===---------------------------------------------------------------------===//
				246
Chris Lattner	f4fee2a	2008-10-15 16:02:15 +0000	[diff] [blame]	247	These idioms should be recognized as popcount (see PR1488):
				248
				249	unsigned countbits_slow(unsigned v) {
				250	unsigned c;
				251	for (c = 0; v; v >>= 1)
				252	c += v & 1;
				253	return c;
				254	}
				255	unsigned countbits_fast(unsigned v){
				256	unsigned c;
				257	for (c = 0; v; c++)
				258	v &= v - 1; // clear the least significant bit set
				259	return c;
				260	}
				261
				262	BITBOARD = unsigned long long
				263	int PopCnt(register BITBOARD a) {
				264	register int c=0;
				265	while(a) {
				266	c++;
				267	a &= a - 1;
				268	}
				269	return c;
				270	}
				271	unsigned int popcount(unsigned int input) {
				272	unsigned int count = 0;
				273	for (unsigned int i = 0; i < 4 * 8; i++)
				274	count += (input >> i) & i;
				275	return count;
				276	}
				277
				278	//===---------------------------------------------------------------------===//
				279
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	280	These should turn into single 16-bit (unaligned?) loads on little/big endian
				281	processors.
				282
				283	unsigned short read_16_le(const unsigned char *adr) {
				284	return adr[0] \| (adr[1] << 8);
				285	}
				286	unsigned short read_16_be(const unsigned char *adr) {
				287	return (adr[0] << 8) \| adr[1];
				288	}
				289
				290	//===---------------------------------------------------------------------===//
Chris Lattner	cf10391	2006-10-24 16:12:47 +0000	[diff] [blame]	291
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	292	-instcombine should handle this transform:
Reid Spencer	e4d87aa	2006-12-23 06:05:41 +0000	[diff] [blame]	293	icmp pred (sdiv X / C1 ), C2
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	294	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				295
				296	Currently InstCombine avoids this transform but will do it when the signs of
				297	the operands and the sign of the divide match. See the FIXME in
				298	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				299	for Instruction::UDiv (around line 4447) for more details.
				300
				301	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				302	this construct.
Chris Lattner	d7c628d	2006-11-03 22:27:39 +0000	[diff] [blame]	303
				304	//===---------------------------------------------------------------------===//
				305
Chris Lattner	578d2df	2006-11-10 00:23:26 +0000	[diff] [blame]	306	viterbi speeds up significantly if the various "history" related copy loops
				307	are turned into memcpy calls at the source level. We need a "loops to memcpy"
				308	pass.
				309
				310	//===---------------------------------------------------------------------===//
Nick Lewycky	bf63734	2006-11-13 00:23:28 +0000	[diff] [blame]	311
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	312	Consider:
				313
				314	typedef unsigned U32;
				315	typedef unsigned long long U64;
				316	int test (U32 inst, U64 regs) {
				317	U64 effective_addr2;
				318	U32 temp = *inst;
				319	int r1 = (temp >> 20) & 0xf;
				320	int b2 = (temp >> 16) & 0xf;
				321	effective_addr2 = temp & 0xfff;
				322	if (b2) effective_addr2 += regs[b2];
				323	b2 = (temp >> 12) & 0xf;
				324	if (b2) effective_addr2 += regs[b2];
				325	effective_addr2 &= regs[4];
				326	if ((effective_addr2 & 3) == 0)
				327	return 1;
				328	return 0;
				329	}
				330
				331	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				332	we don't eliminate the computation of the top half of effective_addr2 because
				333	we don't have whole-function selection dags. On x86, this means we use one
				334	extra register for the function when effective_addr2 is declared as U64 than
				335	when it is declared U32.
				336
				337	//===---------------------------------------------------------------------===//
				338
Chris Lattner	36e37d2	2007-02-13 21:44:43 +0000	[diff] [blame]	339	Promote for i32 bswap can use i64 bswap + shr. Useful on targets with 64-bit
				340	regs and bswap, like itanium.
				341
				342	//===---------------------------------------------------------------------===//
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	343
				344	LSR should know what GPR types a target has. This code:
				345
				346	volatile short X, Y; // globals
				347
				348	void foo(int N) {
				349	int i;
				350	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				351	}
				352
				353	produces two identical IV's (after promotion) on PPC/ARM:
				354
				355	LBB1_1: @bb.preheader
				356	mov r3, #0
				357	mov r2, r3
				358	mov r1, r3
				359	LBB1_2: @bb
				360	ldr r12, LCPI1_0
				361	ldr r12, [r12]
				362	strh r2, [r12]
				363	ldr r12, LCPI1_1
				364	ldr r12, [r12]
				365	strh r3, [r12]
				366	add r1, r1, #1 <- [0,+,1]
				367	add r3, r3, #4
				368	add r2, r2, #1 <- [0,+,1]
				369	cmp r1, r0
				370	bne LBB1_2 @bb
				371
				372
				373	//===---------------------------------------------------------------------===//
				374
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	375	Tail call elim should be more aggressive, checking to see if the call is
				376	followed by an uncond branch to an exit block.
				377
				378	; This testcase is due to tail-duplication not wanting to copy the return
				379	; instruction into the terminating blocks because there was other code
				380	; optimized out of the function after the taildup happened.
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	381	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	382
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	383	define i32 @t4(i32 %a) {
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	384	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	385	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				386	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				387	br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	388
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	389	then.0: ; preds = %entry
				390	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				391	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				392	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	393
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	394	else.0: ; preds = %entry
				395	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				396	br i1 %tmp.7, label %then.1, label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	397
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	398	then.1: ; preds = %else.0
				399	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				400	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				401	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	402
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	403	return: ; preds = %then.1, %else.0, %then.0
				404	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	405	[ %tmp.9, %then.1 ]
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	406	ret i32 %result.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	407	}
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	408
				409	//===---------------------------------------------------------------------===//
				410
Chris Lattner	e1bb6ab	2007-10-03 06:10:59 +0000	[diff] [blame]	411	Tail recursion elimination is not transforming this function, because it is
				412	returning n, which fails the isDynamicConstant check in the accumulator
				413	recursion checks.
				414
				415	long long fib(const long long n) {
				416	switch(n) {
				417	case 0:
				418	case 1:
				419	return n;
				420	default:
				421	return fib(n-1) + fib(n-2);
				422	}
				423	}
				424
				425	//===---------------------------------------------------------------------===//
				426
Chris Lattner	c90b866	2008-08-10 00:47:21 +0000	[diff] [blame]	427	Tail recursion elimination should handle:
				428
				429	int pow2m1(int n) {
				430	if (n == 0)
				431	return 0;
				432	return 2 * pow2m1 (n - 1) + 1;
				433	}
				434
				435	Also, multiplies can be turned into SHL's, so they should be handled as if
				436	they were associative. "return foo() << 1" can be tail recursion eliminated.
				437
				438	//===---------------------------------------------------------------------===//
				439
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	440	Argument promotion should promote arguments for recursive functions, like
				441	this:
				442
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	443	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	444
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	445	define internal i32 @foo(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	446	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	447	%tmp = load i32* %x ; <i32> [#uses=0]
				448	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				449	ret i32 %tmp.foo
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	450	}
				451
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	452	define i32 @bar(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	453	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	454	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				455	ret i32 %tmp3
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	456	}
				457
Chris Lattner	81f2d71	2007-12-05 23:05:06 +0000	[diff] [blame]	458	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	459
				460	"basicaa" should know how to look through "or" instructions that act like add
				461	instructions. For example in this code, the x4+1 is turned into x4 \| 1, and
				462	basicaa can't analyze the array subscript, leading to duplicated loads in the
				463	generated code:
				464
				465	void test(int X, int Y, int a[]) {
				466	int i;
				467	for (i=2; i<1000; i+=4) {
				468	a[i+0] = a[i-1+0]*a[i-2+0];
				469	a[i+1] = a[i-1+1]*a[i-2+1];
				470	a[i+2] = a[i-1+2]*a[i-2+2];
				471	a[i+3] = a[i-1+3]*a[i-2+3];
				472	}
				473	}
				474
Chris Lattner	2dae65d	2008-12-10 01:30:48 +0000	[diff] [blame]	475	BasicAA also doesn't do this for add. It needs to know that &A[i+1] != &A[i].
				476
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	477	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	478
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	479	We should investigate an instruction sinking pass. Consider this silly
				480	example in pic mode:
				481
				482	#include <assert.h>
				483	void foo(int x) {
				484	assert(x);
				485	//...
				486	}
				487
				488	we compile this to:
				489	_foo:
				490	subl $28, %esp
				491	call "L1$pb"
				492	"L1$pb":
				493	popl %eax
				494	cmpl $0, 32(%esp)
				495	je LBB1_2 # cond_true
				496	LBB1_1: # return
				497	# ...
				498	addl $28, %esp
				499	ret
				500	LBB1_2: # cond_true
				501	...
				502
				503	The PIC base computation (call+popl) is only used on one path through the
				504	code, but is currently always computed in the entry block. It would be
				505	better to sink the picbase computation down into the block for the
				506	assertion, as it is the only one that uses it. This happens for a lot of
				507	code with early outs.
				508
Chris Lattner	92c06a0	2007-12-29 01:05:01 +0000	[diff] [blame]	509	Another example is loads of arguments, which are usually emitted into the
				510	entry block on targets like x86. If not used in all paths through a
				511	function, they should be sunk into the ones that do.
				512
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	513	In this case, whole-function-isel would also handle this.
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	514
				515	//===---------------------------------------------------------------------===//
Chris Lattner	b304194	2008-01-07 21:38:14 +0000	[diff] [blame]	516
				517	Investigate lowering of sparse switch statements into perfect hash tables:
				518	http://burtleburtle.net/bob/hash/perfect.html
				519
				520	//===---------------------------------------------------------------------===//
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	521
				522	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				523	corresponding integer operations. On a yonah, this loop:
				524
				525	double a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	526	void foo() {
				527	int i, b;
				528	for (b = 0; b < 10000000; b++)
				529	for (i = 0; i < 256; i++)
				530	a[i] = -a[i];
				531	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	532
				533	is twice as slow as this loop:
				534
				535	long long a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	536	void foo() {
				537	int i, b;
				538	for (b = 0; b < 10000000; b++)
				539	for (i = 0; i < 256; i++)
				540	a[i] ^= (1ULL << 63);
				541	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	542
				543	and I suspect other processors are similar. On X86 in particular this is a
				544	big win because doing this with integers allows the use of read/modify/write
				545	instructions.
				546
				547	//===---------------------------------------------------------------------===//
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	548
				549	DAG Combiner should try to combine small loads into larger loads when
				550	profitable. For example, we compile this C++ example:
				551
				552	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				553	extern THotKey m_HotKey;
				554	THotKey GetHotKey () { return m_HotKey; }
				555
				556	into (-O3 -fno-exceptions -static -fomit-frame-pointer):
				557
				558	__Z9GetHotKeyv:
				559	pushl %esi
				560	movl 8(%esp), %eax
				561	movb _m_HotKey+3, %cl
				562	movb _m_HotKey+4, %dl
				563	movb _m_HotKey+2, %ch
				564	movw _m_HotKey, %si
				565	movw %si, (%eax)
				566	movb %ch, 2(%eax)
				567	movb %cl, 3(%eax)
				568	movb %dl, 4(%eax)
				569	popl %esi
				570	ret $4
				571
				572	GCC produces:
				573
				574	__Z9GetHotKeyv:
				575	movl _m_HotKey, %edx
				576	movl 4(%esp), %eax
				577	movl %edx, (%eax)
				578	movzwl _m_HotKey+4, %edx
				579	movw %dx, 4(%eax)
				580	ret $4
				581
				582	The LLVM IR contains the needed alignment info, so we should be able to
				583	merge the loads and stores into 4-byte loads:
				584
				585	%struct.THotKey = type { i16, i8, i8, i8 }
				586	define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
				587	...
				588	%tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
				589	%tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
				590	%tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
				591	%tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
				592
				593	Alternatively, we should use a small amount of base-offset alias analysis
				594	to make it so the scheduler doesn't need to hold all the loads in regs at
				595	once.
				596
				597	//===---------------------------------------------------------------------===//
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	598
Nate Begeman	e9fe65c	2008-02-18 18:39:23 +0000	[diff] [blame]	599	We should add an FRINT node to the DAG to model targets that have legal
				600	implementations of ceil/floor/rint.
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	601
				602	//===---------------------------------------------------------------------===//
				603
Chris Lattner	e29536c	2008-02-28 17:21:27 +0000	[diff] [blame]	604	This GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34043
				605	contains a testcase that compiles down to:
				606
				607	%struct.XMM128 = type { <4 x float> }
				608	..
				609	%src = alloca %struct.XMM128
				610	..
				611	%tmp6263 = bitcast %struct.XMM128* %src to <2 x i64>*
				612	%tmp65 = getelementptr %struct.XMM128* %src, i32 0, i32 0
				613	store <2 x i64> %tmp5899, <2 x i64>* %tmp6263, align 16
				614	%tmp66 = load <4 x float>* %tmp65, align 16
				615	%tmp71 = add <4 x float> %tmp66, %tmp66
				616
				617	If the mid-level optimizer turned the bitcast of pointer + store of tmp5899
				618	into a bitcast of the vector value and a store to the pointer, then the
				619	store->load could be easily removed.
				620
				621	//===---------------------------------------------------------------------===//
				622
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	623	Consider:
				624
				625	int test() {
				626	long long input[8] = {1,1,1,1,1,1,1,1};
				627	foo(input);
				628	}
				629
				630	We currently compile this into a memcpy from a global array since the
				631	initializer is fairly large and not memset'able. This is good, but the memcpy
				632	gets lowered to load/stores in the code generator. This is also ok, except
				633	that the codegen lowering for memcpy doesn't handle the case when the source
				634	is a constant global. This gives us atrocious code like this:
				635
				636	call "L1$pb"
				637	"L1$pb":
				638	popl %eax
				639	movl _C.0.1444-"L1$pb"+32(%eax), %ecx
				640	movl %ecx, 40(%esp)
				641	movl _C.0.1444-"L1$pb"+20(%eax), %ecx
				642	movl %ecx, 28(%esp)
				643	movl _C.0.1444-"L1$pb"+36(%eax), %ecx
				644	movl %ecx, 44(%esp)
				645	movl _C.0.1444-"L1$pb"+44(%eax), %ecx
				646	movl %ecx, 52(%esp)
				647	movl _C.0.1444-"L1$pb"+40(%eax), %ecx
				648	movl %ecx, 48(%esp)
				649	movl _C.0.1444-"L1$pb"+12(%eax), %ecx
				650	movl %ecx, 20(%esp)
				651	movl _C.0.1444-"L1$pb"+4(%eax), %ecx
				652	...
				653
				654	instead of:
				655	movl $1, 16(%esp)
				656	movl $0, 20(%esp)
				657	movl $1, 24(%esp)
				658	movl $0, 28(%esp)
				659	movl $1, 32(%esp)
				660	movl $0, 36(%esp)
				661	...
				662
				663	//===---------------------------------------------------------------------===//
Chris Lattner	a11deb0	2008-03-02 02:51:40 +0000	[diff] [blame]	664
				665	http://llvm.org/PR717:
				666
				667	The following code should compile into "ret int undef". Instead, LLVM
				668	produces "ret int 0":
				669
				670	int f() {
				671	int x = 4;
				672	int y;
				673	if (x == 3) y = 0;
				674	return y;
				675	}
				676
				677	//===---------------------------------------------------------------------===//
Chris Lattner	53b7277	2008-03-02 19:29:42 +0000	[diff] [blame]	678
				679	The loop unroller should partially unroll loops (instead of peeling them)
				680	when code growth isn't too bad and when an unroll count allows simplification
				681	of some code within the loop. One trivial example is:
				682
				683	#include <stdio.h>
				684	int main() {
				685	int nRet = 17;
				686	int nLoop;
				687	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				688	if ( nLoop & 1 )
				689	nRet += 2;
				690	else
				691	nRet -= 1;
				692	}
				693	return nRet;
				694	}
				695
				696	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				697	reduction in code size. The resultant code would then also be suitable for
				698	exit value computation.
				699
				700	//===---------------------------------------------------------------------===//
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	701
				702	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				703	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				704	matching code in dag combine doesn't look through truncates aggressively
				705	enough. Here are some testcases reduces from GCC PR17886:
				706
				707	unsigned long long f(unsigned long long x, int y) {
				708	return (x << y) \| (x >> 64-y);
				709	}
				710	unsigned f2(unsigned x, int y){
				711	return (x << y) \| (x >> 32-y);
				712	}
				713	unsigned long long f3(unsigned long long x){
				714	int y = 9;
				715	return (x << y) \| (x >> 64-y);
				716	}
				717	unsigned f4(unsigned x){
				718	int y = 10;
				719	return (x << y) \| (x >> 32-y);
				720	}
				721	unsigned long long f5(unsigned long long x, unsigned long long y) {
				722	return (x << 8) \| ((y >> 48) & 0xffull);
				723	}
				724	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				725	switch(z) {
				726	case 1:
				727	return (x << 8) \| ((y >> 48) & 0xffull);
				728	case 2:
				729	return (x << 16) \| ((y >> 40) & 0xffffull);
				730	case 3:
				731	return (x << 24) \| ((y >> 32) & 0xffffffull);
				732	case 4:
				733	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				734	default:
				735	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				736	}
				737	}
				738
Dan Gohman	cb747c5	2008-10-17 21:39:27 +0000	[diff] [blame]	739	On X86-64, we only handle f2/f3/f4 right. On x86-32, a few of these
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	740	generate truly horrible code, instead of using shld and friends. On
				741	ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
				742	badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
				743
				744	//===---------------------------------------------------------------------===//
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	745
				746	We do a number of simplifications in simplify libcalls to strength reduce
				747	standard library functions, but we don't currently merge them together. For
				748	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				749	be done safely if "b" isn't modified between the strlen and memcpy of course.
				750
				751	//===---------------------------------------------------------------------===//
				752
Chris Lattner	b578310	2008-05-17 15:37:38 +0000	[diff] [blame]	753	We should be able to evaluate this loop:
				754
				755	int test(int x_offs) {
				756	while (x_offs > 4)
				757	x_offs -= 4;
				758	return x_offs;
				759	}
				760
				761	//===---------------------------------------------------------------------===//
Chris Lattner	10c5d36	2008-07-14 00:19:59 +0000	[diff] [blame]	762
				763	Reassociate should turn things like:
				764
				765	int factorial(int X) {
				766	return XXXXXXX*X;
				767	}
				768
				769	into llvm.powi calls, allowing the code generator to produce balanced
				770	multiplication trees.
				771
				772	//===---------------------------------------------------------------------===//
				773
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	774	We generate a horrible libcall for llvm.powi. For example, we compile:
				775
				776	#include <cmath>
				777	double f(double a) { return std::pow(a, 4); }
				778
				779	into:
				780
				781	__Z1fd:
				782	subl $12, %esp
				783	movsd 16(%esp), %xmm0
				784	movsd %xmm0, (%esp)
				785	movl $4, 8(%esp)
				786	call L___powidf2$stub
				787	addl $12, %esp
				788	ret
				789
				790	GCC produces:
				791
				792	__Z1fd:
				793	subl $12, %esp
				794	movsd 16(%esp), %xmm0
				795	mulsd %xmm0, %xmm0
				796	mulsd %xmm0, %xmm0
				797	movsd %xmm0, (%esp)
				798	fldl (%esp)
				799	addl $12, %esp
				800	ret
				801
				802	//===---------------------------------------------------------------------===//
				803
				804	We compile this program: (from GCC PR11680)
				805	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				806
				807	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				808	slower than when compile with GCC (either 4.0 or 4.2):
				809
				810	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				811	$ time ./a.out fast
				812	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				813
				814	$ g++ perf.cpp -O3 -fno-exceptions
				815	$ time ./a.out fast
				816	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				817
				818	It looks like we are making the same inlining decisions, so this may be raw
				819	codegen badness or something else (haven't investigated).
				820
				821	//===---------------------------------------------------------------------===//
				822
				823	We miss some instcombines for stuff like this:
				824	void bar (void);
				825	void foo (unsigned int a) {
				826	/* This one is equivalent to a >= (3 << 2). */
				827	if ((a >> 2) >= 3)
				828	bar ();
				829	}
				830
				831	A few other related ones are in GCC PR14753.
				832
				833	//===---------------------------------------------------------------------===//
				834
				835	Divisibility by constant can be simplified (according to GCC PR12849) from
				836	being a mulhi to being a mul lo (cheaper). Testcase:
				837
				838	void bar(unsigned n) {
				839	if (n % 3 == 0)
				840	true();
				841	}
				842
				843	I think this basically amounts to a dag combine to simplify comparisons against
				844	multiply hi's into a comparison against the mullo.
				845
				846	//===---------------------------------------------------------------------===//
Chris Lattner	23f35bc	2008-08-19 06:22:16 +0000	[diff] [blame]	847
Chris Lattner	db03983	2008-10-15 16:06:03 +0000	[diff] [blame]	848	Better mod/ref analysis for scanf would allow us to eliminate the vtable and a
				849	bunch of other stuff from this example (see PR1604):
				850
				851	#include <cstdio>
				852	struct test {
				853	int val;
				854	virtual ~test() {}
				855	};
				856
				857	int main() {
				858	test t;
				859	std::scanf("%d", &t.val);
				860	std::printf("%d\n", t.val);
				861	}
				862
				863	//===---------------------------------------------------------------------===//
				864
Chris Lattner	3b364cb	2008-10-15 16:33:52 +0000	[diff] [blame]	865	Instcombine will merge comparisons like (x >= 10) && (x < 20) by producing (x -
				866	10) u< 10, but only when the comparisons have matching sign.
				867
				868	This could be converted with a similiar technique. (PR1941)
				869
				870	define i1 @test(i8 %x) {
				871	%A = icmp uge i8 %x, 5
				872	%B = icmp slt i8 %x, 20
				873	%C = and i1 %A, %B
				874	ret i1 %C
				875	}
				876
				877	//===---------------------------------------------------------------------===//
Nick Lewycky	df563ca	2008-11-27 22:12:22 +0000	[diff] [blame]	878
Nick Lewycky	d2f0db1	2008-11-27 22:41:45 +0000	[diff] [blame]	879	These functions perform the same computation, but produce different assembly.
Nick Lewycky	df563ca	2008-11-27 22:12:22 +0000	[diff] [blame]	880
				881	define i8 @select(i8 %x) readnone nounwind {
				882	%A = icmp ult i8 %x, 250
				883	%B = select i1 %A, i8 0, i8 1
				884	ret i8 %B
				885	}
				886
				887	define i8 @addshr(i8 %x) readnone nounwind {
				888	%A = zext i8 %x to i9
				889	%B = add i9 %A, 6 ;; 256 - 250 == 6
				890	%C = lshr i9 %B, 8
				891	%D = trunc i9 %C to i8
				892	ret i8 %D
				893	}
				894
				895	//===---------------------------------------------------------------------===//
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	896
				897	From gcc bug 24696:
				898	int
				899	f (unsigned long a, unsigned long b, unsigned long c)
				900	{
				901	return ((a & (c - 1)) != 0) \|\| ((b & (c - 1)) != 0);
				902	}
				903	int
				904	f (unsigned long a, unsigned long b, unsigned long c)
				905	{
				906	return ((a & (c - 1)) != 0) \| ((b & (c - 1)) != 0);
				907	}
				908	Both should combine to ((a\|b) & (c-1)) != 0. Currently not optimized with
				909	"clang -emit-llvm-bc \| opt -std-compile-opts".
				910
				911	//===---------------------------------------------------------------------===//
				912
				913	From GCC Bug 20192:
				914	#define PMD_MASK (~((1UL << 23) - 1))
				915	void clear_pmd_range(unsigned long start, unsigned long end)
				916	{
				917	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))
				918	f();
				919	}
				920	The expression should optimize to something like
				921	"!((start\|end)&~PMD_MASK). Currently not optimized with "clang
				922	-emit-llvm-bc \| opt -std-compile-opts".
				923
				924	//===---------------------------------------------------------------------===//
				925
				926	From GCC Bug 15241:
				927	unsigned int
				928	foo (unsigned int a, unsigned int b)
				929	{
				930	if (a <= 7 && b <= 7)
				931	baz ();
				932	}
				933	Should combine to "(a\|b) <= 7". Currently not optimized with "clang
				934	-emit-llvm-bc \| opt -std-compile-opts".
				935
				936	//===---------------------------------------------------------------------===//
				937
				938	From GCC Bug 3756:
				939	int
				940	pn (int n)
				941	{
				942	return (n >= 0 ? 1 : -1);
				943	}
				944	Should combine to (n >> 31) \| 1. Currently not optimized with "clang
				945	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				946
				947	//===---------------------------------------------------------------------===//
				948
				949	From GCC Bug 28685:
				950	int test(int a, int b)
				951	{
				952	int lt = a < b;
				953	int eq = a == b;
				954
				955	return (lt \|\| eq);
				956	}
				957	Should combine to "a <= b". Currently not optimized with "clang
				958	-emit-llvm-bc \| opt -std-compile-opts \| llc".
				959
				960	//===---------------------------------------------------------------------===//
				961
				962	void a(int variable)
				963	{
				964	if (variable == 4 \|\| variable == 6)
				965	bar();
				966	}
				967	This should optimize to "if ((variable \| 2) == 6)". Currently not
				968	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts \| llc".
				969
				970	//===---------------------------------------------------------------------===//
				971
				972	unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return
				973	i;}
				974	unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
				975	These should combine to the same thing. Currently, the first function
				976	produces better code on X86.
				977
				978	//===---------------------------------------------------------------------===//
				979
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	980	From GCC Bug 15784:
				981	#define abs(x) x>0?x:-x
				982	int f(int x, int y)
				983	{
				984	return (abs(x)) >= 0;
				985	}
				986	This should optimize to x == INT_MIN. (With -fwrapv.) Currently not
				987	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				988
				989	//===---------------------------------------------------------------------===//
				990
				991	From GCC Bug 14753:
				992	void
				993	rotate_cst (unsigned int a)
				994	{
				995	a = (a << 10) \| (a >> 22);
				996	if (a == 123)
				997	bar ();
				998	}
				999	void
				1000	minus_cst (unsigned int a)
				1001	{
				1002	unsigned int tem;
				1003
				1004	tem = 20 - a;
				1005	if (tem == 5)
				1006	bar ();
				1007	}
				1008	void
				1009	mask_gt (unsigned int a)
				1010	{
				1011	/* This is equivalent to a > 15. */
				1012	if ((a & ~7) > 8)
				1013	bar ();
				1014	}
				1015	void
				1016	rshift_gt (unsigned int a)
				1017	{
				1018	/* This is equivalent to a > 23. */
				1019	if ((a >> 2) > 5)
				1020	bar ();
				1021	}
				1022	All should simplify to a single comparison. All of these are
				1023	currently not optimized with "clang -emit-llvm-bc \| opt
				1024	-std-compile-opts".
				1025
				1026	//===---------------------------------------------------------------------===//
				1027
				1028	From GCC Bug 32605:
				1029	int c(int* x) {return (char)x+2 == (char)x;}
				1030	Should combine to 0. Currently not optimized with "clang
				1031	-emit-llvm-bc \| opt -std-compile-opts" (although llc can optimize it).
				1032
				1033	//===---------------------------------------------------------------------===//
				1034
				1035	int a(unsigned char* b) {return *b > 99;}
				1036	There's an unnecessary zext in the generated code with "clang
				1037	-emit-llvm-bc \| opt -std-compile-opts".
				1038
				1039	//===---------------------------------------------------------------------===//
				1040
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1041	int a(unsigned b) {return ((b << 31) \| (b << 30)) >> 31;}
				1042	Should be combined to "((b >> 1) \| b) & 1". Currently not optimized
				1043	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1044
				1045	//===---------------------------------------------------------------------===//
				1046
				1047	unsigned a(unsigned x, unsigned y) { return x \| (y & 1) \| (y & 2);}
				1048	Should combine to "x \| (y & 3)". Currently not optimized with "clang
				1049	-emit-llvm-bc \| opt -std-compile-opts".
				1050
				1051	//===---------------------------------------------------------------------===//
				1052
				1053	unsigned a(unsigned a) {return ((a \| 1) & 3) \| (a & -4);}
				1054	Should combine to "a \| 1". Currently not optimized with "clang
				1055	-emit-llvm-bc \| opt -std-compile-opts".
				1056
				1057	//===---------------------------------------------------------------------===//
				1058
Eli Friedman	4e16b29	2008-11-30 07:36:04 +0000	[diff] [blame]	1059	int a(int a, int b, int c) {return (~a & c) \| ((c\|a) & b);}
				1060	Should fold to "(~a & c) \| (a & b)". Currently not optimized with
				1061	"clang -emit-llvm-bc \| opt -std-compile-opts".
				1062
				1063	//===---------------------------------------------------------------------===//
				1064
				1065	int a(int a,int b) {return (~(a\|b))\|a;}
				1066	Should fold to "a\|~b". Currently not optimized with "clang
				1067	-emit-llvm-bc \| opt -std-compile-opts".
				1068
				1069	//===---------------------------------------------------------------------===//
				1070
				1071	int a(int a, int b) {return (a&&b) \|\| (a&&!b);}
				1072	Should fold to "a". Currently not optimized with "clang -emit-llvm-bc
				1073	\| opt -std-compile-opts".
				1074
				1075	//===---------------------------------------------------------------------===//
				1076
				1077	int a(int a, int b, int c) {return (a&&b) \|\| (!a&&c);}
				1078	Should fold to "a ? b : c", or at least something sane. Currently not
				1079	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1080
				1081	//===---------------------------------------------------------------------===//
				1082
				1083	int a(int a, int b, int c) {return (a&&b) \|\| (a&&c) \|\| (a&&b&&c);}
				1084	Should fold to a && (b \|\| c). Currently not optimized with "clang
				1085	-emit-llvm-bc \| opt -std-compile-opts".
				1086
				1087	//===---------------------------------------------------------------------===//
				1088
				1089	int a(int x) {return x \| ((x & 8) ^ 8);}
				1090	Should combine to x \| 8. Currently not optimized with "clang
				1091	-emit-llvm-bc \| opt -std-compile-opts".
				1092
				1093	//===---------------------------------------------------------------------===//
				1094
				1095	int a(int x) {return x ^ ((x & 8) ^ 8);}
				1096	Should also combine to x \| 8. Currently not optimized with "clang
				1097	-emit-llvm-bc \| opt -std-compile-opts".
				1098
				1099	//===---------------------------------------------------------------------===//
				1100
				1101	int a(int x) {return (x & 8) == 0 ? -1 : -9;}
				1102	Should combine to (x \| -9) ^ 8. Currently not optimized with "clang
				1103	-emit-llvm-bc \| opt -std-compile-opts".
				1104
				1105	//===---------------------------------------------------------------------===//
				1106
				1107	int a(int x) {return (x & 8) == 0 ? -9 : -1;}
				1108	Should combine to x \| -9. Currently not optimized with "clang
				1109	-emit-llvm-bc \| opt -std-compile-opts".
				1110
				1111	//===---------------------------------------------------------------------===//
				1112
				1113	int a(int x) {return ((x \| -9) ^ 8) & x;}
				1114	Should combine to x & -9. Currently not optimized with "clang
				1115	-emit-llvm-bc \| opt -std-compile-opts".
				1116
				1117	//===---------------------------------------------------------------------===//
				1118
				1119	unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}
				1120	Should combine to "a * 0x88888888 >> 31". Currently not optimized
				1121	with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1122
				1123	//===---------------------------------------------------------------------===//
				1124
				1125	unsigned a(char* x) {if ((*x & 32) == 0) return b();}
				1126	There's an unnecessary zext in the generated code with "clang
				1127	-emit-llvm-bc \| opt -std-compile-opts".
				1128
				1129	//===---------------------------------------------------------------------===//
				1130
				1131	unsigned a(unsigned long long x) {return 40 * (x >> 1);}
				1132	Should combine to "20 * (((unsigned)x) & -2)". Currently not
				1133	optimized with "clang -emit-llvm-bc \| opt -std-compile-opts".
				1134
				1135	//===---------------------------------------------------------------------===//
Bill Wendling	3bdcda8	2008-12-02 05:12:47 +0000	[diff] [blame]	1136
				1137	We would like to do the following transform in the instcombiner:
				1138
				1139	-X/C -> X/-C
				1140
				1141	However, this isn't valid if (-X) overflows. We can implement this when we
				1142	have the concept of a "C signed subtraction" operator that which is undefined
				1143	on overflow.
				1144
				1145	//===---------------------------------------------------------------------===//
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1146
				1147	This was noticed in the entryblock for grokdeclarator in 403.gcc:
				1148
				1149	%tmp = icmp eq i32 %decl_context, 4
				1150	%decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context
				1151	%tmp1 = icmp eq i32 %decl_context_addr.0, 1
				1152	%decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0
				1153
				1154	tmp1 should be simplified to something like:
				1155	(!tmp \|\| decl_context == 1)
				1156
				1157	This allows recursive simplifications, tmp1 is used all over the place in
				1158	the function, e.g. by:
				1159
				1160	%tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1]
				1161	%tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1]
				1162	%or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1]
				1163
				1164	later.
				1165
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1166	//===---------------------------------------------------------------------===//
				1167
				1168	Store sinking: This code:
				1169
				1170	void f (int n, int cond, int res) {
				1171	int i;
				1172	*res = 0;
				1173	for (i = 0; i < n; i++)
				1174	if (*cond)
				1175	res ^= 234; / () /
				1176	}
				1177
				1178	On this function GVN hoists the fully redundant value of *res, but nothing
				1179	moves the store out. This gives us this code:
				1180
				1181	bb: ; preds = %bb2, %entry
				1182	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]
				1183	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]
				1184	%1 = load i32* %cond, align 4
				1185	%2 = icmp eq i32 %1, 0
				1186	br i1 %2, label %bb2, label %bb1
				1187
				1188	bb1: ; preds = %bb
				1189	%3 = xor i32 %.rle, 234
				1190	store i32 %3, i32* %res, align 4
				1191	br label %bb2
				1192
				1193	bb2: ; preds = %bb, %bb1
				1194	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]
				1195	%indvar.next = add i32 %i.05, 1
				1196	%exitcond = icmp eq i32 %indvar.next, %n
				1197	br i1 %exitcond, label %return, label %bb
				1198
				1199	DSE should sink partially dead stores to get the store out of the loop.
				1200
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1201	Here's another partial dead case:
				1202	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395
				1203
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1204	//===---------------------------------------------------------------------===//
				1205
				1206	Scalar PRE hoists the mul in the common block up to the else:
				1207
				1208	int test (int a, int b, int c, int g) {
				1209	int d, e;
				1210	if (a)
				1211	d = b * c;
				1212	else
				1213	d = b - c;
				1214	e = b * c + g;
				1215	return d + e;
				1216	}
				1217
				1218	It would be better to do the mul once to reduce codesize above the if.
				1219	This is GCC PR38204.
				1220
				1221	//===---------------------------------------------------------------------===//
				1222
				1223	GCC PR37810 is an interesting case where we should sink load/store reload
				1224	into the if block and outside the loop, so we don't reload/store it on the
				1225	non-call path.
				1226
				1227	for () {
				1228	*P += 1;
				1229	if ()
				1230	call();
				1231	else
				1232	...
				1233	->
				1234	tmp = *P
				1235	for () {
				1236	tmp += 1;
				1237	if () {
				1238	*P = tmp;
				1239	call();
				1240	tmp = *P;
				1241	} else ...
				1242	}
				1243	*P = tmp;
				1244
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1245	We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but
				1246	we don't sink the store. We need partially dead store sinking.
				1247
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1248	//===---------------------------------------------------------------------===//
				1249
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1250	[PHI TRANSLATE GEPs]
				1251
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1252	GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack
				1253	leading to excess stack traffic. This could be handled by GVN with some crazy
				1254	symbolic phi translation. The code we get looks like (g is on the stack):
				1255
				1256	bb2: ; preds = %bb1
				1257	..
				1258	%9 = getelementptr %struct.f* %g, i32 0, i32 0
				1259	store i32 %8, i32* %9, align bel %bb3
				1260
				1261	bb3: ; preds = %bb1, %bb2, %bb
				1262	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]
				1263	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]
				1264	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0
				1265	%11 = load i32* %10, align 4
				1266
				1267	%11 is fully redundant, an in BB2 it should have the value %8.
				1268
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1269	GCC PR33344 is a similar case.
				1270
Chris Lattner	78a7e7c	2008-12-06 19:28:22 +0000	[diff] [blame]	1271	//===---------------------------------------------------------------------===//
				1272
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1273	There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the
				1274	GCC testsuite. There are many pre testcases as ssa-pre-*.c
				1275
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1276	//===---------------------------------------------------------------------===//
				1277
				1278	There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the
				1279	GCC testsuite. For example, predcom-1.c is:
				1280
				1281	for (i = 2; i < 1000; i++)
				1282	fib[i] = (fib[i-1] + fib[i - 2]) & 0xffff;
				1283
				1284	which compiles into:
				1285
				1286	bb1: ; preds = %bb1, %bb1.thread
				1287	%indvar = phi i32 [ 0, %bb1.thread ], [ %0, %bb1 ]
				1288	%i.0.reg2mem.0 = add i32 %indvar, 2
				1289	%0 = add i32 %indvar, 1 ; <i32> [#uses=3]
				1290	%1 = getelementptr [1000 x i32]* @fib, i32 0, i32 %0
				1291	%2 = load i32* %1, align 4 ; <i32> [#uses=1]
				1292	%3 = getelementptr [1000 x i32]* @fib, i32 0, i32 %indvar
				1293	%4 = load i32* %3, align 4 ; <i32> [#uses=1]
				1294	%5 = add i32 %4, %2 ; <i32> [#uses=1]
				1295	%6 = and i32 %5, 65535 ; <i32> [#uses=1]
				1296	%7 = getelementptr [1000 x i32]* @fib, i32 0, i32 %i.0.reg2mem.0
				1297	store i32 %6, i32* %7, align 4
				1298	%exitcond = icmp eq i32 %0, 998 ; <i1> [#uses=1]
				1299	br i1 %exitcond, label %return, label %bb1
				1300
				1301	This is basically:
				1302	LOAD fib[i+1]
				1303	LOAD fib[i]
				1304	STORE fib[i+2]
				1305
				1306	instead of handling this as a loop or other xform, all we'd need to do is teach
				1307	load PRE to phi translate the %0 add (i+1) into the predecessor as (i'+1+1) =
				1308	(i'+2) (where i' is the previous iteration of i). This would find the store
				1309	which feeds it.
				1310
				1311	predcom-2.c is apparently the same as predcom-1.c
				1312	predcom-3.c is very similar but needs loads feeding each other instead of
				1313	store->load.
				1314	predcom-4.c seems the same as the rest.
				1315
				1316
				1317	//===---------------------------------------------------------------------===//
				1318
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1319	Other simple load PRE cases:
Chris Lattner	8f416f3	2008-12-15 07:49:24 +0000	[diff] [blame]	1320	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35287 [LPRE crit edge splitting]
				1321
				1322	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34677 (licm does this, LPRE crit edge)
				1323	llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \| opt -mem2reg -simplifycfg -gvn \| llvm-dis
				1324
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1325	//===---------------------------------------------------------------------===//
				1326
				1327	Type based alias analysis:
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1328	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705
				1329
				1330	//===---------------------------------------------------------------------===//
				1331
				1332	When GVN/PRE finds a store of float* to a must aliases pointer when expecting
				1333	an int*, it should turn it into a bitcast. This is a nice generalization of
Chris Lattner	630c99f	2008-12-07 00:15:10 +0000	[diff] [blame]	1334	the SROA hack that would apply to other cases, e.g.:
				1335
				1336	int foo(int C, int *P, float X) {
				1337	if (C) {
				1338	bar();
				1339	*P = 42;
				1340	} else
				1341	(float)P = X;
				1342
				1343	return *P;
				1344	}
				1345
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1346
				1347	One example (that requires crazy phi translation) is:
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1348	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16799 [BITCAST PHI TRANS]
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1349
				1350	//===---------------------------------------------------------------------===//
				1351
				1352	A/B get pinned to the stack because we turn an if/then into a select instead
				1353	of PRE'ing the load/store. This may be fixable in instcombine:
				1354	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37892
				1355
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1356
				1357
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1358	Interesting missed case because of control flow flattening (should be 2 loads):
				1359	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629
Chris Lattner	582048d	2008-12-15 08:32:28 +0000	[diff] [blame]	1360	With: llvm-gcc t2.c -S -o - -O0 -emit-llvm \| llvm-as \|
				1361	opt -mem2reg -gvn -instcombine \| llvm-dis
				1362	we miss it because we need 1) GEP PHI TRAN, 2) CRIT EDGE 3) MULTIPLE DIFFERENT
				1363	VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS
Chris Lattner	6a09a74	2008-12-06 22:52:12 +0000	[diff] [blame]	1364
				1365	//===---------------------------------------------------------------------===//
				1366
				1367	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633
				1368	We could eliminate the branch condition here, loading from null is undefined:
				1369
				1370	struct S { int w, x, y, z; };
				1371	struct T { int r; struct S s; };
				1372	void bar (struct S, int);
				1373	void foo (int a, struct T b)
				1374	{
				1375	struct S *c = 0;
				1376	if (a)
				1377	c = &b.s;
				1378	bar (*c, a);
				1379	}
				1380
				1381	//===---------------------------------------------------------------------===//
Chris Lattner	88d84b2	2008-12-02 06:32:34 +0000	[diff] [blame]	1382
Chris Lattner	9cf8ef6	2008-12-23 20:52:52 +0000	[diff] [blame]	1383	simplifylibcalls should do several optimizations for strspn/strcspn:
				1384
				1385	strcspn(x, "") -> strlen(x)
				1386	strcspn("", x) -> 0
				1387	strspn("", x) -> 0
				1388	strspn(x, "") -> strlen(x)
				1389	strspn(x, "a") -> strchr(x, 'a')-x
				1390
				1391	strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):
				1392
				1393	size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,
				1394	int __reject3) {
				1395	register size_t __result = 0;
				1396	while (__s[__result] != '\0' && __s[__result] != __reject1 &&
				1397	__s[__result] != __reject2 && __s[__result] != __reject3)
				1398	++__result;
				1399	return __result;
				1400	}
				1401
				1402	This should turn into a switch on the character. See PR3253 for some notes on
				1403	codegen.
				1404
				1405	456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn.
				1406
				1407	//===---------------------------------------------------------------------===//
Chris Lattner	d23b799	2008-12-31 00:54:13 +0000	[diff] [blame]	1408
				1409	"gas" uses this idiom:
				1410	else if (strchr ("+-/%\|&^:[]()~", intel_parser.op_string))
				1411	..
				1412	else if (strchr ("<>", *intel_parser.op_string)
				1413
				1414	Those should be turned into a switch.
				1415
				1416	//===---------------------------------------------------------------------===//
Chris Lattner	ffb08f5	2009-01-08 06:52:57 +0000	[diff] [blame]	1417
				1418	252.eon contains this interesting code:
				1419
				1420	%3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0
				1421	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1422	%strlen = call i32 @strlen(i8* %3072) ; uses = 1
				1423	%endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen
				1424	call void @llvm.memcpy.i32(i8* %endptr,
				1425	i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)
				1426	%3074 = call i32 @strlen(i8* %endptr) nounwind readonly
				1427
				1428	This is interesting for a couple reasons. First, in this:
				1429
				1430	%3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind
				1431	%strlen = call i32 @strlen(i8* %3072)
				1432
				1433	The strlen could be replaced with: %strlen = sub %3072, %3073, because the
				1434	strcpy call returns a pointer to the end of the string. Based on that, the
				1435	endptr GEP just becomes equal to 3073, which eliminates a strlen call and GEP.
				1436
				1437	Second, the memcpy+strlen strlen can be replaced with:
				1438
				1439	%3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly
				1440
				1441	Because the destination was just copied into the specified memory buffer. This,
				1442	in turn, can be constant folded to "4".
				1443
				1444	In other code, it contains:
				1445
				1446	%endptr6978 = bitcast i8* %endptr69 to i32*
				1447	store i32 7107374, i32* %endptr6978, align 1
				1448	%3167 = call i32 @strlen(i8* %endptr69) nounwind readonly
				1449
				1450	Which could also be constant folded. Whatever is producing this should probably
				1451	be fixed to leave this as a memcpy from a string.
				1452
				1453	Further, eon also has an interesting partially redundant strlen call:
				1454
				1455	bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit
				1456	%682 = getelementptr i8 %argv, i32 6 ; <i8> [#uses=2]
				1457	%683 = load i8** %682, align 4 ; <i8*> [#uses=4]
				1458	%684 = load i8* %683, align 1 ; <i8> [#uses=1]
				1459	%685 = icmp eq i8 %684, 0 ; <i1> [#uses=1]
				1460	br i1 %685, label %bb10, label %bb9
				1461
				1462	bb9: ; preds = %bb8
				1463	%686 = call i32 @strlen(i8* %683) nounwind readonly
				1464	%687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1]
				1465	br i1 %687, label %bb10, label %bb11
				1466
				1467	bb10: ; preds = %bb9, %bb8
				1468	%688 = call i32 @strlen(i8* %683) nounwind readonly
				1469
				1470	This could be eliminated by doing the strlen once in bb8, saving code size and
				1471	improving perf on the bb8->9->10 path.
				1472
				1473	//===---------------------------------------------------------------------===//
Chris Lattner	9fee08f	2009-01-08 07:34:55 +0000	[diff] [blame]	1474
				1475	I see an interesting fully redundant call to strlen left in 186.crafty:InputMove
				1476	which looks like:
				1477	%movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0
				1478
				1479
				1480	bb62: ; preds = %bb55, %bb53
				1481	%promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]
				1482	%171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1483	%172 = add i32 %171, -1 ; <i32> [#uses=1]
				1484	%173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172
				1485
				1486	... no stores ...
				1487	br i1 %or.cond, label %bb65, label %bb72
				1488
				1489	bb65: ; preds = %bb62
				1490	store i8 0, i8* %173, align 1
				1491	br label %bb72
				1492
				1493	bb72: ; preds = %bb65, %bb62
				1494	%trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]
				1495	%177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1
				1496
				1497	Note that on the bb62->bb72 path, that the %177 strlen call is partially
				1498	redundant with the %171 call. At worst, we could shove the %177 strlen call
				1499	up into the bb65 block moving it out of the bb62->bb72 path. However, note
				1500	that bb65 stores to the string, zeroing out the last byte. This means that on
				1501	that path the value of %177 is actually just %171-1. A sub is cheaper than a
				1502	strlen!
				1503
				1504	This pattern repeats several times, basically doing:
				1505
				1506	A = strlen(P);
				1507	P[A-1] = 0;
				1508	B = strlen(P);
				1509	where it is "obvious" that B = A-1.
				1510
				1511	//===---------------------------------------------------------------------===//
				1512
				1513	186.crafty contains this interesting pattern:
				1514
				1515	%77 = call i8* @strstr(i8* getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0),
				1516	i8* %30)
				1517	%phitmp648 = icmp eq i8* %77, getelementptr ([6 x i8]* @"\01LC5", i32 0, i32 0)
				1518	br i1 %phitmp648, label %bb70, label %bb76
				1519
				1520	bb70: ; preds = %OptionMatch.exit91, %bb69
				1521	%78 = call i32 @strlen(i8* %30) nounwind readonly align 1 ; <i32> [#uses=1]
				1522
				1523	This is basically:
				1524	cststr = "abcdef";
				1525	if (strstr(cststr, P) == cststr) {
				1526	x = strlen(P);
				1527	...
				1528
				1529	The strstr call would be significantly cheaper written as:
				1530
				1531	cststr = "abcdef";
				1532	if (memcmp(P, str, strlen(P)))
				1533	x = strlen(P);
				1534
				1535	This is memcmp+strlen instead of strstr. This also makes the strlen fully
				1536	redundant.
				1537
				1538	//===---------------------------------------------------------------------===//
				1539
				1540	186.crafty also contains this code:
				1541
				1542	%1906 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1543	%1907 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1906
				1544	%1908 = call i8* @strcpy(i8* %1907, i8* %1905) nounwind align 1
				1545	%1909 = call i32 @strlen(i8* getelementptr ([32 x i8]* @pgn_event, i32 0,i32 0))
				1546	%1910 = getelementptr [32 x i8]* @pgn_event, i32 0, i32 %1909
				1547
				1548	The last strlen is computable as 1908-@pgn_event, which means 1910=1908.
				1549
				1550	//===---------------------------------------------------------------------===//
				1551
				1552	186.crafty has this interesting pattern with the "out.4543" variable:
				1553
				1554	call void @llvm.memcpy.i32(
				1555	i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),
				1556	i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)
				1557	%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind
				1558
				1559	It is basically doing:
				1560
				1561	memcpy(globalarray, "string");
				1562	printf(..., globalarray);
				1563
				1564	Anyway, by knowing that printf just reads the memory and forward substituting
				1565	the string directly into the printf, this eliminates reads from globalarray.
				1566	Since this pattern occurs frequently in crafty (due to the "DisplayTime" and
				1567	other similar functions) there are many stores to "out". Once all the printfs
				1568	stop using "out", all that is left is the memcpy's into it. This should allow
				1569	globalopt to remove the "stored only" global.
				1570
				1571	//===---------------------------------------------------------------------===//
				1572
Dan Gohman	8289b05	2009-01-20 01:07:33 +0000	[diff] [blame]	1573	This code:
				1574
				1575	define inreg i32 @foo(i8* inreg %p) nounwind {
				1576	%tmp0 = load i8* %p
				1577	%tmp1 = ashr i8 %tmp0, 5
				1578	%tmp2 = sext i8 %tmp1 to i32
				1579	ret i32 %tmp2
				1580	}
				1581
				1582	could be dagcombine'd to a sign-extending load with a shift.
				1583	For example, on x86 this currently gets this:
				1584
				1585	movb (%eax), %al
				1586	sarb $5, %al
				1587	movsbl %al, %eax
				1588
				1589	while it could get this:
				1590
				1591	movsbl (%eax), %eax
				1592	sarl $5, %eax
				1593
				1594	//===---------------------------------------------------------------------===//
Chris Lattner	256baa4	2009-01-22 07:16:03 +0000	[diff] [blame]	1595
				1596	GCC PR31029:
				1597
				1598	int test(int x) { return 1-x == x; } // --> return false
				1599	int test2(int x) { return 2-x == x; } // --> return x == 1 ?
				1600
				1601	Always foldable for odd constants, what is the rule for even?
				1602
				1603	//===---------------------------------------------------------------------===//
				1604
Torok Edwin	e46a686	2009-01-24 19:30:25 +0000	[diff] [blame]	1605	PR 3381: GEP to field of size 0 inside a struct could be turned into GEP
				1606	for next field in struct (which is at same address).
				1607
				1608	For example: store of float into { {{}}, float } could be turned into a store to
				1609	the float directly.
				1610
Torok Edwin	474479f	2009-02-20 18:42:06 +0000	[diff] [blame]	1611	//===---------------------------------------------------------------------===//
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1612
Torok Edwin	474479f	2009-02-20 18:42:06 +0000	[diff] [blame]	1613	#include <math.h>
				1614	double foo(double a) { return sin(a); }
				1615
				1616	This compiles into this on x86-64 Linux:
				1617	foo:
				1618	subq $8, %rsp
				1619	call sin
				1620	addq $8, %rsp
				1621	ret
				1622	vs:
				1623
				1624	foo:
				1625	jmp sin
				1626
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1627	//===---------------------------------------------------------------------===//
				1628
Chris Lattner	32c5f17	2009-05-11 17:41:40 +0000	[diff] [blame^]	1629	The arg promotion pass should make use of nocapture to make its alias analysis
				1630	stuff much more precise.
				1631
				1632	//===---------------------------------------------------------------------===//
				1633
				1634	The following functions should be optimized to use a select instead of a
				1635	branch (from gcc PR40072):
				1636
				1637	char char_int(int m) {if(m>7) return 0; return m;}
				1638	int int_char(char m) {if(m>7) return 0; return m;}
				1639
				1640	//===---------------------------------------------------------------------===//
				1641
Nick Lewycky	20babb1	2009-02-25 06:52:48 +0000	[diff] [blame]	1642	Instcombine should replace the load with a constant in:
				1643
				1644	static const char x[4] = {'a', 'b', 'c', 'd'};
				1645
				1646	unsigned int y(void) {
				1647	return (unsigned int )x;
				1648	}
				1649
				1650	It currently only does this transformation when the size of the constant
				1651	is the same as the size of the integer (so, try x[5]) and the last byte
				1652	is a null (making it a C string). There's no need for these restrictions.
				1653
				1654	//===---------------------------------------------------------------------===//
				1655
Chris Lattner	d919a8b	2009-05-11 17:36:33 +0000	[diff] [blame]	1656	InstCombine's "turn load from constant into constant" optimization should be
				1657	more aggressive in the presence of bitcasts. For example, because of unions,
				1658	this code:
				1659
				1660	union vec2d {
				1661	double e[2];
				1662	double v __attribute__((vector_size(16)));
				1663	};
				1664	typedef union vec2d vec2d;
				1665
				1666	static vec2d a={{1,2}}, b={{3,4}};
				1667
				1668	vec2d foo () {
				1669	return (vec2d){ .v = a.v + b.v * (vec2d){{5,5}}.v };
				1670	}
				1671
				1672	Compiles into:
				1673
				1674	@a = internal constant %0 { [2 x double]
				1675	[double 1.000000e+00, double 2.000000e+00] }, align 16
				1676	@b = internal constant %0 { [2 x double]
				1677	[double 3.000000e+00, double 4.000000e+00] }, align 16
				1678	...
				1679	define void @foo(%struct.vec2d* noalias nocapture sret %agg.result) nounwind {
				1680	entry:
				1681	%0 = load <2 x double>* getelementptr (%struct.vec2d*
				1682	bitcast (%0* @a to %struct.vec2d*), i32 0, i32 0), align 16
				1683	%1 = load <2 x double>* getelementptr (%struct.vec2d*
				1684	bitcast (%0* @b to %struct.vec2d*), i32 0, i32 0), align 16
				1685
				1686
				1687	Instcombine should be able to optimize away the loads (and thus the globals).
				1688
				1689
				1690	//===---------------------------------------------------------------------===//