Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: 0214cbb84c434c9b2c219749cf1f94e53ccdc3af [file] [log] [blame]

Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
Chris Lattner	f308ea0	2006-09-28 06:01:17 +0000	[diff] [blame]	3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	4e4e461	2008-01-07 07:46:23 +0000	[diff] [blame]	5	We should make the various target's "IMPLICIT_DEF" instructions be a single
				6	target-independent opcode like TargetInstrInfo::INLINEASM. This would allow
				7	us to eliminate the TargetInstrDesc::isImplicitDef() method, and would allow
				8	us to avoid having to define this for every target for every register class.
				9
				10	//===---------------------------------------------------------------------===//
				11
Chris Lattner	9b62b45	2006-11-14 01:57:53 +0000	[diff] [blame]	12	With the recent changes to make the implicit def/use set explicit in
				13	machineinstrs, we should change the target descriptions for 'call' instructions
				14	so that the .td files don't list all the call-clobbered registers as implicit
				15	defs. Instead, these should be added by the code generator (e.g. on the dag).
				16
				17	This has a number of uses:
				18
				19	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				20	for their different impdef sets.
				21	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				22	sets don't need copies of call instructions.
				23	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				24	of calls.
				25
				26	//===---------------------------------------------------------------------===//
				27
Nate Begeman	81e8097	2006-03-17 01:40:33 +0000	[diff] [blame]	28	Make the PPC branch selector target independant
				29
				30	//===---------------------------------------------------------------------===//
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	31
				32	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
				33	precision don't matter (ffastmath). Misc/mandel will like this. :)
				34
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	35	//===---------------------------------------------------------------------===//
				36
				37	Solve this DAG isel folding deficiency:
				38
				39	int X, Y;
				40
				41	void fn1(void)
				42	{
				43	X = X \| (Y << 3);
				44	}
				45
				46	compiles to
				47
				48	fn1:
				49	movl Y, %eax
				50	shll $3, %eax
				51	orl X, %eax
				52	movl %eax, X
				53	ret
				54
				55	The problem is the store's chain operand is not the load X but rather
				56	a TokenFactor of the load X and load Y, which prevents the folding.
				57
				58	There are two ways to fix this:
				59
				60	1. The dag combiner can start using alias analysis to realize that y/x
				61	don't alias, making the store to X not dependent on the load from Y.
				62	2. The generated isel could be made smarter in the case it can't
				63	disambiguate the pointers.
				64
				65	Number 1 is the preferred solution.
				66
Evan Cheng	e617b08	2006-03-13 23:19:10 +0000	[diff] [blame]	67	This has been "fixed" by a TableGen hack. But that is a short term workaround
				68	which will be removed once the proper fix is made.
				69
Chris Lattner	086c014	2006-02-03 06:21:43 +0000	[diff] [blame]	70	//===---------------------------------------------------------------------===//
				71
Chris Lattner	b27b69f	2006-03-04 01:19:34 +0000	[diff] [blame]	72	On targets with expensive 64-bit multiply, we could LSR this:
				73
				74	for (i = ...; ++i) {
				75	x = 1ULL << i;
				76
				77	into:
				78	long long tmp = 1;
				79	for (i = ...; ++i, tmp+=tmp)
				80	x = tmp;
				81
				82	This would be a win on ppc32, but not x86 or ppc64.
				83
Chris Lattner	ad01993	2006-03-04 08:44:51 +0000	[diff] [blame]	84	//===---------------------------------------------------------------------===//
Chris Lattner	5b0fe7d	2006-03-05 20:00:08 +0000	[diff] [blame]	85
				86	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				87
				88	//===---------------------------------------------------------------------===//
Chris Lattner	549f27d2	2006-03-07 02:46:26 +0000	[diff] [blame]	89
Chris Lattner	c20995e	2006-03-11 20:17:08 +0000	[diff] [blame]	90	Reassociate should turn: XXXX -> t=(XX) (t*t) to eliminate a multiply.
				91
				92	//===---------------------------------------------------------------------===//
				93
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	94	Interesting? testcase for add/shift/mul reassoc:
				95
				96	int bar(int x, int y) {
				97	return xxx+y+xxxxxyyyy;
				98	}
				99	int foo(int z, int n) {
				100	return bar(z, n) + bar(2z, 2n);
				101	}
				102
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	103	Reassociate should handle the example in GCC PR16157.
				104
Chris Lattner	74cfb7d	2006-03-11 20:20:40 +0000	[diff] [blame]	105	//===---------------------------------------------------------------------===//
				106
Chris Lattner	82c78b2	2006-03-09 20:13:21 +0000	[diff] [blame]	107	These two functions should generate the same code on big-endian systems:
				108
				109	int g(int j,int l) { return memcmp(j,l,4); }
				110	int h(int j, int l) { return j - l; }
				111
				112	this could be done in SelectionDAGISel.cpp, along with other special cases,
				113	for 1,2,4,8 bytes.
				114
				115	//===---------------------------------------------------------------------===//
				116
Chris Lattner	c04b423	2006-03-22 07:33:46 +0000	[diff] [blame]	117	It would be nice to revert this patch:
				118	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				119
				120	And teach the dag combiner enough to simplify the code expanded before
				121	legalize. It seems plausible that this knowledge would let it simplify other
				122	stuff too.
				123
Chris Lattner	e6cd96d	2006-03-24 19:59:17 +0000	[diff] [blame]	124	//===---------------------------------------------------------------------===//
				125
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	126	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
Evan Cheng	67d3d4c	2006-03-31 22:35:14 +0000	[diff] [blame]	127	to the type size. It works but can be overly conservative as the alignment of
Reid Spencer	ac9dcb9	2007-02-15 03:39:18 +0000	[diff] [blame]	128	specific vector types are target dependent.
Chris Lattner	eaa7c06	2006-04-01 04:08:29 +0000	[diff] [blame]	129
				130	//===---------------------------------------------------------------------===//
				131
				132	We should add 'unaligned load/store' nodes, and produce them from code like
				133	this:
				134
				135	v4sf example(float *P) {
				136	return (v4sf){P[0], P[1], P[2], P[3] };
				137	}
				138
				139	//===---------------------------------------------------------------------===//
				140
Chris Lattner	16abfdf	2006-05-18 18:26:13 +0000	[diff] [blame]	141	Add support for conditional increments, and other related patterns. Instead
				142	of:
				143
				144	movl 136(%esp), %eax
				145	cmpl $0, %eax
				146	je LBB16_2 #cond_next
				147	LBB16_1: #cond_true
				148	incl _foo
				149	LBB16_2: #cond_next
				150
				151	emit:
				152	movl _foo, %eax
				153	cmpl $1, %edi
				154	sbbl $-1, %eax
				155	movl %eax, _foo
				156
				157	//===---------------------------------------------------------------------===//
Chris Lattner	870cf1b	2006-05-19 20:45:08 +0000	[diff] [blame]	158
				159	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				160
				161	Expand these to calls of sin/cos and stores:
				162	double sincos(double x, double sin, double cos);
				163	float sincosf(float x, float sin, float cos);
				164	long double sincosl(long double x, long double sin, long double cos);
				165
				166	Doing so could allow SROA of the destination pointers. See also:
				167	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				168
				169	//===---------------------------------------------------------------------===//
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	170
				171	Scalar Repl cannot currently promote this testcase to 'ret long cst':
				172
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	173	%struct.X = type { i32, i32 }
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	174	%struct.Y = type { %struct.X }
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	175
				176	define i64 @bar() {
				177	%retval = alloca %struct.Y, align 8
				178	%tmp12 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 0
				179	store i32 0, i32* %tmp12
				180	%tmp15 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 1
				181	store i32 1, i32* %tmp15
				182	%retval.upgrd.1 = bitcast %struct.Y* %retval to i64*
				183	%retval.upgrd.2 = load i64* %retval.upgrd.1
				184	ret i64 %retval.upgrd.2
Chris Lattner	f00f68a	2006-05-19 21:01:38 +0000	[diff] [blame]	185	}
				186
				187	it should be extended to do so.
				188
				189	//===---------------------------------------------------------------------===//
Chris Lattner	e8263e6	2006-05-21 03:57:07 +0000	[diff] [blame]	190
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	191	-scalarrepl should promote this to be a vector scalar.
				192
				193	%struct..0anon = type { <4 x float> }
				194
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	195	define void @test1(<4 x float> %V, float* %P) {
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	196	%u = alloca %struct..0anon, align 16
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	197	%tmp = getelementptr %struct..0anon* %u, i32 0, i32 0
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	198	store <4 x float> %V, <4 x float>* %tmp
				199	%tmp1 = bitcast %struct..0anon* %u to [4 x float]*
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	200	%tmp.upgrd.1 = getelementptr [4 x float]* %tmp1, i32 0, i32 1
				201	%tmp.upgrd.2 = load float* %tmp.upgrd.1
				202	%tmp3 = mul float %tmp.upgrd.2, 2.000000e+00
Chris Lattner	a5546fb	2006-12-11 00:44:03 +0000	[diff] [blame]	203	store float %tmp3, float* %P
				204	ret void
				205	}
				206
				207	//===---------------------------------------------------------------------===//
				208
Chris Lattner	e8263e6	2006-05-21 03:57:07 +0000	[diff] [blame]	209	Turn this into a single byte store with no load (the other 3 bytes are
				210	unmodified):
				211
				212	void %test(uint* %P) {
				213	%tmp = load uint* %P
				214	%tmp14 = or uint %tmp, 3305111552
				215	%tmp15 = and uint %tmp14, 3321888767
				216	store uint %tmp15, uint* %P
				217	ret void
				218	}
				219
Chris Lattner	9e18ef5	2006-05-30 21:29:15 +0000	[diff] [blame]	220	//===---------------------------------------------------------------------===//
				221
				222	dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
				223
				224	Compile:
				225
				226	int bar(int x)
				227	{
				228	int t = __builtin_clz(x);
				229	return -(t>>5);
				230	}
				231
				232	to:
				233
				234	_bar: addic r3,r3,-1
				235	subfe r3,r3,r3
				236	blr
				237
Chris Lattner	cbce2f6	2006-09-15 20:31:36 +0000	[diff] [blame]	238	//===---------------------------------------------------------------------===//
				239
				240	Legalize should lower ctlz like this:
				241	ctlz(x) = popcnt((x-1) & ~x)
				242
				243	on targets that have popcnt but not ctlz. itanium, what else?
Chris Lattner	9e18ef5	2006-05-30 21:29:15 +0000	[diff] [blame]	244
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	245	//===---------------------------------------------------------------------===//
				246
				247	quantum_sigma_x in 462.libquantum contains the following loop:
				248
				249	for(i=0; i<reg->size; i++)
				250	{
				251	/* Flip the target bit of each basis state */
				252	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				253	}
				254
				255	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				256	so cool to turn it into something like:
				257
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	258	long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	259	if (target < 32) {
				260	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	261	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	262	} else {
				263	for(i=0; i<reg->size; i++)
Chris Lattner	b33a42a	2006-09-18 04:54:35 +0000	[diff] [blame]	264	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattner	7ed96ab	2006-09-16 23:57:51 +0000	[diff] [blame]	265	}
				266
				267	... which would only do one 32-bit XOR per loop iteration instead of two.
				268
				269	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
				270	alas...
				271
				272	//===---------------------------------------------------------------------===//
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	273
				274	This isn't recognized as bswap by instcombine:
				275
				276	unsigned int swap_32(unsigned int v) {
				277	v = ((v & 0x00ff00ffU) << 8) \| ((v & 0xff00ff00U) >> 8);
				278	v = ((v & 0x0000ffffU) << 16) \| ((v & 0xffff0000U) >> 16);
				279	return v;
				280	}
				281
Chris Lattner	f9bae43	2006-12-08 02:01:32 +0000	[diff] [blame]	282	Nor is this (yes, it really is bswap):
				283
				284	unsigned long reverse(unsigned v) {
				285	unsigned t;
				286	t = v ^ ((v << 16) \| (v >> 16));
				287	t &= ~0xff0000;
				288	v = (v << 24) \| (v >> 8);
				289	return v ^ (t >> 8);
				290	}
				291
Chris Lattner	fb981f3	2006-09-25 17:12:14 +0000	[diff] [blame]	292	//===---------------------------------------------------------------------===//
				293
				294	These should turn into single 16-bit (unaligned?) loads on little/big endian
				295	processors.
				296
				297	unsigned short read_16_le(const unsigned char *adr) {
				298	return adr[0] \| (adr[1] << 8);
				299	}
				300	unsigned short read_16_be(const unsigned char *adr) {
				301	return (adr[0] << 8) \| adr[1];
				302	}
				303
				304	//===---------------------------------------------------------------------===//
Chris Lattner	cf10391	2006-10-24 16:12:47 +0000	[diff] [blame]	305
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	306	-instcombine should handle this transform:
Reid Spencer	e4d87aa	2006-12-23 06:05:41 +0000	[diff] [blame]	307	icmp pred (sdiv X / C1 ), C2
Reid Spencer	1628cec	2006-10-26 06:15:43 +0000	[diff] [blame]	308	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				309
				310	Currently InstCombine avoids this transform but will do it when the signs of
				311	the operands and the sign of the divide match. See the FIXME in
				312	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				313	for Instruction::UDiv (around line 4447) for more details.
				314
				315	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				316	this construct.
Chris Lattner	d7c628d	2006-11-03 22:27:39 +0000	[diff] [blame]	317
				318	//===---------------------------------------------------------------------===//
				319
Chris Lattner	578d2df	2006-11-10 00:23:26 +0000	[diff] [blame]	320	viterbi speeds up significantly if the various "history" related copy loops
				321	are turned into memcpy calls at the source level. We need a "loops to memcpy"
				322	pass.
				323
				324	//===---------------------------------------------------------------------===//
Nick Lewycky	bf63734	2006-11-13 00:23:28 +0000	[diff] [blame]	325
Chris Lattner	03a6d96	2007-01-16 06:39:48 +0000	[diff] [blame]	326	Consider:
				327
				328	typedef unsigned U32;
				329	typedef unsigned long long U64;
				330	int test (U32 inst, U64 regs) {
				331	U64 effective_addr2;
				332	U32 temp = *inst;
				333	int r1 = (temp >> 20) & 0xf;
				334	int b2 = (temp >> 16) & 0xf;
				335	effective_addr2 = temp & 0xfff;
				336	if (b2) effective_addr2 += regs[b2];
				337	b2 = (temp >> 12) & 0xf;
				338	if (b2) effective_addr2 += regs[b2];
				339	effective_addr2 &= regs[4];
				340	if ((effective_addr2 & 3) == 0)
				341	return 1;
				342	return 0;
				343	}
				344
				345	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				346	we don't eliminate the computation of the top half of effective_addr2 because
				347	we don't have whole-function selection dags. On x86, this means we use one
				348	extra register for the function when effective_addr2 is declared as U64 than
				349	when it is declared U32.
				350
				351	//===---------------------------------------------------------------------===//
				352
Chris Lattner	36e37d2	2007-02-13 21:44:43 +0000	[diff] [blame]	353	Promote for i32 bswap can use i64 bswap + shr. Useful on targets with 64-bit
				354	regs and bswap, like itanium.
				355
				356	//===---------------------------------------------------------------------===//
Chris Lattner	1a77a55	2007-03-24 06:01:32 +0000	[diff] [blame]	357
				358	LSR should know what GPR types a target has. This code:
				359
				360	volatile short X, Y; // globals
				361
				362	void foo(int N) {
				363	int i;
				364	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				365	}
				366
				367	produces two identical IV's (after promotion) on PPC/ARM:
				368
				369	LBB1_1: @bb.preheader
				370	mov r3, #0
				371	mov r2, r3
				372	mov r1, r3
				373	LBB1_2: @bb
				374	ldr r12, LCPI1_0
				375	ldr r12, [r12]
				376	strh r2, [r12]
				377	ldr r12, LCPI1_1
				378	ldr r12, [r12]
				379	strh r3, [r12]
				380	add r1, r1, #1 <- [0,+,1]
				381	add r3, r3, #4
				382	add r2, r2, #1 <- [0,+,1]
				383	cmp r1, r0
				384	bne LBB1_2 @bb
				385
				386
				387	//===---------------------------------------------------------------------===//
				388
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	389	Tail call elim should be more aggressive, checking to see if the call is
				390	followed by an uncond branch to an exit block.
				391
				392	; This testcase is due to tail-duplication not wanting to copy the return
				393	; instruction into the terminating blocks because there was other code
				394	; optimized out of the function after the taildup happened.
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	395	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	396
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	397	define i32 @t4(i32 %a) {
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	398	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	399	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				400	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				401	br i1 %tmp.2, label %then.0, label %else.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	402
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	403	then.0: ; preds = %entry
				404	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				405	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				406	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	407
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	408	else.0: ; preds = %entry
				409	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				410	br i1 %tmp.7, label %then.1, label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	411
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	412	then.1: ; preds = %else.0
				413	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				414	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				415	br label %return
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	416
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	417	return: ; preds = %then.1, %else.0, %then.0
				418	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	419	[ %tmp.9, %then.1 ]
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	420	ret i32 %result.0
Chris Lattner	5e14b0d	2007-05-05 22:29:06 +0000	[diff] [blame]	421	}
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	422
				423	//===---------------------------------------------------------------------===//
				424
Chris Lattner	e1bb6ab	2007-10-03 06:10:59 +0000	[diff] [blame]	425	Tail recursion elimination is not transforming this function, because it is
				426	returning n, which fails the isDynamicConstant check in the accumulator
				427	recursion checks.
				428
				429	long long fib(const long long n) {
				430	switch(n) {
				431	case 0:
				432	case 1:
				433	return n;
				434	default:
				435	return fib(n-1) + fib(n-2);
				436	}
				437	}
				438
				439	//===---------------------------------------------------------------------===//
				440
Chris Lattner	c90b866	2008-08-10 00:47:21 +0000	[diff] [blame]	441	Tail recursion elimination should handle:
				442
				443	int pow2m1(int n) {
				444	if (n == 0)
				445	return 0;
				446	return 2 * pow2m1 (n - 1) + 1;
				447	}
				448
				449	Also, multiplies can be turned into SHL's, so they should be handled as if
				450	they were associative. "return foo() << 1" can be tail recursion eliminated.
				451
				452	//===---------------------------------------------------------------------===//
				453
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	454	Argument promotion should promote arguments for recursive functions, like
				455	this:
				456
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	457	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	458
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	459	define internal i32 @foo(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	460	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	461	%tmp = load i32* %x ; <i32> [#uses=0]
				462	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				463	ret i32 %tmp.foo
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	464	}
				465
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	466	define i32 @bar(i32* %x) {
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	467	entry:
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	468	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				469	ret i32 %tmp3
Chris Lattner	f110a2b	2007-05-05 22:44:08 +0000	[diff] [blame]	470	}
				471
Chris Lattner	81f2d71	2007-12-05 23:05:06 +0000	[diff] [blame]	472	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	473
				474	"basicaa" should know how to look through "or" instructions that act like add
				475	instructions. For example in this code, the x4+1 is turned into x4 \| 1, and
				476	basicaa can't analyze the array subscript, leading to duplicated loads in the
				477	generated code:
				478
				479	void test(int X, int Y, int a[]) {
				480	int i;
				481	for (i=2; i<1000; i+=4) {
				482	a[i+0] = a[i-1+0]*a[i-2+0];
				483	a[i+1] = a[i-1+1]*a[i-2+1];
				484	a[i+2] = a[i-1+2]*a[i-2+2];
				485	a[i+3] = a[i-1+3]*a[i-2+3];
				486	}
				487	}
				488
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	489	//===---------------------------------------------------------------------===//
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	490
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	491	We should investigate an instruction sinking pass. Consider this silly
				492	example in pic mode:
				493
				494	#include <assert.h>
				495	void foo(int x) {
				496	assert(x);
				497	//...
				498	}
				499
				500	we compile this to:
				501	_foo:
				502	subl $28, %esp
				503	call "L1$pb"
				504	"L1$pb":
				505	popl %eax
				506	cmpl $0, 32(%esp)
				507	je LBB1_2 # cond_true
				508	LBB1_1: # return
				509	# ...
				510	addl $28, %esp
				511	ret
				512	LBB1_2: # cond_true
				513	...
				514
				515	The PIC base computation (call+popl) is only used on one path through the
				516	code, but is currently always computed in the entry block. It would be
				517	better to sink the picbase computation down into the block for the
				518	assertion, as it is the only one that uses it. This happens for a lot of
				519	code with early outs.
				520
Chris Lattner	92c06a0	2007-12-29 01:05:01 +0000	[diff] [blame]	521	Another example is loads of arguments, which are usually emitted into the
				522	entry block on targets like x86. If not used in all paths through a
				523	function, they should be sunk into the ones that do.
				524
Chris Lattner	a1643ba	2007-12-28 22:30:05 +0000	[diff] [blame]	525	In this case, whole-function-isel would also handle this.
Chris Lattner	166a268	2007-12-28 04:42:05 +0000	[diff] [blame]	526
				527	//===---------------------------------------------------------------------===//
Chris Lattner	b304194	2008-01-07 21:38:14 +0000	[diff] [blame]	528
				529	Investigate lowering of sparse switch statements into perfect hash tables:
				530	http://burtleburtle.net/bob/hash/perfect.html
				531
				532	//===---------------------------------------------------------------------===//
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	533
				534	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				535	corresponding integer operations. On a yonah, this loop:
				536
				537	double a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	538	void foo() {
				539	int i, b;
				540	for (b = 0; b < 10000000; b++)
				541	for (i = 0; i < 256; i++)
				542	a[i] = -a[i];
				543	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	544
				545	is twice as slow as this loop:
				546
				547	long long a[256];
Chris Lattner	7c4e9a4	2008-02-18 18:46:39 +0000	[diff] [blame]	548	void foo() {
				549	int i, b;
				550	for (b = 0; b < 10000000; b++)
				551	for (i = 0; i < 256; i++)
				552	a[i] ^= (1ULL << 63);
				553	}
Chris Lattner	f61b63e	2008-01-09 00:17:57 +0000	[diff] [blame]	554
				555	and I suspect other processors are similar. On X86 in particular this is a
				556	big win because doing this with integers allows the use of read/modify/write
				557	instructions.
				558
				559	//===---------------------------------------------------------------------===//
Chris Lattner	8372601	2008-01-10 18:25:41 +0000	[diff] [blame]	560
				561	DAG Combiner should try to combine small loads into larger loads when
				562	profitable. For example, we compile this C++ example:
				563
				564	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				565	extern THotKey m_HotKey;
				566	THotKey GetHotKey () { return m_HotKey; }
				567
				568	into (-O3 -fno-exceptions -static -fomit-frame-pointer):
				569
				570	__Z9GetHotKeyv:
				571	pushl %esi
				572	movl 8(%esp), %eax
				573	movb _m_HotKey+3, %cl
				574	movb _m_HotKey+4, %dl
				575	movb _m_HotKey+2, %ch
				576	movw _m_HotKey, %si
				577	movw %si, (%eax)
				578	movb %ch, 2(%eax)
				579	movb %cl, 3(%eax)
				580	movb %dl, 4(%eax)
				581	popl %esi
				582	ret $4
				583
				584	GCC produces:
				585
				586	__Z9GetHotKeyv:
				587	movl _m_HotKey, %edx
				588	movl 4(%esp), %eax
				589	movl %edx, (%eax)
				590	movzwl _m_HotKey+4, %edx
				591	movw %dx, 4(%eax)
				592	ret $4
				593
				594	The LLVM IR contains the needed alignment info, so we should be able to
				595	merge the loads and stores into 4-byte loads:
				596
				597	%struct.THotKey = type { i16, i8, i8, i8 }
				598	define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
				599	...
				600	%tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
				601	%tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
				602	%tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
				603	%tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
				604
				605	Alternatively, we should use a small amount of base-offset alias analysis
				606	to make it so the scheduler doesn't need to hold all the loads in regs at
				607	once.
				608
				609	//===---------------------------------------------------------------------===//
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	610
				611	We should extend parameter attributes to capture more information about
				612	pointer parameters for alias analysis. Some ideas:
				613
				614	1. Add a "nocapture" attribute, which indicates that the callee does not store
				615	the address of the parameter into a global or any other memory location
				616	visible to the callee. This can be used to make basicaa and other analyses
				617	more powerful. It is true for things like memcpy, strcat, and many other
				618	things, including structs passed by value, most C++ references, etc.
				619	2. Generalize readonly to be set on parameters. This is important mod/ref
				620	info for the function, which is important for basicaa and others. It can
				621	also be used by the inliner to avoid inserting a memcpy for byval
				622	arguments when the function is inlined.
				623
				624	These functions can be inferred by various analysis passes such as the
Chris Lattner	65844fb	2008-01-12 18:58:46 +0000	[diff] [blame]	625	globalsmodrefaa pass. Note that getting #2 right is actually really tricky.
				626	Consider this code:
				627
				628	struct S; S G;
				629	void caller(S byvalarg) { G.field = 1; ... }
				630	void callee() { caller(G); }
				631
				632	The fact that the caller does not modify byval arg is not enough, we need
				633	to know that it doesn't modify G either. This is very tricky.
Chris Lattner	497b7e9	2008-01-11 06:17:47 +0000	[diff] [blame]	634
				635	//===---------------------------------------------------------------------===//
Nate Begeman	e9fe65c	2008-02-18 18:39:23 +0000	[diff] [blame]	636
				637	We should add an FRINT node to the DAG to model targets that have legal
				638	implementations of ceil/floor/rint.
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	639
				640	//===---------------------------------------------------------------------===//
				641
Chris Lattner	e29536c	2008-02-28 17:21:27 +0000	[diff] [blame]	642	This GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34043
				643	contains a testcase that compiles down to:
				644
				645	%struct.XMM128 = type { <4 x float> }
				646	..
				647	%src = alloca %struct.XMM128
				648	..
				649	%tmp6263 = bitcast %struct.XMM128* %src to <2 x i64>*
				650	%tmp65 = getelementptr %struct.XMM128* %src, i32 0, i32 0
				651	store <2 x i64> %tmp5899, <2 x i64>* %tmp6263, align 16
				652	%tmp66 = load <4 x float>* %tmp65, align 16
				653	%tmp71 = add <4 x float> %tmp66, %tmp66
				654
				655	If the mid-level optimizer turned the bitcast of pointer + store of tmp5899
				656	into a bitcast of the vector value and a store to the pointer, then the
				657	store->load could be easily removed.
				658
				659	//===---------------------------------------------------------------------===//
				660
Chris Lattner	48840f8	2008-02-28 05:34:27 +0000	[diff] [blame]	661	Consider:
				662
				663	int test() {
				664	long long input[8] = {1,1,1,1,1,1,1,1};
				665	foo(input);
				666	}
				667
				668	We currently compile this into a memcpy from a global array since the
				669	initializer is fairly large and not memset'able. This is good, but the memcpy
				670	gets lowered to load/stores in the code generator. This is also ok, except
				671	that the codegen lowering for memcpy doesn't handle the case when the source
				672	is a constant global. This gives us atrocious code like this:
				673
				674	call "L1$pb"
				675	"L1$pb":
				676	popl %eax
				677	movl _C.0.1444-"L1$pb"+32(%eax), %ecx
				678	movl %ecx, 40(%esp)
				679	movl _C.0.1444-"L1$pb"+20(%eax), %ecx
				680	movl %ecx, 28(%esp)
				681	movl _C.0.1444-"L1$pb"+36(%eax), %ecx
				682	movl %ecx, 44(%esp)
				683	movl _C.0.1444-"L1$pb"+44(%eax), %ecx
				684	movl %ecx, 52(%esp)
				685	movl _C.0.1444-"L1$pb"+40(%eax), %ecx
				686	movl %ecx, 48(%esp)
				687	movl _C.0.1444-"L1$pb"+12(%eax), %ecx
				688	movl %ecx, 20(%esp)
				689	movl _C.0.1444-"L1$pb"+4(%eax), %ecx
				690	...
				691
				692	instead of:
				693	movl $1, 16(%esp)
				694	movl $0, 20(%esp)
				695	movl $1, 24(%esp)
				696	movl $0, 28(%esp)
				697	movl $1, 32(%esp)
				698	movl $0, 36(%esp)
				699	...
				700
				701	//===---------------------------------------------------------------------===//
Chris Lattner	a11deb0	2008-03-02 02:51:40 +0000	[diff] [blame]	702
				703	http://llvm.org/PR717:
				704
				705	The following code should compile into "ret int undef". Instead, LLVM
				706	produces "ret int 0":
				707
				708	int f() {
				709	int x = 4;
				710	int y;
				711	if (x == 3) y = 0;
				712	return y;
				713	}
				714
				715	//===---------------------------------------------------------------------===//
Chris Lattner	53b7277	2008-03-02 19:29:42 +0000	[diff] [blame]	716
				717	The loop unroller should partially unroll loops (instead of peeling them)
				718	when code growth isn't too bad and when an unroll count allows simplification
				719	of some code within the loop. One trivial example is:
				720
				721	#include <stdio.h>
				722	int main() {
				723	int nRet = 17;
				724	int nLoop;
				725	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				726	if ( nLoop & 1 )
				727	nRet += 2;
				728	else
				729	nRet -= 1;
				730	}
				731	return nRet;
				732	}
				733
				734	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				735	reduction in code size. The resultant code would then also be suitable for
				736	exit value computation.
				737
				738	//===---------------------------------------------------------------------===//
Chris Lattner	349155b	2008-03-17 01:47:51 +0000	[diff] [blame]	739
				740	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				741	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				742	matching code in dag combine doesn't look through truncates aggressively
				743	enough. Here are some testcases reduces from GCC PR17886:
				744
				745	unsigned long long f(unsigned long long x, int y) {
				746	return (x << y) \| (x >> 64-y);
				747	}
				748	unsigned f2(unsigned x, int y){
				749	return (x << y) \| (x >> 32-y);
				750	}
				751	unsigned long long f3(unsigned long long x){
				752	int y = 9;
				753	return (x << y) \| (x >> 64-y);
				754	}
				755	unsigned f4(unsigned x){
				756	int y = 10;
				757	return (x << y) \| (x >> 32-y);
				758	}
				759	unsigned long long f5(unsigned long long x, unsigned long long y) {
				760	return (x << 8) \| ((y >> 48) & 0xffull);
				761	}
				762	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				763	switch(z) {
				764	case 1:
				765	return (x << 8) \| ((y >> 48) & 0xffull);
				766	case 2:
				767	return (x << 16) \| ((y >> 40) & 0xffffull);
				768	case 3:
				769	return (x << 24) \| ((y >> 32) & 0xffffffull);
				770	case 4:
				771	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				772	default:
				773	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				774	}
				775	}
				776
				777	On X86-64, we only handle f3/f4 right. On x86-32, several of these
				778	generate truly horrible code, instead of using shld and friends. On
				779	ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
				780	badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
				781
				782	//===---------------------------------------------------------------------===//
Chris Lattner	f70107f	2008-03-20 04:46:13 +0000	[diff] [blame]	783
				784	We do a number of simplifications in simplify libcalls to strength reduce
				785	standard library functions, but we don't currently merge them together. For
				786	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				787	be done safely if "b" isn't modified between the strlen and memcpy of course.
				788
				789	//===---------------------------------------------------------------------===//
				790
Chris Lattner	b578310	2008-05-17 15:37:38 +0000	[diff] [blame]	791	We should be able to evaluate this loop:
				792
				793	int test(int x_offs) {
				794	while (x_offs > 4)
				795	x_offs -= 4;
				796	return x_offs;
				797	}
				798
				799	//===---------------------------------------------------------------------===//
Chris Lattner	10c5d36	2008-07-14 00:19:59 +0000	[diff] [blame]	800
				801	Reassociate should turn things like:
				802
				803	int factorial(int X) {
				804	return XXXXXXX*X;
				805	}
				806
				807	into llvm.powi calls, allowing the code generator to produce balanced
				808	multiplication trees.
				809
				810	//===---------------------------------------------------------------------===//
				811
Chris Lattner	26e150f	2008-08-10 01:14:08 +0000	[diff] [blame]	812	We generate a horrible libcall for llvm.powi. For example, we compile:
				813
				814	#include <cmath>
				815	double f(double a) { return std::pow(a, 4); }
				816
				817	into:
				818
				819	__Z1fd:
				820	subl $12, %esp
				821	movsd 16(%esp), %xmm0
				822	movsd %xmm0, (%esp)
				823	movl $4, 8(%esp)
				824	call L___powidf2$stub
				825	addl $12, %esp
				826	ret
				827
				828	GCC produces:
				829
				830	__Z1fd:
				831	subl $12, %esp
				832	movsd 16(%esp), %xmm0
				833	mulsd %xmm0, %xmm0
				834	mulsd %xmm0, %xmm0
				835	movsd %xmm0, (%esp)
				836	fldl (%esp)
				837	addl $12, %esp
				838	ret
				839
				840	//===---------------------------------------------------------------------===//
				841
				842	We compile this program: (from GCC PR11680)
				843	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				844
				845	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				846	slower than when compile with GCC (either 4.0 or 4.2):
				847
				848	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				849	$ time ./a.out fast
				850	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				851
				852	$ g++ perf.cpp -O3 -fno-exceptions
				853	$ time ./a.out fast
				854	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				855
				856	It looks like we are making the same inlining decisions, so this may be raw
				857	codegen badness or something else (haven't investigated).
				858
				859	//===---------------------------------------------------------------------===//
				860
				861	We miss some instcombines for stuff like this:
				862	void bar (void);
				863	void foo (unsigned int a) {
				864	/* This one is equivalent to a >= (3 << 2). */
				865	if ((a >> 2) >= 3)
				866	bar ();
				867	}
				868
				869	A few other related ones are in GCC PR14753.
				870
				871	//===---------------------------------------------------------------------===//
				872
				873	Divisibility by constant can be simplified (according to GCC PR12849) from
				874	being a mulhi to being a mul lo (cheaper). Testcase:
				875
				876	void bar(unsigned n) {
				877	if (n % 3 == 0)
				878	true();
				879	}
				880
				881	I think this basically amounts to a dag combine to simplify comparisons against
				882	multiply hi's into a comparison against the mullo.
				883
				884	//===---------------------------------------------------------------------===//
Chris Lattner	23f35bc	2008-08-19 06:22:16 +0000	[diff] [blame^]	885
				886	SROA is not promoting the union on the stack in this example, we should end
				887	up with no allocas.
				888
				889	union vec2d {
				890	double e[2];
				891	double v __attribute__((vector_size(16)));
				892	};
				893	typedef union vec2d vec2d;
				894
				895	static vec2d a={{1,2}}, b={{3,4}};
				896
				897	vec2d foo () {
				898	return (vec2d){ .v = a.v + b.v * (vec2d){{5,5}}.v };
				899	}
				900
				901	//===---------------------------------------------------------------------===//