Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: faa351a51b486f20b023eb92c4fb7c03c98eff5a [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
				3	//===---------------------------------------------------------------------===//
				4
Chris Lattner	0124421	2008-01-07 07:46:23 +0000	[diff] [blame]	5	We should make the various target's "IMPLICIT_DEF" instructions be a single
				6	target-independent opcode like TargetInstrInfo::INLINEASM. This would allow
				7	us to eliminate the TargetInstrDesc::isImplicitDef() method, and would allow
				8	us to avoid having to define this for every target for every register class.
				9
				10	//===---------------------------------------------------------------------===//
				11
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	12	With the recent changes to make the implicit def/use set explicit in
				13	machineinstrs, we should change the target descriptions for 'call' instructions
				14	so that the .td files don't list all the call-clobbered registers as implicit
				15	defs. Instead, these should be added by the code generator (e.g. on the dag).
				16
				17	This has a number of uses:
				18
				19	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				20	for their different impdef sets.
				21	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				22	sets don't need copies of call instructions.
				23	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				24	of calls.
				25
				26	//===---------------------------------------------------------------------===//
				27
				28	Make the PPC branch selector target independant
				29
				30	//===---------------------------------------------------------------------===//
				31
				32	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
				33	precision don't matter (ffastmath). Misc/mandel will like this. :)
				34
				35	//===---------------------------------------------------------------------===//
				36
				37	Solve this DAG isel folding deficiency:
				38
				39	int X, Y;
				40
				41	void fn1(void)
				42	{
				43	X = X \| (Y << 3);
				44	}
				45
				46	compiles to
				47
				48	fn1:
				49	movl Y, %eax
				50	shll $3, %eax
				51	orl X, %eax
				52	movl %eax, X
				53	ret
				54
				55	The problem is the store's chain operand is not the load X but rather
				56	a TokenFactor of the load X and load Y, which prevents the folding.
				57
				58	There are two ways to fix this:
				59
				60	1. The dag combiner can start using alias analysis to realize that y/x
				61	don't alias, making the store to X not dependent on the load from Y.
				62	2. The generated isel could be made smarter in the case it can't
				63	disambiguate the pointers.
				64
				65	Number 1 is the preferred solution.
				66
				67	This has been "fixed" by a TableGen hack. But that is a short term workaround
				68	which will be removed once the proper fix is made.
				69
				70	//===---------------------------------------------------------------------===//
				71
				72	On targets with expensive 64-bit multiply, we could LSR this:
				73
				74	for (i = ...; ++i) {
				75	x = 1ULL << i;
				76
				77	into:
				78	long long tmp = 1;
				79	for (i = ...; ++i, tmp+=tmp)
				80	x = tmp;
				81
				82	This would be a win on ppc32, but not x86 or ppc64.
				83
				84	//===---------------------------------------------------------------------===//
				85
				86	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				87
				88	//===---------------------------------------------------------------------===//
				89
				90	Reassociate should turn: XXXX -> t=(XX) (t*t) to eliminate a multiply.
				91
				92	//===---------------------------------------------------------------------===//
				93
				94	Interesting? testcase for add/shift/mul reassoc:
				95
				96	int bar(int x, int y) {
				97	return xxx+y+xxxxxyyyy;
				98	}
				99	int foo(int z, int n) {
				100	return bar(z, n) + bar(2z, 2n);
				101	}
				102
				103	Reassociate should handle the example in GCC PR16157.
				104
				105	//===---------------------------------------------------------------------===//
				106
				107	These two functions should generate the same code on big-endian systems:
				108
				109	int g(int j,int l) { return memcmp(j,l,4); }
				110	int h(int j, int l) { return j - l; }
				111
				112	this could be done in SelectionDAGISel.cpp, along with other special cases,
				113	for 1,2,4,8 bytes.
				114
				115	//===---------------------------------------------------------------------===//
				116
				117	It would be nice to revert this patch:
				118	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				119
				120	And teach the dag combiner enough to simplify the code expanded before
				121	legalize. It seems plausible that this knowledge would let it simplify other
				122	stuff too.
				123
				124	//===---------------------------------------------------------------------===//
				125
				126	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
				127	to the type size. It works but can be overly conservative as the alignment of
				128	specific vector types are target dependent.
				129
				130	//===---------------------------------------------------------------------===//
				131
				132	We should add 'unaligned load/store' nodes, and produce them from code like
				133	this:
				134
				135	v4sf example(float *P) {
				136	return (v4sf){P[0], P[1], P[2], P[3] };
				137	}
				138
				139	//===---------------------------------------------------------------------===//
				140
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	141	Add support for conditional increments, and other related patterns. Instead
				142	of:
				143
				144	movl 136(%esp), %eax
				145	cmpl $0, %eax
				146	je LBB16_2 #cond_next
				147	LBB16_1: #cond_true
				148	incl _foo
				149	LBB16_2: #cond_next
				150
				151	emit:
				152	movl _foo, %eax
				153	cmpl $1, %edi
				154	sbbl $-1, %eax
				155	movl %eax, _foo
				156
				157	//===---------------------------------------------------------------------===//
				158
				159	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				160
				161	Expand these to calls of sin/cos and stores:
				162	double sincos(double x, double sin, double cos);
				163	float sincosf(float x, float sin, float cos);
				164	long double sincosl(long double x, long double sin, long double cos);
				165
				166	Doing so could allow SROA of the destination pointers. See also:
				167	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				168
				169	//===---------------------------------------------------------------------===//
				170
				171	Scalar Repl cannot currently promote this testcase to 'ret long cst':
				172
				173	%struct.X = type { i32, i32 }
				174	%struct.Y = type { %struct.X }
				175
				176	define i64 @bar() {
				177	%retval = alloca %struct.Y, align 8
				178	%tmp12 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 0
				179	store i32 0, i32* %tmp12
				180	%tmp15 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 1
				181	store i32 1, i32* %tmp15
				182	%retval.upgrd.1 = bitcast %struct.Y* %retval to i64*
				183	%retval.upgrd.2 = load i64* %retval.upgrd.1
				184	ret i64 %retval.upgrd.2
				185	}
				186
				187	it should be extended to do so.
				188
				189	//===---------------------------------------------------------------------===//
				190
				191	-scalarrepl should promote this to be a vector scalar.
				192
				193	%struct..0anon = type { <4 x float> }
				194
				195	define void @test1(<4 x float> %V, float* %P) {
				196	%u = alloca %struct..0anon, align 16
				197	%tmp = getelementptr %struct..0anon* %u, i32 0, i32 0
				198	store <4 x float> %V, <4 x float>* %tmp
				199	%tmp1 = bitcast %struct..0anon* %u to [4 x float]*
				200	%tmp.upgrd.1 = getelementptr [4 x float]* %tmp1, i32 0, i32 1
				201	%tmp.upgrd.2 = load float* %tmp.upgrd.1
				202	%tmp3 = mul float %tmp.upgrd.2, 2.000000e+00
				203	store float %tmp3, float* %P
				204	ret void
				205	}
				206
				207	//===---------------------------------------------------------------------===//
				208
				209	Turn this into a single byte store with no load (the other 3 bytes are
				210	unmodified):
				211
				212	void %test(uint* %P) {
				213	%tmp = load uint* %P
				214	%tmp14 = or uint %tmp, 3305111552
				215	%tmp15 = and uint %tmp14, 3321888767
				216	store uint %tmp15, uint* %P
				217	ret void
				218	}
				219
				220	//===---------------------------------------------------------------------===//
				221
				222	dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
				223
				224	Compile:
				225
				226	int bar(int x)
				227	{
				228	int t = __builtin_clz(x);
				229	return -(t>>5);
				230	}
				231
				232	to:
				233
				234	_bar: addic r3,r3,-1
				235	subfe r3,r3,r3
				236	blr
				237
				238	//===---------------------------------------------------------------------===//
				239
				240	Legalize should lower ctlz like this:
				241	ctlz(x) = popcnt((x-1) & ~x)
				242
				243	on targets that have popcnt but not ctlz. itanium, what else?
				244
				245	//===---------------------------------------------------------------------===//
				246
				247	quantum_sigma_x in 462.libquantum contains the following loop:
				248
				249	for(i=0; i<reg->size; i++)
				250	{
				251	/* Flip the target bit of each basis state */
				252	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				253	}
				254
				255	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				256	so cool to turn it into something like:
				257
				258	long long Res = ((MAX_UNSIGNED) 1 << target);
				259	if (target < 32) {
				260	for(i=0; i<reg->size; i++)
				261	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
				262	} else {
				263	for(i=0; i<reg->size; i++)
				264	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
				265	}
				266
				267	... which would only do one 32-bit XOR per loop iteration instead of two.
				268
				269	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
				270	alas...
				271
				272	//===---------------------------------------------------------------------===//
				273
Chris Lattner	909c72c	2008-10-05 02:16:12 +0000	[diff] [blame]	274	This isn't recognized as bswap by instcombine (yes, it really is bswap):
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	275
				276	unsigned long reverse(unsigned v) {
				277	unsigned t;
				278	t = v ^ ((v << 16) \| (v >> 16));
				279	t &= ~0xff0000;
				280	v = (v << 24) \| (v >> 8);
				281	return v ^ (t >> 8);
				282	}
				283
				284	//===---------------------------------------------------------------------===//
				285
				286	These should turn into single 16-bit (unaligned?) loads on little/big endian
				287	processors.
				288
				289	unsigned short read_16_le(const unsigned char *adr) {
				290	return adr[0] \| (adr[1] << 8);
				291	}
				292	unsigned short read_16_be(const unsigned char *adr) {
				293	return (adr[0] << 8) \| adr[1];
				294	}
				295
				296	//===---------------------------------------------------------------------===//
				297
				298	-instcombine should handle this transform:
				299	icmp pred (sdiv X / C1 ), C2
				300	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				301
				302	Currently InstCombine avoids this transform but will do it when the signs of
				303	the operands and the sign of the divide match. See the FIXME in
				304	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				305	for Instruction::UDiv (around line 4447) for more details.
				306
				307	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				308	this construct.
				309
				310	//===---------------------------------------------------------------------===//
				311
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	312	viterbi speeds up significantly if the various "history" related copy loops
				313	are turned into memcpy calls at the source level. We need a "loops to memcpy"
				314	pass.
				315
				316	//===---------------------------------------------------------------------===//
				317
				318	Consider:
				319
				320	typedef unsigned U32;
				321	typedef unsigned long long U64;
				322	int test (U32 inst, U64 regs) {
				323	U64 effective_addr2;
				324	U32 temp = *inst;
				325	int r1 = (temp >> 20) & 0xf;
				326	int b2 = (temp >> 16) & 0xf;
				327	effective_addr2 = temp & 0xfff;
				328	if (b2) effective_addr2 += regs[b2];
				329	b2 = (temp >> 12) & 0xf;
				330	if (b2) effective_addr2 += regs[b2];
				331	effective_addr2 &= regs[4];
				332	if ((effective_addr2 & 3) == 0)
				333	return 1;
				334	return 0;
				335	}
				336
				337	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				338	we don't eliminate the computation of the top half of effective_addr2 because
				339	we don't have whole-function selection dags. On x86, this means we use one
				340	extra register for the function when effective_addr2 is declared as U64 than
				341	when it is declared U32.
				342
				343	//===---------------------------------------------------------------------===//
				344
				345	Promote for i32 bswap can use i64 bswap + shr. Useful on targets with 64-bit
				346	regs and bswap, like itanium.
				347
				348	//===---------------------------------------------------------------------===//
				349
				350	LSR should know what GPR types a target has. This code:
				351
				352	volatile short X, Y; // globals
				353
				354	void foo(int N) {
				355	int i;
				356	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				357	}
				358
				359	produces two identical IV's (after promotion) on PPC/ARM:
				360
				361	LBB1_1: @bb.preheader
				362	mov r3, #0
				363	mov r2, r3
				364	mov r1, r3
				365	LBB1_2: @bb
				366	ldr r12, LCPI1_0
				367	ldr r12, [r12]
				368	strh r2, [r12]
				369	ldr r12, LCPI1_1
				370	ldr r12, [r12]
				371	strh r3, [r12]
				372	add r1, r1, #1 <- [0,+,1]
				373	add r3, r3, #4
				374	add r2, r2, #1 <- [0,+,1]
				375	cmp r1, r0
				376	bne LBB1_2 @bb
				377
				378
				379	//===---------------------------------------------------------------------===//
				380
				381	Tail call elim should be more aggressive, checking to see if the call is
				382	followed by an uncond branch to an exit block.
				383
				384	; This testcase is due to tail-duplication not wanting to copy the return
				385	; instruction into the terminating blocks because there was other code
				386	; optimized out of the function after the taildup happened.
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	387	; RUN: llvm-as < %s \| opt -tailcallelim \| llvm-dis \| not grep call
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	388
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	389	define i32 @t4(i32 %a) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	390	entry:
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	391	%tmp.1 = and i32 %a, 1 ; <i32> [#uses=1]
				392	%tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1]
				393	br i1 %tmp.2, label %then.0, label %else.0
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	394
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	395	then.0: ; preds = %entry
				396	%tmp.5 = add i32 %a, -1 ; <i32> [#uses=1]
				397	%tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1]
				398	br label %return
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	399
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	400	else.0: ; preds = %entry
				401	%tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1]
				402	br i1 %tmp.7, label %then.1, label %return
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	403
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	404	then.1: ; preds = %else.0
				405	%tmp.11 = add i32 %a, -2 ; <i32> [#uses=1]
				406	%tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1]
				407	br label %return
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	408
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	409	return: ; preds = %then.1, %else.0, %then.0
				410	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	411	[ %tmp.9, %then.1 ]
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	412	ret i32 %result.0
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	413	}
				414
				415	//===---------------------------------------------------------------------===//
				416
Chris Lattner	00159fc	2007-10-03 06:10:59 +0000	[diff] [blame]	417	Tail recursion elimination is not transforming this function, because it is
				418	returning n, which fails the isDynamicConstant check in the accumulator
				419	recursion checks.
				420
				421	long long fib(const long long n) {
				422	switch(n) {
				423	case 0:
				424	case 1:
				425	return n;
				426	default:
				427	return fib(n-1) + fib(n-2);
				428	}
				429	}
				430
				431	//===---------------------------------------------------------------------===//
				432
Chris Lattner	7be8ac2	2008-08-10 00:47:21 +0000	[diff] [blame]	433	Tail recursion elimination should handle:
				434
				435	int pow2m1(int n) {
				436	if (n == 0)
				437	return 0;
				438	return 2 * pow2m1 (n - 1) + 1;
				439	}
				440
				441	Also, multiplies can be turned into SHL's, so they should be handled as if
				442	they were associative. "return foo() << 1" can be tail recursion eliminated.
				443
				444	//===---------------------------------------------------------------------===//
				445
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	446	Argument promotion should promote arguments for recursive functions, like
				447	this:
				448
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	449	; RUN: llvm-as < %s \| opt -argpromotion \| llvm-dis \| grep x.val
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	450
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	451	define internal i32 @foo(i32* %x) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	452	entry:
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	453	%tmp = load i32* %x ; <i32> [#uses=0]
				454	%tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				455	ret i32 %tmp.foo
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	456	}
				457
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	458	define i32 @bar(i32* %x) {
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	459	entry:
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	460	%tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1]
				461	ret i32 %tmp3
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	462	}
				463
Chris Lattner	421a733	2007-12-05 23:05:06 +0000	[diff] [blame]	464	//===---------------------------------------------------------------------===//
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	465
				466	"basicaa" should know how to look through "or" instructions that act like add
				467	instructions. For example in this code, the x4+1 is turned into x4 \| 1, and
				468	basicaa can't analyze the array subscript, leading to duplicated loads in the
				469	generated code:
				470
				471	void test(int X, int Y, int a[]) {
				472	int i;
				473	for (i=2; i<1000; i+=4) {
				474	a[i+0] = a[i-1+0]*a[i-2+0];
				475	a[i+1] = a[i-1+1]*a[i-2+1];
				476	a[i+2] = a[i-1+2]*a[i-2+2];
				477	a[i+3] = a[i-1+3]*a[i-2+3];
				478	}
				479	}
				480
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	481	//===---------------------------------------------------------------------===//
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	482
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	483	We should investigate an instruction sinking pass. Consider this silly
				484	example in pic mode:
				485
				486	#include <assert.h>
				487	void foo(int x) {
				488	assert(x);
				489	//...
				490	}
				491
				492	we compile this to:
				493	_foo:
				494	subl $28, %esp
				495	call "L1$pb"
				496	"L1$pb":
				497	popl %eax
				498	cmpl $0, 32(%esp)
				499	je LBB1_2 # cond_true
				500	LBB1_1: # return
				501	# ...
				502	addl $28, %esp
				503	ret
				504	LBB1_2: # cond_true
				505	...
				506
				507	The PIC base computation (call+popl) is only used on one path through the
				508	code, but is currently always computed in the entry block. It would be
				509	better to sink the picbase computation down into the block for the
				510	assertion, as it is the only one that uses it. This happens for a lot of
				511	code with early outs.
				512
Chris Lattner	be9fe9d	2007-12-29 01:05:01 +0000	[diff] [blame]	513	Another example is loads of arguments, which are usually emitted into the
				514	entry block on targets like x86. If not used in all paths through a
				515	function, they should be sunk into the ones that do.
				516
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	517	In this case, whole-function-isel would also handle this.
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	518
				519	//===---------------------------------------------------------------------===//
Chris Lattner	8551f19	2008-01-07 21:38:14 +0000	[diff] [blame]	520
				521	Investigate lowering of sparse switch statements into perfect hash tables:
				522	http://burtleburtle.net/bob/hash/perfect.html
				523
				524	//===---------------------------------------------------------------------===//
Chris Lattner	dc089f0	2008-01-09 00:17:57 +0000	[diff] [blame]	525
				526	We should turn things like "load+fabs+store" and "load+fneg+store" into the
				527	corresponding integer operations. On a yonah, this loop:
				528
				529	double a[256];
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	530	void foo() {
				531	int i, b;
				532	for (b = 0; b < 10000000; b++)
				533	for (i = 0; i < 256; i++)
				534	a[i] = -a[i];
				535	}
Chris Lattner	dc089f0	2008-01-09 00:17:57 +0000	[diff] [blame]	536
				537	is twice as slow as this loop:
				538
				539	long long a[256];
Chris Lattner	d78823e	2008-02-18 18:46:39 +0000	[diff] [blame]	540	void foo() {
				541	int i, b;
				542	for (b = 0; b < 10000000; b++)
				543	for (i = 0; i < 256; i++)
				544	a[i] ^= (1ULL << 63);
				545	}
Chris Lattner	dc089f0	2008-01-09 00:17:57 +0000	[diff] [blame]	546
				547	and I suspect other processors are similar. On X86 in particular this is a
				548	big win because doing this with integers allows the use of read/modify/write
				549	instructions.
				550
				551	//===---------------------------------------------------------------------===//
Chris Lattner	a909d31	2008-01-10 18:25:41 +0000	[diff] [blame]	552
				553	DAG Combiner should try to combine small loads into larger loads when
				554	profitable. For example, we compile this C++ example:
				555
				556	struct THotKey { short Key; bool Control; bool Shift; bool Alt; };
				557	extern THotKey m_HotKey;
				558	THotKey GetHotKey () { return m_HotKey; }
				559
				560	into (-O3 -fno-exceptions -static -fomit-frame-pointer):
				561
				562	__Z9GetHotKeyv:
				563	pushl %esi
				564	movl 8(%esp), %eax
				565	movb _m_HotKey+3, %cl
				566	movb _m_HotKey+4, %dl
				567	movb _m_HotKey+2, %ch
				568	movw _m_HotKey, %si
				569	movw %si, (%eax)
				570	movb %ch, 2(%eax)
				571	movb %cl, 3(%eax)
				572	movb %dl, 4(%eax)
				573	popl %esi
				574	ret $4
				575
				576	GCC produces:
				577
				578	__Z9GetHotKeyv:
				579	movl _m_HotKey, %edx
				580	movl 4(%esp), %eax
				581	movl %edx, (%eax)
				582	movzwl _m_HotKey+4, %edx
				583	movw %dx, 4(%eax)
				584	ret $4
				585
				586	The LLVM IR contains the needed alignment info, so we should be able to
				587	merge the loads and stores into 4-byte loads:
				588
				589	%struct.THotKey = type { i16, i8, i8, i8 }
				590	define void @_Z9GetHotKeyv(%struct.THotKey* sret %agg.result) nounwind {
				591	...
				592	%tmp2 = load i16* getelementptr (@m_HotKey, i32 0, i32 0), align 8
				593	%tmp5 = load i8* getelementptr (@m_HotKey, i32 0, i32 1), align 2
				594	%tmp8 = load i8* getelementptr (@m_HotKey, i32 0, i32 2), align 1
				595	%tmp11 = load i8* getelementptr (@m_HotKey, i32 0, i32 3), align 2
				596
				597	Alternatively, we should use a small amount of base-offset alias analysis
				598	to make it so the scheduler doesn't need to hold all the loads in regs at
				599	once.
				600
				601	//===---------------------------------------------------------------------===//
Chris Lattner	71538b5	2008-01-11 06:17:47 +0000	[diff] [blame]	602
				603	We should extend parameter attributes to capture more information about
				604	pointer parameters for alias analysis. Some ideas:
				605
				606	1. Add a "nocapture" attribute, which indicates that the callee does not store
				607	the address of the parameter into a global or any other memory location
				608	visible to the callee. This can be used to make basicaa and other analyses
				609	more powerful. It is true for things like memcpy, strcat, and many other
				610	things, including structs passed by value, most C++ references, etc.
				611	2. Generalize readonly to be set on parameters. This is important mod/ref
				612	info for the function, which is important for basicaa and others. It can
				613	also be used by the inliner to avoid inserting a memcpy for byval
				614	arguments when the function is inlined.
				615
				616	These functions can be inferred by various analysis passes such as the
Chris Lattner	de99a23	2008-01-12 18:58:46 +0000	[diff] [blame]	617	globalsmodrefaa pass. Note that getting #2 right is actually really tricky.
				618	Consider this code:
				619
				620	struct S; S G;
				621	void caller(S byvalarg) { G.field = 1; ... }
				622	void callee() { caller(G); }
				623
				624	The fact that the caller does not modify byval arg is not enough, we need
				625	to know that it doesn't modify G either. This is very tricky.
Chris Lattner	71538b5	2008-01-11 06:17:47 +0000	[diff] [blame]	626
				627	//===---------------------------------------------------------------------===//
Nate Begeman	a3c7ba3	2008-02-18 18:39:23 +0000	[diff] [blame]	628
				629	We should add an FRINT node to the DAG to model targets that have legal
				630	implementations of ceil/floor/rint.
Chris Lattner	a672d3d	2008-02-28 05:34:27 +0000	[diff] [blame]	631
				632	//===---------------------------------------------------------------------===//
				633
Chris Lattner	61075f3	2008-02-28 17:21:27 +0000	[diff] [blame]	634	This GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34043
				635	contains a testcase that compiles down to:
				636
				637	%struct.XMM128 = type { <4 x float> }
				638	..
				639	%src = alloca %struct.XMM128
				640	..
				641	%tmp6263 = bitcast %struct.XMM128* %src to <2 x i64>*
				642	%tmp65 = getelementptr %struct.XMM128* %src, i32 0, i32 0
				643	store <2 x i64> %tmp5899, <2 x i64>* %tmp6263, align 16
				644	%tmp66 = load <4 x float>* %tmp65, align 16
				645	%tmp71 = add <4 x float> %tmp66, %tmp66
				646
				647	If the mid-level optimizer turned the bitcast of pointer + store of tmp5899
				648	into a bitcast of the vector value and a store to the pointer, then the
				649	store->load could be easily removed.
				650
				651	//===---------------------------------------------------------------------===//
				652
Chris Lattner	a672d3d	2008-02-28 05:34:27 +0000	[diff] [blame]	653	Consider:
				654
				655	int test() {
				656	long long input[8] = {1,1,1,1,1,1,1,1};
				657	foo(input);
				658	}
				659
				660	We currently compile this into a memcpy from a global array since the
				661	initializer is fairly large and not memset'able. This is good, but the memcpy
				662	gets lowered to load/stores in the code generator. This is also ok, except
				663	that the codegen lowering for memcpy doesn't handle the case when the source
				664	is a constant global. This gives us atrocious code like this:
				665
				666	call "L1$pb"
				667	"L1$pb":
				668	popl %eax
				669	movl _C.0.1444-"L1$pb"+32(%eax), %ecx
				670	movl %ecx, 40(%esp)
				671	movl _C.0.1444-"L1$pb"+20(%eax), %ecx
				672	movl %ecx, 28(%esp)
				673	movl _C.0.1444-"L1$pb"+36(%eax), %ecx
				674	movl %ecx, 44(%esp)
				675	movl _C.0.1444-"L1$pb"+44(%eax), %ecx
				676	movl %ecx, 52(%esp)
				677	movl _C.0.1444-"L1$pb"+40(%eax), %ecx
				678	movl %ecx, 48(%esp)
				679	movl _C.0.1444-"L1$pb"+12(%eax), %ecx
				680	movl %ecx, 20(%esp)
				681	movl _C.0.1444-"L1$pb"+4(%eax), %ecx
				682	...
				683
				684	instead of:
				685	movl $1, 16(%esp)
				686	movl $0, 20(%esp)
				687	movl $1, 24(%esp)
				688	movl $0, 28(%esp)
				689	movl $1, 32(%esp)
				690	movl $0, 36(%esp)
				691	...
				692
				693	//===---------------------------------------------------------------------===//
Chris Lattner	a709d33	2008-03-02 02:51:40 +0000	[diff] [blame]	694
				695	http://llvm.org/PR717:
				696
				697	The following code should compile into "ret int undef". Instead, LLVM
				698	produces "ret int 0":
				699
				700	int f() {
				701	int x = 4;
				702	int y;
				703	if (x == 3) y = 0;
				704	return y;
				705	}
				706
				707	//===---------------------------------------------------------------------===//
Chris Lattner	e9c6844	2008-03-02 19:29:42 +0000	[diff] [blame]	708
				709	The loop unroller should partially unroll loops (instead of peeling them)
				710	when code growth isn't too bad and when an unroll count allows simplification
				711	of some code within the loop. One trivial example is:
				712
				713	#include <stdio.h>
				714	int main() {
				715	int nRet = 17;
				716	int nLoop;
				717	for ( nLoop = 0; nLoop < 1000; nLoop++ ) {
				718	if ( nLoop & 1 )
				719	nRet += 2;
				720	else
				721	nRet -= 1;
				722	}
				723	return nRet;
				724	}
				725
				726	Unrolling by 2 would eliminate the '&1' in both copies, leading to a net
				727	reduction in code size. The resultant code would then also be suitable for
				728	exit value computation.
				729
				730	//===---------------------------------------------------------------------===//
Chris Lattner	962a7d5	2008-03-17 01:47:51 +0000	[diff] [blame]	731
				732	We miss a bunch of rotate opportunities on various targets, including ppc, x86,
				733	etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate
				734	matching code in dag combine doesn't look through truncates aggressively
				735	enough. Here are some testcases reduces from GCC PR17886:
				736
				737	unsigned long long f(unsigned long long x, int y) {
				738	return (x << y) \| (x >> 64-y);
				739	}
				740	unsigned f2(unsigned x, int y){
				741	return (x << y) \| (x >> 32-y);
				742	}
				743	unsigned long long f3(unsigned long long x){
				744	int y = 9;
				745	return (x << y) \| (x >> 64-y);
				746	}
				747	unsigned f4(unsigned x){
				748	int y = 10;
				749	return (x << y) \| (x >> 32-y);
				750	}
				751	unsigned long long f5(unsigned long long x, unsigned long long y) {
				752	return (x << 8) \| ((y >> 48) & 0xffull);
				753	}
				754	unsigned long long f6(unsigned long long x, unsigned long long y, int z) {
				755	switch(z) {
				756	case 1:
				757	return (x << 8) \| ((y >> 48) & 0xffull);
				758	case 2:
				759	return (x << 16) \| ((y >> 40) & 0xffffull);
				760	case 3:
				761	return (x << 24) \| ((y >> 32) & 0xffffffull);
				762	case 4:
				763	return (x << 32) \| ((y >> 24) & 0xffffffffull);
				764	default:
				765	return (x << 40) \| ((y >> 16) & 0xffffffffffull);
				766	}
				767	}
				768
				769	On X86-64, we only handle f3/f4 right. On x86-32, several of these
				770	generate truly horrible code, instead of using shld and friends. On
				771	ARM, we end up with calls to L___lshrdi3/L___ashldi3 in f, which is
				772	badness. PPC64 misses f, f5 and f6. CellSPU aborts in isel.
				773
				774	//===---------------------------------------------------------------------===//
Chris Lattner	38c4a15	2008-03-20 04:46:13 +0000	[diff] [blame]	775
				776	We do a number of simplifications in simplify libcalls to strength reduce
				777	standard library functions, but we don't currently merge them together. For
				778	example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only
				779	be done safely if "b" isn't modified between the strlen and memcpy of course.
				780
				781	//===---------------------------------------------------------------------===//
				782
Chris Lattner	045fd22	2008-05-17 15:37:38 +0000	[diff] [blame]	783	We should be able to evaluate this loop:
				784
				785	int test(int x_offs) {
				786	while (x_offs > 4)
				787	x_offs -= 4;
				788	return x_offs;
				789	}
				790
				791	//===---------------------------------------------------------------------===//
Chris Lattner	40a6864	2008-07-14 00:19:59 +0000	[diff] [blame]	792
				793	Reassociate should turn things like:
				794
				795	int factorial(int X) {
				796	return XXXXXXX*X;
				797	}
				798
				799	into llvm.powi calls, allowing the code generator to produce balanced
				800	multiplication trees.
				801
				802	//===---------------------------------------------------------------------===//
				803
Chris Lattner	16e192f	2008-08-10 01:14:08 +0000	[diff] [blame]	804	We generate a horrible libcall for llvm.powi. For example, we compile:
				805
				806	#include <cmath>
				807	double f(double a) { return std::pow(a, 4); }
				808
				809	into:
				810
				811	__Z1fd:
				812	subl $12, %esp
				813	movsd 16(%esp), %xmm0
				814	movsd %xmm0, (%esp)
				815	movl $4, 8(%esp)
				816	call L___powidf2$stub
				817	addl $12, %esp
				818	ret
				819
				820	GCC produces:
				821
				822	__Z1fd:
				823	subl $12, %esp
				824	movsd 16(%esp), %xmm0
				825	mulsd %xmm0, %xmm0
				826	mulsd %xmm0, %xmm0
				827	movsd %xmm0, (%esp)
				828	fldl (%esp)
				829	addl $12, %esp
				830	ret
				831
				832	//===---------------------------------------------------------------------===//
				833
				834	We compile this program: (from GCC PR11680)
				835	http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487
				836
				837	Into code that runs the same speed in fast/slow modes, but both modes run 2x
				838	slower than when compile with GCC (either 4.0 or 4.2):
				839
				840	$ llvm-g++ perf.cpp -O3 -fno-exceptions
				841	$ time ./a.out fast
				842	1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w
				843
				844	$ g++ perf.cpp -O3 -fno-exceptions
				845	$ time ./a.out fast
				846	0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w
				847
				848	It looks like we are making the same inlining decisions, so this may be raw
				849	codegen badness or something else (haven't investigated).
				850
				851	//===---------------------------------------------------------------------===//
				852
				853	We miss some instcombines for stuff like this:
				854	void bar (void);
				855	void foo (unsigned int a) {
				856	/* This one is equivalent to a >= (3 << 2). */
				857	if ((a >> 2) >= 3)
				858	bar ();
				859	}
				860
				861	A few other related ones are in GCC PR14753.
				862
				863	//===---------------------------------------------------------------------===//
				864
				865	Divisibility by constant can be simplified (according to GCC PR12849) from
				866	being a mulhi to being a mul lo (cheaper). Testcase:
				867
				868	void bar(unsigned n) {
				869	if (n % 3 == 0)
				870	true();
				871	}
				872
				873	I think this basically amounts to a dag combine to simplify comparisons against
				874	multiply hi's into a comparison against the mullo.
				875
				876	//===---------------------------------------------------------------------===//
Chris Lattner	c10e07a	2008-08-19 06:22:16 +0000	[diff] [blame]	877
				878	SROA is not promoting the union on the stack in this example, we should end
				879	up with no allocas.
				880
				881	union vec2d {
				882	double e[2];
				883	double v __attribute__((vector_size(16)));
				884	};
				885	typedef union vec2d vec2d;
				886
				887	static vec2d a={{1,2}}, b={{3,4}};
				888
				889	vec2d foo () {
				890	return (vec2d){ .v = a.v + b.v * (vec2d){{5,5}}.v };
				891	}
				892
				893	//===---------------------------------------------------------------------===//