Blame - lib/Target/README.txt - fp2-dev/platform/external/llvm

blob: 284be24b679a61296eb429c38985746b92e33065 [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	Target Independent Opportunities:
				2
				3	//===---------------------------------------------------------------------===//
				4
				5	With the recent changes to make the implicit def/use set explicit in
				6	machineinstrs, we should change the target descriptions for 'call' instructions
				7	so that the .td files don't list all the call-clobbered registers as implicit
				8	defs. Instead, these should be added by the code generator (e.g. on the dag).
				9
				10	This has a number of uses:
				11
				12	1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
				13	for their different impdef sets.
				14	2. Targets with multiple calling convs (e.g. x86) which have different clobber
				15	sets don't need copies of call instructions.
				16	3. 'Interprocedural register allocation' can be done to reduce the clobber sets
				17	of calls.
				18
				19	//===---------------------------------------------------------------------===//
				20
				21	Make the PPC branch selector target independant
				22
				23	//===---------------------------------------------------------------------===//
				24
				25	Get the C front-end to expand hypot(x,y) -> llvm.sqrt(xx+yy) when errno and
				26	precision don't matter (ffastmath). Misc/mandel will like this. :)
				27
				28	//===---------------------------------------------------------------------===//
				29
				30	Solve this DAG isel folding deficiency:
				31
				32	int X, Y;
				33
				34	void fn1(void)
				35	{
				36	X = X \| (Y << 3);
				37	}
				38
				39	compiles to
				40
				41	fn1:
				42	movl Y, %eax
				43	shll $3, %eax
				44	orl X, %eax
				45	movl %eax, X
				46	ret
				47
				48	The problem is the store's chain operand is not the load X but rather
				49	a TokenFactor of the load X and load Y, which prevents the folding.
				50
				51	There are two ways to fix this:
				52
				53	1. The dag combiner can start using alias analysis to realize that y/x
				54	don't alias, making the store to X not dependent on the load from Y.
				55	2. The generated isel could be made smarter in the case it can't
				56	disambiguate the pointers.
				57
				58	Number 1 is the preferred solution.
				59
				60	This has been "fixed" by a TableGen hack. But that is a short term workaround
				61	which will be removed once the proper fix is made.
				62
				63	//===---------------------------------------------------------------------===//
				64
				65	On targets with expensive 64-bit multiply, we could LSR this:
				66
				67	for (i = ...; ++i) {
				68	x = 1ULL << i;
				69
				70	into:
				71	long long tmp = 1;
				72	for (i = ...; ++i, tmp+=tmp)
				73	x = tmp;
				74
				75	This would be a win on ppc32, but not x86 or ppc64.
				76
				77	//===---------------------------------------------------------------------===//
				78
				79	Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
				80
				81	//===---------------------------------------------------------------------===//
				82
				83	Reassociate should turn: XXXX -> t=(XX) (t*t) to eliminate a multiply.
				84
				85	//===---------------------------------------------------------------------===//
				86
				87	Interesting? testcase for add/shift/mul reassoc:
				88
				89	int bar(int x, int y) {
				90	return xxx+y+xxxxxyyyy;
				91	}
				92	int foo(int z, int n) {
				93	return bar(z, n) + bar(2z, 2n);
				94	}
				95
				96	Reassociate should handle the example in GCC PR16157.
				97
				98	//===---------------------------------------------------------------------===//
				99
				100	These two functions should generate the same code on big-endian systems:
				101
				102	int g(int j,int l) { return memcmp(j,l,4); }
				103	int h(int j, int l) { return j - l; }
				104
				105	this could be done in SelectionDAGISel.cpp, along with other special cases,
				106	for 1,2,4,8 bytes.
				107
				108	//===---------------------------------------------------------------------===//
				109
				110	It would be nice to revert this patch:
				111	http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
				112
				113	And teach the dag combiner enough to simplify the code expanded before
				114	legalize. It seems plausible that this knowledge would let it simplify other
				115	stuff too.
				116
				117	//===---------------------------------------------------------------------===//
				118
				119	For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal
				120	to the type size. It works but can be overly conservative as the alignment of
				121	specific vector types are target dependent.
				122
				123	//===---------------------------------------------------------------------===//
				124
				125	We should add 'unaligned load/store' nodes, and produce them from code like
				126	this:
				127
				128	v4sf example(float *P) {
				129	return (v4sf){P[0], P[1], P[2], P[3] };
				130	}
				131
				132	//===---------------------------------------------------------------------===//
				133
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	134	Add support for conditional increments, and other related patterns. Instead
				135	of:
				136
				137	movl 136(%esp), %eax
				138	cmpl $0, %eax
				139	je LBB16_2 #cond_next
				140	LBB16_1: #cond_true
				141	incl _foo
				142	LBB16_2: #cond_next
				143
				144	emit:
				145	movl _foo, %eax
				146	cmpl $1, %edi
				147	sbbl $-1, %eax
				148	movl %eax, _foo
				149
				150	//===---------------------------------------------------------------------===//
				151
				152	Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
				153
				154	Expand these to calls of sin/cos and stores:
				155	double sincos(double x, double sin, double cos);
				156	float sincosf(float x, float sin, float cos);
				157	long double sincosl(long double x, long double sin, long double cos);
				158
				159	Doing so could allow SROA of the destination pointers. See also:
				160	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
				161
				162	//===---------------------------------------------------------------------===//
				163
				164	Scalar Repl cannot currently promote this testcase to 'ret long cst':
				165
				166	%struct.X = type { i32, i32 }
				167	%struct.Y = type { %struct.X }
				168
				169	define i64 @bar() {
				170	%retval = alloca %struct.Y, align 8
				171	%tmp12 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 0
				172	store i32 0, i32* %tmp12
				173	%tmp15 = getelementptr %struct.Y* %retval, i32 0, i32 0, i32 1
				174	store i32 1, i32* %tmp15
				175	%retval.upgrd.1 = bitcast %struct.Y* %retval to i64*
				176	%retval.upgrd.2 = load i64* %retval.upgrd.1
				177	ret i64 %retval.upgrd.2
				178	}
				179
				180	it should be extended to do so.
				181
				182	//===---------------------------------------------------------------------===//
				183
				184	-scalarrepl should promote this to be a vector scalar.
				185
				186	%struct..0anon = type { <4 x float> }
				187
				188	define void @test1(<4 x float> %V, float* %P) {
				189	%u = alloca %struct..0anon, align 16
				190	%tmp = getelementptr %struct..0anon* %u, i32 0, i32 0
				191	store <4 x float> %V, <4 x float>* %tmp
				192	%tmp1 = bitcast %struct..0anon* %u to [4 x float]*
				193	%tmp.upgrd.1 = getelementptr [4 x float]* %tmp1, i32 0, i32 1
				194	%tmp.upgrd.2 = load float* %tmp.upgrd.1
				195	%tmp3 = mul float %tmp.upgrd.2, 2.000000e+00
				196	store float %tmp3, float* %P
				197	ret void
				198	}
				199
				200	//===---------------------------------------------------------------------===//
				201
				202	Turn this into a single byte store with no load (the other 3 bytes are
				203	unmodified):
				204
				205	void %test(uint* %P) {
				206	%tmp = load uint* %P
				207	%tmp14 = or uint %tmp, 3305111552
				208	%tmp15 = and uint %tmp14, 3321888767
				209	store uint %tmp15, uint* %P
				210	ret void
				211	}
				212
				213	//===---------------------------------------------------------------------===//
				214
				215	dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
				216
				217	Compile:
				218
				219	int bar(int x)
				220	{
				221	int t = __builtin_clz(x);
				222	return -(t>>5);
				223	}
				224
				225	to:
				226
				227	_bar: addic r3,r3,-1
				228	subfe r3,r3,r3
				229	blr
				230
				231	//===---------------------------------------------------------------------===//
				232
				233	Legalize should lower ctlz like this:
				234	ctlz(x) = popcnt((x-1) & ~x)
				235
				236	on targets that have popcnt but not ctlz. itanium, what else?
				237
				238	//===---------------------------------------------------------------------===//
				239
				240	quantum_sigma_x in 462.libquantum contains the following loop:
				241
				242	for(i=0; i<reg->size; i++)
				243	{
				244	/* Flip the target bit of each basis state */
				245	reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
				246	}
				247
				248	Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
				249	so cool to turn it into something like:
				250
				251	long long Res = ((MAX_UNSIGNED) 1 << target);
				252	if (target < 32) {
				253	for(i=0; i<reg->size; i++)
				254	reg->node[i].state ^= Res & 0xFFFFFFFFULL;
				255	} else {
				256	for(i=0; i<reg->size; i++)
				257	reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
				258	}
				259
				260	... which would only do one 32-bit XOR per loop iteration instead of two.
				261
				262	It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
				263	alas...
				264
				265	//===---------------------------------------------------------------------===//
				266
				267	This isn't recognized as bswap by instcombine:
				268
				269	unsigned int swap_32(unsigned int v) {
				270	v = ((v & 0x00ff00ffU) << 8) \| ((v & 0xff00ff00U) >> 8);
				271	v = ((v & 0x0000ffffU) << 16) \| ((v & 0xffff0000U) >> 16);
				272	return v;
				273	}
				274
				275	Nor is this (yes, it really is bswap):
				276
				277	unsigned long reverse(unsigned v) {
				278	unsigned t;
				279	t = v ^ ((v << 16) \| (v >> 16));
				280	t &= ~0xff0000;
				281	v = (v << 24) \| (v >> 8);
				282	return v ^ (t >> 8);
				283	}
				284
				285	//===---------------------------------------------------------------------===//
				286
				287	These should turn into single 16-bit (unaligned?) loads on little/big endian
				288	processors.
				289
				290	unsigned short read_16_le(const unsigned char *adr) {
				291	return adr[0] \| (adr[1] << 8);
				292	}
				293	unsigned short read_16_be(const unsigned char *adr) {
				294	return (adr[0] << 8) \| adr[1];
				295	}
				296
				297	//===---------------------------------------------------------------------===//
				298
				299	-instcombine should handle this transform:
				300	icmp pred (sdiv X / C1 ), C2
				301	when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
				302
				303	Currently InstCombine avoids this transform but will do it when the signs of
				304	the operands and the sign of the divide match. See the FIXME in
				305	InstructionCombining.cpp in the visitSetCondInst method after the switch case
				306	for Instruction::UDiv (around line 4447) for more details.
				307
				308	The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
				309	this construct.
				310
				311	//===---------------------------------------------------------------------===//
				312
				313	Instcombine misses several of these cases (see the testcase in the patch):
				314	http://gcc.gnu.org/ml/gcc-patches/2006-10/msg01519.html
				315
				316	//===---------------------------------------------------------------------===//
				317
				318	viterbi speeds up significantly if the various "history" related copy loops
				319	are turned into memcpy calls at the source level. We need a "loops to memcpy"
				320	pass.
				321
				322	//===---------------------------------------------------------------------===//
				323
				324	Consider:
				325
				326	typedef unsigned U32;
				327	typedef unsigned long long U64;
				328	int test (U32 inst, U64 regs) {
				329	U64 effective_addr2;
				330	U32 temp = *inst;
				331	int r1 = (temp >> 20) & 0xf;
				332	int b2 = (temp >> 16) & 0xf;
				333	effective_addr2 = temp & 0xfff;
				334	if (b2) effective_addr2 += regs[b2];
				335	b2 = (temp >> 12) & 0xf;
				336	if (b2) effective_addr2 += regs[b2];
				337	effective_addr2 &= regs[4];
				338	if ((effective_addr2 & 3) == 0)
				339	return 1;
				340	return 0;
				341	}
				342
				343	Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
				344	we don't eliminate the computation of the top half of effective_addr2 because
				345	we don't have whole-function selection dags. On x86, this means we use one
				346	extra register for the function when effective_addr2 is declared as U64 than
				347	when it is declared U32.
				348
				349	//===---------------------------------------------------------------------===//
				350
				351	Promote for i32 bswap can use i64 bswap + shr. Useful on targets with 64-bit
				352	regs and bswap, like itanium.
				353
				354	//===---------------------------------------------------------------------===//
				355
				356	LSR should know what GPR types a target has. This code:
				357
				358	volatile short X, Y; // globals
				359
				360	void foo(int N) {
				361	int i;
				362	for (i = 0; i < N; i++) { X = i; Y = i*4; }
				363	}
				364
				365	produces two identical IV's (after promotion) on PPC/ARM:
				366
				367	LBB1_1: @bb.preheader
				368	mov r3, #0
				369	mov r2, r3
				370	mov r1, r3
				371	LBB1_2: @bb
				372	ldr r12, LCPI1_0
				373	ldr r12, [r12]
				374	strh r2, [r12]
				375	ldr r12, LCPI1_1
				376	ldr r12, [r12]
				377	strh r3, [r12]
				378	add r1, r1, #1 <- [0,+,1]
				379	add r3, r3, #4
				380	add r2, r2, #1 <- [0,+,1]
				381	cmp r1, r0
				382	bne LBB1_2 @bb
				383
				384
				385	//===---------------------------------------------------------------------===//
				386
				387	Tail call elim should be more aggressive, checking to see if the call is
				388	followed by an uncond branch to an exit block.
				389
				390	; This testcase is due to tail-duplication not wanting to copy the return
				391	; instruction into the terminating blocks because there was other code
				392	; optimized out of the function after the taildup happened.
				393	;RUN: llvm-upgrade < %s \| llvm-as \| opt -tailcallelim \| llvm-dis \| not grep call
				394
				395	int %t4(int %a) {
				396	entry:
				397	%tmp.1 = and int %a, 1
				398	%tmp.2 = cast int %tmp.1 to bool
				399	br bool %tmp.2, label %then.0, label %else.0
				400
				401	then.0:
				402	%tmp.5 = add int %a, -1
				403	%tmp.3 = call int %t4( int %tmp.5 )
				404	br label %return
				405
				406	else.0:
				407	%tmp.7 = setne int %a, 0
				408	br bool %tmp.7, label %then.1, label %return
				409
				410	then.1:
				411	%tmp.11 = add int %a, -2
				412	%tmp.9 = call int %t4( int %tmp.11 )
				413	br label %return
				414
				415	return:
				416	%result.0 = phi int [ 0, %else.0 ], [ %tmp.3, %then.0 ],
				417	[ %tmp.9, %then.1 ]
				418	ret int %result.0
				419	}
				420
				421	//===---------------------------------------------------------------------===//
				422
Chris Lattner	00159fc	2007-10-03 06:10:59 +0000	[diff] [blame]	423	Tail recursion elimination is not transforming this function, because it is
				424	returning n, which fails the isDynamicConstant check in the accumulator
				425	recursion checks.
				426
				427	long long fib(const long long n) {
				428	switch(n) {
				429	case 0:
				430	case 1:
				431	return n;
				432	default:
				433	return fib(n-1) + fib(n-2);
				434	}
				435	}
				436
				437	//===---------------------------------------------------------------------===//
				438
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	439	Argument promotion should promote arguments for recursive functions, like
				440	this:
				441
				442	; RUN: llvm-upgrade < %s \| llvm-as \| opt -argpromotion \| llvm-dis \| grep x.val
				443
				444	implementation ; Functions:
				445
				446	internal int %foo(int* %x) {
				447	entry:
				448	%tmp = load int* %x
				449	%tmp.foo = call int %foo(int *%x)
				450	ret int %tmp.foo
				451	}
				452
				453	int %bar(int* %x) {
				454	entry:
				455	%tmp3 = call int %foo( int* %x) ; <int>[#uses=1]
				456	ret int %tmp3
				457	}
				458
Chris Lattner	421a733	2007-12-05 23:05:06 +0000	[diff] [blame]	459	//===---------------------------------------------------------------------===//
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	460
				461	"basicaa" should know how to look through "or" instructions that act like add
				462	instructions. For example in this code, the x4+1 is turned into x4 \| 1, and
				463	basicaa can't analyze the array subscript, leading to duplicated loads in the
				464	generated code:
				465
				466	void test(int X, int Y, int a[]) {
				467	int i;
				468	for (i=2; i<1000; i+=4) {
				469	a[i+0] = a[i-1+0]*a[i-2+0];
				470	a[i+1] = a[i-1+1]*a[i-2+1];
				471	a[i+2] = a[i-1+2]*a[i-2+2];
				472	a[i+3] = a[i-1+3]*a[i-2+3];
				473	}
				474	}
				475
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	476	//===---------------------------------------------------------------------===//
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	477
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	478	We should investigate an instruction sinking pass. Consider this silly
				479	example in pic mode:
				480
				481	#include <assert.h>
				482	void foo(int x) {
				483	assert(x);
				484	//...
				485	}
				486
				487	we compile this to:
				488	_foo:
				489	subl $28, %esp
				490	call "L1$pb"
				491	"L1$pb":
				492	popl %eax
				493	cmpl $0, 32(%esp)
				494	je LBB1_2 # cond_true
				495	LBB1_1: # return
				496	# ...
				497	addl $28, %esp
				498	ret
				499	LBB1_2: # cond_true
				500	...
				501
				502	The PIC base computation (call+popl) is only used on one path through the
				503	code, but is currently always computed in the entry block. It would be
				504	better to sink the picbase computation down into the block for the
				505	assertion, as it is the only one that uses it. This happens for a lot of
				506	code with early outs.
				507
Chris Lattner	be9fe9d	2007-12-29 01:05:01 +0000	[diff] [blame]	508	Another example is loads of arguments, which are usually emitted into the
				509	entry block on targets like x86. If not used in all paths through a
				510	function, they should be sunk into the ones that do.
				511
Chris Lattner	fe7fe91	2007-12-28 22:30:05 +0000	[diff] [blame]	512	In this case, whole-function-isel would also handle this.
Chris Lattner	072ab75	2007-12-28 04:42:05 +0000	[diff] [blame]	513
				514	//===---------------------------------------------------------------------===//