Blame - lib/Target/ARM/README.txt - fp2-dev/platform/external/llvm

blob: 940db16df66d3a25845f4d1d3b766532c0bebd16 [file] [log] [blame]

Rafael Espindola	f4d4005	2006-08-22 12:22:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the ARM backend.
				3	//===---------------------------------------------------------------------===//
				4
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	5	Reimplement 'select' in terms of 'SEL'.
Rafael Espindola	f4d4005	2006-08-22 12:22:46 +0000	[diff] [blame]	6
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	7	* We would really like to support UXTAB16, but we need to prove that the
				8	add doesn't need to overflow between the two 16-bit chunks.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	9
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	10	* implement predication support
				11	* Implement pre/post increment support. (e.g. PR935)
				12	* Coalesce stack slots!
				13	* Implement smarter constant generation for binops with large immediates.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	14
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	15	* Consider materializing FP constants like 0.0f and 1.0f using integer
				16	immediate instructions then copy to FPU. Slower than load into FPU?
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	17
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	18	//===---------------------------------------------------------------------===//
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	19
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	20	The constant island pass is in good shape. Some cleanups might be desirable,
				21	but there is unlikely to be much improvement in the generated code.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	22
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	23	1. There may be some advantage to trying to be smarter about the initial
Dale Johannesen	88e37ae	2007-02-23 05:02:36 +0000	[diff] [blame]	24	placement, rather than putting everything at the end.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	25
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	26	2. The handling of 2-byte padding for Thumb is overly conservative. There
Dale Johannesen	99c49a4	2007-02-25 00:47:03 +0000	[diff] [blame]	27	would be a small gain to keeping accurate track of the padding (which would
				28	require aligning functions containing constant pools to 4-byte boundaries).
				29
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	30	3. There might be some compile-time efficiency to be had by representing
				31	consecutive islands as a single block rather than multiple blocks.
				32
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	33	//===---------------------------------------------------------------------===//
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	34
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	35	We need to start generating predicated instructions. The .td files have a way
				36	to express this now (see the PPC conditional return instruction), but the
				37	branch folding pass (or a new if-cvt pass) should start producing these, at
				38	least in the trivial case.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	39
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	40	Among the obvious wins, doing so can eliminate the need to custom expand
				41	copysign (i.e. we won't need to custom expand it to get the conditional
				42	negate).
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	43
Chris Lattner	2d1222c	2007-02-02 04:36:46 +0000	[diff] [blame]	44	This allows us to eliminate one instruction from:
				45
				46	define i32 @_Z6slow4bii(i32 %x, i32 %y) {
				47	%tmp = icmp sgt i32 %x, %y
				48	%retval = select i1 %tmp, i32 %x, i32 %y
				49	ret i32 %retval
				50	}
				51
				52	__Z6slow4bii:
				53	cmp r0, r1
				54	movgt r1, r0
				55	mov r0, r1
				56	bx lr
				57
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	58	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	59
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	60	Implement long long "X-3" with instructions that fold the immediate in. These
				61	were disabled due to badness with the ARM carry flag on subtracts.
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	62
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	63	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	64
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	65	We currently compile abs:
				66	int foo(int p) { return p < 0 ? -p : p; }
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	67
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	68	into:
Rafael Espindola	cd71da5	2006-10-03 17:27:58 +0000	[diff] [blame]	69
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	70	_foo:
				71	rsb r1, r0, #0
				72	cmn r0, #1
				73	movgt r1, r0
				74	mov r0, r1
				75	bx lr
Rafael Espindola	cd71da5	2006-10-03 17:27:58 +0000	[diff] [blame]	76
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	77	This is very, uh, literal. This could be a 3 operation sequence:
				78	t = (p sra 31);
				79	res = (p xor t)-t
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	80
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	81	Which would be better. This occurs in png decode.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	82
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	83	//===---------------------------------------------------------------------===//
				84
				85	More load / store optimizations:
				86	1) Look past instructions without side-effects (not load, store, branch, etc.)
				87	when forming the list of loads / stores to optimize.
				88
				89	2) Smarter register allocation?
				90	We are probably missing some opportunities to use ldm / stm. Consider:
				91
				92	ldr r5, [r0]
				93	ldr r4, [r0, #4]
				94
				95	This cannot be merged into a ldm. Perhaps we will need to do the transformation
				96	before register allocation. Then teach the register allocator to allocate a
				97	chunk of consecutive registers.
				98
				99	3) Better representation for block transfer? This is from Olden/power:
				100
				101	fldd d0, [r4]
				102	fstd d0, [r4, #+32]
				103	fldd d0, [r4, #+8]
				104	fstd d0, [r4, #+40]
				105	fldd d0, [r4, #+16]
				106	fstd d0, [r4, #+48]
				107	fldd d0, [r4, #+24]
				108	fstd d0, [r4, #+56]
				109
				110	If we can spare the registers, it would be better to use fldm and fstm here.
				111	Need major register allocator enhancement though.
				112
				113	4) Can we recognize the relative position of constantpool entries? i.e. Treat
				114
				115	ldr r0, LCPI17_3
				116	ldr r1, LCPI17_4
				117	ldr r2, LCPI17_5
				118
				119	as
				120	ldr r0, LCPI17
				121	ldr r1, LCPI17+4
				122	ldr r2, LCPI17+8
				123
				124	Then the ldr's can be combined into a single ldm. See Olden/power.
				125
				126	Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
				127	double 64-bit FP constant:
				128
				129	adr r0, L6
				130	ldmia r0, {r0-r1}
				131
				132	.align 2
				133	L6:
				134	.long -858993459
				135	.long 1074318540
				136
				137	5) Can we make use of ldrd and strd? Instead of generating ldm / stm, use
				138	ldrd/strd instead if there are only two destination registers that form an
				139	odd/even pair. However, we probably would pay a penalty if the address is not
				140	aligned on 8-byte boundary. This requires more information on load / store
				141	nodes (and MI's?) then we currently carry.
				142
Dale Johannesen	818c085	2007-03-09 19:18:59 +0000	[diff] [blame]	143	6) struct copies appear to be done field by field
Dale Johannesen	a6bc6fc	2007-03-09 17:58:17 +0000	[diff] [blame]	144	instead of by words, at least sometimes:
				145
				146	struct foo { int x; short s; char c1; char c2; };
				147	void cpy(struct fooa, struct foob) { a = b; }
				148
				149	llvm code (-O2)
				150	ldrb r3, [r1, #+6]
				151	ldr r2, [r1]
				152	ldrb r12, [r1, #+7]
				153	ldrh r1, [r1, #+4]
				154	str r2, [r0]
				155	strh r1, [r0, #+4]
				156	strb r3, [r0, #+6]
				157	strb r12, [r0, #+7]
				158	gcc code (-O2)
				159	ldmia r1, {r1-r2}
				160	stmia r0, {r1-r2}
				161
				162	In this benchmark poor handling of aggregate copies has shown up as
				163	having a large effect on size, and possibly speed as well (we don't have
				164	a good way to measure on ARM).
				165
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	166	//===---------------------------------------------------------------------===//
				167
				168	* Consider this silly example:
				169
				170	double bar(double x) {
				171	double r = foo(3.1);
				172	return x+r;
				173	}
				174
				175	_bar:
				176	sub sp, sp, #16
				177	str r4, [sp, #+12]
				178	str r5, [sp, #+8]
				179	str lr, [sp, #+4]
				180	mov r4, r0
				181	mov r5, r1
				182	ldr r0, LCPI2_0
				183	bl _foo
				184	fmsr f0, r0
				185	fcvtsd d0, f0
				186	fmdrr d1, r4, r5
				187	faddd d0, d0, d1
				188	fmrrd r0, r1, d0
				189	ldr lr, [sp, #+4]
				190	ldr r5, [sp, #+8]
				191	ldr r4, [sp, #+12]
				192	add sp, sp, #16
				193	bx lr
				194
				195	Ignore the prologue and epilogue stuff for a second. Note
				196	mov r4, r0
				197	mov r5, r1
				198	the copys to callee-save registers and the fact they are only being used by the
				199	fmdrr instruction. It would have been better had the fmdrr been scheduled
				200	before the call and place the result in a callee-save DPR register. The two
				201	mov ops would not have been necessary.
				202
				203	//===---------------------------------------------------------------------===//
				204
				205	Calling convention related stuff:
				206
				207	* gcc's parameter passing implementation is terrible and we suffer as a result:
				208
				209	e.g.
				210	struct s {
				211	double d1;
				212	int s1;
				213	};
				214
				215	void foo(struct s S) {
				216	printf("%g, %d\n", S.d1, S.s1);
				217	}
				218
				219	'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
				220	then reload them to r1, r2, and r3 before issuing the call (r0 contains the
				221	address of the format string):
				222
				223	stmfd sp!, {r7, lr}
				224	add r7, sp, #0
				225	sub sp, sp, #12
				226	stmia sp, {r0, r1, r2}
				227	ldmia sp, {r1-r2}
				228	ldr r0, L5
				229	ldr r3, [sp, #8]
				230	L2:
				231	add r0, pc, r0
				232	bl L_printf$stub
				233
				234	Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
				235
				236	* Return an aggregate type is even worse:
				237
				238	e.g.
				239	struct s foo(void) {
				240	struct s S = {1.1, 2};
				241	return S;
				242	}
				243
				244	mov ip, r0
				245	ldr r0, L5
				246	sub sp, sp, #12
				247	L2:
				248	add r0, pc, r0
				249	@ lr needed for prologue
				250	ldmia r0, {r0, r1, r2}
				251	stmia sp, {r0, r1, r2}
				252	stmia ip, {r0, r1, r2}
				253	mov r0, ip
				254	add sp, sp, #12
				255	bx lr
				256
				257	r0 (and later ip) is the hidden parameter from caller to store the value in. The
				258	first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
				259	r2 into the address passed in. However, there is one additional stmia that
				260	stores r0, r1, and r2 to some stack location. The store is dead.
				261
				262	The llvm-gcc generated code looks like this:
				263
				264	csretcc void %foo(%struct.s* %agg.result) {
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	265	entry:
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	266	%S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
				267	%memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
				268	cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
				269	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
				270	cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
				271	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
				272	cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
				273	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	274	ret void
				275	}
				276
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	277	llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
				278	constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
				279	into a number of load and stores, or 2) custom lower memcpy (of small size) to
				280	be ldmia / stmia. I think option 2 is better but the current register
				281	allocator cannot allocate a chunk of registers at a time.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	282
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	283	A feasible temporary solution is to use specific physical registers at the
				284	lowering time for small (<= 4 words?) transfer size.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	285
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	286	* ARM CSRet calling convention requires the hidden argument to be returned by
				287	the callee.
Rafael Espindola	bec2e38	2006-10-16 16:33:29 +0000	[diff] [blame]	288
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	289	//===---------------------------------------------------------------------===//
Rafael Espindola	bec2e38	2006-10-16 16:33:29 +0000	[diff] [blame]	290
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	291	We can definitely do a better job on BB placements to eliminate some branches.
				292	It's very common to see llvm generated assembly code that looks like this:
Rafael Espindola	82c678b	2006-10-16 17:17:22 +0000	[diff] [blame]	293
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	294	LBB3:
				295	...
				296	LBB4:
				297	...
				298	beq LBB3
				299	b LBB2
Rafael Espindola	82c678b	2006-10-16 17:17:22 +0000	[diff] [blame]	300
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	301	If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
				302	then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
				303
				304	See McCat/18-imp/ComputeBoundingBoxes for an example.
				305
				306	//===---------------------------------------------------------------------===//
				307
Dale Johannesen	a6bc6fc	2007-03-09 17:58:17 +0000	[diff] [blame]	308	Register scavenging is now implemented. The example in the previous version
				309	of this document produces optimal code at -O2.
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	310
				311	//===---------------------------------------------------------------------===//
				312
				313	Pre-/post- indexed load / stores:
				314
				315	1) We should not make the pre/post- indexed load/store transform if the base ptr
				316	is guaranteed to be live beyond the load/store. This can happen if the base
				317	ptr is live out of the block we are performing the optimization. e.g.
				318
				319	mov r1, r2
				320	ldr r3, [r1], #4
				321	...
				322
				323	vs.
				324
				325	ldr r3, [r2]
				326	add r1, r2, #4
				327	...
				328
				329	In most cases, this is just a wasted optimization. However, sometimes it can
				330	negatively impact the performance because two-address code is more restrictive
				331	when it comes to scheduling.
				332
				333	Unfortunately, liveout information is currently unavailable during DAG combine
				334	time.
				335
				336	2) Consider spliting a indexed load / store into a pair of add/sub + load/store
				337	to solve #1 (in TwoAddressInstructionPass.cpp).
				338
				339	3) Enhance LSR to generate more opportunities for indexed ops.
				340
				341	4) Once we added support for multiple result patterns, write indexed loads
				342	patterns instead of C++ instruction selection code.
				343
				344	5) Use FLDM / FSTM to emulate indexed FP load / store.
				345
				346	//===---------------------------------------------------------------------===//
				347
				348	We should add i64 support to take advantage of the 64-bit load / stores.
				349	We can add a pseudo i64 register class containing pseudo registers that are
				350	register pairs. All other ops (e.g. add, sub) would be expanded as usual.
				351
				352	We need to add pseudo instructions (i.e. gethi / getlo) to extract i32 registers
				353	from the i64 register. These are single moves which can be eliminated if the
				354	destination register is a sub-register of the source. We should implement proper
				355	subreg support in the register allocator to coalesce these away.
				356
				357	There are other minor issues such as multiple instructions for a spill / restore
				358	/ move.
				359
				360	//===---------------------------------------------------------------------===//
				361
				362	Implement support for some more tricky ways to materialize immediates. For
				363	example, to get 0xffff8000, we can use:
				364
				365	mov r9, #&3f8000
				366	sub r9, r9, #&400000
				367
				368	//===---------------------------------------------------------------------===//
				369
				370	We sometimes generate multiple add / sub instructions to update sp in prologue
				371	and epilogue if the inc / dec value is too large to fit in a single immediate
				372	operand. In some cases, perhaps it might be better to load the value from a
				373	constantpool instead.
				374
				375	//===---------------------------------------------------------------------===//
				376
				377	GCC generates significantly better code for this function.
				378
				379	int foo(int StackPtr, unsigned char Line, unsigned char Stack, int LineLen) {
				380	int i = 0;
				381
				382	if (StackPtr != 0) {
				383	while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
				384	Line[i++] = Stack[--StackPtr];
				385	if (LineLen > 32768)
				386	{
				387	while (StackPtr != 0 && i < LineLen)
				388	{
				389	i++;
				390	--StackPtr;
				391	}
				392	}
				393	}
				394	return StackPtr;
				395	}
				396
				397	//===---------------------------------------------------------------------===//
				398
				399	This should compile to the mlas instruction:
				400	int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
				401
				402	//===---------------------------------------------------------------------===//
				403
				404	At some point, we should triage these to see if they still apply to us:
				405
				406	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
				407	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
				408	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
				409
				410	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
				411	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
				412	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
				413	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
				414	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
				415	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
				416	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
				417
				418	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
				419	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
				420	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
				421	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
				422	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
				423	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
				424	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
				425
				426	http://www.inf.u-szeged.hu/gcc-arm/
				427	http://citeseer.ist.psu.edu/debus04linktime.html
				428
				429	//===---------------------------------------------------------------------===//
Evan Cheng	2265b49	2007-03-09 19:34:51 +0000	[diff] [blame]	430
Dale Johannesen	818c085	2007-03-09 19:18:59 +0000	[diff] [blame]	431	gcc generates smaller code for this function at -O2 or -Os:
Dale Johannesen	a6bc6fc	2007-03-09 17:58:17 +0000	[diff] [blame]	432
				433	void foo(signed char* p) {
				434	if (*p == 3)
				435	bar();
				436	else if (*p == 4)
				437	baz();
				438	else if (*p == 5)
				439	quux();
				440	}
				441
				442	llvm decides it's a good idea to turn the repeated if...else into a
				443	binary tree, as if it were a switch; the resulting code requires -1
				444	compare-and-branches when p<=2 or p==5, the same number if *p==4
				445	or p>6, and +1 if p==3. So it should be a speed win
				446	(on balance). However, the revised code is larger, with 4 conditional
				447	branches instead of 3.
				448
				449	More seriously, there is a byte->word extend before
				450	each comparison, where there should be only one, and the condition codes
				451	are not remembered when the same two values are compared twice.
				452
Evan Cheng	2265b49	2007-03-09 19:34:51 +0000	[diff] [blame]	453	//===---------------------------------------------------------------------===//
				454
				455	More register scavenging work:
				456
				457	1. Use the register scavenger to track frame index materialized into registers
				458	(those that do not fit in addressing modes) to allow reuse in the same BB.
				459	2. Finish scavenging for Thumb.
				460	3. We know some spills and restores are unnecessary. The issue is once live
				461	intervals are merged, they are not never split. So every def is spilled
				462	and every use requires a restore if the register allocator decides the
				463	resulting live interval is not assigned a physical register. It may be
				464	possible (with the help of the scavenger) to turn some spill / restore
				465	pairs into register copies.
Evan Cheng	44f4fca	2007-03-09 19:35:33 +0000	[diff] [blame^]	466
				467	//===---------------------------------------------------------------------===//
				468
				469	Teach LSR about ARM addressing modes.