Blame - llvm/lib/Target/ARM/README.txt - toolchain/llvm-project

blob: 85d5ca05913cb6b05b743beb66b483005dee3707 [file] [log] [blame]

Rafael Espindola	ea50042	2006-08-22 12:22:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the ARM backend.
				3	//===---------------------------------------------------------------------===//
				4
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	5	Reimplement 'select' in terms of 'SEL'.
Rafael Espindola	ea50042	2006-08-22 12:22:46 +0000	[diff] [blame]	6
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	7	* We would really like to support UXTAB16, but we need to prove that the
				8	add doesn't need to overflow between the two 16-bit chunks.
Rafael Espindola	72d4c07	2006-09-22 11:36:17 +0000	[diff] [blame]	9
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	10	* Implement pre/post increment support. (e.g. PR935)
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	11	* Implement smarter constant generation for binops with large immediates.
Rafael Espindola	72d4c07	2006-09-22 11:36:17 +0000	[diff] [blame]	12
Chris Lattner	cb80e252	2010-02-17 18:42:24 +0000	[diff] [blame]	13	A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX
				14
Chris Lattner	9bc1ed9	2010-03-29 17:02:02 +0000	[diff] [blame^]	15	Interesting optimization for PIC codegen on arm-linux:
				16	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129
				17
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	18	//===---------------------------------------------------------------------===//
Rafael Espindola	72d4c07	2006-09-22 11:36:17 +0000	[diff] [blame]	19
Chris Lattner	3d3f227	2007-04-20 20:18:43 +0000	[diff] [blame]	20	Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
				21	time regalloc happens, these values are now in a 32-bit register, usually with
				22	the top-bits known to be sign or zero extended. If spilled, we should be able
				23	to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
				24	of the reload.
				25
				26	Doing this reduces the size of the stack frame (important for thumb etc), and
				27	also increases the likelihood that we will be able to reload multiple values
				28	from the stack with a single load.
				29
				30	//===---------------------------------------------------------------------===//
				31
Dale Johannesen	440995b	2007-02-28 18:41:23 +0000	[diff] [blame]	32	The constant island pass is in good shape. Some cleanups might be desirable,
				33	but there is unlikely to be much improvement in the generated code.
Rafael Espindola	72d4c07	2006-09-22 11:36:17 +0000	[diff] [blame]	34
Dale Johannesen	440995b	2007-02-28 18:41:23 +0000	[diff] [blame]	35	1. There may be some advantage to trying to be smarter about the initial
Dale Johannesen	e18b13b	2007-02-23 05:02:36 +0000	[diff] [blame]	36	placement, rather than putting everything at the end.
Rafael Espindola	72d4c07	2006-09-22 11:36:17 +0000	[diff] [blame]	37
Dale Johannesen	8dd36a5	2007-04-30 00:32:06 +0000	[diff] [blame]	38	2. There might be some compile-time efficiency to be had by representing
Dale Johannesen	440995b	2007-02-28 18:41:23 +0000	[diff] [blame]	39	consecutive islands as a single block rather than multiple blocks.
				40
Dale Johannesen	8dd36a5	2007-04-30 00:32:06 +0000	[diff] [blame]	41	3. Use a priority queue to sort constant pool users in inverse order of
Evan Cheng	09663ae	2007-03-09 19:46:06 +0000	[diff] [blame]	42	position so we always process the one closed to the end of functions
				43	first. This may simply CreateNewWater.
				44
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	45	//===---------------------------------------------------------------------===//
Rafael Espindola	72d4c07	2006-09-22 11:36:17 +0000	[diff] [blame]	46
Evan Cheng	effa746	2007-07-10 21:49:47 +0000	[diff] [blame]	47	Eliminate copysign custom expansion. We are still generating crappy code with
				48	default expansion + if-conversion.
Rafael Espindola	72d4c07	2006-09-22 11:36:17 +0000	[diff] [blame]	49
Evan Cheng	effa746	2007-07-10 21:49:47 +0000	[diff] [blame]	50	//===---------------------------------------------------------------------===//
Rafael Espindola	eb3e8d7	2006-12-11 23:56:10 +0000	[diff] [blame]	51
Evan Cheng	effa746	2007-07-10 21:49:47 +0000	[diff] [blame]	52	Eliminate one instruction from:
Chris Lattner	2d5b4ac3	2007-02-02 04:36:46 +0000	[diff] [blame]	53
				54	define i32 @_Z6slow4bii(i32 %x, i32 %y) {
				55	%tmp = icmp sgt i32 %x, %y
				56	%retval = select i1 %tmp, i32 %x, i32 %y
				57	ret i32 %retval
				58	}
				59
				60	__Z6slow4bii:
				61	cmp r0, r1
				62	movgt r1, r0
				63	mov r0, r1
				64	bx lr
Evan Cheng	effa746	2007-07-10 21:49:47 +0000	[diff] [blame]	65	=>
				66
				67	__Z6slow4bii:
				68	cmp r0, r1
				69	movle r0, r1
				70	bx lr
Chris Lattner	2d5b4ac3	2007-02-02 04:36:46 +0000	[diff] [blame]	71
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	72	//===---------------------------------------------------------------------===//
Rafael Espindola	eb3e8d7	2006-12-11 23:56:10 +0000	[diff] [blame]	73
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	74	Implement long long "X-3" with instructions that fold the immediate in. These
				75	were disabled due to badness with the ARM carry flag on subtracts.
Rafael Espindola	eb3e8d7	2006-12-11 23:56:10 +0000	[diff] [blame]	76
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	77	//===---------------------------------------------------------------------===//
Rafael Espindola	eb3e8d7	2006-12-11 23:56:10 +0000	[diff] [blame]	78
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	79	More load / store optimizations:
Evan Cheng	cf661fc	2009-06-26 00:25:27 +0000	[diff] [blame]	80	1) Better representation for block transfer? This is from Olden/power:
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	81
				82	fldd d0, [r4]
				83	fstd d0, [r4, #+32]
				84	fldd d0, [r4, #+8]
				85	fstd d0, [r4, #+40]
				86	fldd d0, [r4, #+16]
				87	fstd d0, [r4, #+48]
				88	fldd d0, [r4, #+24]
				89	fstd d0, [r4, #+56]
				90
				91	If we can spare the registers, it would be better to use fldm and fstm here.
				92	Need major register allocator enhancement though.
				93
Evan Cheng	cf661fc	2009-06-26 00:25:27 +0000	[diff] [blame]	94	2) Can we recognize the relative position of constantpool entries? i.e. Treat
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	95
				96	ldr r0, LCPI17_3
				97	ldr r1, LCPI17_4
				98	ldr r2, LCPI17_5
				99
				100	as
				101	ldr r0, LCPI17
				102	ldr r1, LCPI17+4
				103	ldr r2, LCPI17+8
				104
				105	Then the ldr's can be combined into a single ldm. See Olden/power.
				106
				107	Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
				108	double 64-bit FP constant:
				109
				110	adr r0, L6
				111	ldmia r0, {r0-r1}
				112
				113	.align 2
				114	L6:
				115	.long -858993459
				116	.long 1074318540
				117
Evan Cheng	cf661fc	2009-06-26 00:25:27 +0000	[diff] [blame]	118	3) struct copies appear to be done field by field
Dale Johannesen	af0cff2	2007-03-09 17:58:17 +0000	[diff] [blame]	119	instead of by words, at least sometimes:
				120
				121	struct foo { int x; short s; char c1; char c2; };
				122	void cpy(struct fooa, struct foob) { a = b; }
				123
				124	llvm code (-O2)
				125	ldrb r3, [r1, #+6]
				126	ldr r2, [r1]
				127	ldrb r12, [r1, #+7]
				128	ldrh r1, [r1, #+4]
				129	str r2, [r0]
				130	strh r1, [r0, #+4]
				131	strb r3, [r0, #+6]
				132	strb r12, [r0, #+7]
				133	gcc code (-O2)
				134	ldmia r1, {r1-r2}
				135	stmia r0, {r1-r2}
				136
				137	In this benchmark poor handling of aggregate copies has shown up as
				138	having a large effect on size, and possibly speed as well (we don't have
				139	a good way to measure on ARM).
				140
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	141	//===---------------------------------------------------------------------===//
				142
				143	* Consider this silly example:
				144
				145	double bar(double x) {
				146	double r = foo(3.1);
				147	return x+r;
				148	}
				149
				150	_bar:
Chris Lattner	57662f3	2007-11-27 22:41:52 +0000	[diff] [blame]	151	stmfd sp!, {r4, r5, r7, lr}
				152	add r7, sp, #8
				153	mov r4, r0
				154	mov r5, r1
				155	fldd d0, LCPI1_0
				156	fmrrd r0, r1, d0
				157	bl _foo
				158	fmdrr d0, r4, r5
				159	fmsr s2, r0
				160	fsitod d1, s2
				161	faddd d0, d1, d0
				162	fmrrd r0, r1, d0
				163	ldmfd sp!, {r4, r5, r7, pc}
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	164
				165	Ignore the prologue and epilogue stuff for a second. Note
				166	mov r4, r0
				167	mov r5, r1
				168	the copys to callee-save registers and the fact they are only being used by the
				169	fmdrr instruction. It would have been better had the fmdrr been scheduled
				170	before the call and place the result in a callee-save DPR register. The two
				171	mov ops would not have been necessary.
				172
				173	//===---------------------------------------------------------------------===//
				174
				175	Calling convention related stuff:
				176
				177	* gcc's parameter passing implementation is terrible and we suffer as a result:
				178
				179	e.g.
				180	struct s {
				181	double d1;
				182	int s1;
				183	};
				184
				185	void foo(struct s S) {
				186	printf("%g, %d\n", S.d1, S.s1);
				187	}
				188
				189	'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
				190	then reload them to r1, r2, and r3 before issuing the call (r0 contains the
				191	address of the format string):
				192
				193	stmfd sp!, {r7, lr}
				194	add r7, sp, #0
				195	sub sp, sp, #12
				196	stmia sp, {r0, r1, r2}
				197	ldmia sp, {r1-r2}
				198	ldr r0, L5
				199	ldr r3, [sp, #8]
				200	L2:
				201	add r0, pc, r0
				202	bl L_printf$stub
				203
				204	Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
				205
				206	* Return an aggregate type is even worse:
				207
				208	e.g.
				209	struct s foo(void) {
				210	struct s S = {1.1, 2};
				211	return S;
				212	}
				213
				214	mov ip, r0
				215	ldr r0, L5
				216	sub sp, sp, #12
				217	L2:
				218	add r0, pc, r0
				219	@ lr needed for prologue
				220	ldmia r0, {r0, r1, r2}
				221	stmia sp, {r0, r1, r2}
				222	stmia ip, {r0, r1, r2}
				223	mov r0, ip
				224	add sp, sp, #12
				225	bx lr
				226
				227	r0 (and later ip) is the hidden parameter from caller to store the value in. The
				228	first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
				229	r2 into the address passed in. However, there is one additional stmia that
				230	stores r0, r1, and r2 to some stack location. The store is dead.
				231
				232	The llvm-gcc generated code looks like this:
				233
				234	csretcc void %foo(%struct.s* %agg.result) {
Rafael Espindola	c154dac	2006-10-09 14:18:33 +0000	[diff] [blame]	235	entry:
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	236	%S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
				237	%memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
				238	cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
				239	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
				240	cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
				241	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
				242	cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
				243	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
Rafael Espindola	c154dac	2006-10-09 14:18:33 +0000	[diff] [blame]	244	ret void
				245	}
				246
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	247	llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
				248	constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
				249	into a number of load and stores, or 2) custom lower memcpy (of small size) to
				250	be ldmia / stmia. I think option 2 is better but the current register
				251	allocator cannot allocate a chunk of registers at a time.
Rafael Espindola	c154dac	2006-10-09 14:18:33 +0000	[diff] [blame]	252
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	253	A feasible temporary solution is to use specific physical registers at the
				254	lowering time for small (<= 4 words?) transfer size.
Rafael Espindola	c154dac	2006-10-09 14:18:33 +0000	[diff] [blame]	255
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	256	* ARM CSRet calling convention requires the hidden argument to be returned by
				257	the callee.
Rafael Espindola	595dc4c	2006-10-16 16:33:29 +0000	[diff] [blame]	258
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	259	//===---------------------------------------------------------------------===//
Rafael Espindola	595dc4c	2006-10-16 16:33:29 +0000	[diff] [blame]	260
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	261	We can definitely do a better job on BB placements to eliminate some branches.
				262	It's very common to see llvm generated assembly code that looks like this:
Rafael Espindola	677ee83	2006-10-16 17:17:22 +0000	[diff] [blame]	263
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	264	LBB3:
				265	...
				266	LBB4:
				267	...
				268	beq LBB3
				269	b LBB2
Rafael Espindola	677ee83	2006-10-16 17:17:22 +0000	[diff] [blame]	270
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	271	If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
				272	then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
				273
				274	See McCat/18-imp/ComputeBoundingBoxes for an example.
				275
				276	//===---------------------------------------------------------------------===//
				277
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	278	Pre-/post- indexed load / stores:
				279
				280	1) We should not make the pre/post- indexed load/store transform if the base ptr
				281	is guaranteed to be live beyond the load/store. This can happen if the base
				282	ptr is live out of the block we are performing the optimization. e.g.
				283
				284	mov r1, r2
				285	ldr r3, [r1], #4
				286	...
				287
				288	vs.
				289
				290	ldr r3, [r2]
				291	add r1, r2, #4
				292	...
				293
				294	In most cases, this is just a wasted optimization. However, sometimes it can
				295	negatively impact the performance because two-address code is more restrictive
				296	when it comes to scheduling.
				297
				298	Unfortunately, liveout information is currently unavailable during DAG combine
				299	time.
				300
				301	2) Consider spliting a indexed load / store into a pair of add/sub + load/store
				302	to solve #1 (in TwoAddressInstructionPass.cpp).
				303
				304	3) Enhance LSR to generate more opportunities for indexed ops.
				305
				306	4) Once we added support for multiple result patterns, write indexed loads
				307	patterns instead of C++ instruction selection code.
				308
Jim Grosbach	d7cf55c	2009-11-09 00:11:35 +0000	[diff] [blame]	309	5) Use VLDM / VSTM to emulate indexed FP load / store.
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	310
				311	//===---------------------------------------------------------------------===//
				312
Evan Cheng	10043e2	2007-01-19 07:51:42 +0000	[diff] [blame]	313	Implement support for some more tricky ways to materialize immediates. For
				314	example, to get 0xffff8000, we can use:
				315
				316	mov r9, #&3f8000
				317	sub r9, r9, #&400000
				318
				319	//===---------------------------------------------------------------------===//
				320
				321	We sometimes generate multiple add / sub instructions to update sp in prologue
				322	and epilogue if the inc / dec value is too large to fit in a single immediate
				323	operand. In some cases, perhaps it might be better to load the value from a
				324	constantpool instead.
				325
				326	//===---------------------------------------------------------------------===//
				327
				328	GCC generates significantly better code for this function.
				329
				330	int foo(int StackPtr, unsigned char Line, unsigned char Stack, int LineLen) {
				331	int i = 0;
				332
				333	if (StackPtr != 0) {
				334	while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
				335	Line[i++] = Stack[--StackPtr];
				336	if (LineLen > 32768)
				337	{
				338	while (StackPtr != 0 && i < LineLen)
				339	{
				340	i++;
				341	--StackPtr;
				342	}
				343	}
				344	}
				345	return StackPtr;
				346	}
				347
				348	//===---------------------------------------------------------------------===//
				349
				350	This should compile to the mlas instruction:
				351	int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
				352
				353	//===---------------------------------------------------------------------===//
				354
				355	At some point, we should triage these to see if they still apply to us:
				356
				357	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
				358	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
				359	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
				360
				361	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
				362	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
				363	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
				364	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
				365	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
				366	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
				367	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
				368
				369	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
				370	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
				371	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
				372	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
				373	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
				374	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
				375	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
				376
				377	http://www.inf.u-szeged.hu/gcc-arm/
				378	http://citeseer.ist.psu.edu/debus04linktime.html
				379
				380	//===---------------------------------------------------------------------===//
Evan Cheng	603f309	2007-03-09 19:34:51 +0000	[diff] [blame]	381
Dale Johannesen	368faf9	2007-03-09 19:18:59 +0000	[diff] [blame]	382	gcc generates smaller code for this function at -O2 or -Os:
Dale Johannesen	af0cff2	2007-03-09 17:58:17 +0000	[diff] [blame]	383
				384	void foo(signed char* p) {
				385	if (*p == 3)
				386	bar();
				387	else if (*p == 4)
				388	baz();
				389	else if (*p == 5)
				390	quux();
				391	}
				392
				393	llvm decides it's a good idea to turn the repeated if...else into a
				394	binary tree, as if it were a switch; the resulting code requires -1
				395	compare-and-branches when p<=2 or p==5, the same number if *p==4
				396	or p>6, and +1 if p==3. So it should be a speed win
				397	(on balance). However, the revised code is larger, with 4 conditional
				398	branches instead of 3.
				399
				400	More seriously, there is a byte->word extend before
				401	each comparison, where there should be only one, and the condition codes
				402	are not remembered when the same two values are compared twice.
				403
Evan Cheng	603f309	2007-03-09 19:34:51 +0000	[diff] [blame]	404	//===---------------------------------------------------------------------===//
				405
Evan Cheng	cba9538	2007-03-20 22:32:39 +0000	[diff] [blame]	406	More LSR enhancements possible:
				407
				408	1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
				409	in a load / store.
				410	2. Allow iv reuse even when a type conversion is required. For example, i8
				411	and i32 load / store addressing modes are identical.
Chris Lattner	2509d75	2007-04-17 18:03:00 +0000	[diff] [blame]	412
				413
				414	//===---------------------------------------------------------------------===//
				415
				416	This:
				417
				418	int foo(int a, int b, int c, int d) {
				419	long long acc = (long long)a * (long long)b;
				420	acc += (long long)c * (long long)d;
				421	return (int)(acc >> 32);
				422	}
				423
				424	Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
				425	two signed 32-bit values to produce a 64-bit value, and accumulates this with
				426	a 64-bit value.
				427
Chris Lattner	57662f3	2007-11-27 22:41:52 +0000	[diff] [blame]	428	We currently get this with both v4 and v6:
Chris Lattner	2509d75	2007-04-17 18:03:00 +0000	[diff] [blame]	429
				430	_foo:
Chris Lattner	57662f3	2007-11-27 22:41:52 +0000	[diff] [blame]	431	smull r1, r0, r1, r0
				432	smull r3, r2, r3, r2
				433	adds r3, r3, r1
				434	adc r0, r2, r0
Chris Lattner	2509d75	2007-04-17 18:03:00 +0000	[diff] [blame]	435	bx lr
				436
Chris Lattner	2509d75	2007-04-17 18:03:00 +0000	[diff] [blame]	437	//===---------------------------------------------------------------------===//
Chris Lattner	6777b72	2007-09-10 21:43:18 +0000	[diff] [blame]	438
				439	This:
				440	#include <algorithm>
				441	std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				442	{ return std::make_pair(a + b, a + b < a); }
				443	bool no_overflow(unsigned a, unsigned b)
				444	{ return !full_add(a, b).second; }
				445
				446	Should compile to:
				447
				448	_Z8full_addjj:
				449	adds r2, r1, r2
				450	movcc r1, #0
				451	movcs r1, #1
				452	str r2, [r0, #0]
				453	strb r1, [r0, #4]
				454	mov pc, lr
				455
				456	_Z11no_overflowjj:
				457	cmn r0, r1
				458	movcs r0, #0
				459	movcc r0, #1
				460	mov pc, lr
				461
				462	not:
				463
				464	__Z8full_addjj:
				465	add r3, r2, r1
				466	str r3, [r0]
				467	mov r2, #1
				468	mov r12, #0
				469	cmp r3, r1
				470	movlo r12, r2
				471	str r12, [r0, #+4]
				472	bx lr
				473	__Z11no_overflowjj:
				474	add r3, r1, r0
				475	mov r2, #1
				476	mov r1, #0
				477	cmp r3, r0
				478	movhs r1, r2
				479	mov r0, r1
				480	bx lr
				481
				482	//===---------------------------------------------------------------------===//
Chris Lattner	5d979d5	2007-10-19 03:29:26 +0000	[diff] [blame]	483
Bob Wilson	2e076c4	2009-06-22 23:27:02 +0000	[diff] [blame]	484	Some of the NEON intrinsics may be appropriate for more general use, either
				485	as target-independent intrinsics or perhaps elsewhere in the ARM backend.
				486	Some of them may also be lowered to target-independent SDNodes, and perhaps
				487	some new SDNodes could be added.
				488
				489	For example, maximum, minimum, and absolute value operations are well-defined
				490	and standard operations, both for vector and scalar types.
				491
				492	The current NEON-specific intrinsics for count leading zeros and count one
				493	bits could perhaps be replaced by the target-independent ctlz and ctpop
				494	intrinsics. It may also make sense to add a target-independent "ctls"
				495	intrinsic for "count leading sign bits". Likewise, the backend could use
				496	the target-independent SDNodes for these operations.
				497
				498	ARMv6 has scalar saturating and halving adds and subtracts. The same
				499	intrinsics could possibly be used for both NEON's vector implementations of
				500	those operations and the ARMv6 scalar versions.
				501
				502	//===---------------------------------------------------------------------===//
				503
Evan Cheng	e2c4d35	2009-06-26 00:28:48 +0000	[diff] [blame]	504	ARM::MOVCCr is commutable (by flipping the condition). But we need to implement
				505	ARMInstrInfo::commuteInstruction() to support it.
Evan Cheng	b23b50d	2009-06-29 07:51:04 +0000	[diff] [blame]	506
				507	//===---------------------------------------------------------------------===//
				508
				509	Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
				510	LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
				511	ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
				512	while ARMConstantIslandPass only need to worry about LDR (literal).
Evan Cheng	c7a243d	2009-07-22 00:58:27 +0000	[diff] [blame]	513
				514	//===---------------------------------------------------------------------===//
				515
Evan Cheng	c26c76e	2009-07-24 19:31:03 +0000	[diff] [blame]	516	Constant island pass should make use of full range SoImm values for LEApcrel.
				517	Be careful though as the last attempt caused infinite looping on lencod.
Chris Lattner	c667b60	2009-07-30 16:08:58 +0000	[diff] [blame]	518
				519	//===---------------------------------------------------------------------===//
				520
				521	Predication issue. This function:
				522
				523	extern unsigned array[ 128 ];
				524	int foo( int x ) {
				525	int y;
				526	y = array[ x & 127 ];
				527	if ( x & 128 )
				528	y = 123456789 & ( y >> 2 );
				529	else
				530	y = 123456789 & y;
				531	return y;
				532	}
				533
				534	compiles to:
				535
				536	_foo:
				537	and r1, r0, #127
				538	ldr r2, LCPI1_0
				539	ldr r2, [r2]
				540	ldr r1, [r2, +r1, lsl #2]
				541	mov r2, r1, lsr #2
				542	tst r0, #128
				543	moveq r2, r1
				544	ldr r0, LCPI1_1
				545	and r0, r2, r0
				546	bx lr
				547
				548	It would be better to do something like this, to fold the shift into the
				549	conditional move:
				550
				551	and r1, r0, #127
				552	ldr r2, LCPI1_0
				553	ldr r2, [r2]
				554	ldr r1, [r2, +r1, lsl #2]
				555	tst r0, #128
				556	movne r1, r1, lsr #2
				557	ldr r0, LCPI1_1
				558	and r0, r1, r0
				559	bx lr
				560
				561	it saves an instruction and a register.
				562
				563	//===---------------------------------------------------------------------===//
Evan Cheng	83e0d48	2009-09-28 09:14:39 +0000	[diff] [blame]	564
Evan Cheng	83e0d48	2009-09-28 09:14:39 +0000	[diff] [blame]	565	It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
				566	with the same bottom half.
Bob Wilson	c7415bf	2009-10-30 22:22:46 +0000	[diff] [blame]	567
				568	//===---------------------------------------------------------------------===//
				569
				570	Robert Muth started working on an alternate jump table implementation that
				571	does not put the tables in-line in the text. This is more like the llvm
				572	default jump table implementation. This might be useful sometime. Several
				573	revisions of patches are on the mailing list, beginning at:
				574	http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html
				575
Evan Cheng	f6f1b37	2009-11-02 07:51:19 +0000	[diff] [blame]	576	//===---------------------------------------------------------------------===//
				577
				578	Make use of the "rbit" instruction.
Evan Cheng	a8e8a7c	2009-11-07 04:04:34 +0000	[diff] [blame]	579
				580	//===---------------------------------------------------------------------===//
				581
				582	Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
				583	to licm and cse the unnecessary load from cp#1.
Jim Grosbach	267430f	2010-01-22 00:08:13 +0000	[diff] [blame]	584
				585	//===---------------------------------------------------------------------===//
				586
				587	The CMN instruction sets the flags like an ADD instruction, while CMP sets
				588	them like a subtract. Therefore to be able to use CMN for comparisons other
				589	than the Z bit, we'll need additional logic to reverse the conditionals
				590	associated with the comparison. Perhaps a pseudo-instruction for the comparison,
				591	with a post-codegen pass to clean up and handle the condition codes?
				592	See PR5694 for testcase.