Blame - lib/Target/ARM/README.txt - fp2-dev/platform/external/llvm

blob: 4fcaecfcb2359389b4e28004b3e62a3823913573 [file] [log] [blame]

Rafael Espindola	f4d4005	2006-08-22 12:22:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the ARM backend.
				3	//===---------------------------------------------------------------------===//
				4
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	5	Reimplement 'select' in terms of 'SEL'.
Rafael Espindola	f4d4005	2006-08-22 12:22:46 +0000	[diff] [blame]	6
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	7	* We would really like to support UXTAB16, but we need to prove that the
				8	add doesn't need to overflow between the two 16-bit chunks.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	9
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	10	* Implement pre/post increment support. (e.g. PR935)
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	11	* Implement smarter constant generation for binops with large immediates.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	12
Chris Lattner	7180f10	2010-02-17 18:42:24 +0000	[diff] [blame]	13	A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX
				14
Chris Lattner	f283e40	2010-03-29 17:02:02 +0000	[diff] [blame]	15	Interesting optimization for PIC codegen on arm-linux:
				16	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129
				17
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	18	//===---------------------------------------------------------------------===//
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	19
Chris Lattner	93305bc	2007-04-20 20:18:43 +0000	[diff] [blame]	20	Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
				21	time regalloc happens, these values are now in a 32-bit register, usually with
				22	the top-bits known to be sign or zero extended. If spilled, we should be able
				23	to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
				24	of the reload.
				25
				26	Doing this reduces the size of the stack frame (important for thumb etc), and
				27	also increases the likelihood that we will be able to reload multiple values
				28	from the stack with a single load.
				29
				30	//===---------------------------------------------------------------------===//
				31
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	32	The constant island pass is in good shape. Some cleanups might be desirable,
				33	but there is unlikely to be much improvement in the generated code.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	34
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	35	1. There may be some advantage to trying to be smarter about the initial
Dale Johannesen	88e37ae	2007-02-23 05:02:36 +0000	[diff] [blame]	36	placement, rather than putting everything at the end.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	37
Dale Johannesen	9118dbc	2007-04-30 00:32:06 +0000	[diff] [blame]	38	2. There might be some compile-time efficiency to be had by representing
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	39	consecutive islands as a single block rather than multiple blocks.
				40
Dale Johannesen	9118dbc	2007-04-30 00:32:06 +0000	[diff] [blame]	41	3. Use a priority queue to sort constant pool users in inverse order of
Evan Cheng	1a9da0d	2007-03-09 19:46:06 +0000	[diff] [blame]	42	position so we always process the one closed to the end of functions
				43	first. This may simply CreateNewWater.
				44
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	45	//===---------------------------------------------------------------------===//
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	46
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	47	Eliminate copysign custom expansion. We are still generating crappy code with
				48	default expansion + if-conversion.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	49
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	50	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	51
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	52	Eliminate one instruction from:
Chris Lattner	2d1222c	2007-02-02 04:36:46 +0000	[diff] [blame]	53
				54	define i32 @_Z6slow4bii(i32 %x, i32 %y) {
				55	%tmp = icmp sgt i32 %x, %y
				56	%retval = select i1 %tmp, i32 %x, i32 %y
				57	ret i32 %retval
				58	}
				59
				60	__Z6slow4bii:
				61	cmp r0, r1
				62	movgt r1, r0
				63	mov r0, r1
				64	bx lr
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	65	=>
				66
				67	__Z6slow4bii:
				68	cmp r0, r1
				69	movle r0, r1
				70	bx lr
Chris Lattner	2d1222c	2007-02-02 04:36:46 +0000	[diff] [blame]	71
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	72	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	73
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	74	Implement long long "X-3" with instructions that fold the immediate in. These
				75	were disabled due to badness with the ARM carry flag on subtracts.
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	76
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	77	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	78
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	79	More load / store optimizations:
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	80	1) Better representation for block transfer? This is from Olden/power:
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	81
				82	fldd d0, [r4]
				83	fstd d0, [r4, #+32]
				84	fldd d0, [r4, #+8]
				85	fstd d0, [r4, #+40]
				86	fldd d0, [r4, #+16]
				87	fstd d0, [r4, #+48]
				88	fldd d0, [r4, #+24]
				89	fstd d0, [r4, #+56]
				90
				91	If we can spare the registers, it would be better to use fldm and fstm here.
				92	Need major register allocator enhancement though.
				93
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	94	2) Can we recognize the relative position of constantpool entries? i.e. Treat
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	95
				96	ldr r0, LCPI17_3
				97	ldr r1, LCPI17_4
				98	ldr r2, LCPI17_5
				99
				100	as
				101	ldr r0, LCPI17
				102	ldr r1, LCPI17+4
				103	ldr r2, LCPI17+8
				104
				105	Then the ldr's can be combined into a single ldm. See Olden/power.
				106
				107	Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
				108	double 64-bit FP constant:
				109
				110	adr r0, L6
				111	ldmia r0, {r0-r1}
				112
				113	.align 2
				114	L6:
				115	.long -858993459
				116	.long 1074318540
				117
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	118	3) struct copies appear to be done field by field
Dale Johannesen	a6bc6fc	2007-03-09 17:58:17 +0000	[diff] [blame]	119	instead of by words, at least sometimes:
				120
				121	struct foo { int x; short s; char c1; char c2; };
				122	void cpy(struct fooa, struct foob) { a = b; }
				123
				124	llvm code (-O2)
				125	ldrb r3, [r1, #+6]
				126	ldr r2, [r1]
				127	ldrb r12, [r1, #+7]
				128	ldrh r1, [r1, #+4]
				129	str r2, [r0]
				130	strh r1, [r0, #+4]
				131	strb r3, [r0, #+6]
				132	strb r12, [r0, #+7]
				133	gcc code (-O2)
				134	ldmia r1, {r1-r2}
				135	stmia r0, {r1-r2}
				136
				137	In this benchmark poor handling of aggregate copies has shown up as
				138	having a large effect on size, and possibly speed as well (we don't have
				139	a good way to measure on ARM).
				140
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	141	//===---------------------------------------------------------------------===//
				142
				143	* Consider this silly example:
				144
				145	double bar(double x) {
				146	double r = foo(3.1);
				147	return x+r;
				148	}
				149
				150	_bar:
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	151	stmfd sp!, {r4, r5, r7, lr}
				152	add r7, sp, #8
				153	mov r4, r0
				154	mov r5, r1
				155	fldd d0, LCPI1_0
				156	fmrrd r0, r1, d0
				157	bl _foo
				158	fmdrr d0, r4, r5
				159	fmsr s2, r0
				160	fsitod d1, s2
				161	faddd d0, d1, d0
				162	fmrrd r0, r1, d0
				163	ldmfd sp!, {r4, r5, r7, pc}
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	164
				165	Ignore the prologue and epilogue stuff for a second. Note
				166	mov r4, r0
				167	mov r5, r1
				168	the copys to callee-save registers and the fact they are only being used by the
				169	fmdrr instruction. It would have been better had the fmdrr been scheduled
				170	before the call and place the result in a callee-save DPR register. The two
				171	mov ops would not have been necessary.
				172
				173	//===---------------------------------------------------------------------===//
				174
				175	Calling convention related stuff:
				176
				177	* gcc's parameter passing implementation is terrible and we suffer as a result:
				178
				179	e.g.
				180	struct s {
				181	double d1;
				182	int s1;
				183	};
				184
				185	void foo(struct s S) {
				186	printf("%g, %d\n", S.d1, S.s1);
				187	}
				188
				189	'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
				190	then reload them to r1, r2, and r3 before issuing the call (r0 contains the
				191	address of the format string):
				192
				193	stmfd sp!, {r7, lr}
				194	add r7, sp, #0
				195	sub sp, sp, #12
				196	stmia sp, {r0, r1, r2}
				197	ldmia sp, {r1-r2}
				198	ldr r0, L5
				199	ldr r3, [sp, #8]
				200	L2:
				201	add r0, pc, r0
				202	bl L_printf$stub
				203
				204	Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
				205
				206	* Return an aggregate type is even worse:
				207
				208	e.g.
				209	struct s foo(void) {
				210	struct s S = {1.1, 2};
				211	return S;
				212	}
				213
				214	mov ip, r0
				215	ldr r0, L5
				216	sub sp, sp, #12
				217	L2:
				218	add r0, pc, r0
				219	@ lr needed for prologue
				220	ldmia r0, {r0, r1, r2}
				221	stmia sp, {r0, r1, r2}
				222	stmia ip, {r0, r1, r2}
				223	mov r0, ip
				224	add sp, sp, #12
				225	bx lr
				226
				227	r0 (and later ip) is the hidden parameter from caller to store the value in. The
				228	first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
				229	r2 into the address passed in. However, there is one additional stmia that
				230	stores r0, r1, and r2 to some stack location. The store is dead.
				231
				232	The llvm-gcc generated code looks like this:
				233
				234	csretcc void %foo(%struct.s* %agg.result) {
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	235	entry:
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	236	%S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
				237	%memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
				238	cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
				239	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
				240	cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
				241	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
				242	cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
				243	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	244	ret void
				245	}
				246
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	247	llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
				248	constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
				249	into a number of load and stores, or 2) custom lower memcpy (of small size) to
				250	be ldmia / stmia. I think option 2 is better but the current register
				251	allocator cannot allocate a chunk of registers at a time.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	252
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	253	A feasible temporary solution is to use specific physical registers at the
				254	lowering time for small (<= 4 words?) transfer size.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	255
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	256	* ARM CSRet calling convention requires the hidden argument to be returned by
				257	the callee.
Rafael Espindola	bec2e38	2006-10-16 16:33:29 +0000	[diff] [blame]	258
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	259	//===---------------------------------------------------------------------===//
Rafael Espindola	bec2e38	2006-10-16 16:33:29 +0000	[diff] [blame]	260
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	261	We can definitely do a better job on BB placements to eliminate some branches.
				262	It's very common to see llvm generated assembly code that looks like this:
Rafael Espindola	82c678b	2006-10-16 17:17:22 +0000	[diff] [blame]	263
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	264	LBB3:
				265	...
				266	LBB4:
				267	...
				268	beq LBB3
				269	b LBB2
Rafael Espindola	82c678b	2006-10-16 17:17:22 +0000	[diff] [blame]	270
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	271	If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
				272	then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
				273
				274	See McCat/18-imp/ComputeBoundingBoxes for an example.
				275
				276	//===---------------------------------------------------------------------===//
				277
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	278	Pre-/post- indexed load / stores:
				279
				280	1) We should not make the pre/post- indexed load/store transform if the base ptr
				281	is guaranteed to be live beyond the load/store. This can happen if the base
				282	ptr is live out of the block we are performing the optimization. e.g.
				283
				284	mov r1, r2
				285	ldr r3, [r1], #4
				286	...
				287
				288	vs.
				289
				290	ldr r3, [r2]
				291	add r1, r2, #4
				292	...
				293
				294	In most cases, this is just a wasted optimization. However, sometimes it can
				295	negatively impact the performance because two-address code is more restrictive
				296	when it comes to scheduling.
				297
				298	Unfortunately, liveout information is currently unavailable during DAG combine
				299	time.
				300
				301	2) Consider spliting a indexed load / store into a pair of add/sub + load/store
				302	to solve #1 (in TwoAddressInstructionPass.cpp).
				303
				304	3) Enhance LSR to generate more opportunities for indexed ops.
				305
				306	4) Once we added support for multiple result patterns, write indexed loads
				307	patterns instead of C++ instruction selection code.
				308
Jim Grosbach	e516549	2009-11-09 00:11:35 +0000	[diff] [blame]	309	5) Use VLDM / VSTM to emulate indexed FP load / store.
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	310
				311	//===---------------------------------------------------------------------===//
				312
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	313	Implement support for some more tricky ways to materialize immediates. For
				314	example, to get 0xffff8000, we can use:
				315
				316	mov r9, #&3f8000
				317	sub r9, r9, #&400000
				318
				319	//===---------------------------------------------------------------------===//
				320
				321	We sometimes generate multiple add / sub instructions to update sp in prologue
				322	and epilogue if the inc / dec value is too large to fit in a single immediate
				323	operand. In some cases, perhaps it might be better to load the value from a
				324	constantpool instead.
				325
				326	//===---------------------------------------------------------------------===//
				327
				328	GCC generates significantly better code for this function.
				329
				330	int foo(int StackPtr, unsigned char Line, unsigned char Stack, int LineLen) {
				331	int i = 0;
				332
				333	if (StackPtr != 0) {
				334	while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
				335	Line[i++] = Stack[--StackPtr];
				336	if (LineLen > 32768)
				337	{
				338	while (StackPtr != 0 && i < LineLen)
				339	{
				340	i++;
				341	--StackPtr;
				342	}
				343	}
				344	}
				345	return StackPtr;
				346	}
				347
				348	//===---------------------------------------------------------------------===//
				349
				350	This should compile to the mlas instruction:
				351	int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
				352
				353	//===---------------------------------------------------------------------===//
				354
				355	At some point, we should triage these to see if they still apply to us:
				356
				357	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
				358	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
				359	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
				360
				361	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
				362	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
				363	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
				364	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
				365	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
				366	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
				367	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
				368
				369	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
				370	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
				371	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
				372	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
				373	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
				374	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
				375	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
				376
				377	http://www.inf.u-szeged.hu/gcc-arm/
				378	http://citeseer.ist.psu.edu/debus04linktime.html
				379
				380	//===---------------------------------------------------------------------===//
Evan Cheng	2265b49	2007-03-09 19:34:51 +0000	[diff] [blame]	381
Dale Johannesen	818c085	2007-03-09 19:18:59 +0000	[diff] [blame]	382	gcc generates smaller code for this function at -O2 or -Os:
Dale Johannesen	a6bc6fc	2007-03-09 17:58:17 +0000	[diff] [blame]	383
				384	void foo(signed char* p) {
				385	if (*p == 3)
				386	bar();
				387	else if (*p == 4)
				388	baz();
				389	else if (*p == 5)
				390	quux();
				391	}
				392
				393	llvm decides it's a good idea to turn the repeated if...else into a
				394	binary tree, as if it were a switch; the resulting code requires -1
				395	compare-and-branches when p<=2 or p==5, the same number if *p==4
				396	or p>6, and +1 if p==3. So it should be a speed win
				397	(on balance). However, the revised code is larger, with 4 conditional
				398	branches instead of 3.
				399
				400	More seriously, there is a byte->word extend before
				401	each comparison, where there should be only one, and the condition codes
				402	are not remembered when the same two values are compared twice.
				403
Evan Cheng	2265b49	2007-03-09 19:34:51 +0000	[diff] [blame]	404	//===---------------------------------------------------------------------===//
				405
Evan Cheng	a125cbe	2007-03-20 22:32:39 +0000	[diff] [blame]	406	More LSR enhancements possible:
				407
				408	1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
				409	in a load / store.
				410	2. Allow iv reuse even when a type conversion is required. For example, i8
				411	and i32 load / store addressing modes are identical.
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	412
				413
				414	//===---------------------------------------------------------------------===//
				415
				416	This:
				417
				418	int foo(int a, int b, int c, int d) {
				419	long long acc = (long long)a * (long long)b;
				420	acc += (long long)c * (long long)d;
				421	return (int)(acc >> 32);
				422	}
				423
				424	Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
				425	two signed 32-bit values to produce a 64-bit value, and accumulates this with
				426	a 64-bit value.
				427
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	428	We currently get this with both v4 and v6:
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	429
				430	_foo:
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	431	smull r1, r0, r1, r0
				432	smull r3, r2, r3, r2
				433	adds r3, r3, r1
				434	adc r0, r2, r0
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	435	bx lr
				436
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	437	//===---------------------------------------------------------------------===//
Chris Lattner	bf8ae84	2007-09-10 21:43:18 +0000	[diff] [blame]	438
				439	This:
				440	#include <algorithm>
				441	std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				442	{ return std::make_pair(a + b, a + b < a); }
				443	bool no_overflow(unsigned a, unsigned b)
				444	{ return !full_add(a, b).second; }
				445
				446	Should compile to:
				447
				448	_Z8full_addjj:
				449	adds r2, r1, r2
				450	movcc r1, #0
				451	movcs r1, #1
				452	str r2, [r0, #0]
				453	strb r1, [r0, #4]
				454	mov pc, lr
				455
				456	_Z11no_overflowjj:
				457	cmn r0, r1
				458	movcs r0, #0
				459	movcc r0, #1
				460	mov pc, lr
				461
				462	not:
				463
				464	__Z8full_addjj:
				465	add r3, r2, r1
				466	str r3, [r0]
				467	mov r2, #1
				468	mov r12, #0
				469	cmp r3, r1
				470	movlo r12, r2
				471	str r12, [r0, #+4]
				472	bx lr
				473	__Z11no_overflowjj:
				474	add r3, r1, r0
				475	mov r2, #1
				476	mov r1, #0
				477	cmp r3, r0
				478	movhs r1, r2
				479	mov r0, r1
				480	bx lr
				481
				482	//===---------------------------------------------------------------------===//
Chris Lattner	3a7c33a	2007-10-19 03:29:26 +0000	[diff] [blame]	483
Bob Wilson	5bafff3	2009-06-22 23:27:02 +0000	[diff] [blame]	484	Some of the NEON intrinsics may be appropriate for more general use, either
				485	as target-independent intrinsics or perhaps elsewhere in the ARM backend.
				486	Some of them may also be lowered to target-independent SDNodes, and perhaps
				487	some new SDNodes could be added.
				488
				489	For example, maximum, minimum, and absolute value operations are well-defined
				490	and standard operations, both for vector and scalar types.
				491
				492	The current NEON-specific intrinsics for count leading zeros and count one
				493	bits could perhaps be replaced by the target-independent ctlz and ctpop
				494	intrinsics. It may also make sense to add a target-independent "ctls"
				495	intrinsic for "count leading sign bits". Likewise, the backend could use
				496	the target-independent SDNodes for these operations.
				497
				498	ARMv6 has scalar saturating and halving adds and subtracts. The same
				499	intrinsics could possibly be used for both NEON's vector implementations of
				500	those operations and the ARMv6 scalar versions.
				501
				502	//===---------------------------------------------------------------------===//
				503
Evan Cheng	151b9af	2009-06-26 00:28:48 +0000	[diff] [blame]	504	ARM::MOVCCr is commutable (by flipping the condition). But we need to implement
				505	ARMInstrInfo::commuteInstruction() to support it.
Evan Cheng	055b031	2009-06-29 07:51:04 +0000	[diff] [blame]	506
				507	//===---------------------------------------------------------------------===//
				508
				509	Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
				510	LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
				511	ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
				512	while ARMConstantIslandPass only need to worry about LDR (literal).
Evan Cheng	064a6ea	2009-07-22 00:58:27 +0000	[diff] [blame]	513
				514	//===---------------------------------------------------------------------===//
				515
Evan Cheng	40efc25	2009-07-24 19:31:03 +0000	[diff] [blame]	516	Constant island pass should make use of full range SoImm values for LEApcrel.
				517	Be careful though as the last attempt caused infinite looping on lencod.
Chris Lattner	5135039	2009-07-30 16:08:58 +0000	[diff] [blame]	518
				519	//===---------------------------------------------------------------------===//
				520
				521	Predication issue. This function:
				522
				523	extern unsigned array[ 128 ];
				524	int foo( int x ) {
				525	int y;
				526	y = array[ x & 127 ];
				527	if ( x & 128 )
				528	y = 123456789 & ( y >> 2 );
				529	else
				530	y = 123456789 & y;
				531	return y;
				532	}
				533
				534	compiles to:
				535
				536	_foo:
				537	and r1, r0, #127
				538	ldr r2, LCPI1_0
				539	ldr r2, [r2]
				540	ldr r1, [r2, +r1, lsl #2]
				541	mov r2, r1, lsr #2
				542	tst r0, #128
				543	moveq r2, r1
				544	ldr r0, LCPI1_1
				545	and r0, r2, r0
				546	bx lr
				547
				548	It would be better to do something like this, to fold the shift into the
				549	conditional move:
				550
				551	and r1, r0, #127
				552	ldr r2, LCPI1_0
				553	ldr r2, [r2]
				554	ldr r1, [r2, +r1, lsl #2]
				555	tst r0, #128
				556	movne r1, r1, lsr #2
				557	ldr r0, LCPI1_1
				558	and r0, r1, r0
				559	bx lr
				560
				561	it saves an instruction and a register.
				562
				563	//===---------------------------------------------------------------------===//
Evan Cheng	5adb66a	2009-09-28 09:14:39 +0000	[diff] [blame]	564
Evan Cheng	5adb66a	2009-09-28 09:14:39 +0000	[diff] [blame]	565	It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
				566	with the same bottom half.
Bob Wilson	57f224a	2009-10-30 22:22:46 +0000	[diff] [blame]	567
				568	//===---------------------------------------------------------------------===//
				569
				570	Robert Muth started working on an alternate jump table implementation that
				571	does not put the tables in-line in the text. This is more like the llvm
				572	default jump table implementation. This might be useful sometime. Several
				573	revisions of patches are on the mailing list, beginning at:
				574	http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html
				575
Evan Cheng	e3b88fc	2009-11-02 07:51:19 +0000	[diff] [blame]	576	//===---------------------------------------------------------------------===//
				577
				578	Make use of the "rbit" instruction.
Evan Cheng	d457e6e	2009-11-07 04:04:34 +0000	[diff] [blame]	579
				580	//===---------------------------------------------------------------------===//
				581
				582	Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
				583	to licm and cse the unnecessary load from cp#1.
Jim Grosbach	d5d2bae	2010-01-22 00:08:13 +0000	[diff] [blame]	584
				585	//===---------------------------------------------------------------------===//
				586
				587	The CMN instruction sets the flags like an ADD instruction, while CMP sets
				588	them like a subtract. Therefore to be able to use CMN for comparisons other
				589	than the Z bit, we'll need additional logic to reverse the conditionals
				590	associated with the comparison. Perhaps a pseudo-instruction for the comparison,
				591	with a post-codegen pass to clean up and handle the condition codes?
				592	See PR5694 for testcase.
Eli Friedman	54cc0e1	2010-07-14 06:58:26 +0000	[diff] [blame]	593
				594	//===---------------------------------------------------------------------===//
				595
				596	Given the following on armv5:
				597	int test1(int A, int B) {
				598	return (A&-8388481)\|(B&8388480);
				599	}
				600
				601	We currently generate:
				602	ldr r2, .LCPI0_0
				603	and r0, r0, r2
				604	ldr r2, .LCPI0_1
				605	and r1, r1, r2
				606	orr r0, r1, r0
				607	bx lr
				608
				609	We should be able to replace the second ldr+and with a bic (i.e. reuse the
				610	constant which was already loaded). Not sure what's necessary to do that.
				611
				612	//===---------------------------------------------------------------------===//
				613
Eli Friedman	4baa8eb	2010-07-15 02:20:38 +0000	[diff] [blame]	614	The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal:
				615
				616	int a(int x) { return __builtin_bswap32(x); }
				617
				618	a:
				619	mov r1, #255, 24
				620	mov r2, #255, 16
				621	and r1, r1, r0, lsr #8
				622	and r2, r2, r0, lsl #8
				623	orr r1, r1, r0, lsr #24
				624	orr r0, r2, r0, lsl #24
				625	orr r0, r0, r1
				626	bx lr
				627
				628	Something like the following would be better (fewer instructions/registers):
				629	eor r1, r0, r0, ror #16
				630	bic r1, r1, #0xff0000
				631	mov r1, r1, lsr #8
				632	eor r0, r1, r0, ror #8
				633	bx lr
				634
				635	A custom Thumb version would also be a slight improvement over the generic
				636	version.
				637
				638	//===---------------------------------------------------------------------===//
Anton Korobeynikov	25a6ab0	2010-07-26 21:48:35 +0000	[diff] [blame]	639
				640	Consider the following simple C code:
				641
				642	void foo(unsigned char a, unsigned char b, int *c) {
				643	if ((a \| b) == 0) *c = 0;
				644	}
				645
				646	currently llvm-gcc generates something like this (nice branchless code I'd say):
				647
Anton Korobeynikov	4e084e9	2010-07-26 22:23:07 +0000	[diff] [blame]	648	ldrb r0, [r0]
				649	ldrb r1, [r1]
				650	orr r0, r1, r0
				651	tst r0, #255
				652	moveq r0, #0
				653	streq r0, [r2]
Eli Friedman	a10213e	2010-07-26 22:28:18 +0000	[diff] [blame]	654	bx lr
Anton Korobeynikov	25a6ab0	2010-07-26 21:48:35 +0000	[diff] [blame]	655
				656	Note that both "tst" and "moveq" are redundant.
				657
				658	//===---------------------------------------------------------------------===//
				659
Bob Wilson	acc9e73	2011-03-03 06:39:09 +0000	[diff] [blame]	660	When loading immediate constants with movt/movw, if there are multiple
				661	constants needed with the same low 16 bits, and those values are not live at
				662	the same time, it would be possible to use a single movw instruction, followed
				663	by multiple movt instructions to rewrite the high bits to different values.
				664	For example:
				665
				666	volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4,
				667	!tbaa
				668	!0
				669	volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4,
				670	!tbaa
				671	!0
				672
				673	is compiled and optimized to:
				674
				675	movw r0, #32796
				676	mov.w r1, #-1
				677	movt r0, #20480
				678	str r1, [r0]
				679	movw r0, #32796 @ <= this MOVW is not needed, value is there already
				680	movt r0, #20482
				681	str r1, [r0]
				682
				683	//===---------------------------------------------------------------------===//
Evan Cheng	3b73708	2011-07-13 01:33:00 +0000	[diff] [blame]	684
				685	Improve codegen for select's:
				686	if (x != 0) x = 1
				687	if (x == 1) x = 1
				688
				689	ARM codegen used to look like this:
				690	mov r1, r0
				691	cmp r1, #1
				692	mov r0, #0
				693	moveq r0, #1
				694
				695	The naive lowering select between two different values. It should recognize the
				696	test is equality test so it's more a conditional move rather than a select:
				697	cmp r0, #1
				698	movne r0, #0
				699
				700	Currently this is a ARM specific dag combine. We probably should make it into a
				701	target-neutral one.
Bob Wilson	7750ff1	2012-01-28 18:30:07 +0000	[diff] [blame]	702
				703	//===---------------------------------------------------------------------===//
				704
				705	Optimize unnecessary checks for zero with __builtin_clz/ctz. Those builtins
				706	are specified to be undefined at zero, so portable code must check for zero
				707	and handle it as a special case. That is unnecessary on ARM where those
				708	operations are implemented in a way that is well-defined for zero. For
				709	example:
				710
				711	int f(int x) { return x ? __builtin_clz(x) : sizeof(int)*8; }
				712
				713	should just be implemented with a CLZ instruction. Since there are other
				714	targets, e.g., PPC, that share this behavior, it would be best to implement
				715	this in a target-independent way: we should probably fold that (when using
				716	"undefined at zero" semantics) to set the "defined at zero" bit and have
				717	the code generator expand out the right code.