Blame - lib/Target/ARM/README.txt - fp2-dev/platform/external/llvm

blob: 57dc6cb88bed483d42db746dc1a68a154a22f714 [file] [log] [blame]

Rafael Espindola	f4d4005	2006-08-22 12:22:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the ARM backend.
				3	//===---------------------------------------------------------------------===//
				4
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	5	Reimplement 'select' in terms of 'SEL'.
Rafael Espindola	f4d4005	2006-08-22 12:22:46 +0000	[diff] [blame]	6
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	7	* We would really like to support UXTAB16, but we need to prove that the
				8	add doesn't need to overflow between the two 16-bit chunks.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	9
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	10	* Implement pre/post increment support. (e.g. PR935)
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	11	* Implement smarter constant generation for binops with large immediates.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	12
Chris Lattner	7180f10	2010-02-17 18:42:24 +0000	[diff] [blame]	13	A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX
				14
Chris Lattner	f283e40	2010-03-29 17:02:02 +0000	[diff] [blame]	15	Interesting optimization for PIC codegen on arm-linux:
				16	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129
				17
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	18	//===---------------------------------------------------------------------===//
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	19
Chris Lattner	93305bc	2007-04-20 20:18:43 +0000	[diff] [blame]	20	Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
				21	time regalloc happens, these values are now in a 32-bit register, usually with
				22	the top-bits known to be sign or zero extended. If spilled, we should be able
				23	to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
				24	of the reload.
				25
				26	Doing this reduces the size of the stack frame (important for thumb etc), and
				27	also increases the likelihood that we will be able to reload multiple values
				28	from the stack with a single load.
				29
				30	//===---------------------------------------------------------------------===//
				31
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	32	The constant island pass is in good shape. Some cleanups might be desirable,
				33	but there is unlikely to be much improvement in the generated code.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	34
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	35	1. There may be some advantage to trying to be smarter about the initial
Dale Johannesen	88e37ae	2007-02-23 05:02:36 +0000	[diff] [blame]	36	placement, rather than putting everything at the end.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	37
Dale Johannesen	9118dbc	2007-04-30 00:32:06 +0000	[diff] [blame]	38	2. There might be some compile-time efficiency to be had by representing
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	39	consecutive islands as a single block rather than multiple blocks.
				40
Dale Johannesen	9118dbc	2007-04-30 00:32:06 +0000	[diff] [blame]	41	3. Use a priority queue to sort constant pool users in inverse order of
Evan Cheng	1a9da0d	2007-03-09 19:46:06 +0000	[diff] [blame]	42	position so we always process the one closed to the end of functions
				43	first. This may simply CreateNewWater.
				44
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	45	//===---------------------------------------------------------------------===//
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	46
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	47	Eliminate copysign custom expansion. We are still generating crappy code with
				48	default expansion + if-conversion.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	49
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	50	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	51
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	52	Eliminate one instruction from:
Chris Lattner	2d1222c	2007-02-02 04:36:46 +0000	[diff] [blame]	53
				54	define i32 @_Z6slow4bii(i32 %x, i32 %y) {
				55	%tmp = icmp sgt i32 %x, %y
				56	%retval = select i1 %tmp, i32 %x, i32 %y
				57	ret i32 %retval
				58	}
				59
				60	__Z6slow4bii:
				61	cmp r0, r1
				62	movgt r1, r0
				63	mov r0, r1
				64	bx lr
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	65	=>
				66
				67	__Z6slow4bii:
				68	cmp r0, r1
				69	movle r0, r1
				70	bx lr
Chris Lattner	2d1222c	2007-02-02 04:36:46 +0000	[diff] [blame]	71
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	72	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	73
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	74	Implement long long "X-3" with instructions that fold the immediate in. These
				75	were disabled due to badness with the ARM carry flag on subtracts.
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	76
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	77	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	78
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	79	More load / store optimizations:
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	80	1) Better representation for block transfer? This is from Olden/power:
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	81
				82	fldd d0, [r4]
				83	fstd d0, [r4, #+32]
				84	fldd d0, [r4, #+8]
				85	fstd d0, [r4, #+40]
				86	fldd d0, [r4, #+16]
				87	fstd d0, [r4, #+48]
				88	fldd d0, [r4, #+24]
				89	fstd d0, [r4, #+56]
				90
				91	If we can spare the registers, it would be better to use fldm and fstm here.
				92	Need major register allocator enhancement though.
				93
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	94	2) Can we recognize the relative position of constantpool entries? i.e. Treat
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	95
				96	ldr r0, LCPI17_3
				97	ldr r1, LCPI17_4
				98	ldr r2, LCPI17_5
				99
				100	as
				101	ldr r0, LCPI17
				102	ldr r1, LCPI17+4
				103	ldr r2, LCPI17+8
				104
				105	Then the ldr's can be combined into a single ldm. See Olden/power.
				106
				107	Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
				108	double 64-bit FP constant:
				109
				110	adr r0, L6
				111	ldmia r0, {r0-r1}
				112
				113	.align 2
				114	L6:
				115	.long -858993459
				116	.long 1074318540
				117
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	118	3) struct copies appear to be done field by field
Dale Johannesen	a6bc6fc	2007-03-09 17:58:17 +0000	[diff] [blame]	119	instead of by words, at least sometimes:
				120
				121	struct foo { int x; short s; char c1; char c2; };
				122	void cpy(struct fooa, struct foob) { a = b; }
				123
				124	llvm code (-O2)
				125	ldrb r3, [r1, #+6]
				126	ldr r2, [r1]
				127	ldrb r12, [r1, #+7]
				128	ldrh r1, [r1, #+4]
				129	str r2, [r0]
				130	strh r1, [r0, #+4]
				131	strb r3, [r0, #+6]
				132	strb r12, [r0, #+7]
				133	gcc code (-O2)
				134	ldmia r1, {r1-r2}
				135	stmia r0, {r1-r2}
				136
				137	In this benchmark poor handling of aggregate copies has shown up as
				138	having a large effect on size, and possibly speed as well (we don't have
				139	a good way to measure on ARM).
				140
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	141	//===---------------------------------------------------------------------===//
				142
				143	* Consider this silly example:
				144
				145	double bar(double x) {
				146	double r = foo(3.1);
				147	return x+r;
				148	}
				149
				150	_bar:
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	151	stmfd sp!, {r4, r5, r7, lr}
				152	add r7, sp, #8
				153	mov r4, r0
				154	mov r5, r1
				155	fldd d0, LCPI1_0
				156	fmrrd r0, r1, d0
				157	bl _foo
				158	fmdrr d0, r4, r5
				159	fmsr s2, r0
				160	fsitod d1, s2
				161	faddd d0, d1, d0
				162	fmrrd r0, r1, d0
				163	ldmfd sp!, {r4, r5, r7, pc}
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	164
				165	Ignore the prologue and epilogue stuff for a second. Note
				166	mov r4, r0
				167	mov r5, r1
				168	the copys to callee-save registers and the fact they are only being used by the
				169	fmdrr instruction. It would have been better had the fmdrr been scheduled
				170	before the call and place the result in a callee-save DPR register. The two
				171	mov ops would not have been necessary.
				172
				173	//===---------------------------------------------------------------------===//
				174
				175	Calling convention related stuff:
				176
				177	* gcc's parameter passing implementation is terrible and we suffer as a result:
				178
				179	e.g.
				180	struct s {
				181	double d1;
				182	int s1;
				183	};
				184
				185	void foo(struct s S) {
				186	printf("%g, %d\n", S.d1, S.s1);
				187	}
				188
				189	'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
				190	then reload them to r1, r2, and r3 before issuing the call (r0 contains the
				191	address of the format string):
				192
				193	stmfd sp!, {r7, lr}
				194	add r7, sp, #0
				195	sub sp, sp, #12
				196	stmia sp, {r0, r1, r2}
				197	ldmia sp, {r1-r2}
				198	ldr r0, L5
				199	ldr r3, [sp, #8]
				200	L2:
				201	add r0, pc, r0
				202	bl L_printf$stub
				203
				204	Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
				205
				206	* Return an aggregate type is even worse:
				207
				208	e.g.
				209	struct s foo(void) {
				210	struct s S = {1.1, 2};
				211	return S;
				212	}
				213
				214	mov ip, r0
				215	ldr r0, L5
				216	sub sp, sp, #12
				217	L2:
				218	add r0, pc, r0
				219	@ lr needed for prologue
				220	ldmia r0, {r0, r1, r2}
				221	stmia sp, {r0, r1, r2}
				222	stmia ip, {r0, r1, r2}
				223	mov r0, ip
				224	add sp, sp, #12
				225	bx lr
				226
				227	r0 (and later ip) is the hidden parameter from caller to store the value in. The
				228	first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
				229	r2 into the address passed in. However, there is one additional stmia that
				230	stores r0, r1, and r2 to some stack location. The store is dead.
				231
				232	The llvm-gcc generated code looks like this:
				233
				234	csretcc void %foo(%struct.s* %agg.result) {
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	235	entry:
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	236	%S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
				237	%memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
				238	cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
				239	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
				240	cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
				241	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
				242	cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
				243	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	244	ret void
				245	}
				246
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	247	llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
				248	constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
				249	into a number of load and stores, or 2) custom lower memcpy (of small size) to
				250	be ldmia / stmia. I think option 2 is better but the current register
				251	allocator cannot allocate a chunk of registers at a time.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	252
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	253	A feasible temporary solution is to use specific physical registers at the
				254	lowering time for small (<= 4 words?) transfer size.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	255
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	256	* ARM CSRet calling convention requires the hidden argument to be returned by
				257	the callee.
Rafael Espindola	bec2e38	2006-10-16 16:33:29 +0000	[diff] [blame]	258
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	259	//===---------------------------------------------------------------------===//
Rafael Espindola	bec2e38	2006-10-16 16:33:29 +0000	[diff] [blame]	260
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	261	We can definitely do a better job on BB placements to eliminate some branches.
				262	It's very common to see llvm generated assembly code that looks like this:
Rafael Espindola	82c678b	2006-10-16 17:17:22 +0000	[diff] [blame]	263
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	264	LBB3:
				265	...
				266	LBB4:
				267	...
				268	beq LBB3
				269	b LBB2
Rafael Espindola	82c678b	2006-10-16 17:17:22 +0000	[diff] [blame]	270
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	271	If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
				272	then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
				273
				274	See McCat/18-imp/ComputeBoundingBoxes for an example.
				275
				276	//===---------------------------------------------------------------------===//
				277
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	278	Pre-/post- indexed load / stores:
				279
				280	1) We should not make the pre/post- indexed load/store transform if the base ptr
				281	is guaranteed to be live beyond the load/store. This can happen if the base
				282	ptr is live out of the block we are performing the optimization. e.g.
				283
				284	mov r1, r2
				285	ldr r3, [r1], #4
				286	...
				287
				288	vs.
				289
				290	ldr r3, [r2]
				291	add r1, r2, #4
				292	...
				293
				294	In most cases, this is just a wasted optimization. However, sometimes it can
				295	negatively impact the performance because two-address code is more restrictive
				296	when it comes to scheduling.
				297
				298	Unfortunately, liveout information is currently unavailable during DAG combine
				299	time.
				300
				301	2) Consider spliting a indexed load / store into a pair of add/sub + load/store
				302	to solve #1 (in TwoAddressInstructionPass.cpp).
				303
				304	3) Enhance LSR to generate more opportunities for indexed ops.
				305
				306	4) Once we added support for multiple result patterns, write indexed loads
				307	patterns instead of C++ instruction selection code.
				308
Jim Grosbach	e516549	2009-11-09 00:11:35 +0000	[diff] [blame]	309	5) Use VLDM / VSTM to emulate indexed FP load / store.
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	310
				311	//===---------------------------------------------------------------------===//
				312
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	313	Implement support for some more tricky ways to materialize immediates. For
				314	example, to get 0xffff8000, we can use:
				315
				316	mov r9, #&3f8000
				317	sub r9, r9, #&400000
				318
				319	//===---------------------------------------------------------------------===//
				320
				321	We sometimes generate multiple add / sub instructions to update sp in prologue
				322	and epilogue if the inc / dec value is too large to fit in a single immediate
				323	operand. In some cases, perhaps it might be better to load the value from a
				324	constantpool instead.
				325
				326	//===---------------------------------------------------------------------===//
				327
				328	GCC generates significantly better code for this function.
				329
				330	int foo(int StackPtr, unsigned char Line, unsigned char Stack, int LineLen) {
				331	int i = 0;
				332
				333	if (StackPtr != 0) {
				334	while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
				335	Line[i++] = Stack[--StackPtr];
				336	if (LineLen > 32768)
				337	{
				338	while (StackPtr != 0 && i < LineLen)
				339	{
				340	i++;
				341	--StackPtr;
				342	}
				343	}
				344	}
				345	return StackPtr;
				346	}
				347
				348	//===---------------------------------------------------------------------===//
				349
				350	This should compile to the mlas instruction:
				351	int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
				352
				353	//===---------------------------------------------------------------------===//
				354
				355	At some point, we should triage these to see if they still apply to us:
				356
				357	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
				358	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
				359	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
				360
				361	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
				362	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
				363	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
				364	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
				365	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
				366	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
				367	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
				368
				369	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
				370	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
				371	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
				372	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
				373	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
				374	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
				375	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
				376
				377	http://www.inf.u-szeged.hu/gcc-arm/
				378	http://citeseer.ist.psu.edu/debus04linktime.html
				379
				380	//===---------------------------------------------------------------------===//
Evan Cheng	2265b49	2007-03-09 19:34:51 +0000	[diff] [blame]	381
Dale Johannesen	818c085	2007-03-09 19:18:59 +0000	[diff] [blame]	382	gcc generates smaller code for this function at -O2 or -Os:
Dale Johannesen	a6bc6fc	2007-03-09 17:58:17 +0000	[diff] [blame]	383
				384	void foo(signed char* p) {
				385	if (*p == 3)
				386	bar();
				387	else if (*p == 4)
				388	baz();
				389	else if (*p == 5)
				390	quux();
				391	}
				392
				393	llvm decides it's a good idea to turn the repeated if...else into a
				394	binary tree, as if it were a switch; the resulting code requires -1
				395	compare-and-branches when p<=2 or p==5, the same number if *p==4
				396	or p>6, and +1 if p==3. So it should be a speed win
				397	(on balance). However, the revised code is larger, with 4 conditional
				398	branches instead of 3.
				399
				400	More seriously, there is a byte->word extend before
				401	each comparison, where there should be only one, and the condition codes
				402	are not remembered when the same two values are compared twice.
				403
Evan Cheng	2265b49	2007-03-09 19:34:51 +0000	[diff] [blame]	404	//===---------------------------------------------------------------------===//
				405
Evan Cheng	a125cbe	2007-03-20 22:32:39 +0000	[diff] [blame]	406	More LSR enhancements possible:
				407
				408	1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
				409	in a load / store.
				410	2. Allow iv reuse even when a type conversion is required. For example, i8
				411	and i32 load / store addressing modes are identical.
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	412
				413
				414	//===---------------------------------------------------------------------===//
				415
				416	This:
				417
				418	int foo(int a, int b, int c, int d) {
				419	long long acc = (long long)a * (long long)b;
				420	acc += (long long)c * (long long)d;
				421	return (int)(acc >> 32);
				422	}
				423
				424	Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
				425	two signed 32-bit values to produce a 64-bit value, and accumulates this with
				426	a 64-bit value.
				427
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	428	We currently get this with both v4 and v6:
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	429
				430	_foo:
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	431	smull r1, r0, r1, r0
				432	smull r3, r2, r3, r2
				433	adds r3, r3, r1
				434	adc r0, r2, r0
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	435	bx lr
				436
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	437	//===---------------------------------------------------------------------===//
Chris Lattner	bf8ae84	2007-09-10 21:43:18 +0000	[diff] [blame]	438
				439	This:
				440	#include <algorithm>
				441	std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				442	{ return std::make_pair(a + b, a + b < a); }
				443	bool no_overflow(unsigned a, unsigned b)
				444	{ return !full_add(a, b).second; }
				445
				446	Should compile to:
				447
				448	_Z8full_addjj:
				449	adds r2, r1, r2
				450	movcc r1, #0
				451	movcs r1, #1
				452	str r2, [r0, #0]
				453	strb r1, [r0, #4]
				454	mov pc, lr
				455
				456	_Z11no_overflowjj:
				457	cmn r0, r1
				458	movcs r0, #0
				459	movcc r0, #1
				460	mov pc, lr
				461
				462	not:
				463
				464	__Z8full_addjj:
				465	add r3, r2, r1
				466	str r3, [r0]
				467	mov r2, #1
				468	mov r12, #0
				469	cmp r3, r1
				470	movlo r12, r2
				471	str r12, [r0, #+4]
				472	bx lr
				473	__Z11no_overflowjj:
				474	add r3, r1, r0
				475	mov r2, #1
				476	mov r1, #0
				477	cmp r3, r0
				478	movhs r1, r2
				479	mov r0, r1
				480	bx lr
				481
				482	//===---------------------------------------------------------------------===//
Chris Lattner	3a7c33a	2007-10-19 03:29:26 +0000	[diff] [blame]	483
Bob Wilson	5bafff3	2009-06-22 23:27:02 +0000	[diff] [blame]	484	Some of the NEON intrinsics may be appropriate for more general use, either
				485	as target-independent intrinsics or perhaps elsewhere in the ARM backend.
				486	Some of them may also be lowered to target-independent SDNodes, and perhaps
				487	some new SDNodes could be added.
				488
				489	For example, maximum, minimum, and absolute value operations are well-defined
				490	and standard operations, both for vector and scalar types.
				491
				492	The current NEON-specific intrinsics for count leading zeros and count one
				493	bits could perhaps be replaced by the target-independent ctlz and ctpop
				494	intrinsics. It may also make sense to add a target-independent "ctls"
				495	intrinsic for "count leading sign bits". Likewise, the backend could use
				496	the target-independent SDNodes for these operations.
				497
				498	ARMv6 has scalar saturating and halving adds and subtracts. The same
				499	intrinsics could possibly be used for both NEON's vector implementations of
				500	those operations and the ARMv6 scalar versions.
				501
				502	//===---------------------------------------------------------------------===//
				503
Evan Cheng	055b031	2009-06-29 07:51:04 +0000	[diff] [blame]	504	Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
				505	LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
				506	ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
				507	while ARMConstantIslandPass only need to worry about LDR (literal).
Evan Cheng	064a6ea	2009-07-22 00:58:27 +0000	[diff] [blame]	508
				509	//===---------------------------------------------------------------------===//
				510
Evan Cheng	40efc25	2009-07-24 19:31:03 +0000	[diff] [blame]	511	Constant island pass should make use of full range SoImm values for LEApcrel.
				512	Be careful though as the last attempt caused infinite looping on lencod.
Chris Lattner	5135039	2009-07-30 16:08:58 +0000	[diff] [blame]	513
				514	//===---------------------------------------------------------------------===//
				515
				516	Predication issue. This function:
				517
				518	extern unsigned array[ 128 ];
				519	int foo( int x ) {
				520	int y;
				521	y = array[ x & 127 ];
				522	if ( x & 128 )
				523	y = 123456789 & ( y >> 2 );
				524	else
				525	y = 123456789 & y;
				526	return y;
				527	}
				528
				529	compiles to:
				530
				531	_foo:
				532	and r1, r0, #127
				533	ldr r2, LCPI1_0
				534	ldr r2, [r2]
				535	ldr r1, [r2, +r1, lsl #2]
				536	mov r2, r1, lsr #2
				537	tst r0, #128
				538	moveq r2, r1
				539	ldr r0, LCPI1_1
				540	and r0, r2, r0
				541	bx lr
				542
				543	It would be better to do something like this, to fold the shift into the
				544	conditional move:
				545
				546	and r1, r0, #127
				547	ldr r2, LCPI1_0
				548	ldr r2, [r2]
				549	ldr r1, [r2, +r1, lsl #2]
				550	tst r0, #128
				551	movne r1, r1, lsr #2
				552	ldr r0, LCPI1_1
				553	and r0, r1, r0
				554	bx lr
				555
				556	it saves an instruction and a register.
				557
				558	//===---------------------------------------------------------------------===//
Evan Cheng	5adb66a	2009-09-28 09:14:39 +0000	[diff] [blame]	559
Evan Cheng	5adb66a	2009-09-28 09:14:39 +0000	[diff] [blame]	560	It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
				561	with the same bottom half.
Bob Wilson	57f224a	2009-10-30 22:22:46 +0000	[diff] [blame]	562
				563	//===---------------------------------------------------------------------===//
				564
				565	Robert Muth started working on an alternate jump table implementation that
				566	does not put the tables in-line in the text. This is more like the llvm
				567	default jump table implementation. This might be useful sometime. Several
				568	revisions of patches are on the mailing list, beginning at:
				569	http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html
				570
Evan Cheng	e3b88fc	2009-11-02 07:51:19 +0000	[diff] [blame]	571	//===---------------------------------------------------------------------===//
				572
				573	Make use of the "rbit" instruction.
Evan Cheng	d457e6e	2009-11-07 04:04:34 +0000	[diff] [blame]	574
				575	//===---------------------------------------------------------------------===//
				576
				577	Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
				578	to licm and cse the unnecessary load from cp#1.
Jim Grosbach	d5d2bae	2010-01-22 00:08:13 +0000	[diff] [blame]	579
				580	//===---------------------------------------------------------------------===//
				581
				582	The CMN instruction sets the flags like an ADD instruction, while CMP sets
				583	them like a subtract. Therefore to be able to use CMN for comparisons other
				584	than the Z bit, we'll need additional logic to reverse the conditionals
				585	associated with the comparison. Perhaps a pseudo-instruction for the comparison,
				586	with a post-codegen pass to clean up and handle the condition codes?
				587	See PR5694 for testcase.
Eli Friedman	54cc0e1	2010-07-14 06:58:26 +0000	[diff] [blame]	588
				589	//===---------------------------------------------------------------------===//
				590
				591	Given the following on armv5:
				592	int test1(int A, int B) {
				593	return (A&-8388481)\|(B&8388480);
				594	}
				595
				596	We currently generate:
				597	ldr r2, .LCPI0_0
				598	and r0, r0, r2
				599	ldr r2, .LCPI0_1
				600	and r1, r1, r2
				601	orr r0, r1, r0
				602	bx lr
				603
				604	We should be able to replace the second ldr+and with a bic (i.e. reuse the
				605	constant which was already loaded). Not sure what's necessary to do that.
				606
				607	//===---------------------------------------------------------------------===//
				608
Eli Friedman	4baa8eb	2010-07-15 02:20:38 +0000	[diff] [blame]	609	The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal:
				610
				611	int a(int x) { return __builtin_bswap32(x); }
				612
				613	a:
				614	mov r1, #255, 24
				615	mov r2, #255, 16
				616	and r1, r1, r0, lsr #8
				617	and r2, r2, r0, lsl #8
				618	orr r1, r1, r0, lsr #24
				619	orr r0, r2, r0, lsl #24
				620	orr r0, r0, r1
				621	bx lr
				622
				623	Something like the following would be better (fewer instructions/registers):
				624	eor r1, r0, r0, ror #16
				625	bic r1, r1, #0xff0000
				626	mov r1, r1, lsr #8
				627	eor r0, r1, r0, ror #8
				628	bx lr
				629
				630	A custom Thumb version would also be a slight improvement over the generic
				631	version.
				632
				633	//===---------------------------------------------------------------------===//
Anton Korobeynikov	25a6ab0	2010-07-26 21:48:35 +0000	[diff] [blame]	634
				635	Consider the following simple C code:
				636
				637	void foo(unsigned char a, unsigned char b, int *c) {
				638	if ((a \| b) == 0) *c = 0;
				639	}
				640
				641	currently llvm-gcc generates something like this (nice branchless code I'd say):
				642
Anton Korobeynikov	4e084e9	2010-07-26 22:23:07 +0000	[diff] [blame]	643	ldrb r0, [r0]
				644	ldrb r1, [r1]
				645	orr r0, r1, r0
				646	tst r0, #255
				647	moveq r0, #0
				648	streq r0, [r2]
Eli Friedman	a10213e	2010-07-26 22:28:18 +0000	[diff] [blame]	649	bx lr
Anton Korobeynikov	25a6ab0	2010-07-26 21:48:35 +0000	[diff] [blame]	650
				651	Note that both "tst" and "moveq" are redundant.
				652
				653	//===---------------------------------------------------------------------===//
				654
Bob Wilson	acc9e73	2011-03-03 06:39:09 +0000	[diff] [blame]	655	When loading immediate constants with movt/movw, if there are multiple
				656	constants needed with the same low 16 bits, and those values are not live at
				657	the same time, it would be possible to use a single movw instruction, followed
				658	by multiple movt instructions to rewrite the high bits to different values.
				659	For example:
				660
				661	volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4,
				662	!tbaa
				663	!0
				664	volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4,
				665	!tbaa
				666	!0
				667
				668	is compiled and optimized to:
				669
				670	movw r0, #32796
				671	mov.w r1, #-1
				672	movt r0, #20480
				673	str r1, [r0]
				674	movw r0, #32796 @ <= this MOVW is not needed, value is there already
				675	movt r0, #20482
				676	str r1, [r0]
				677
				678	//===---------------------------------------------------------------------===//
Evan Cheng	3b73708	2011-07-13 01:33:00 +0000	[diff] [blame]	679
				680	Improve codegen for select's:
				681	if (x != 0) x = 1
				682	if (x == 1) x = 1
				683
				684	ARM codegen used to look like this:
				685	mov r1, r0
				686	cmp r1, #1
				687	mov r0, #0
				688	moveq r0, #1
				689
				690	The naive lowering select between two different values. It should recognize the
				691	test is equality test so it's more a conditional move rather than a select:
				692	cmp r0, #1
				693	movne r0, #0
				694
				695	Currently this is a ARM specific dag combine. We probably should make it into a
				696	target-neutral one.
Bob Wilson	7750ff1	2012-01-28 18:30:07 +0000	[diff] [blame]	697
				698	//===---------------------------------------------------------------------===//
				699
				700	Optimize unnecessary checks for zero with __builtin_clz/ctz. Those builtins
				701	are specified to be undefined at zero, so portable code must check for zero
				702	and handle it as a special case. That is unnecessary on ARM where those
				703	operations are implemented in a way that is well-defined for zero. For
				704	example:
				705
				706	int f(int x) { return x ? __builtin_clz(x) : sizeof(int)*8; }
				707
				708	should just be implemented with a CLZ instruction. Since there are other
				709	targets, e.g., PPC, that share this behavior, it would be best to implement
				710	this in a target-independent way: we should probably fold that (when using
				711	"undefined at zero" semantics) to set the "defined at zero" bit and have
				712	the code generator expand out the right code.
Jim Grosbach	db6faeb	2012-07-06 21:52:04 +0000	[diff] [blame]	713
				714
				715	//===---------------------------------------------------------------------===//
				716
				717	Clean up the test/MC/ARM files to have more robust register choices.
				718
				719	R0 should not be used as a register operand in the assembler tests as it's then
				720	not possible to distinguish between a correct encoding and a missing operand
				721	encoding, as zero is the default value for the binary encoder.
				722	e.g.,
				723	add r0, r0 // bad
				724	add r3, r5 // good
				725
				726	Register operands should be distinct. That is, when the encoding does not
				727	require two syntactical operands to refer to the same register, two different
				728	registers should be used in the test so as to catch errors where the
				729	operands are swapped in the encoding.
				730	e.g.,
				731	subs.w r1, r1, r1 // bad
				732	subs.w r1, r2, r3 // good
				733