Blame - lib/Target/ARM/README.txt - fp2-dev/platform/external/llvm

blob: a6f26a5dfe1954d7f3fc66158391aa10959d0fd3 [file] [log] [blame]

Rafael Espindola	f4d4005	2006-08-22 12:22:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the ARM backend.
				3	//===---------------------------------------------------------------------===//
				4
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	5	Reimplement 'select' in terms of 'SEL'.
Rafael Espindola	f4d4005	2006-08-22 12:22:46 +0000	[diff] [blame]	6
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	7	* We would really like to support UXTAB16, but we need to prove that the
				8	add doesn't need to overflow between the two 16-bit chunks.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	9
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	10	* Implement pre/post increment support. (e.g. PR935)
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	11	* Implement smarter constant generation for binops with large immediates.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	12
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	13	//===---------------------------------------------------------------------===//
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	14
Chris Lattner	93305bc	2007-04-20 20:18:43 +0000	[diff] [blame]	15	Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
				16	time regalloc happens, these values are now in a 32-bit register, usually with
				17	the top-bits known to be sign or zero extended. If spilled, we should be able
				18	to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
				19	of the reload.
				20
				21	Doing this reduces the size of the stack frame (important for thumb etc), and
				22	also increases the likelihood that we will be able to reload multiple values
				23	from the stack with a single load.
				24
				25	//===---------------------------------------------------------------------===//
				26
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	27	The constant island pass is in good shape. Some cleanups might be desirable,
				28	but there is unlikely to be much improvement in the generated code.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	29
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	30	1. There may be some advantage to trying to be smarter about the initial
Dale Johannesen	88e37ae	2007-02-23 05:02:36 +0000	[diff] [blame]	31	placement, rather than putting everything at the end.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	32
Dale Johannesen	9118dbc	2007-04-30 00:32:06 +0000	[diff] [blame]	33	2. There might be some compile-time efficiency to be had by representing
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	34	consecutive islands as a single block rather than multiple blocks.
				35
Dale Johannesen	9118dbc	2007-04-30 00:32:06 +0000	[diff] [blame]	36	3. Use a priority queue to sort constant pool users in inverse order of
Evan Cheng	1a9da0d	2007-03-09 19:46:06 +0000	[diff] [blame]	37	position so we always process the one closed to the end of functions
				38	first. This may simply CreateNewWater.
				39
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	40	//===---------------------------------------------------------------------===//
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	41
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	42	Eliminate copysign custom expansion. We are still generating crappy code with
				43	default expansion + if-conversion.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	44
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	45	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	46
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	47	Eliminate one instruction from:
Chris Lattner	2d1222c	2007-02-02 04:36:46 +0000	[diff] [blame]	48
				49	define i32 @_Z6slow4bii(i32 %x, i32 %y) {
				50	%tmp = icmp sgt i32 %x, %y
				51	%retval = select i1 %tmp, i32 %x, i32 %y
				52	ret i32 %retval
				53	}
				54
				55	__Z6slow4bii:
				56	cmp r0, r1
				57	movgt r1, r0
				58	mov r0, r1
				59	bx lr
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	60	=>
				61
				62	__Z6slow4bii:
				63	cmp r0, r1
				64	movle r0, r1
				65	bx lr
Chris Lattner	2d1222c	2007-02-02 04:36:46 +0000	[diff] [blame]	66
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	67	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	68
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	69	Implement long long "X-3" with instructions that fold the immediate in. These
				70	were disabled due to badness with the ARM carry flag on subtracts.
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	71
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	72	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	73
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	74	We currently compile abs:
				75	int foo(int p) { return p < 0 ? -p : p; }
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	76
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	77	into:
Rafael Espindola	cd71da5	2006-10-03 17:27:58 +0000	[diff] [blame]	78
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	79	_foo:
				80	rsb r1, r0, #0
				81	cmn r0, #1
				82	movgt r1, r0
				83	mov r0, r1
				84	bx lr
Rafael Espindola	cd71da5	2006-10-03 17:27:58 +0000	[diff] [blame]	85
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	86	This is very, uh, literal. This could be a 3 operation sequence:
				87	t = (p sra 31);
				88	res = (p xor t)-t
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	89
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	90	Which would be better. This occurs in png decode.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	91
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	92	//===---------------------------------------------------------------------===//
				93
				94	More load / store optimizations:
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	95	1) Better representation for block transfer? This is from Olden/power:
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	96
				97	fldd d0, [r4]
				98	fstd d0, [r4, #+32]
				99	fldd d0, [r4, #+8]
				100	fstd d0, [r4, #+40]
				101	fldd d0, [r4, #+16]
				102	fstd d0, [r4, #+48]
				103	fldd d0, [r4, #+24]
				104	fstd d0, [r4, #+56]
				105
				106	If we can spare the registers, it would be better to use fldm and fstm here.
				107	Need major register allocator enhancement though.
				108
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	109	2) Can we recognize the relative position of constantpool entries? i.e. Treat
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	110
				111	ldr r0, LCPI17_3
				112	ldr r1, LCPI17_4
				113	ldr r2, LCPI17_5
				114
				115	as
				116	ldr r0, LCPI17
				117	ldr r1, LCPI17+4
				118	ldr r2, LCPI17+8
				119
				120	Then the ldr's can be combined into a single ldm. See Olden/power.
				121
				122	Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
				123	double 64-bit FP constant:
				124
				125	adr r0, L6
				126	ldmia r0, {r0-r1}
				127
				128	.align 2
				129	L6:
				130	.long -858993459
				131	.long 1074318540
				132
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	133	3) struct copies appear to be done field by field
Dale Johannesen	a6bc6fc	2007-03-09 17:58:17 +0000	[diff] [blame]	134	instead of by words, at least sometimes:
				135
				136	struct foo { int x; short s; char c1; char c2; };
				137	void cpy(struct fooa, struct foob) { a = b; }
				138
				139	llvm code (-O2)
				140	ldrb r3, [r1, #+6]
				141	ldr r2, [r1]
				142	ldrb r12, [r1, #+7]
				143	ldrh r1, [r1, #+4]
				144	str r2, [r0]
				145	strh r1, [r0, #+4]
				146	strb r3, [r0, #+6]
				147	strb r12, [r0, #+7]
				148	gcc code (-O2)
				149	ldmia r1, {r1-r2}
				150	stmia r0, {r1-r2}
				151
				152	In this benchmark poor handling of aggregate copies has shown up as
				153	having a large effect on size, and possibly speed as well (we don't have
				154	a good way to measure on ARM).
				155
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	156	//===---------------------------------------------------------------------===//
				157
				158	* Consider this silly example:
				159
				160	double bar(double x) {
				161	double r = foo(3.1);
				162	return x+r;
				163	}
				164
				165	_bar:
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	166	stmfd sp!, {r4, r5, r7, lr}
				167	add r7, sp, #8
				168	mov r4, r0
				169	mov r5, r1
				170	fldd d0, LCPI1_0
				171	fmrrd r0, r1, d0
				172	bl _foo
				173	fmdrr d0, r4, r5
				174	fmsr s2, r0
				175	fsitod d1, s2
				176	faddd d0, d1, d0
				177	fmrrd r0, r1, d0
				178	ldmfd sp!, {r4, r5, r7, pc}
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	179
				180	Ignore the prologue and epilogue stuff for a second. Note
				181	mov r4, r0
				182	mov r5, r1
				183	the copys to callee-save registers and the fact they are only being used by the
				184	fmdrr instruction. It would have been better had the fmdrr been scheduled
				185	before the call and place the result in a callee-save DPR register. The two
				186	mov ops would not have been necessary.
				187
				188	//===---------------------------------------------------------------------===//
				189
				190	Calling convention related stuff:
				191
				192	* gcc's parameter passing implementation is terrible and we suffer as a result:
				193
				194	e.g.
				195	struct s {
				196	double d1;
				197	int s1;
				198	};
				199
				200	void foo(struct s S) {
				201	printf("%g, %d\n", S.d1, S.s1);
				202	}
				203
				204	'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
				205	then reload them to r1, r2, and r3 before issuing the call (r0 contains the
				206	address of the format string):
				207
				208	stmfd sp!, {r7, lr}
				209	add r7, sp, #0
				210	sub sp, sp, #12
				211	stmia sp, {r0, r1, r2}
				212	ldmia sp, {r1-r2}
				213	ldr r0, L5
				214	ldr r3, [sp, #8]
				215	L2:
				216	add r0, pc, r0
				217	bl L_printf$stub
				218
				219	Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
				220
				221	* Return an aggregate type is even worse:
				222
				223	e.g.
				224	struct s foo(void) {
				225	struct s S = {1.1, 2};
				226	return S;
				227	}
				228
				229	mov ip, r0
				230	ldr r0, L5
				231	sub sp, sp, #12
				232	L2:
				233	add r0, pc, r0
				234	@ lr needed for prologue
				235	ldmia r0, {r0, r1, r2}
				236	stmia sp, {r0, r1, r2}
				237	stmia ip, {r0, r1, r2}
				238	mov r0, ip
				239	add sp, sp, #12
				240	bx lr
				241
				242	r0 (and later ip) is the hidden parameter from caller to store the value in. The
				243	first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
				244	r2 into the address passed in. However, there is one additional stmia that
				245	stores r0, r1, and r2 to some stack location. The store is dead.
				246
				247	The llvm-gcc generated code looks like this:
				248
				249	csretcc void %foo(%struct.s* %agg.result) {
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	250	entry:
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	251	%S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
				252	%memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
				253	cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
				254	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
				255	cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
				256	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
				257	cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
				258	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	259	ret void
				260	}
				261
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	262	llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
				263	constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
				264	into a number of load and stores, or 2) custom lower memcpy (of small size) to
				265	be ldmia / stmia. I think option 2 is better but the current register
				266	allocator cannot allocate a chunk of registers at a time.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	267
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	268	A feasible temporary solution is to use specific physical registers at the
				269	lowering time for small (<= 4 words?) transfer size.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	270
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	271	* ARM CSRet calling convention requires the hidden argument to be returned by
				272	the callee.
Rafael Espindola	bec2e38	2006-10-16 16:33:29 +0000	[diff] [blame]	273
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	274	//===---------------------------------------------------------------------===//
Rafael Espindola	bec2e38	2006-10-16 16:33:29 +0000	[diff] [blame]	275
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	276	We can definitely do a better job on BB placements to eliminate some branches.
				277	It's very common to see llvm generated assembly code that looks like this:
Rafael Espindola	82c678b	2006-10-16 17:17:22 +0000	[diff] [blame]	278
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	279	LBB3:
				280	...
				281	LBB4:
				282	...
				283	beq LBB3
				284	b LBB2
Rafael Espindola	82c678b	2006-10-16 17:17:22 +0000	[diff] [blame]	285
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	286	If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
				287	then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
				288
				289	See McCat/18-imp/ComputeBoundingBoxes for an example.
				290
				291	//===---------------------------------------------------------------------===//
				292
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	293	Pre-/post- indexed load / stores:
				294
				295	1) We should not make the pre/post- indexed load/store transform if the base ptr
				296	is guaranteed to be live beyond the load/store. This can happen if the base
				297	ptr is live out of the block we are performing the optimization. e.g.
				298
				299	mov r1, r2
				300	ldr r3, [r1], #4
				301	...
				302
				303	vs.
				304
				305	ldr r3, [r2]
				306	add r1, r2, #4
				307	...
				308
				309	In most cases, this is just a wasted optimization. However, sometimes it can
				310	negatively impact the performance because two-address code is more restrictive
				311	when it comes to scheduling.
				312
				313	Unfortunately, liveout information is currently unavailable during DAG combine
				314	time.
				315
				316	2) Consider spliting a indexed load / store into a pair of add/sub + load/store
				317	to solve #1 (in TwoAddressInstructionPass.cpp).
				318
				319	3) Enhance LSR to generate more opportunities for indexed ops.
				320
				321	4) Once we added support for multiple result patterns, write indexed loads
				322	patterns instead of C++ instruction selection code.
				323
Jim Grosbach	e516549	2009-11-09 00:11:35 +0000	[diff] [blame]	324	5) Use VLDM / VSTM to emulate indexed FP load / store.
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	325
				326	//===---------------------------------------------------------------------===//
				327
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	328	Implement support for some more tricky ways to materialize immediates. For
				329	example, to get 0xffff8000, we can use:
				330
				331	mov r9, #&3f8000
				332	sub r9, r9, #&400000
				333
				334	//===---------------------------------------------------------------------===//
				335
				336	We sometimes generate multiple add / sub instructions to update sp in prologue
				337	and epilogue if the inc / dec value is too large to fit in a single immediate
				338	operand. In some cases, perhaps it might be better to load the value from a
				339	constantpool instead.
				340
				341	//===---------------------------------------------------------------------===//
				342
				343	GCC generates significantly better code for this function.
				344
				345	int foo(int StackPtr, unsigned char Line, unsigned char Stack, int LineLen) {
				346	int i = 0;
				347
				348	if (StackPtr != 0) {
				349	while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
				350	Line[i++] = Stack[--StackPtr];
				351	if (LineLen > 32768)
				352	{
				353	while (StackPtr != 0 && i < LineLen)
				354	{
				355	i++;
				356	--StackPtr;
				357	}
				358	}
				359	}
				360	return StackPtr;
				361	}
				362
				363	//===---------------------------------------------------------------------===//
				364
				365	This should compile to the mlas instruction:
				366	int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
				367
				368	//===---------------------------------------------------------------------===//
				369
				370	At some point, we should triage these to see if they still apply to us:
				371
				372	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
				373	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
				374	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
				375
				376	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
				377	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
				378	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
				379	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
				380	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
				381	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
				382	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
				383
				384	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
				385	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
				386	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
				387	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
				388	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
				389	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
				390	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
				391
				392	http://www.inf.u-szeged.hu/gcc-arm/
				393	http://citeseer.ist.psu.edu/debus04linktime.html
				394
				395	//===---------------------------------------------------------------------===//
Evan Cheng	2265b49	2007-03-09 19:34:51 +0000	[diff] [blame]	396
Dale Johannesen	818c085	2007-03-09 19:18:59 +0000	[diff] [blame]	397	gcc generates smaller code for this function at -O2 or -Os:
Dale Johannesen	a6bc6fc	2007-03-09 17:58:17 +0000	[diff] [blame]	398
				399	void foo(signed char* p) {
				400	if (*p == 3)
				401	bar();
				402	else if (*p == 4)
				403	baz();
				404	else if (*p == 5)
				405	quux();
				406	}
				407
				408	llvm decides it's a good idea to turn the repeated if...else into a
				409	binary tree, as if it were a switch; the resulting code requires -1
				410	compare-and-branches when p<=2 or p==5, the same number if *p==4
				411	or p>6, and +1 if p==3. So it should be a speed win
				412	(on balance). However, the revised code is larger, with 4 conditional
				413	branches instead of 3.
				414
				415	More seriously, there is a byte->word extend before
				416	each comparison, where there should be only one, and the condition codes
				417	are not remembered when the same two values are compared twice.
				418
Evan Cheng	2265b49	2007-03-09 19:34:51 +0000	[diff] [blame]	419	//===---------------------------------------------------------------------===//
				420
Evan Cheng	a125cbe	2007-03-20 22:32:39 +0000	[diff] [blame]	421	More LSR enhancements possible:
				422
				423	1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
				424	in a load / store.
				425	2. Allow iv reuse even when a type conversion is required. For example, i8
				426	and i32 load / store addressing modes are identical.
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	427
				428
				429	//===---------------------------------------------------------------------===//
				430
				431	This:
				432
				433	int foo(int a, int b, int c, int d) {
				434	long long acc = (long long)a * (long long)b;
				435	acc += (long long)c * (long long)d;
				436	return (int)(acc >> 32);
				437	}
				438
				439	Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
				440	two signed 32-bit values to produce a 64-bit value, and accumulates this with
				441	a 64-bit value.
				442
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	443	We currently get this with both v4 and v6:
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	444
				445	_foo:
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	446	smull r1, r0, r1, r0
				447	smull r3, r2, r3, r2
				448	adds r3, r3, r1
				449	adc r0, r2, r0
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	450	bx lr
				451
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	452	//===---------------------------------------------------------------------===//
Chris Lattner	bf8ae84	2007-09-10 21:43:18 +0000	[diff] [blame]	453
				454	This:
				455	#include <algorithm>
				456	std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				457	{ return std::make_pair(a + b, a + b < a); }
				458	bool no_overflow(unsigned a, unsigned b)
				459	{ return !full_add(a, b).second; }
				460
				461	Should compile to:
				462
				463	_Z8full_addjj:
				464	adds r2, r1, r2
				465	movcc r1, #0
				466	movcs r1, #1
				467	str r2, [r0, #0]
				468	strb r1, [r0, #4]
				469	mov pc, lr
				470
				471	_Z11no_overflowjj:
				472	cmn r0, r1
				473	movcs r0, #0
				474	movcc r0, #1
				475	mov pc, lr
				476
				477	not:
				478
				479	__Z8full_addjj:
				480	add r3, r2, r1
				481	str r3, [r0]
				482	mov r2, #1
				483	mov r12, #0
				484	cmp r3, r1
				485	movlo r12, r2
				486	str r12, [r0, #+4]
				487	bx lr
				488	__Z11no_overflowjj:
				489	add r3, r1, r0
				490	mov r2, #1
				491	mov r1, #0
				492	cmp r3, r0
				493	movhs r1, r2
				494	mov r0, r1
				495	bx lr
				496
				497	//===---------------------------------------------------------------------===//
Chris Lattner	3a7c33a	2007-10-19 03:29:26 +0000	[diff] [blame]	498
Bob Wilson	5bafff3	2009-06-22 23:27:02 +0000	[diff] [blame]	499	Some of the NEON intrinsics may be appropriate for more general use, either
				500	as target-independent intrinsics or perhaps elsewhere in the ARM backend.
				501	Some of them may also be lowered to target-independent SDNodes, and perhaps
				502	some new SDNodes could be added.
				503
				504	For example, maximum, minimum, and absolute value operations are well-defined
				505	and standard operations, both for vector and scalar types.
				506
				507	The current NEON-specific intrinsics for count leading zeros and count one
				508	bits could perhaps be replaced by the target-independent ctlz and ctpop
				509	intrinsics. It may also make sense to add a target-independent "ctls"
				510	intrinsic for "count leading sign bits". Likewise, the backend could use
				511	the target-independent SDNodes for these operations.
				512
				513	ARMv6 has scalar saturating and halving adds and subtracts. The same
				514	intrinsics could possibly be used for both NEON's vector implementations of
				515	those operations and the ARMv6 scalar versions.
				516
				517	//===---------------------------------------------------------------------===//
				518
Evan Cheng	151b9af	2009-06-26 00:28:48 +0000	[diff] [blame]	519	ARM::MOVCCr is commutable (by flipping the condition). But we need to implement
				520	ARMInstrInfo::commuteInstruction() to support it.
Evan Cheng	055b031	2009-06-29 07:51:04 +0000	[diff] [blame]	521
				522	//===---------------------------------------------------------------------===//
				523
				524	Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
				525	LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
				526	ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
				527	while ARMConstantIslandPass only need to worry about LDR (literal).
Evan Cheng	064a6ea	2009-07-22 00:58:27 +0000	[diff] [blame]	528
				529	//===---------------------------------------------------------------------===//
				530
Evan Cheng	40efc25	2009-07-24 19:31:03 +0000	[diff] [blame]	531	Constant island pass should make use of full range SoImm values for LEApcrel.
				532	Be careful though as the last attempt caused infinite looping on lencod.
Chris Lattner	5135039	2009-07-30 16:08:58 +0000	[diff] [blame]	533
				534	//===---------------------------------------------------------------------===//
				535
				536	Predication issue. This function:
				537
				538	extern unsigned array[ 128 ];
				539	int foo( int x ) {
				540	int y;
				541	y = array[ x & 127 ];
				542	if ( x & 128 )
				543	y = 123456789 & ( y >> 2 );
				544	else
				545	y = 123456789 & y;
				546	return y;
				547	}
				548
				549	compiles to:
				550
				551	_foo:
				552	and r1, r0, #127
				553	ldr r2, LCPI1_0
				554	ldr r2, [r2]
				555	ldr r1, [r2, +r1, lsl #2]
				556	mov r2, r1, lsr #2
				557	tst r0, #128
				558	moveq r2, r1
				559	ldr r0, LCPI1_1
				560	and r0, r2, r0
				561	bx lr
				562
				563	It would be better to do something like this, to fold the shift into the
				564	conditional move:
				565
				566	and r1, r0, #127
				567	ldr r2, LCPI1_0
				568	ldr r2, [r2]
				569	ldr r1, [r2, +r1, lsl #2]
				570	tst r0, #128
				571	movne r1, r1, lsr #2
				572	ldr r0, LCPI1_1
				573	and r0, r1, r0
				574	bx lr
				575
				576	it saves an instruction and a register.
				577
				578	//===---------------------------------------------------------------------===//
Evan Cheng	5adb66a	2009-09-28 09:14:39 +0000	[diff] [blame]	579
Evan Cheng	5adb66a	2009-09-28 09:14:39 +0000	[diff] [blame]	580	It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
				581	with the same bottom half.
Bob Wilson	57f224a	2009-10-30 22:22:46 +0000	[diff] [blame]	582
				583	//===---------------------------------------------------------------------===//
				584
				585	Robert Muth started working on an alternate jump table implementation that
				586	does not put the tables in-line in the text. This is more like the llvm
				587	default jump table implementation. This might be useful sometime. Several
				588	revisions of patches are on the mailing list, beginning at:
				589	http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html
				590
Evan Cheng	e3b88fc	2009-11-02 07:51:19 +0000	[diff] [blame]	591	//===---------------------------------------------------------------------===//
				592
				593	Make use of the "rbit" instruction.
Evan Cheng	d457e6e	2009-11-07 04:04:34 +0000	[diff] [blame]	594
				595	//===---------------------------------------------------------------------===//
				596
				597	Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
				598	to licm and cse the unnecessary load from cp#1.
Jim Grosbach	d5d2bae	2010-01-22 00:08:13 +0000	[diff] [blame^]	599
				600	//===---------------------------------------------------------------------===//
				601
				602	The CMN instruction sets the flags like an ADD instruction, while CMP sets
				603	them like a subtract. Therefore to be able to use CMN for comparisons other
				604	than the Z bit, we'll need additional logic to reverse the conditionals
				605	associated with the comparison. Perhaps a pseudo-instruction for the comparison,
				606	with a post-codegen pass to clean up and handle the condition codes?
				607	See PR5694 for testcase.