Blame - lib/Target/ARM/README.txt - fp2-dev/platform/external/llvm

blob: 6e5a7a151e063c7b8906c17a517accda6097d053 [file] [log] [blame]

Rafael Espindola	f4d4005	2006-08-22 12:22:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the ARM backend.
				3	//===---------------------------------------------------------------------===//
				4
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	5	Reimplement 'select' in terms of 'SEL'.
Rafael Espindola	f4d4005	2006-08-22 12:22:46 +0000	[diff] [blame]	6
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	7	* We would really like to support UXTAB16, but we need to prove that the
				8	add doesn't need to overflow between the two 16-bit chunks.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	9
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	10	* Implement pre/post increment support. (e.g. PR935)
				11	* Coalesce stack slots!
				12	* Implement smarter constant generation for binops with large immediates.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	13
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	14	* Consider materializing FP constants like 0.0f and 1.0f using integer
				15	immediate instructions then copy to FPU. Slower than load into FPU?
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	16
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	17	//===---------------------------------------------------------------------===//
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	18
Chris Lattner	93305bc	2007-04-20 20:18:43 +0000	[diff] [blame]	19	Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
				20	time regalloc happens, these values are now in a 32-bit register, usually with
				21	the top-bits known to be sign or zero extended. If spilled, we should be able
				22	to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
				23	of the reload.
				24
				25	Doing this reduces the size of the stack frame (important for thumb etc), and
				26	also increases the likelihood that we will be able to reload multiple values
				27	from the stack with a single load.
				28
				29	//===---------------------------------------------------------------------===//
				30
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	31	The constant island pass is in good shape. Some cleanups might be desirable,
				32	but there is unlikely to be much improvement in the generated code.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	33
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	34	1. There may be some advantage to trying to be smarter about the initial
Dale Johannesen	88e37ae	2007-02-23 05:02:36 +0000	[diff] [blame]	35	placement, rather than putting everything at the end.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	36
Dale Johannesen	9118dbc	2007-04-30 00:32:06 +0000	[diff] [blame]	37	2. There might be some compile-time efficiency to be had by representing
Dale Johannesen	f1b214d	2007-02-28 18:41:23 +0000	[diff] [blame]	38	consecutive islands as a single block rather than multiple blocks.
				39
Dale Johannesen	9118dbc	2007-04-30 00:32:06 +0000	[diff] [blame]	40	3. Use a priority queue to sort constant pool users in inverse order of
Evan Cheng	1a9da0d	2007-03-09 19:46:06 +0000	[diff] [blame]	41	position so we always process the one closed to the end of functions
				42	first. This may simply CreateNewWater.
				43
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	44	//===---------------------------------------------------------------------===//
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	45
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	46	Eliminate copysign custom expansion. We are still generating crappy code with
				47	default expansion + if-conversion.
Rafael Espindola	7564549	2006-09-22 11:36:17 +0000	[diff] [blame]	48
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	49	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	50
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	51	Eliminate one instruction from:
Chris Lattner	2d1222c	2007-02-02 04:36:46 +0000	[diff] [blame]	52
				53	define i32 @_Z6slow4bii(i32 %x, i32 %y) {
				54	%tmp = icmp sgt i32 %x, %y
				55	%retval = select i1 %tmp, i32 %x, i32 %y
				56	ret i32 %retval
				57	}
				58
				59	__Z6slow4bii:
				60	cmp r0, r1
				61	movgt r1, r0
				62	mov r0, r1
				63	bx lr
Evan Cheng	c608ff2	2007-07-10 21:49:47 +0000	[diff] [blame]	64	=>
				65
				66	__Z6slow4bii:
				67	cmp r0, r1
				68	movle r0, r1
				69	bx lr
Chris Lattner	2d1222c	2007-02-02 04:36:46 +0000	[diff] [blame]	70
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	71	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	72
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	73	Implement long long "X-3" with instructions that fold the immediate in. These
				74	were disabled due to badness with the ARM carry flag on subtracts.
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	75
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	76	//===---------------------------------------------------------------------===//
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	77
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	78	We currently compile abs:
				79	int foo(int p) { return p < 0 ? -p : p; }
Rafael Espindola	4dfab98	2006-12-11 23:56:10 +0000	[diff] [blame]	80
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	81	into:
Rafael Espindola	cd71da5	2006-10-03 17:27:58 +0000	[diff] [blame]	82
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	83	_foo:
				84	rsb r1, r0, #0
				85	cmn r0, #1
				86	movgt r1, r0
				87	mov r0, r1
				88	bx lr
Rafael Espindola	cd71da5	2006-10-03 17:27:58 +0000	[diff] [blame]	89
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	90	This is very, uh, literal. This could be a 3 operation sequence:
				91	t = (p sra 31);
				92	res = (p xor t)-t
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	93
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	94	Which would be better. This occurs in png decode.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	95
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	96	//===---------------------------------------------------------------------===//
				97
				98	More load / store optimizations:
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	99	1) Better representation for block transfer? This is from Olden/power:
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	100
				101	fldd d0, [r4]
				102	fstd d0, [r4, #+32]
				103	fldd d0, [r4, #+8]
				104	fstd d0, [r4, #+40]
				105	fldd d0, [r4, #+16]
				106	fstd d0, [r4, #+48]
				107	fldd d0, [r4, #+24]
				108	fstd d0, [r4, #+56]
				109
				110	If we can spare the registers, it would be better to use fldm and fstm here.
				111	Need major register allocator enhancement though.
				112
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	113	2) Can we recognize the relative position of constantpool entries? i.e. Treat
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	114
				115	ldr r0, LCPI17_3
				116	ldr r1, LCPI17_4
				117	ldr r2, LCPI17_5
				118
				119	as
				120	ldr r0, LCPI17
				121	ldr r1, LCPI17+4
				122	ldr r2, LCPI17+8
				123
				124	Then the ldr's can be combined into a single ldm. See Olden/power.
				125
				126	Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
				127	double 64-bit FP constant:
				128
				129	adr r0, L6
				130	ldmia r0, {r0-r1}
				131
				132	.align 2
				133	L6:
				134	.long -858993459
				135	.long 1074318540
				136
Evan Cheng	b604b2c	2009-06-26 00:25:27 +0000	[diff] [blame]	137	3) struct copies appear to be done field by field
Dale Johannesen	a6bc6fc	2007-03-09 17:58:17 +0000	[diff] [blame]	138	instead of by words, at least sometimes:
				139
				140	struct foo { int x; short s; char c1; char c2; };
				141	void cpy(struct fooa, struct foob) { a = b; }
				142
				143	llvm code (-O2)
				144	ldrb r3, [r1, #+6]
				145	ldr r2, [r1]
				146	ldrb r12, [r1, #+7]
				147	ldrh r1, [r1, #+4]
				148	str r2, [r0]
				149	strh r1, [r0, #+4]
				150	strb r3, [r0, #+6]
				151	strb r12, [r0, #+7]
				152	gcc code (-O2)
				153	ldmia r1, {r1-r2}
				154	stmia r0, {r1-r2}
				155
				156	In this benchmark poor handling of aggregate copies has shown up as
				157	having a large effect on size, and possibly speed as well (we don't have
				158	a good way to measure on ARM).
				159
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	160	//===---------------------------------------------------------------------===//
				161
				162	* Consider this silly example:
				163
				164	double bar(double x) {
				165	double r = foo(3.1);
				166	return x+r;
				167	}
				168
				169	_bar:
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	170	stmfd sp!, {r4, r5, r7, lr}
				171	add r7, sp, #8
				172	mov r4, r0
				173	mov r5, r1
				174	fldd d0, LCPI1_0
				175	fmrrd r0, r1, d0
				176	bl _foo
				177	fmdrr d0, r4, r5
				178	fmsr s2, r0
				179	fsitod d1, s2
				180	faddd d0, d1, d0
				181	fmrrd r0, r1, d0
				182	ldmfd sp!, {r4, r5, r7, pc}
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	183
				184	Ignore the prologue and epilogue stuff for a second. Note
				185	mov r4, r0
				186	mov r5, r1
				187	the copys to callee-save registers and the fact they are only being used by the
				188	fmdrr instruction. It would have been better had the fmdrr been scheduled
				189	before the call and place the result in a callee-save DPR register. The two
				190	mov ops would not have been necessary.
				191
				192	//===---------------------------------------------------------------------===//
				193
				194	Calling convention related stuff:
				195
				196	* gcc's parameter passing implementation is terrible and we suffer as a result:
				197
				198	e.g.
				199	struct s {
				200	double d1;
				201	int s1;
				202	};
				203
				204	void foo(struct s S) {
				205	printf("%g, %d\n", S.d1, S.s1);
				206	}
				207
				208	'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
				209	then reload them to r1, r2, and r3 before issuing the call (r0 contains the
				210	address of the format string):
				211
				212	stmfd sp!, {r7, lr}
				213	add r7, sp, #0
				214	sub sp, sp, #12
				215	stmia sp, {r0, r1, r2}
				216	ldmia sp, {r1-r2}
				217	ldr r0, L5
				218	ldr r3, [sp, #8]
				219	L2:
				220	add r0, pc, r0
				221	bl L_printf$stub
				222
				223	Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
				224
				225	* Return an aggregate type is even worse:
				226
				227	e.g.
				228	struct s foo(void) {
				229	struct s S = {1.1, 2};
				230	return S;
				231	}
				232
				233	mov ip, r0
				234	ldr r0, L5
				235	sub sp, sp, #12
				236	L2:
				237	add r0, pc, r0
				238	@ lr needed for prologue
				239	ldmia r0, {r0, r1, r2}
				240	stmia sp, {r0, r1, r2}
				241	stmia ip, {r0, r1, r2}
				242	mov r0, ip
				243	add sp, sp, #12
				244	bx lr
				245
				246	r0 (and later ip) is the hidden parameter from caller to store the value in. The
				247	first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
				248	r2 into the address passed in. However, there is one additional stmia that
				249	stores r0, r1, and r2 to some stack location. The store is dead.
				250
				251	The llvm-gcc generated code looks like this:
				252
				253	csretcc void %foo(%struct.s* %agg.result) {
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	254	entry:
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	255	%S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
				256	%memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
				257	cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
				258	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
				259	cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
				260	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
				261	cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
				262	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	263	ret void
				264	}
				265
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	266	llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
				267	constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
				268	into a number of load and stores, or 2) custom lower memcpy (of small size) to
				269	be ldmia / stmia. I think option 2 is better but the current register
				270	allocator cannot allocate a chunk of registers at a time.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	271
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	272	A feasible temporary solution is to use specific physical registers at the
				273	lowering time for small (<= 4 words?) transfer size.
Rafael Espindola	5af3a68	2006-10-09 14:18:33 +0000	[diff] [blame]	274
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	275	* ARM CSRet calling convention requires the hidden argument to be returned by
				276	the callee.
Rafael Espindola	bec2e38	2006-10-16 16:33:29 +0000	[diff] [blame]	277
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	278	//===---------------------------------------------------------------------===//
Rafael Espindola	bec2e38	2006-10-16 16:33:29 +0000	[diff] [blame]	279
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	280	We can definitely do a better job on BB placements to eliminate some branches.
				281	It's very common to see llvm generated assembly code that looks like this:
Rafael Espindola	82c678b	2006-10-16 17:17:22 +0000	[diff] [blame]	282
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	283	LBB3:
				284	...
				285	LBB4:
				286	...
				287	beq LBB3
				288	b LBB2
Rafael Espindola	82c678b	2006-10-16 17:17:22 +0000	[diff] [blame]	289
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	290	If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
				291	then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
				292
				293	See McCat/18-imp/ComputeBoundingBoxes for an example.
				294
				295	//===---------------------------------------------------------------------===//
				296
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	297	Pre-/post- indexed load / stores:
				298
				299	1) We should not make the pre/post- indexed load/store transform if the base ptr
				300	is guaranteed to be live beyond the load/store. This can happen if the base
				301	ptr is live out of the block we are performing the optimization. e.g.
				302
				303	mov r1, r2
				304	ldr r3, [r1], #4
				305	...
				306
				307	vs.
				308
				309	ldr r3, [r2]
				310	add r1, r2, #4
				311	...
				312
				313	In most cases, this is just a wasted optimization. However, sometimes it can
				314	negatively impact the performance because two-address code is more restrictive
				315	when it comes to scheduling.
				316
				317	Unfortunately, liveout information is currently unavailable during DAG combine
				318	time.
				319
				320	2) Consider spliting a indexed load / store into a pair of add/sub + load/store
				321	to solve #1 (in TwoAddressInstructionPass.cpp).
				322
				323	3) Enhance LSR to generate more opportunities for indexed ops.
				324
				325	4) Once we added support for multiple result patterns, write indexed loads
				326	patterns instead of C++ instruction selection code.
				327
				328	5) Use FLDM / FSTM to emulate indexed FP load / store.
				329
				330	//===---------------------------------------------------------------------===//
				331
Evan Cheng	a8e2989	2007-01-19 07:51:42 +0000	[diff] [blame]	332	Implement support for some more tricky ways to materialize immediates. For
				333	example, to get 0xffff8000, we can use:
				334
				335	mov r9, #&3f8000
				336	sub r9, r9, #&400000
				337
				338	//===---------------------------------------------------------------------===//
				339
				340	We sometimes generate multiple add / sub instructions to update sp in prologue
				341	and epilogue if the inc / dec value is too large to fit in a single immediate
				342	operand. In some cases, perhaps it might be better to load the value from a
				343	constantpool instead.
				344
				345	//===---------------------------------------------------------------------===//
				346
				347	GCC generates significantly better code for this function.
				348
				349	int foo(int StackPtr, unsigned char Line, unsigned char Stack, int LineLen) {
				350	int i = 0;
				351
				352	if (StackPtr != 0) {
				353	while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
				354	Line[i++] = Stack[--StackPtr];
				355	if (LineLen > 32768)
				356	{
				357	while (StackPtr != 0 && i < LineLen)
				358	{
				359	i++;
				360	--StackPtr;
				361	}
				362	}
				363	}
				364	return StackPtr;
				365	}
				366
				367	//===---------------------------------------------------------------------===//
				368
				369	This should compile to the mlas instruction:
				370	int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
				371
				372	//===---------------------------------------------------------------------===//
				373
				374	At some point, we should triage these to see if they still apply to us:
				375
				376	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
				377	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
				378	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
				379
				380	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
				381	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
				382	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
				383	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
				384	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
				385	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
				386	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
				387
				388	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
				389	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
				390	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
				391	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
				392	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
				393	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
				394	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
				395
				396	http://www.inf.u-szeged.hu/gcc-arm/
				397	http://citeseer.ist.psu.edu/debus04linktime.html
				398
				399	//===---------------------------------------------------------------------===//
Evan Cheng	2265b49	2007-03-09 19:34:51 +0000	[diff] [blame]	400
Dale Johannesen	818c085	2007-03-09 19:18:59 +0000	[diff] [blame]	401	gcc generates smaller code for this function at -O2 or -Os:
Dale Johannesen	a6bc6fc	2007-03-09 17:58:17 +0000	[diff] [blame]	402
				403	void foo(signed char* p) {
				404	if (*p == 3)
				405	bar();
				406	else if (*p == 4)
				407	baz();
				408	else if (*p == 5)
				409	quux();
				410	}
				411
				412	llvm decides it's a good idea to turn the repeated if...else into a
				413	binary tree, as if it were a switch; the resulting code requires -1
				414	compare-and-branches when p<=2 or p==5, the same number if *p==4
				415	or p>6, and +1 if p==3. So it should be a speed win
				416	(on balance). However, the revised code is larger, with 4 conditional
				417	branches instead of 3.
				418
				419	More seriously, there is a byte->word extend before
				420	each comparison, where there should be only one, and the condition codes
				421	are not remembered when the same two values are compared twice.
				422
Evan Cheng	2265b49	2007-03-09 19:34:51 +0000	[diff] [blame]	423	//===---------------------------------------------------------------------===//
				424
				425	More register scavenging work:
				426
				427	1. Use the register scavenger to track frame index materialized into registers
				428	(those that do not fit in addressing modes) to allow reuse in the same BB.
				429	2. Finish scavenging for Thumb.
Evan Cheng	44f4fca	2007-03-09 19:35:33 +0000	[diff] [blame]	430
				431	//===---------------------------------------------------------------------===//
				432
Evan Cheng	a125cbe	2007-03-20 22:32:39 +0000	[diff] [blame]	433	More LSR enhancements possible:
				434
				435	1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
				436	in a load / store.
				437	2. Allow iv reuse even when a type conversion is required. For example, i8
				438	and i32 load / store addressing modes are identical.
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	439
				440
				441	//===---------------------------------------------------------------------===//
				442
				443	This:
				444
				445	int foo(int a, int b, int c, int d) {
				446	long long acc = (long long)a * (long long)b;
				447	acc += (long long)c * (long long)d;
				448	return (int)(acc >> 32);
				449	}
				450
				451	Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
				452	two signed 32-bit values to produce a 64-bit value, and accumulates this with
				453	a 64-bit value.
				454
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	455	We currently get this with both v4 and v6:
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	456
				457	_foo:
Chris Lattner	a3f61df	2007-11-27 22:41:52 +0000	[diff] [blame]	458	smull r1, r0, r1, r0
				459	smull r3, r2, r3, r2
				460	adds r3, r3, r1
				461	adc r0, r2, r0
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	462	bx lr
				463
Chris Lattner	3c30d10	2007-04-17 18:03:00 +0000	[diff] [blame]	464	//===---------------------------------------------------------------------===//
Chris Lattner	bf8ae84	2007-09-10 21:43:18 +0000	[diff] [blame]	465
				466	This:
				467	#include <algorithm>
				468	std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				469	{ return std::make_pair(a + b, a + b < a); }
				470	bool no_overflow(unsigned a, unsigned b)
				471	{ return !full_add(a, b).second; }
				472
				473	Should compile to:
				474
				475	_Z8full_addjj:
				476	adds r2, r1, r2
				477	movcc r1, #0
				478	movcs r1, #1
				479	str r2, [r0, #0]
				480	strb r1, [r0, #4]
				481	mov pc, lr
				482
				483	_Z11no_overflowjj:
				484	cmn r0, r1
				485	movcs r0, #0
				486	movcc r0, #1
				487	mov pc, lr
				488
				489	not:
				490
				491	__Z8full_addjj:
				492	add r3, r2, r1
				493	str r3, [r0]
				494	mov r2, #1
				495	mov r12, #0
				496	cmp r3, r1
				497	movlo r12, r2
				498	str r12, [r0, #+4]
				499	bx lr
				500	__Z11no_overflowjj:
				501	add r3, r1, r0
				502	mov r2, #1
				503	mov r1, #0
				504	cmp r3, r0
				505	movhs r1, r2
				506	mov r0, r1
				507	bx lr
				508
				509	//===---------------------------------------------------------------------===//
Chris Lattner	3a7c33a	2007-10-19 03:29:26 +0000	[diff] [blame]	510
Bob Wilson	5bafff3	2009-06-22 23:27:02 +0000	[diff] [blame]	511	Some of the NEON intrinsics may be appropriate for more general use, either
				512	as target-independent intrinsics or perhaps elsewhere in the ARM backend.
				513	Some of them may also be lowered to target-independent SDNodes, and perhaps
				514	some new SDNodes could be added.
				515
				516	For example, maximum, minimum, and absolute value operations are well-defined
				517	and standard operations, both for vector and scalar types.
				518
				519	The current NEON-specific intrinsics for count leading zeros and count one
				520	bits could perhaps be replaced by the target-independent ctlz and ctpop
				521	intrinsics. It may also make sense to add a target-independent "ctls"
				522	intrinsic for "count leading sign bits". Likewise, the backend could use
				523	the target-independent SDNodes for these operations.
				524
				525	ARMv6 has scalar saturating and halving adds and subtracts. The same
				526	intrinsics could possibly be used for both NEON's vector implementations of
				527	those operations and the ARMv6 scalar versions.
				528
				529	//===---------------------------------------------------------------------===//
				530
Evan Cheng	151b9af	2009-06-26 00:28:48 +0000	[diff] [blame]	531	ARM::MOVCCr is commutable (by flipping the condition). But we need to implement
				532	ARMInstrInfo::commuteInstruction() to support it.
Evan Cheng	055b031	2009-06-29 07:51:04 +0000	[diff] [blame]	533
				534	//===---------------------------------------------------------------------===//
				535
				536	Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
				537	LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
				538	ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
				539	while ARMConstantIslandPass only need to worry about LDR (literal).
Evan Cheng	064a6ea	2009-07-22 00:58:27 +0000	[diff] [blame]	540
				541	//===---------------------------------------------------------------------===//
				542
				543	We need to fix constant isel for ARMv6t2 to use MOVT.
Evan Cheng	40efc25	2009-07-24 19:31:03 +0000	[diff] [blame]	544
				545	//===---------------------------------------------------------------------===//
				546
				547	Constant island pass should make use of full range SoImm values for LEApcrel.
				548	Be careful though as the last attempt caused infinite looping on lencod.
Chris Lattner	5135039	2009-07-30 16:08:58 +0000	[diff] [blame]	549
				550	//===---------------------------------------------------------------------===//
				551
				552	Predication issue. This function:
				553
				554	extern unsigned array[ 128 ];
				555	int foo( int x ) {
				556	int y;
				557	y = array[ x & 127 ];
				558	if ( x & 128 )
				559	y = 123456789 & ( y >> 2 );
				560	else
				561	y = 123456789 & y;
				562	return y;
				563	}
				564
				565	compiles to:
				566
				567	_foo:
				568	and r1, r0, #127
				569	ldr r2, LCPI1_0
				570	ldr r2, [r2]
				571	ldr r1, [r2, +r1, lsl #2]
				572	mov r2, r1, lsr #2
				573	tst r0, #128
				574	moveq r2, r1
				575	ldr r0, LCPI1_1
				576	and r0, r2, r0
				577	bx lr
				578
				579	It would be better to do something like this, to fold the shift into the
				580	conditional move:
				581
				582	and r1, r0, #127
				583	ldr r2, LCPI1_0
				584	ldr r2, [r2]
				585	ldr r1, [r2, +r1, lsl #2]
				586	tst r0, #128
				587	movne r1, r1, lsr #2
				588	ldr r0, LCPI1_1
				589	and r0, r1, r0
				590	bx lr
				591
				592	it saves an instruction and a register.
				593
				594	//===---------------------------------------------------------------------===//
Evan Cheng	5adb66a	2009-09-28 09:14:39 +0000	[diff] [blame]	595
				596	add/sub/and/or + i32 imm can be simplified by folding part of the immediate
				597	into the operation.
				598
				599	//===---------------------------------------------------------------------===//
				600
				601	It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
				602	with the same bottom half.
Bob Wilson	57f224a	2009-10-30 22:22:46 +0000	[diff] [blame]	603
				604	//===---------------------------------------------------------------------===//
				605
				606	Robert Muth started working on an alternate jump table implementation that
				607	does not put the tables in-line in the text. This is more like the llvm
				608	default jump table implementation. This might be useful sometime. Several
				609	revisions of patches are on the mailing list, beginning at:
				610	http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html
				611
Evan Cheng	e3b88fc	2009-11-02 07:51:19 +0000	[diff] [blame^]	612	//===---------------------------------------------------------------------===//
				613
				614	Make use of the "rbit" instruction.