Blame - lib/Target/ARM/README.txt - fp2-dev/platform/external/llvm

blob: 9efb5a1426a5a87bd4ae5a3b8968efbc03c9a826 [file] [log] [blame]

Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	1	//===---------------------------------------------------------------------===//
				2	// Random ideas for the ARM backend.
				3	//===---------------------------------------------------------------------===//
				4
				5	Reimplement 'select' in terms of 'SEL'.
				6
				7	* We would really like to support UXTAB16, but we need to prove that the
				8	add doesn't need to overflow between the two 16-bit chunks.
				9
				10	* Implement pre/post increment support. (e.g. PR935)
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	11	* Implement smarter constant generation for binops with large immediates.
				12
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	13	//===---------------------------------------------------------------------===//
				14
				15	Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
				16	time regalloc happens, these values are now in a 32-bit register, usually with
				17	the top-bits known to be sign or zero extended. If spilled, we should be able
				18	to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
				19	of the reload.
				20
				21	Doing this reduces the size of the stack frame (important for thumb etc), and
				22	also increases the likelihood that we will be able to reload multiple values
				23	from the stack with a single load.
				24
				25	//===---------------------------------------------------------------------===//
				26
				27	The constant island pass is in good shape. Some cleanups might be desirable,
				28	but there is unlikely to be much improvement in the generated code.
				29
				30	1. There may be some advantage to trying to be smarter about the initial
				31	placement, rather than putting everything at the end.
				32
				33	2. There might be some compile-time efficiency to be had by representing
				34	consecutive islands as a single block rather than multiple blocks.
				35
				36	3. Use a priority queue to sort constant pool users in inverse order of
				37	position so we always process the one closed to the end of functions
				38	first. This may simply CreateNewWater.
				39
				40	//===---------------------------------------------------------------------===//
				41
				42	Eliminate copysign custom expansion. We are still generating crappy code with
				43	default expansion + if-conversion.
				44
				45	//===---------------------------------------------------------------------===//
				46
				47	Eliminate one instruction from:
				48
				49	define i32 @_Z6slow4bii(i32 %x, i32 %y) {
				50	%tmp = icmp sgt i32 %x, %y
				51	%retval = select i1 %tmp, i32 %x, i32 %y
				52	ret i32 %retval
				53	}
				54
				55	__Z6slow4bii:
				56	cmp r0, r1
				57	movgt r1, r0
				58	mov r0, r1
				59	bx lr
				60	=>
				61
				62	__Z6slow4bii:
				63	cmp r0, r1
				64	movle r0, r1
				65	bx lr
				66
				67	//===---------------------------------------------------------------------===//
				68
				69	Implement long long "X-3" with instructions that fold the immediate in. These
				70	were disabled due to badness with the ARM carry flag on subtracts.
				71
				72	//===---------------------------------------------------------------------===//
				73
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	74	More load / store optimizations:
Evan Cheng	1362add	2009-06-26 00:25:27 +0000	[diff] [blame]	75	1) Better representation for block transfer? This is from Olden/power:
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	76
				77	fldd d0, [r4]
				78	fstd d0, [r4, #+32]
				79	fldd d0, [r4, #+8]
				80	fstd d0, [r4, #+40]
				81	fldd d0, [r4, #+16]
				82	fstd d0, [r4, #+48]
				83	fldd d0, [r4, #+24]
				84	fstd d0, [r4, #+56]
				85
				86	If we can spare the registers, it would be better to use fldm and fstm here.
				87	Need major register allocator enhancement though.
				88
Evan Cheng	1362add	2009-06-26 00:25:27 +0000	[diff] [blame]	89	2) Can we recognize the relative position of constantpool entries? i.e. Treat
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	90
				91	ldr r0, LCPI17_3
				92	ldr r1, LCPI17_4
				93	ldr r2, LCPI17_5
				94
				95	as
				96	ldr r0, LCPI17
				97	ldr r1, LCPI17+4
				98	ldr r2, LCPI17+8
				99
				100	Then the ldr's can be combined into a single ldm. See Olden/power.
				101
				102	Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
				103	double 64-bit FP constant:
				104
				105	adr r0, L6
				106	ldmia r0, {r0-r1}
				107
				108	.align 2
				109	L6:
				110	.long -858993459
				111	.long 1074318540
				112
Evan Cheng	1362add	2009-06-26 00:25:27 +0000	[diff] [blame]	113	3) struct copies appear to be done field by field
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	114	instead of by words, at least sometimes:
				115
				116	struct foo { int x; short s; char c1; char c2; };
				117	void cpy(struct fooa, struct foob) { a = b; }
				118
				119	llvm code (-O2)
				120	ldrb r3, [r1, #+6]
				121	ldr r2, [r1]
				122	ldrb r12, [r1, #+7]
				123	ldrh r1, [r1, #+4]
				124	str r2, [r0]
				125	strh r1, [r0, #+4]
				126	strb r3, [r0, #+6]
				127	strb r12, [r0, #+7]
				128	gcc code (-O2)
				129	ldmia r1, {r1-r2}
				130	stmia r0, {r1-r2}
				131
				132	In this benchmark poor handling of aggregate copies has shown up as
				133	having a large effect on size, and possibly speed as well (we don't have
				134	a good way to measure on ARM).
				135
				136	//===---------------------------------------------------------------------===//
				137
				138	* Consider this silly example:
				139
				140	double bar(double x) {
				141	double r = foo(3.1);
				142	return x+r;
				143	}
				144
				145	_bar:
Chris Lattner	a1688ed	2007-11-27 22:41:52 +0000	[diff] [blame]	146	stmfd sp!, {r4, r5, r7, lr}
				147	add r7, sp, #8
				148	mov r4, r0
				149	mov r5, r1
				150	fldd d0, LCPI1_0
				151	fmrrd r0, r1, d0
				152	bl _foo
				153	fmdrr d0, r4, r5
				154	fmsr s2, r0
				155	fsitod d1, s2
				156	faddd d0, d1, d0
				157	fmrrd r0, r1, d0
				158	ldmfd sp!, {r4, r5, r7, pc}
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	159
				160	Ignore the prologue and epilogue stuff for a second. Note
				161	mov r4, r0
				162	mov r5, r1
				163	the copys to callee-save registers and the fact they are only being used by the
				164	fmdrr instruction. It would have been better had the fmdrr been scheduled
				165	before the call and place the result in a callee-save DPR register. The two
				166	mov ops would not have been necessary.
				167
				168	//===---------------------------------------------------------------------===//
				169
				170	Calling convention related stuff:
				171
				172	* gcc's parameter passing implementation is terrible and we suffer as a result:
				173
				174	e.g.
				175	struct s {
				176	double d1;
				177	int s1;
				178	};
				179
				180	void foo(struct s S) {
				181	printf("%g, %d\n", S.d1, S.s1);
				182	}
				183
				184	'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
				185	then reload them to r1, r2, and r3 before issuing the call (r0 contains the
				186	address of the format string):
				187
				188	stmfd sp!, {r7, lr}
				189	add r7, sp, #0
				190	sub sp, sp, #12
				191	stmia sp, {r0, r1, r2}
				192	ldmia sp, {r1-r2}
				193	ldr r0, L5
				194	ldr r3, [sp, #8]
				195	L2:
				196	add r0, pc, r0
				197	bl L_printf$stub
				198
				199	Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
				200
				201	* Return an aggregate type is even worse:
				202
				203	e.g.
				204	struct s foo(void) {
				205	struct s S = {1.1, 2};
				206	return S;
				207	}
				208
				209	mov ip, r0
				210	ldr r0, L5
				211	sub sp, sp, #12
				212	L2:
				213	add r0, pc, r0
				214	@ lr needed for prologue
				215	ldmia r0, {r0, r1, r2}
				216	stmia sp, {r0, r1, r2}
				217	stmia ip, {r0, r1, r2}
				218	mov r0, ip
				219	add sp, sp, #12
				220	bx lr
				221
				222	r0 (and later ip) is the hidden parameter from caller to store the value in. The
				223	first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
				224	r2 into the address passed in. However, there is one additional stmia that
				225	stores r0, r1, and r2 to some stack location. The store is dead.
				226
				227	The llvm-gcc generated code looks like this:
				228
				229	csretcc void %foo(%struct.s* %agg.result) {
				230	entry:
				231	%S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
				232	%memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
				233	cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
				234	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
				235	cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
				236	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
				237	cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
				238	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
				239	ret void
				240	}
				241
				242	llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
				243	constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
				244	into a number of load and stores, or 2) custom lower memcpy (of small size) to
				245	be ldmia / stmia. I think option 2 is better but the current register
				246	allocator cannot allocate a chunk of registers at a time.
				247
				248	A feasible temporary solution is to use specific physical registers at the
				249	lowering time for small (<= 4 words?) transfer size.
				250
				251	* ARM CSRet calling convention requires the hidden argument to be returned by
				252	the callee.
				253
				254	//===---------------------------------------------------------------------===//
				255
				256	We can definitely do a better job on BB placements to eliminate some branches.
				257	It's very common to see llvm generated assembly code that looks like this:
				258
				259	LBB3:
				260	...
				261	LBB4:
				262	...
				263	beq LBB3
				264	b LBB2
				265
				266	If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
				267	then eliminate beq and and turn the unconditional branch to LBB2 to a bne.
				268
				269	See McCat/18-imp/ComputeBoundingBoxes for an example.
				270
				271	//===---------------------------------------------------------------------===//
				272
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	273	Pre-/post- indexed load / stores:
				274
				275	1) We should not make the pre/post- indexed load/store transform if the base ptr
				276	is guaranteed to be live beyond the load/store. This can happen if the base
				277	ptr is live out of the block we are performing the optimization. e.g.
				278
				279	mov r1, r2
				280	ldr r3, [r1], #4
				281	...
				282
				283	vs.
				284
				285	ldr r3, [r2]
				286	add r1, r2, #4
				287	...
				288
				289	In most cases, this is just a wasted optimization. However, sometimes it can
				290	negatively impact the performance because two-address code is more restrictive
				291	when it comes to scheduling.
				292
				293	Unfortunately, liveout information is currently unavailable during DAG combine
				294	time.
				295
				296	2) Consider spliting a indexed load / store into a pair of add/sub + load/store
				297	to solve #1 (in TwoAddressInstructionPass.cpp).
				298
				299	3) Enhance LSR to generate more opportunities for indexed ops.
				300
				301	4) Once we added support for multiple result patterns, write indexed loads
				302	patterns instead of C++ instruction selection code.
				303
Jim Grosbach	e2fda53	2009-11-09 00:11:35 +0000	[diff] [blame]	304	5) Use VLDM / VSTM to emulate indexed FP load / store.
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	305
				306	//===---------------------------------------------------------------------===//
				307
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	308	Implement support for some more tricky ways to materialize immediates. For
				309	example, to get 0xffff8000, we can use:
				310
				311	mov r9, #&3f8000
				312	sub r9, r9, #&400000
				313
				314	//===---------------------------------------------------------------------===//
				315
				316	We sometimes generate multiple add / sub instructions to update sp in prologue
				317	and epilogue if the inc / dec value is too large to fit in a single immediate
				318	operand. In some cases, perhaps it might be better to load the value from a
				319	constantpool instead.
				320
				321	//===---------------------------------------------------------------------===//
				322
				323	GCC generates significantly better code for this function.
				324
				325	int foo(int StackPtr, unsigned char Line, unsigned char Stack, int LineLen) {
				326	int i = 0;
				327
				328	if (StackPtr != 0) {
				329	while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
				330	Line[i++] = Stack[--StackPtr];
				331	if (LineLen > 32768)
				332	{
				333	while (StackPtr != 0 && i < LineLen)
				334	{
				335	i++;
				336	--StackPtr;
				337	}
				338	}
				339	}
				340	return StackPtr;
				341	}
				342
				343	//===---------------------------------------------------------------------===//
				344
				345	This should compile to the mlas instruction:
				346	int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
				347
				348	//===---------------------------------------------------------------------===//
				349
				350	At some point, we should triage these to see if they still apply to us:
				351
				352	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
				353	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
				354	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
				355
				356	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
				357	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
				358	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
				359	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
				360	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
				361	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
				362	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
				363
				364	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
				365	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
				366	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
				367	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
				368	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
				369	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
				370	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
				371
				372	http://www.inf.u-szeged.hu/gcc-arm/
				373	http://citeseer.ist.psu.edu/debus04linktime.html
				374
				375	//===---------------------------------------------------------------------===//
				376
				377	gcc generates smaller code for this function at -O2 or -Os:
				378
				379	void foo(signed char* p) {
				380	if (*p == 3)
				381	bar();
				382	else if (*p == 4)
				383	baz();
				384	else if (*p == 5)
				385	quux();
				386	}
				387
				388	llvm decides it's a good idea to turn the repeated if...else into a
				389	binary tree, as if it were a switch; the resulting code requires -1
				390	compare-and-branches when p<=2 or p==5, the same number if *p==4
				391	or p>6, and +1 if p==3. So it should be a speed win
				392	(on balance). However, the revised code is larger, with 4 conditional
				393	branches instead of 3.
				394
				395	More seriously, there is a byte->word extend before
				396	each comparison, where there should be only one, and the condition codes
				397	are not remembered when the same two values are compared twice.
				398
				399	//===---------------------------------------------------------------------===//
				400
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	401	More LSR enhancements possible:
				402
				403	1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
				404	in a load / store.
				405	2. Allow iv reuse even when a type conversion is required. For example, i8
				406	and i32 load / store addressing modes are identical.
				407
				408
				409	//===---------------------------------------------------------------------===//
				410
				411	This:
				412
				413	int foo(int a, int b, int c, int d) {
				414	long long acc = (long long)a * (long long)b;
				415	acc += (long long)c * (long long)d;
				416	return (int)(acc >> 32);
				417	}
				418
				419	Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
				420	two signed 32-bit values to produce a 64-bit value, and accumulates this with
				421	a 64-bit value.
				422
Chris Lattner	a1688ed	2007-11-27 22:41:52 +0000	[diff] [blame]	423	We currently get this with both v4 and v6:
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	424
				425	_foo:
Chris Lattner	a1688ed	2007-11-27 22:41:52 +0000	[diff] [blame]	426	smull r1, r0, r1, r0
				427	smull r3, r2, r3, r2
				428	adds r3, r3, r1
				429	adc r0, r2, r0
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	430	bx lr
				431
Dan Gohman	f17a25c	2007-07-18 16:29:46 +0000	[diff] [blame]	432	//===---------------------------------------------------------------------===//
Chris Lattner	4084d49	2007-09-10 21:43:18 +0000	[diff] [blame]	433
				434	This:
				435	#include <algorithm>
				436	std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
				437	{ return std::make_pair(a + b, a + b < a); }
				438	bool no_overflow(unsigned a, unsigned b)
				439	{ return !full_add(a, b).second; }
				440
				441	Should compile to:
				442
				443	_Z8full_addjj:
				444	adds r2, r1, r2
				445	movcc r1, #0
				446	movcs r1, #1
				447	str r2, [r0, #0]
				448	strb r1, [r0, #4]
				449	mov pc, lr
				450
				451	_Z11no_overflowjj:
				452	cmn r0, r1
				453	movcs r0, #0
				454	movcc r0, #1
				455	mov pc, lr
				456
				457	not:
				458
				459	__Z8full_addjj:
				460	add r3, r2, r1
				461	str r3, [r0]
				462	mov r2, #1
				463	mov r12, #0
				464	cmp r3, r1
				465	movlo r12, r2
				466	str r12, [r0, #+4]
				467	bx lr
				468	__Z11no_overflowjj:
				469	add r3, r1, r0
				470	mov r2, #1
				471	mov r1, #0
				472	cmp r3, r0
				473	movhs r1, r2
				474	mov r0, r1
				475	bx lr
				476
				477	//===---------------------------------------------------------------------===//
Chris Lattner	01eb728	2007-10-19 03:29:26 +0000	[diff] [blame]	478
Bob Wilson	e60fee0	2009-06-22 23:27:02 +0000	[diff] [blame]	479	Some of the NEON intrinsics may be appropriate for more general use, either
				480	as target-independent intrinsics or perhaps elsewhere in the ARM backend.
				481	Some of them may also be lowered to target-independent SDNodes, and perhaps
				482	some new SDNodes could be added.
				483
				484	For example, maximum, minimum, and absolute value operations are well-defined
				485	and standard operations, both for vector and scalar types.
				486
				487	The current NEON-specific intrinsics for count leading zeros and count one
				488	bits could perhaps be replaced by the target-independent ctlz and ctpop
				489	intrinsics. It may also make sense to add a target-independent "ctls"
				490	intrinsic for "count leading sign bits". Likewise, the backend could use
				491	the target-independent SDNodes for these operations.
				492
				493	ARMv6 has scalar saturating and halving adds and subtracts. The same
				494	intrinsics could possibly be used for both NEON's vector implementations of
				495	those operations and the ARMv6 scalar versions.
				496
				497	//===---------------------------------------------------------------------===//
				498
Evan Cheng	9d4e200	2009-06-26 00:28:48 +0000	[diff] [blame]	499	ARM::MOVCCr is commutable (by flipping the condition). But we need to implement
				500	ARMInstrInfo::commuteInstruction() to support it.
Evan Cheng	532cdc5	2009-06-29 07:51:04 +0000	[diff] [blame]	501
				502	//===---------------------------------------------------------------------===//
				503
				504	Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
				505	LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
				506	ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
				507	while ARMConstantIslandPass only need to worry about LDR (literal).
Evan Cheng	8523043	2009-07-22 00:58:27 +0000	[diff] [blame]	508
				509	//===---------------------------------------------------------------------===//
				510
Evan Cheng	4e50240	2009-07-24 19:31:03 +0000	[diff] [blame]	511	Constant island pass should make use of full range SoImm values for LEApcrel.
				512	Be careful though as the last attempt caused infinite looping on lencod.
Chris Lattner	ecd9b64	2009-07-30 16:08:58 +0000	[diff] [blame]	513
				514	//===---------------------------------------------------------------------===//
				515
				516	Predication issue. This function:
				517
				518	extern unsigned array[ 128 ];
				519	int foo( int x ) {
				520	int y;
				521	y = array[ x & 127 ];
				522	if ( x & 128 )
				523	y = 123456789 & ( y >> 2 );
				524	else
				525	y = 123456789 & y;
				526	return y;
				527	}
				528
				529	compiles to:
				530
				531	_foo:
				532	and r1, r0, #127
				533	ldr r2, LCPI1_0
				534	ldr r2, [r2]
				535	ldr r1, [r2, +r1, lsl #2]
				536	mov r2, r1, lsr #2
				537	tst r0, #128
				538	moveq r2, r1
				539	ldr r0, LCPI1_1
				540	and r0, r2, r0
				541	bx lr
				542
				543	It would be better to do something like this, to fold the shift into the
				544	conditional move:
				545
				546	and r1, r0, #127
				547	ldr r2, LCPI1_0
				548	ldr r2, [r2]
				549	ldr r1, [r2, +r1, lsl #2]
				550	tst r0, #128
				551	movne r1, r1, lsr #2
				552	ldr r0, LCPI1_1
				553	and r0, r1, r0
				554	bx lr
				555
				556	it saves an instruction and a register.
				557
				558	//===---------------------------------------------------------------------===//
Evan Cheng	16c012d	2009-09-28 09:14:39 +0000	[diff] [blame]	559
Evan Cheng	16c012d	2009-09-28 09:14:39 +0000	[diff] [blame]	560	It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
				561	with the same bottom half.
Bob Wilson	4381e3b	2009-10-30 22:22:46 +0000	[diff] [blame]	562
				563	//===---------------------------------------------------------------------===//
				564
				565	Robert Muth started working on an alternate jump table implementation that
				566	does not put the tables in-line in the text. This is more like the llvm
				567	default jump table implementation. This might be useful sometime. Several
				568	revisions of patches are on the mailing list, beginning at:
				569	http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html
				570
Evan Cheng	d9ca3d1	2009-11-02 07:51:19 +0000	[diff] [blame]	571	//===---------------------------------------------------------------------===//
				572
				573	Make use of the "rbit" instruction.
Evan Cheng	ca52923	2009-11-07 04:04:34 +0000	[diff] [blame]	574
				575	//===---------------------------------------------------------------------===//
				576
				577	Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
				578	to licm and cse the unnecessary load from cp#1.
Jim Grosbach	40e4b11	2010-01-22 00:08:13 +0000	[diff] [blame]	579
				580	//===---------------------------------------------------------------------===//
				581
				582	The CMN instruction sets the flags like an ADD instruction, while CMP sets
				583	them like a subtract. Therefore to be able to use CMN for comparisons other
				584	than the Z bit, we'll need additional logic to reverse the conditionals
				585	associated with the comparison. Perhaps a pseudo-instruction for the comparison,
				586	with a post-codegen pass to clean up and handle the condition codes?
				587	See PR5694 for testcase.