blob: 2f6ecfacd1255b49f08570edd1421fabec32ff90 [file] [log] [blame]
Chris Lattnerd1aaee02006-02-03 06:21:43 +00001Target Independent Opportunities:
2
Chris Lattner3cbd1602006-09-28 06:01:17 +00003//===---------------------------------------------------------------------===//
4
Chris Lattner6dc22332006-11-14 01:57:53 +00005With the recent changes to make the implicit def/use set explicit in
6machineinstrs, we should change the target descriptions for 'call' instructions
7so that the .td files don't list all the call-clobbered registers as implicit
8defs. Instead, these should be added by the code generator (e.g. on the dag).
9
10This has a number of uses:
11
121. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
13 for their different impdef sets.
142. Targets with multiple calling convs (e.g. x86) which have different clobber
15 sets don't need copies of call instructions.
163. 'Interprocedural register allocation' can be done to reduce the clobber sets
17 of calls.
18
19//===---------------------------------------------------------------------===//
20
Chris Lattnerd1aaee02006-02-03 06:21:43 +000021FreeBench/mason contains code like this:
22
23static p_type m0u(p_type p) {
24 int m[]={0, 8, 1, 2, 16, 5, 13, 7, 14, 9, 3, 4, 11, 12, 15, 10, 17, 6};
25 p_type pu;
26 pu.a = m[p.a];
27 pu.b = m[p.b];
28 pu.c = m[p.c];
29 return pu;
30}
31
32We currently compile this into a memcpy from a static array into 'm', then
33a bunch of loads from m. It would be better to avoid the memcpy and just do
34loads from the static array.
35
Nate Begemanbb01d4f2006-03-17 01:40:33 +000036//===---------------------------------------------------------------------===//
37
38Make the PPC branch selector target independant
39
40//===---------------------------------------------------------------------===//
Chris Lattnerd1aaee02006-02-03 06:21:43 +000041
42Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
43precision don't matter (ffastmath). Misc/mandel will like this. :)
44
Chris Lattnerd1aaee02006-02-03 06:21:43 +000045//===---------------------------------------------------------------------===//
46
47Solve this DAG isel folding deficiency:
48
49int X, Y;
50
51void fn1(void)
52{
53 X = X | (Y << 3);
54}
55
56compiles to
57
58fn1:
59 movl Y, %eax
60 shll $3, %eax
61 orl X, %eax
62 movl %eax, X
63 ret
64
65The problem is the store's chain operand is not the load X but rather
66a TokenFactor of the load X and load Y, which prevents the folding.
67
68There are two ways to fix this:
69
701. The dag combiner can start using alias analysis to realize that y/x
71 don't alias, making the store to X not dependent on the load from Y.
722. The generated isel could be made smarter in the case it can't
73 disambiguate the pointers.
74
75Number 1 is the preferred solution.
76
Evan Cheng60f49512006-03-13 23:19:10 +000077This has been "fixed" by a TableGen hack. But that is a short term workaround
78which will be removed once the proper fix is made.
79
Chris Lattnerd1aaee02006-02-03 06:21:43 +000080//===---------------------------------------------------------------------===//
81
Chris Lattnere43e5c02006-03-04 01:19:34 +000082On targets with expensive 64-bit multiply, we could LSR this:
83
84for (i = ...; ++i) {
85 x = 1ULL << i;
86
87into:
88 long long tmp = 1;
89 for (i = ...; ++i, tmp+=tmp)
90 x = tmp;
91
92This would be a win on ppc32, but not x86 or ppc64.
93
Chris Lattnerc9a318d2006-03-04 08:44:51 +000094//===---------------------------------------------------------------------===//
Chris Lattner5032c322006-03-05 20:00:08 +000095
96Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)
97
98//===---------------------------------------------------------------------===//
Chris Lattnerbccb0e02006-03-07 02:46:26 +000099
Chris Lattner003f6332006-03-11 20:17:08 +0000100Reassociate should turn: X*X*X*X -> t=(X*X) (t*t) to eliminate a multiply.
101
102//===---------------------------------------------------------------------===//
103
Chris Lattner4e56b682006-03-11 20:20:40 +0000104Interesting? testcase for add/shift/mul reassoc:
105
106int bar(int x, int y) {
107 return x*x*x+y+x*x*x*x*x*y*y*y*y;
108}
109int foo(int z, int n) {
110 return bar(z, n) + bar(2*z, 2*n);
111}
112
113//===---------------------------------------------------------------------===//
114
Chris Lattnerf1362992006-03-09 20:13:21 +0000115These two functions should generate the same code on big-endian systems:
116
117int g(int *j,int *l) { return memcmp(j,l,4); }
118int h(int *j, int *l) { return *j - *l; }
119
120this could be done in SelectionDAGISel.cpp, along with other special cases,
121for 1,2,4,8 bytes.
122
123//===---------------------------------------------------------------------===//
124
Evan Cheng0a03f782006-03-19 06:09:23 +0000125Add LSR exit value substitution. It'll probably be a win for Ackermann, etc.
Chris Lattnere24cf9d2006-03-22 07:33:46 +0000126
127//===---------------------------------------------------------------------===//
128
129It would be nice to revert this patch:
130http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html
131
132And teach the dag combiner enough to simplify the code expanded before
133legalize. It seems plausible that this knowledge would let it simplify other
134stuff too.
135
Chris Lattner0affd762006-03-24 19:59:17 +0000136//===---------------------------------------------------------------------===//
137
Evan Chengdc1161c2006-03-31 22:35:14 +0000138For packed types, TargetData.cpp::getTypeInfo() returns alignment that is equal
139to the type size. It works but can be overly conservative as the alignment of
140specific packed types are target dependent.
Chris Lattner0baebb12006-04-01 04:08:29 +0000141
142//===---------------------------------------------------------------------===//
143
144We should add 'unaligned load/store' nodes, and produce them from code like
145this:
146
147v4sf example(float *P) {
148 return (v4sf){P[0], P[1], P[2], P[3] };
149}
150
151//===---------------------------------------------------------------------===//
152
Chris Lattner7a29cf32006-04-02 01:47:20 +0000153We should constant fold packed type casts at the LLVM level, regardless of the
154cast. Currently we cannot fold some casts because we don't have TargetData
155information in the constant folder, so we don't know the endianness of the
156target!
157
158//===---------------------------------------------------------------------===//
Chris Lattnerd1c3a062006-04-20 18:49:28 +0000159
Chris Lattner4cda95b2006-05-18 18:26:13 +0000160Add support for conditional increments, and other related patterns. Instead
161of:
162
163 movl 136(%esp), %eax
164 cmpl $0, %eax
165 je LBB16_2 #cond_next
166LBB16_1: #cond_true
167 incl _foo
168LBB16_2: #cond_next
169
170emit:
171 movl _foo, %eax
172 cmpl $1, %edi
173 sbbl $-1, %eax
174 movl %eax, _foo
175
176//===---------------------------------------------------------------------===//
Chris Lattner240f8462006-05-19 20:45:08 +0000177
178Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
179
180Expand these to calls of sin/cos and stores:
181 double sincos(double x, double *sin, double *cos);
182 float sincosf(float x, float *sin, float *cos);
183 long double sincosl(long double x, long double *sin, long double *cos);
184
185Doing so could allow SROA of the destination pointers. See also:
186http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
187
188//===---------------------------------------------------------------------===//
Chris Lattner29d7bde2006-05-19 21:01:38 +0000189
190Scalar Repl cannot currently promote this testcase to 'ret long cst':
191
192 %struct.X = type { int, int }
193 %struct.Y = type { %struct.X }
194ulong %bar() {
Chris Lattnerfeeb9c72006-12-11 00:44:03 +0000195 %retval = alloca %struct.Y, align 8
Chris Lattnerf7e34782006-09-16 23:57:51 +0000196 %tmp12 = getelementptr %struct.Y* %retval, int 0, uint 0, uint 0
Chris Lattner29d7bde2006-05-19 21:01:38 +0000197 store int 0, int* %tmp12
Chris Lattnerf7e34782006-09-16 23:57:51 +0000198 %tmp15 = getelementptr %struct.Y* %retval, int 0, uint 0, uint 1
Chris Lattner29d7bde2006-05-19 21:01:38 +0000199 store int 1, int* %tmp15
Chris Lattnerfeeb9c72006-12-11 00:44:03 +0000200 %retval = bitcast %struct.Y* %retval to ulong*
201 %retval = load ulong* %retval
Chris Lattner29d7bde2006-05-19 21:01:38 +0000202 ret ulong %retval
203}
204
205it should be extended to do so.
206
207//===---------------------------------------------------------------------===//
Chris Lattner80b0a702006-05-21 03:57:07 +0000208
Chris Lattnerfeeb9c72006-12-11 00:44:03 +0000209-scalarrepl should promote this to be a vector scalar.
210
211 %struct..0anon = type { <4 x float> }
212
213implementation ; Functions:
214
215void %test1(<4 x float> %V, float* %P) {
216 %u = alloca %struct..0anon, align 16
217 %tmp = getelementptr %struct..0anon* %u, int 0, uint 0
218 store <4 x float> %V, <4 x float>* %tmp
219 %tmp1 = bitcast %struct..0anon* %u to [4 x float]*
220 %tmp = getelementptr [4 x float]* %tmp1, int 0, int 1
221 %tmp = load float* %tmp
222 %tmp3 = mul float %tmp, 2.000000e+00
223 store float %tmp3, float* %P
224 ret void
225}
226
227//===---------------------------------------------------------------------===//
228
Chris Lattner80b0a702006-05-21 03:57:07 +0000229Turn this into a single byte store with no load (the other 3 bytes are
230unmodified):
231
232void %test(uint* %P) {
233 %tmp = load uint* %P
234 %tmp14 = or uint %tmp, 3305111552
235 %tmp15 = and uint %tmp14, 3321888767
236 store uint %tmp15, uint* %P
237 ret void
238}
239
Chris Lattnera5d458722006-05-30 21:29:15 +0000240//===---------------------------------------------------------------------===//
241
242dag/inst combine "clz(x)>>5 -> x==0" for 32-bit x.
243
244Compile:
245
246int bar(int x)
247{
248 int t = __builtin_clz(x);
249 return -(t>>5);
250}
251
252to:
253
254_bar: addic r3,r3,-1
255 subfe r3,r3,r3
256 blr
257
Chris Lattnerc9dc3752006-09-15 20:31:36 +0000258//===---------------------------------------------------------------------===//
259
260Legalize should lower ctlz like this:
261 ctlz(x) = popcnt((x-1) & ~x)
262
263on targets that have popcnt but not ctlz. itanium, what else?
Chris Lattnera5d458722006-05-30 21:29:15 +0000264
Chris Lattnerf7e34782006-09-16 23:57:51 +0000265//===---------------------------------------------------------------------===//
266
267quantum_sigma_x in 462.libquantum contains the following loop:
268
269 for(i=0; i<reg->size; i++)
270 {
271 /* Flip the target bit of each basis state */
272 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
273 }
274
275Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just
276so cool to turn it into something like:
277
Chris Lattner4a13d3b2006-09-18 04:54:35 +0000278 long long Res = ((MAX_UNSIGNED) 1 << target);
Chris Lattnerf7e34782006-09-16 23:57:51 +0000279 if (target < 32) {
280 for(i=0; i<reg->size; i++)
Chris Lattner4a13d3b2006-09-18 04:54:35 +0000281 reg->node[i].state ^= Res & 0xFFFFFFFFULL;
Chris Lattnerf7e34782006-09-16 23:57:51 +0000282 } else {
283 for(i=0; i<reg->size; i++)
Chris Lattner4a13d3b2006-09-18 04:54:35 +0000284 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL
Chris Lattnerf7e34782006-09-16 23:57:51 +0000285 }
286
287... which would only do one 32-bit XOR per loop iteration instead of two.
288
289It would also be nice to recognize the reg->size doesn't alias reg->node[i], but
290alas...
291
292//===---------------------------------------------------------------------===//
Chris Lattnerf11327d2006-09-25 17:12:14 +0000293
294This isn't recognized as bswap by instcombine:
295
296unsigned int swap_32(unsigned int v) {
297 v = ((v & 0x00ff00ffU) << 8) | ((v & 0xff00ff00U) >> 8);
298 v = ((v & 0x0000ffffU) << 16) | ((v & 0xffff0000U) >> 16);
299 return v;
300}
301
Chris Lattner4d475f62006-12-08 02:01:32 +0000302Nor is this (yes, it really is bswap):
303
304unsigned long reverse(unsigned v) {
305 unsigned t;
306 t = v ^ ((v << 16) | (v >> 16));
307 t &= ~0xff0000;
308 v = (v << 24) | (v >> 8);
309 return v ^ (t >> 8);
310}
311
Chris Lattnerf11327d2006-09-25 17:12:14 +0000312//===---------------------------------------------------------------------===//
313
314These should turn into single 16-bit (unaligned?) loads on little/big endian
315processors.
316
317unsigned short read_16_le(const unsigned char *adr) {
318 return adr[0] | (adr[1] << 8);
319}
320unsigned short read_16_be(const unsigned char *adr) {
321 return (adr[0] << 8) | adr[1];
322}
323
324//===---------------------------------------------------------------------===//
Chris Lattnerf0540032006-10-24 16:12:47 +0000325
Reid Spencer7e80b0b2006-10-26 06:15:43 +0000326-instcombine should handle this transform:
Reid Spencer266e42b2006-12-23 06:05:41 +0000327 icmp pred (sdiv X / C1 ), C2
Reid Spencer7e80b0b2006-10-26 06:15:43 +0000328when X, C1, and C2 are unsigned. Similarly for udiv and signed operands.
329
330Currently InstCombine avoids this transform but will do it when the signs of
331the operands and the sign of the divide match. See the FIXME in
332InstructionCombining.cpp in the visitSetCondInst method after the switch case
333for Instruction::UDiv (around line 4447) for more details.
334
335The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of
336this construct.
Chris Lattner20483732006-11-03 22:27:39 +0000337
338//===---------------------------------------------------------------------===//
339
340Instcombine misses several of these cases (see the testcase in the patch):
341http://gcc.gnu.org/ml/gcc-patches/2006-10/msg01519.html
342
Reid Spencer7e80b0b2006-10-26 06:15:43 +0000343//===---------------------------------------------------------------------===//
Chris Lattner4e03cb12006-11-10 00:23:26 +0000344
345viterbi speeds up *significantly* if the various "history" related copy loops
346are turned into memcpy calls at the source level. We need a "loops to memcpy"
347pass.
348
349//===---------------------------------------------------------------------===//
Nick Lewycky0df2ada2006-11-13 00:23:28 +0000350
351-predsimplify should transform this:
352
353void bad(unsigned x)
354{
355 if (x > 4)
356 bar(12);
357 else if (x > 3)
358 bar(523);
359 else if (x > 2)
360 bar(36);
361 else if (x > 1)
362 bar(65);
363 else if (x > 0)
364 bar(45);
365 else
366 bar(367);
367}
368
369into:
370
371void good(unsigned x)
372{
373 if (x == 4)
374 bar(523);
375 else if (x == 3)
376 bar(36);
377 else if (x == 2)
378 bar(65);
379 else if (x == 1)
380 bar(45);
381 else if (x == 0)
382 bar(367);
383 else
384 bar(12);
385}
386
387to enable further optimizations.
388
389//===---------------------------------------------------------------------===//
Chris Lattner89e58132007-01-16 06:39:48 +0000390
391Consider:
392
393typedef unsigned U32;
394typedef unsigned long long U64;
395int test (U32 *inst, U64 *regs) {
396 U64 effective_addr2;
397 U32 temp = *inst;
398 int r1 = (temp >> 20) & 0xf;
399 int b2 = (temp >> 16) & 0xf;
400 effective_addr2 = temp & 0xfff;
401 if (b2) effective_addr2 += regs[b2];
402 b2 = (temp >> 12) & 0xf;
403 if (b2) effective_addr2 += regs[b2];
404 effective_addr2 &= regs[4];
405 if ((effective_addr2 & 3) == 0)
406 return 1;
407 return 0;
408}
409
410Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems,
411we don't eliminate the computation of the top half of effective_addr2 because
412we don't have whole-function selection dags. On x86, this means we use one
413extra register for the function when effective_addr2 is declared as U64 than
414when it is declared U32.
415
416//===---------------------------------------------------------------------===//
417