blob: 6ff181771be62fc7ffa995958c34bc07b6d31e19 [file] [log] [blame]
Chris Lattner1171ff42005-10-23 19:52:42 +00001//===---------------------------------------------------------------------===//
2// Random ideas for the X86 backend.
3//===---------------------------------------------------------------------===//
4
5Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
6Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
7X86, & make the dag combiner produce it when needed. This will eliminate one
8imul from the code generated for:
9
10long long test(long long X, long long Y) { return X*Y; }
11
12by using the EAX result from the mul. We should add a similar node for
13DIVREM.
14
Chris Lattner865874c2005-12-02 00:11:20 +000015another case is:
16
17long long test(int X, int Y) { return (long long)X*Y; }
18
19... which should only be one imul instruction.
20
Chris Lattner1171ff42005-10-23 19:52:42 +000021//===---------------------------------------------------------------------===//
22
23This should be one DIV/IDIV instruction, not a libcall:
24
25unsigned test(unsigned long long X, unsigned Y) {
26 return X/Y;
27}
28
29This can be done trivially with a custom legalizer. What about overflow
30though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
31
32//===---------------------------------------------------------------------===//
33
Chris Lattner1171ff42005-10-23 19:52:42 +000034Some targets (e.g. athlons) prefer freep to fstp ST(0):
35http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html
36
37//===---------------------------------------------------------------------===//
38
Evan Chenga3195e82006-01-12 22:54:21 +000039This should use fiadd on chips where it is profitable:
Chris Lattner1171ff42005-10-23 19:52:42 +000040double foo(double P, int *I) { return P+*I; }
41
42//===---------------------------------------------------------------------===//
43
44The FP stackifier needs to be global. Also, it should handle simple permutates
45to reduce number of shuffle instructions, e.g. turning:
46
47fld P -> fld Q
48fld Q fld P
49fxch
50
51or:
52
53fxch -> fucomi
54fucomi jl X
55jg X
56
Chris Lattner1db4b4f2006-01-16 17:53:00 +000057Ideas:
58http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html
59
60
Chris Lattner1171ff42005-10-23 19:52:42 +000061//===---------------------------------------------------------------------===//
62
63Improvements to the multiply -> shift/add algorithm:
64http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
65
66//===---------------------------------------------------------------------===//
67
68Improve code like this (occurs fairly frequently, e.g. in LLVM):
69long long foo(int x) { return 1LL << x; }
70
71http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
72http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
73http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
74
75Another useful one would be ~0ULL >> X and ~0ULL << X.
76
Chris Lattnerffff6172005-10-23 21:44:59 +000077//===---------------------------------------------------------------------===//
78
Chris Lattner1e4ed932005-11-28 04:52:39 +000079Compile this:
80_Bool f(_Bool a) { return a!=1; }
81
82into:
83 movzbl %dil, %eax
84 xorl $1, %eax
85 ret
Evan Cheng8dee8cc2005-12-17 01:25:19 +000086
87//===---------------------------------------------------------------------===//
88
89Some isel ideas:
90
911. Dynamic programming based approach when compile time if not an
92 issue.
932. Code duplication (addressing mode) during isel.
943. Other ideas from "Register-Sensitive Selection, Duplication, and
95 Sequencing of Instructions".
Chris Lattnercb298902006-02-08 07:12:07 +0000964. Scheduling for reduced register pressure. E.g. "Minimum Register
97 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
98 and other related papers.
99 http://citeseer.ist.psu.edu/govindarajan01minimum.html
Evan Cheng8dee8cc2005-12-17 01:25:19 +0000100
101//===---------------------------------------------------------------------===//
102
103Should we promote i16 to i32 to avoid partial register update stalls?
Evan Cheng98abbfb2005-12-17 06:54:43 +0000104
105//===---------------------------------------------------------------------===//
106
107Leave any_extend as pseudo instruction and hint to register
108allocator. Delay codegen until post register allocation.
Evan Chenga3195e82006-01-12 22:54:21 +0000109
110//===---------------------------------------------------------------------===//
111
112Add a target specific hook to DAG combiner to handle SINT_TO_FP and
113FP_TO_SINT when the source operand is already in memory.
114
115//===---------------------------------------------------------------------===//
116
117Check if load folding would add a cycle in the dag.
Evan Chenge08c2702006-01-13 01:20:42 +0000118
119//===---------------------------------------------------------------------===//
120
121Model X86 EFLAGS as a real register to avoid redudant cmp / test. e.g.
122
123 cmpl $1, %eax
124 setg %al
125 testb %al, %al # unnecessary
126 jne .BB7
Chris Lattner1db4b4f2006-01-16 17:53:00 +0000127
128//===---------------------------------------------------------------------===//
129
130Count leading zeros and count trailing zeros:
131
132int clz(int X) { return __builtin_clz(X); }
133int ctz(int X) { return __builtin_ctz(X); }
134
135$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
136clz:
137 bsr %eax, DWORD PTR [%esp+4]
138 xor %eax, 31
139 ret
140ctz:
141 bsf %eax, DWORD PTR [%esp+4]
142 ret
143
144however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
145aren't.
146
147//===---------------------------------------------------------------------===//
148
149Use push/pop instructions in prolog/epilog sequences instead of stores off
150ESP (certain code size win, perf win on some [which?] processors).
151
152//===---------------------------------------------------------------------===//
153
154Only use inc/neg/not instructions on processors where they are faster than
155add/sub/xor. They are slower on the P4 due to only updating some processor
156flags.
157
158//===---------------------------------------------------------------------===//
159
160Open code rint,floor,ceil,trunc:
161http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html
162http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html
163
164//===---------------------------------------------------------------------===//
165
166Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
167
Chris Lattner8f77b732006-02-08 06:52:06 +0000168Expand these to calls of sin/cos and stores:
169 double sincos(double x, double *sin, double *cos);
170 float sincosf(float x, float *sin, float *cos);
171 long double sincosl(long double x, long double *sin, long double *cos);
172
173Doing so could allow SROA of the destination pointers. See also:
174http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
175
Evan Chenge826a012006-01-27 22:11:01 +0000176//===---------------------------------------------------------------------===//
177
Chris Lattnerb638cd82006-01-29 09:08:15 +0000178The instruction selector sometimes misses folding a load into a compare. The
179pattern is written as (cmp reg, (load p)). Because the compare isn't
180commutative, it is not matched with the load on both sides. The dag combiner
181should be made smart enough to cannonicalize the load into the RHS of a compare
182when it can invert the result of the compare for free.
183
Chris Lattner6a284562006-01-29 09:14:47 +0000184//===---------------------------------------------------------------------===//
185
Chris Lattner5164a312006-01-29 09:42:20 +0000186LSR should be turned on for the X86 backend and tuned to take advantage of its
187addressing modes.
188
Chris Lattnerc7097af2006-01-29 09:46:06 +0000189//===---------------------------------------------------------------------===//
190
191When compiled with unsafemath enabled, "main" should enable SSE DAZ mode and
192other fast SSE modes.
Chris Lattnerbdde4652006-01-31 00:20:38 +0000193
194//===---------------------------------------------------------------------===//
195
Chris Lattner594086d2006-01-31 00:45:37 +0000196Think about doing i64 math in SSE regs.
197
Chris Lattner8e38ae62006-01-31 02:10:06 +0000198//===---------------------------------------------------------------------===//
199
200The DAG Isel doesn't fold the loads into the adds in this testcase. The
201pattern selector does. This is because the chain value of the load gets
202selected first, and the loads aren't checking to see if they are only used by
203and add.
204
205.ll:
206
207int %test(int* %x, int* %y, int* %z) {
208 %X = load int* %x
209 %Y = load int* %y
210 %Z = load int* %z
211 %a = add int %X, %Y
212 %b = add int %a, %Z
213 ret int %b
214}
215
216dag isel:
217
218_test:
219 movl 4(%esp), %eax
220 movl (%eax), %eax
221 movl 8(%esp), %ecx
222 movl (%ecx), %ecx
223 addl %ecx, %eax
224 movl 12(%esp), %ecx
225 movl (%ecx), %ecx
226 addl %ecx, %eax
227 ret
228
229pattern isel:
230
231_test:
232 movl 12(%esp), %ecx
233 movl 4(%esp), %edx
234 movl 8(%esp), %eax
235 movl (%eax), %eax
236 addl (%edx), %eax
237 addl (%ecx), %eax
238 ret
239
240This is bad for register pressure, though the dag isel is producing a
241better schedule. :)
Chris Lattner3e1d5e52006-02-01 01:44:25 +0000242
243//===---------------------------------------------------------------------===//
244
245This testcase should have no SSE instructions in it, and only one load from
246a constant pool:
247
248double %test3(bool %B) {
249 %C = select bool %B, double 123.412, double 523.01123123
250 ret double %C
251}
252
253Currently, the select is being lowered, which prevents the dag combiner from
254turning 'select (load CPI1), (load CPI2)' -> 'load (select CPI1, CPI2)'
255
256The pattern isel got this one right.
257
Chris Lattner1f7c6302006-02-01 06:40:32 +0000258//===---------------------------------------------------------------------===//
259
Chris Lattner3e2b94a2006-02-01 21:44:48 +0000260We need to lower switch statements to tablejumps when appropriate instead of
261always into binary branch trees.
Chris Lattner4d7db402006-02-01 23:38:08 +0000262
263//===---------------------------------------------------------------------===//
264
265SSE doesn't have [mem] op= reg instructions. If we have an SSE instruction
266like this:
267
268 X += y
269
270and the register allocator decides to spill X, it is cheaper to emit this as:
271
272Y += [xslot]
273store Y -> [xslot]
274
275than as:
276
277tmp = [xslot]
278tmp += y
279store tmp -> [xslot]
280
281..and this uses one fewer register (so this should be done at load folding
282time, not at spiller time). *Note* however that this can only be done
283if Y is dead. Here's a testcase:
284
285%.str_3 = external global [15 x sbyte] ; <[15 x sbyte]*> [#uses=0]
286implementation ; Functions:
287declare void %printf(int, ...)
288void %main() {
289build_tree.exit:
290 br label %no_exit.i7
291no_exit.i7: ; preds = %no_exit.i7, %build_tree.exit
292 %tmp.0.1.0.i9 = phi double [ 0.000000e+00, %build_tree.exit ], [ %tmp.34.i18, %no_exit.i7 ] ; <double> [#uses=1]
293 %tmp.0.0.0.i10 = phi double [ 0.000000e+00, %build_tree.exit ], [ %tmp.28.i16, %no_exit.i7 ] ; <double> [#uses=1]
294 %tmp.28.i16 = add double %tmp.0.0.0.i10, 0.000000e+00
295 %tmp.34.i18 = add double %tmp.0.1.0.i9, 0.000000e+00
296 br bool false, label %Compute_Tree.exit23, label %no_exit.i7
297Compute_Tree.exit23: ; preds = %no_exit.i7
298 tail call void (int, ...)* %printf( int 0 )
299 store double %tmp.34.i18, double* null
300 ret void
301}
302
303We currently emit:
304
305.BBmain_1:
306 xorpd %XMM1, %XMM1
307 addsd %XMM0, %XMM1
308*** movsd %XMM2, QWORD PTR [%ESP + 8]
309*** addsd %XMM2, %XMM1
310*** movsd QWORD PTR [%ESP + 8], %XMM2
311 jmp .BBmain_1 # no_exit.i7
312
313This is a bugpoint reduced testcase, which is why the testcase doesn't make
314much sense (e.g. its an infinite loop). :)
315
Evan Cheng8b6e4e62006-02-02 02:40:17 +0000316//===---------------------------------------------------------------------===//
317
318None of the FPStack instructions are handled in
319X86RegisterInfo::foldMemoryOperand, which prevents the spiller from
320folding spill code into the instructions.
Chris Lattner9acddcd2006-02-02 19:16:34 +0000321
322//===---------------------------------------------------------------------===//
323
324In many cases, LLVM generates code like this:
325
326_test:
327 movl 8(%esp), %eax
328 cmpl %eax, 4(%esp)
329 setl %al
330 movzbl %al, %eax
331 ret
332
333on some processors (which ones?), it is more efficient to do this:
334
335_test:
336 movl 8(%esp), %ebx
337 xor %eax, %eax
338 cmpl %ebx, 4(%esp)
339 setl %al
340 ret
341
342Doing this correctly is tricky though, as the xor clobbers the flags.
343
Chris Lattnerd395d092006-02-02 19:43:28 +0000344//===---------------------------------------------------------------------===//
345
346We should generate 'test' instead of 'cmp' in various cases, e.g.:
347
348bool %test(int %X) {
349 %Y = shl int %X, ubyte 1
350 %C = seteq int %Y, 0
351 ret bool %C
352}
353bool %test(int %X) {
354 %Y = and int %X, 8
355 %C = seteq int %Y, 0
356 ret bool %C
357}
358
359This may just be a matter of using 'test' to write bigger patterns for X86cmp.
360
361//===---------------------------------------------------------------------===//
362
363Evaluate whether using movapd for SSE reg-reg moves is faster than using
364movsd/movss for them. It may eliminate false partial register dependences by
365writing the whole result register.
366
367//===---------------------------------------------------------------------===//
368
369SSE should implement 'select_cc' using 'emulated conditional moves' that use
370pcmp/pand/pandn/por to do a selection instead of a conditional branch:
371
372double %X(double %Y, double %Z, double %A, double %B) {
373 %C = setlt double %A, %B
374 %z = add double %Z, 0.0 ;; select operand is not a load
375 %D = select bool %C, double %Y, double %z
376 ret double %D
377}
378
379We currently emit:
380
381_X:
382 subl $12, %esp
383 xorpd %xmm0, %xmm0
384 addsd 24(%esp), %xmm0
385 movsd 32(%esp), %xmm1
386 movsd 16(%esp), %xmm2
387 ucomisd 40(%esp), %xmm1
388 jb LBB_X_2
389LBB_X_1:
390 movsd %xmm0, %xmm2
391LBB_X_2:
392 movsd %xmm2, (%esp)
393 fldl (%esp)
394 addl $12, %esp
395 ret
Chris Lattner9acddcd2006-02-02 19:16:34 +0000396
Evan Cheng183fff92006-02-07 08:35:44 +0000397//===---------------------------------------------------------------------===//
398
399The x86 backend currently supports dynamic-no-pic. Need to add asm
400printer support for static and PIC.
Chris Lattner8f77b732006-02-08 06:52:06 +0000401
402//===---------------------------------------------------------------------===//
403
404We should generate bts/btr/etc instructions on targets where they are cheap or
405when codesize is important. e.g., for:
406
407void setbit(int *target, int bit) {
408 *target |= (1 << bit);
409}
410void clearbit(int *target, int bit) {
411 *target &= ~(1 << bit);
412}
413
Chris Lattnerdba382b2006-02-08 17:47:22 +0000414//===---------------------------------------------------------------------===//
415
416Easy: Global addresses are not always allowed as immediates. For this:
417
418int dst = 0; int *ptr = 0;
419void foo() { ptr = &dst; }
420
421we get this:
422
423_foo:
424 movl $_dst, %eax
425 movl %eax, _ptr
426 ret
427
428When: "movl $_dst, _ptr" is sufficient.
429