blob: f32a6f6398a9fa227d0509f9bc7c75fc3788b5da [file] [log] [blame]
Clement Courbetac74acd2018-04-04 11:37:06 +00001llvm-exegesis - LLVM Machine Instruction Benchmark
2==================================================
3
James Hendersona0566842019-06-27 13:24:46 +00004.. program:: llvm-exegesis
5
Clement Courbetac74acd2018-04-04 11:37:06 +00006SYNOPSIS
7--------
8
9:program:`llvm-exegesis` [*options*]
10
11DESCRIPTION
12-----------
13
14:program:`llvm-exegesis` is a benchmarking tool that uses information available
Clement Courbet362653f2019-01-30 16:02:20 +000015in LLVM to measure host machine instruction characteristics like latency,
16throughput, or port decomposition.
Clement Courbetac74acd2018-04-04 11:37:06 +000017
18Given an LLVM opcode name and a benchmarking mode, :program:`llvm-exegesis`
19generates a code snippet that makes execution as serial (resp. as parallel) as
Clement Courbet362653f2019-01-30 16:02:20 +000020possible so that we can measure the latency (resp. inverse throughput/uop decomposition)
21of the instruction.
Clement Courbetac74acd2018-04-04 11:37:06 +000022The code snippet is jitted and executed on the host subtarget. The time taken
23(resp. resource usage) is measured using hardware performance counters. The
24result is printed out as YAML to the standard output.
25
26The main goal of this tool is to automatically (in)validate the LLVM's TableDef
Clement Courbet5ec03cd2018-05-18 12:33:57 +000027scheduling models. To that end, we also provide analysis of the results.
28
Clement Courbet78b2e732018-09-25 07:31:44 +000029:program:`llvm-exegesis` can also benchmark arbitrary user-provided code
30snippets.
31
32EXAMPLE 1: benchmarking instructions
33------------------------------------
Clement Courbet5ec03cd2018-05-18 12:33:57 +000034
35Assume you have an X86-64 machine. To measure the latency of a single
36instruction, run:
37
38.. code-block:: bash
39
40 $ llvm-exegesis -mode=latency -opcode-name=ADD64rr
41
Clement Courbet362653f2019-01-30 16:02:20 +000042Measuring the uop decomposition or inverse throughput of an instruction works similarly:
Clement Courbet5ec03cd2018-05-18 12:33:57 +000043
44.. code-block:: bash
45
46 $ llvm-exegesis -mode=uops -opcode-name=ADD64rr
Clement Courbet362653f2019-01-30 16:02:20 +000047 $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr
48
Clement Courbet5ec03cd2018-05-18 12:33:57 +000049
50The output is a YAML document (the default is to write to stdout, but you can
51redirect the output to a file using `-benchmarks-file`):
52
53.. code-block:: none
54
55 ---
56 key:
57 opcode_name: ADD64rr
58 mode: latency
59 config: ''
60 cpu_name: haswell
61 llvm_triple: x86_64-unknown-linux-gnu
62 num_repetitions: 10000
63 measurements:
64 - { key: latency, value: 1.0058, debug_string: '' }
65 error: ''
66 info: 'explicit self cycles, selecting one aliasing configuration.
67 Snippet:
68 ADD64rr R8, R8, R10
69 '
70 ...
71
72To measure the latency of all instructions for the host architecture, run:
73
74.. code-block:: bash
75
Clement Courbet992da892020-10-28 08:15:58 +010076 $ llvm-exegesis -mode=latency -opcode-index=-1
Clement Courbet5ec03cd2018-05-18 12:33:57 +000077
Clement Courbet78b2e732018-09-25 07:31:44 +000078
79EXAMPLE 2: benchmarking a custom code snippet
80---------------------------------------------
81
82To measure the latency/uops of a custom piece of code, you can specify the
83`snippets-file` option (`-` reads from standard input).
84
85.. code-block:: bash
86
87 $ echo "vzeroupper" | llvm-exegesis -mode=uops -snippets-file=-
88
89Real-life code snippets typically depend on registers or memory.
90:program:`llvm-exegesis` checks the liveliness of registers (i.e. any register
91use has a corresponding def or is a "live in"). If your code depends on the
92value of some registers, you have two options:
Clement Courbet86ecf462018-09-25 07:48:38 +000093
94- Mark the register as requiring a definition. :program:`llvm-exegesis` will
95 automatically assign a value to the register. This can be done using the
96 directive `LLVM-EXEGESIS-DEFREG <reg name> <hex_value>`, where `<hex_value>`
97 is a bit pattern used to fill `<reg_name>`. If `<hex_value>` is smaller than
98 the register width, it will be sign-extended.
99- Mark the register as a "live in". :program:`llvm-exegesis` will benchmark
100 using whatever value was in this registers on entry. This can be done using
101 the directive `LLVM-EXEGESIS-LIVEIN <reg name>`.
Clement Courbet78b2e732018-09-25 07:31:44 +0000102
103For example, the following code snippet depends on the values of XMM1 (which
104will be set by the tool) and the memory buffer passed in RDI (live in).
105
106.. code-block:: none
107
108 # LLVM-EXEGESIS-LIVEIN RDI
109 # LLVM-EXEGESIS-DEFREG XMM1 42
110 vmulps (%rdi), %xmm1, %xmm2
111 vhaddps %xmm2, %xmm2, %xmm3
112 addq $0x10, %rdi
113
114
115EXAMPLE 3: analysis
116-------------------
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000117
118Assuming you have a set of benchmarked instructions (either latency or uops) as
119YAML in file `/tmp/benchmarks.yaml`, you can analyze the results using the
120following command:
121
122.. code-block:: bash
123
124 $ llvm-exegesis -mode=analysis \
125 -benchmarks-file=/tmp/benchmarks.yaml \
126 -analysis-clusters-output-file=/tmp/clusters.csv \
Simon Pilgrimc4976f62018-09-27 13:49:52 +0000127 -analysis-inconsistencies-output-file=/tmp/inconsistencies.html
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000128
129This will group the instructions into clusters with the same performance
130characteristics. The clusters will be written out to `/tmp/clusters.csv` in the
131following format:
132
133.. code-block:: none
134
135 cluster_id,opcode_name,config,sched_class
136 ...
137 2,ADD32ri8_DB,,WriteALU,1.00
138 2,ADD32ri_DB,,WriteALU,1.01
139 2,ADD32rr,,WriteALU,1.01
140 2,ADD32rr_DB,,WriteALU,1.00
141 2,ADD32rr_REV,,WriteALU,1.00
142 2,ADD64i32,,WriteALU,1.01
143 2,ADD64ri32,,WriteALU,1.01
144 2,MOVSX64rr32,,BSWAP32r_BSWAP64r_MOVSX64rr32,1.00
145 2,VPADDQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.02
146 2,VPSUBQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.01
147 2,ADD64ri8,,WriteALU,1.00
148 2,SETBr,,WriteSETCC,1.01
149 ...
150
151:program:`llvm-exegesis` will also analyze the clusters to point out
Clement Courbet488ebfb2018-05-22 13:36:29 +0000152inconsistencies in the scheduling information. The output is an html file. For
Clement Courbet2637e5f2018-05-24 10:47:05 +0000153example, `/tmp/inconsistencies.html` will contain messages like the following :
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000154
Clement Courbet2637e5f2018-05-24 10:47:05 +0000155.. image:: llvm-exegesis-analysis.png
156 :align: center
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000157
158Note that the scheduling class names will be resolved only when
159:program:`llvm-exegesis` is compiled in debug mode, else only the class id will
160be shown. This does not invalidate any of the analysis results though.
161
Clement Courbetac74acd2018-04-04 11:37:06 +0000162OPTIONS
163-------
164
165.. option:: -help
166
167 Print a summary of command line options.
168
169.. option:: -opcode-index=<LLVM opcode index>
170
Roman Lebedevcc5549d2020-02-13 12:45:15 +0300171 Specify the opcode to measure, by index. Specifying `-1` will result
172 in measuring every existing opcode. See example 1 for details.
Clement Courbet78b2e732018-09-25 07:31:44 +0000173 Either `opcode-index`, `opcode-name` or `snippets-file` must be set.
Clement Courbetac74acd2018-04-04 11:37:06 +0000174
Clement Courbetf973c2d2018-10-17 15:04:15 +0000175.. option:: -opcode-name=<opcode name 1>,<opcode name 2>,...
Clement Courbetac74acd2018-04-04 11:37:06 +0000176
Clement Courbetf973c2d2018-10-17 15:04:15 +0000177 Specify the opcode to measure, by name. Several opcodes can be specified as
178 a comma-separated list. See example 1 for details.
Clement Courbet78b2e732018-09-25 07:31:44 +0000179 Either `opcode-index`, `opcode-name` or `snippets-file` must be set.
180
Clement Courbet89a66472020-02-06 12:08:02 +0100181.. option:: -snippets-file=<filename>
Clement Courbet78b2e732018-09-25 07:31:44 +0000182
Clement Courbet89a66472020-02-06 12:08:02 +0100183 Specify the custom code snippet to measure. See example 2 for details.
184 Either `opcode-index`, `opcode-name` or `snippets-file` must be set.
Clement Courbetac74acd2018-04-04 11:37:06 +0000185
Clement Courbet362653f2019-01-30 16:02:20 +0000186.. option:: -mode=[latency|uops|inverse_throughput|analysis]
Clement Courbetac74acd2018-04-04 11:37:06 +0000187
Vy Nguyenee7caa72020-07-27 12:38:05 -0400188 Specify the run mode. Note that some modes have additional requirements and options.
Clement Courbetac74acd2018-04-04 11:37:06 +0000189
Vy Nguyenee7caa72020-07-27 12:38:05 -0400190 `latency` mode can be make use of either RDTSC or LBR.
191 `latency[LBR]` is only available on X86 (at least `Skylake`).
Simon Pilgrimfeb9d8b2020-08-04 15:52:09 +0100192 To run in `latency` mode, a positive value must be specified for `x86-lbr-sample-period` and `--repetition-mode=loop`.
Vy Nguyenee7caa72020-07-27 12:38:05 -0400193
194 In `analysis` mode, you also need to specify at least one of the
195 `-analysis-clusters-output-file=` and `-analysis-inconsistencies-output-file=`.
196
197.. option:: -x86-lbr-sample-period=<nBranches/sample>
198
199 Specify the LBR sampling period - how many branches before we take a sample.
200 When a positive value is specified for this option and when the mode is `latency`,
201 we will use LBRs for measuring.
202 On choosing the "right" sampling period, a small value is preferred, but throttling
203 could occur if the sampling is too frequent. A prime number should be used to
204 avoid consistently skipping certain blocks.
205
Roman Lebedevde22d712020-04-02 09:28:35 +0300206.. option:: -repetition-mode=[duplicate|loop|min]
Clement Courbet89a66472020-02-06 12:08:02 +0100207
208 Specify the repetition mode. `duplicate` will create a large, straight line
209 basic block with `num-repetitions` copies of the snippet. `loop` will wrap
210 the snippet in a loop which will be run `num-repetitions` times. The `loop`
211 mode tends to better hide the effects of the CPU frontend on architectures
212 that cache decoded instructions, but consumes a register for counting
Roman Lebedevde22d712020-04-02 09:28:35 +0300213 iterations. If performing an analysis over many opcodes, it may be best
214 to instead use the `min` mode, which will run each other mode, and produce
215 the minimal measured result.
Clement Courbet89a66472020-02-06 12:08:02 +0100216
Clement Courbet2cd0f282019-10-08 14:30:24 +0000217.. option:: -num-repetitions=<Number of repetitions>
Clement Courbetac74acd2018-04-04 11:37:06 +0000218
219 Specify the number of repetitions of the asm snippet.
220 Higher values lead to more accurate measurements but lengthen the benchmark.
221
Clement Courbet2cd0f282019-10-08 14:30:24 +0000222.. option:: -max-configs-per-opcode=<value>
223
224 Specify the maximum configurations that can be generated for each opcode.
225 By default this is `1`, meaning that we assume that a single measurement is
226 enough to characterize an opcode. This might not be true of all instructions:
227 for example, the performance characteristics of the LEA instruction on X86
228 depends on the value of assigned registers and immediates. Setting a value of
229 `-max-configs-per-opcode` larger than `1` allows `llvm-exegesis` to explore
230 more configurations to discover if some register or immediate assignments
231 lead to different performance characteristics.
232
233
Simon Pilgrima5638432018-06-18 20:05:02 +0000234.. option:: -benchmarks-file=</path/to/file>
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000235
Clement Courbet362653f2019-01-30 16:02:20 +0000236 File to read (`analysis` mode) or write (`latency`/`uops`/`inverse_throughput`
237 modes) benchmark results. "-" uses stdin/stdout.
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000238
239.. option:: -analysis-clusters-output-file=</path/to/file>
240
241 If provided, write the analysis clusters as CSV to this file. "-" prints to
Roman Lebedev21193f42019-02-04 09:12:08 +0000242 stdout. By default, this analysis is not run.
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000243
244.. option:: -analysis-inconsistencies-output-file=</path/to/file>
245
246 If non-empty, write inconsistencies found during analysis to this file. `-`
Roman Lebedev21193f42019-02-04 09:12:08 +0000247 prints to stdout. By default, this analysis is not run.
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000248
Roman Lebedevc2423fe2019-03-28 08:55:01 +0000249.. option:: -analysis-clustering=[dbscan,naive]
250
251 Specify the clustering algorithm to use. By default DBSCAN will be used.
252 Naive clustering algorithm is better for doing further work on the
253 `-analysis-inconsistencies-output-file=` output, it will create one cluster
254 per opcode, and check that the cluster is stable (all points are neighbours).
255
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000256.. option:: -analysis-numpoints=<dbscan numPoints parameter>
257
258 Specify the numPoints parameters to be used for DBSCAN clustering
Roman Lebedevc2423fe2019-03-28 08:55:01 +0000259 (`analysis` mode, DBSCAN only).
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000260
Roman Lebedev542e5d72019-02-25 09:36:12 +0000261.. option:: -analysis-clustering-epsilon=<dbscan epsilon parameter>
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000262
Roman Lebedev542e5d72019-02-25 09:36:12 +0000263 Specify the epsilon parameter used for clustering of benchmark points
Clement Courbet5ec03cd2018-05-18 12:33:57 +0000264 (`analysis` mode).
265
Roman Lebedev542e5d72019-02-25 09:36:12 +0000266.. option:: -analysis-inconsistency-epsilon=<epsilon>
267
268 Specify the epsilon parameter used for detection of when the cluster
269 is different from the LLVM schedule profile values (`analysis` mode).
270
Roman Lebedev69716392019-02-20 09:14:04 +0000271.. option:: -analysis-display-unstable-clusters
272
273 If there is more than one benchmark for an opcode, said benchmarks may end up
274 not being clustered into the same cluster if the measured performance
275 characteristics are different. by default all such opcodes are filtered out.
276 This flag will instead show only such unstable opcodes.
277
Simon Pilgrima5638432018-06-18 20:05:02 +0000278.. option:: -ignore-invalid-sched-class=false
Clement Courbete752fd62018-06-18 11:27:47 +0000279
Simon Pilgrima5638432018-06-18 20:05:02 +0000280 If set, ignore instructions that do not have a sched class (class idx = 0).
Clement Courbete752fd62018-06-18 11:27:47 +0000281
Guillaume Chatelet848df5b2019-04-05 15:18:59 +0000282.. option:: -mcpu=<cpu name>
Clement Courbet41c8af32018-10-25 07:44:01 +0000283
Guillaume Chatelet848df5b2019-04-05 15:18:59 +0000284 If set, measure the cpu characteristics using the counters for this CPU. This
285 is useful when creating new sched models (the host CPU is unknown to LLVM).
286
287.. option:: --dump-object-to-disk=true
288
289 By default, llvm-exegesis will dump the generated code to a temporary file to
290 enable code inspection. You may disable it to speed up the execution and save
291 disk space.
Clement Courbetac74acd2018-04-04 11:37:06 +0000292
293EXIT STATUS
294-----------
295
296:program:`llvm-exegesis` returns 0 on success. Otherwise, an error message is
297printed to standard error, and the tool returns a non 0 value.