Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 1 | llvm-exegesis - LLVM Machine Instruction Benchmark |
| 2 | ================================================== |
| 3 | |
James Henderson | a056684 | 2019-06-27 13:24:46 +0000 | [diff] [blame] | 4 | .. program:: llvm-exegesis |
| 5 | |
Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 6 | SYNOPSIS |
| 7 | -------- |
| 8 | |
| 9 | :program:`llvm-exegesis` [*options*] |
| 10 | |
| 11 | DESCRIPTION |
| 12 | ----------- |
| 13 | |
| 14 | :program:`llvm-exegesis` is a benchmarking tool that uses information available |
Clement Courbet | 362653f | 2019-01-30 16:02:20 +0000 | [diff] [blame] | 15 | in LLVM to measure host machine instruction characteristics like latency, |
| 16 | throughput, or port decomposition. |
Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 17 | |
| 18 | Given an LLVM opcode name and a benchmarking mode, :program:`llvm-exegesis` |
| 19 | generates a code snippet that makes execution as serial (resp. as parallel) as |
Clement Courbet | 362653f | 2019-01-30 16:02:20 +0000 | [diff] [blame] | 20 | possible so that we can measure the latency (resp. inverse throughput/uop decomposition) |
| 21 | of the instruction. |
Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 22 | The code snippet is jitted and executed on the host subtarget. The time taken |
| 23 | (resp. resource usage) is measured using hardware performance counters. The |
| 24 | result is printed out as YAML to the standard output. |
| 25 | |
| 26 | The main goal of this tool is to automatically (in)validate the LLVM's TableDef |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 27 | scheduling models. To that end, we also provide analysis of the results. |
| 28 | |
Clement Courbet | 78b2e73 | 2018-09-25 07:31:44 +0000 | [diff] [blame] | 29 | :program:`llvm-exegesis` can also benchmark arbitrary user-provided code |
| 30 | snippets. |
| 31 | |
| 32 | EXAMPLE 1: benchmarking instructions |
| 33 | ------------------------------------ |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 34 | |
| 35 | Assume you have an X86-64 machine. To measure the latency of a single |
| 36 | instruction, run: |
| 37 | |
| 38 | .. code-block:: bash |
| 39 | |
| 40 | $ llvm-exegesis -mode=latency -opcode-name=ADD64rr |
| 41 | |
Clement Courbet | 362653f | 2019-01-30 16:02:20 +0000 | [diff] [blame] | 42 | Measuring the uop decomposition or inverse throughput of an instruction works similarly: |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 43 | |
| 44 | .. code-block:: bash |
| 45 | |
| 46 | $ llvm-exegesis -mode=uops -opcode-name=ADD64rr |
Clement Courbet | 362653f | 2019-01-30 16:02:20 +0000 | [diff] [blame] | 47 | $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr |
| 48 | |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 49 | |
| 50 | The output is a YAML document (the default is to write to stdout, but you can |
| 51 | redirect the output to a file using `-benchmarks-file`): |
| 52 | |
| 53 | .. code-block:: none |
| 54 | |
| 55 | --- |
| 56 | key: |
| 57 | opcode_name: ADD64rr |
| 58 | mode: latency |
| 59 | config: '' |
| 60 | cpu_name: haswell |
| 61 | llvm_triple: x86_64-unknown-linux-gnu |
| 62 | num_repetitions: 10000 |
| 63 | measurements: |
| 64 | - { key: latency, value: 1.0058, debug_string: '' } |
| 65 | error: '' |
| 66 | info: 'explicit self cycles, selecting one aliasing configuration. |
| 67 | Snippet: |
| 68 | ADD64rr R8, R8, R10 |
| 69 | ' |
| 70 | ... |
| 71 | |
| 72 | To measure the latency of all instructions for the host architecture, run: |
| 73 | |
| 74 | .. code-block:: bash |
| 75 | |
Clement Courbet | 992da89 | 2020-10-28 08:15:58 +0100 | [diff] [blame] | 76 | $ llvm-exegesis -mode=latency -opcode-index=-1 |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 77 | |
Clement Courbet | 78b2e73 | 2018-09-25 07:31:44 +0000 | [diff] [blame] | 78 | |
| 79 | EXAMPLE 2: benchmarking a custom code snippet |
| 80 | --------------------------------------------- |
| 81 | |
| 82 | To measure the latency/uops of a custom piece of code, you can specify the |
| 83 | `snippets-file` option (`-` reads from standard input). |
| 84 | |
| 85 | .. code-block:: bash |
| 86 | |
| 87 | $ echo "vzeroupper" | llvm-exegesis -mode=uops -snippets-file=- |
| 88 | |
| 89 | Real-life code snippets typically depend on registers or memory. |
| 90 | :program:`llvm-exegesis` checks the liveliness of registers (i.e. any register |
| 91 | use has a corresponding def or is a "live in"). If your code depends on the |
| 92 | value of some registers, you have two options: |
Clement Courbet | 86ecf46 | 2018-09-25 07:48:38 +0000 | [diff] [blame] | 93 | |
| 94 | - Mark the register as requiring a definition. :program:`llvm-exegesis` will |
| 95 | automatically assign a value to the register. This can be done using the |
| 96 | directive `LLVM-EXEGESIS-DEFREG <reg name> <hex_value>`, where `<hex_value>` |
| 97 | is a bit pattern used to fill `<reg_name>`. If `<hex_value>` is smaller than |
| 98 | the register width, it will be sign-extended. |
| 99 | - Mark the register as a "live in". :program:`llvm-exegesis` will benchmark |
| 100 | using whatever value was in this registers on entry. This can be done using |
| 101 | the directive `LLVM-EXEGESIS-LIVEIN <reg name>`. |
Clement Courbet | 78b2e73 | 2018-09-25 07:31:44 +0000 | [diff] [blame] | 102 | |
| 103 | For example, the following code snippet depends on the values of XMM1 (which |
| 104 | will be set by the tool) and the memory buffer passed in RDI (live in). |
| 105 | |
| 106 | .. code-block:: none |
| 107 | |
| 108 | # LLVM-EXEGESIS-LIVEIN RDI |
| 109 | # LLVM-EXEGESIS-DEFREG XMM1 42 |
| 110 | vmulps (%rdi), %xmm1, %xmm2 |
| 111 | vhaddps %xmm2, %xmm2, %xmm3 |
| 112 | addq $0x10, %rdi |
| 113 | |
| 114 | |
| 115 | EXAMPLE 3: analysis |
| 116 | ------------------- |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 117 | |
| 118 | Assuming you have a set of benchmarked instructions (either latency or uops) as |
| 119 | YAML in file `/tmp/benchmarks.yaml`, you can analyze the results using the |
| 120 | following command: |
| 121 | |
| 122 | .. code-block:: bash |
| 123 | |
| 124 | $ llvm-exegesis -mode=analysis \ |
| 125 | -benchmarks-file=/tmp/benchmarks.yaml \ |
| 126 | -analysis-clusters-output-file=/tmp/clusters.csv \ |
Simon Pilgrim | c4976f6 | 2018-09-27 13:49:52 +0000 | [diff] [blame] | 127 | -analysis-inconsistencies-output-file=/tmp/inconsistencies.html |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 128 | |
| 129 | This will group the instructions into clusters with the same performance |
| 130 | characteristics. The clusters will be written out to `/tmp/clusters.csv` in the |
| 131 | following format: |
| 132 | |
| 133 | .. code-block:: none |
| 134 | |
| 135 | cluster_id,opcode_name,config,sched_class |
| 136 | ... |
| 137 | 2,ADD32ri8_DB,,WriteALU,1.00 |
| 138 | 2,ADD32ri_DB,,WriteALU,1.01 |
| 139 | 2,ADD32rr,,WriteALU,1.01 |
| 140 | 2,ADD32rr_DB,,WriteALU,1.00 |
| 141 | 2,ADD32rr_REV,,WriteALU,1.00 |
| 142 | 2,ADD64i32,,WriteALU,1.01 |
| 143 | 2,ADD64ri32,,WriteALU,1.01 |
| 144 | 2,MOVSX64rr32,,BSWAP32r_BSWAP64r_MOVSX64rr32,1.00 |
| 145 | 2,VPADDQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.02 |
| 146 | 2,VPSUBQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.01 |
| 147 | 2,ADD64ri8,,WriteALU,1.00 |
| 148 | 2,SETBr,,WriteSETCC,1.01 |
| 149 | ... |
| 150 | |
| 151 | :program:`llvm-exegesis` will also analyze the clusters to point out |
Clement Courbet | 488ebfb | 2018-05-22 13:36:29 +0000 | [diff] [blame] | 152 | inconsistencies in the scheduling information. The output is an html file. For |
Clement Courbet | 2637e5f | 2018-05-24 10:47:05 +0000 | [diff] [blame] | 153 | example, `/tmp/inconsistencies.html` will contain messages like the following : |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 154 | |
Clement Courbet | 2637e5f | 2018-05-24 10:47:05 +0000 | [diff] [blame] | 155 | .. image:: llvm-exegesis-analysis.png |
| 156 | :align: center |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 157 | |
| 158 | Note that the scheduling class names will be resolved only when |
| 159 | :program:`llvm-exegesis` is compiled in debug mode, else only the class id will |
| 160 | be shown. This does not invalidate any of the analysis results though. |
| 161 | |
Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 162 | OPTIONS |
| 163 | ------- |
| 164 | |
| 165 | .. option:: -help |
| 166 | |
| 167 | Print a summary of command line options. |
| 168 | |
| 169 | .. option:: -opcode-index=<LLVM opcode index> |
| 170 | |
Roman Lebedev | cc5549d | 2020-02-13 12:45:15 +0300 | [diff] [blame] | 171 | Specify the opcode to measure, by index. Specifying `-1` will result |
| 172 | in measuring every existing opcode. See example 1 for details. |
Clement Courbet | 78b2e73 | 2018-09-25 07:31:44 +0000 | [diff] [blame] | 173 | Either `opcode-index`, `opcode-name` or `snippets-file` must be set. |
Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 174 | |
Clement Courbet | f973c2d | 2018-10-17 15:04:15 +0000 | [diff] [blame] | 175 | .. option:: -opcode-name=<opcode name 1>,<opcode name 2>,... |
Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 176 | |
Clement Courbet | f973c2d | 2018-10-17 15:04:15 +0000 | [diff] [blame] | 177 | Specify the opcode to measure, by name. Several opcodes can be specified as |
| 178 | a comma-separated list. See example 1 for details. |
Clement Courbet | 78b2e73 | 2018-09-25 07:31:44 +0000 | [diff] [blame] | 179 | Either `opcode-index`, `opcode-name` or `snippets-file` must be set. |
| 180 | |
Clement Courbet | 89a6647 | 2020-02-06 12:08:02 +0100 | [diff] [blame] | 181 | .. option:: -snippets-file=<filename> |
Clement Courbet | 78b2e73 | 2018-09-25 07:31:44 +0000 | [diff] [blame] | 182 | |
Clement Courbet | 89a6647 | 2020-02-06 12:08:02 +0100 | [diff] [blame] | 183 | Specify the custom code snippet to measure. See example 2 for details. |
| 184 | Either `opcode-index`, `opcode-name` or `snippets-file` must be set. |
Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 185 | |
Clement Courbet | 362653f | 2019-01-30 16:02:20 +0000 | [diff] [blame] | 186 | .. option:: -mode=[latency|uops|inverse_throughput|analysis] |
Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 187 | |
Vy Nguyen | ee7caa7 | 2020-07-27 12:38:05 -0400 | [diff] [blame] | 188 | Specify the run mode. Note that some modes have additional requirements and options. |
Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 189 | |
Vy Nguyen | ee7caa7 | 2020-07-27 12:38:05 -0400 | [diff] [blame] | 190 | `latency` mode can be make use of either RDTSC or LBR. |
| 191 | `latency[LBR]` is only available on X86 (at least `Skylake`). |
Simon Pilgrim | feb9d8b | 2020-08-04 15:52:09 +0100 | [diff] [blame] | 192 | To run in `latency` mode, a positive value must be specified for `x86-lbr-sample-period` and `--repetition-mode=loop`. |
Vy Nguyen | ee7caa7 | 2020-07-27 12:38:05 -0400 | [diff] [blame] | 193 | |
| 194 | In `analysis` mode, you also need to specify at least one of the |
| 195 | `-analysis-clusters-output-file=` and `-analysis-inconsistencies-output-file=`. |
| 196 | |
| 197 | .. option:: -x86-lbr-sample-period=<nBranches/sample> |
| 198 | |
| 199 | Specify the LBR sampling period - how many branches before we take a sample. |
| 200 | When a positive value is specified for this option and when the mode is `latency`, |
| 201 | we will use LBRs for measuring. |
| 202 | On choosing the "right" sampling period, a small value is preferred, but throttling |
| 203 | could occur if the sampling is too frequent. A prime number should be used to |
| 204 | avoid consistently skipping certain blocks. |
| 205 | |
Roman Lebedev | de22d71 | 2020-04-02 09:28:35 +0300 | [diff] [blame] | 206 | .. option:: -repetition-mode=[duplicate|loop|min] |
Clement Courbet | 89a6647 | 2020-02-06 12:08:02 +0100 | [diff] [blame] | 207 | |
| 208 | Specify the repetition mode. `duplicate` will create a large, straight line |
| 209 | basic block with `num-repetitions` copies of the snippet. `loop` will wrap |
| 210 | the snippet in a loop which will be run `num-repetitions` times. The `loop` |
| 211 | mode tends to better hide the effects of the CPU frontend on architectures |
| 212 | that cache decoded instructions, but consumes a register for counting |
Roman Lebedev | de22d71 | 2020-04-02 09:28:35 +0300 | [diff] [blame] | 213 | iterations. If performing an analysis over many opcodes, it may be best |
| 214 | to instead use the `min` mode, which will run each other mode, and produce |
| 215 | the minimal measured result. |
Clement Courbet | 89a6647 | 2020-02-06 12:08:02 +0100 | [diff] [blame] | 216 | |
Clement Courbet | 2cd0f28 | 2019-10-08 14:30:24 +0000 | [diff] [blame] | 217 | .. option:: -num-repetitions=<Number of repetitions> |
Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 218 | |
| 219 | Specify the number of repetitions of the asm snippet. |
| 220 | Higher values lead to more accurate measurements but lengthen the benchmark. |
| 221 | |
Clement Courbet | 2cd0f28 | 2019-10-08 14:30:24 +0000 | [diff] [blame] | 222 | .. option:: -max-configs-per-opcode=<value> |
| 223 | |
| 224 | Specify the maximum configurations that can be generated for each opcode. |
| 225 | By default this is `1`, meaning that we assume that a single measurement is |
| 226 | enough to characterize an opcode. This might not be true of all instructions: |
| 227 | for example, the performance characteristics of the LEA instruction on X86 |
| 228 | depends on the value of assigned registers and immediates. Setting a value of |
| 229 | `-max-configs-per-opcode` larger than `1` allows `llvm-exegesis` to explore |
| 230 | more configurations to discover if some register or immediate assignments |
| 231 | lead to different performance characteristics. |
| 232 | |
| 233 | |
Simon Pilgrim | a563843 | 2018-06-18 20:05:02 +0000 | [diff] [blame] | 234 | .. option:: -benchmarks-file=</path/to/file> |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 235 | |
Clement Courbet | 362653f | 2019-01-30 16:02:20 +0000 | [diff] [blame] | 236 | File to read (`analysis` mode) or write (`latency`/`uops`/`inverse_throughput` |
| 237 | modes) benchmark results. "-" uses stdin/stdout. |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 238 | |
| 239 | .. option:: -analysis-clusters-output-file=</path/to/file> |
| 240 | |
| 241 | If provided, write the analysis clusters as CSV to this file. "-" prints to |
Roman Lebedev | 21193f4 | 2019-02-04 09:12:08 +0000 | [diff] [blame] | 242 | stdout. By default, this analysis is not run. |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 243 | |
| 244 | .. option:: -analysis-inconsistencies-output-file=</path/to/file> |
| 245 | |
| 246 | If non-empty, write inconsistencies found during analysis to this file. `-` |
Roman Lebedev | 21193f4 | 2019-02-04 09:12:08 +0000 | [diff] [blame] | 247 | prints to stdout. By default, this analysis is not run. |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 248 | |
Roman Lebedev | c2423fe | 2019-03-28 08:55:01 +0000 | [diff] [blame] | 249 | .. option:: -analysis-clustering=[dbscan,naive] |
| 250 | |
| 251 | Specify the clustering algorithm to use. By default DBSCAN will be used. |
| 252 | Naive clustering algorithm is better for doing further work on the |
| 253 | `-analysis-inconsistencies-output-file=` output, it will create one cluster |
| 254 | per opcode, and check that the cluster is stable (all points are neighbours). |
| 255 | |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 256 | .. option:: -analysis-numpoints=<dbscan numPoints parameter> |
| 257 | |
| 258 | Specify the numPoints parameters to be used for DBSCAN clustering |
Roman Lebedev | c2423fe | 2019-03-28 08:55:01 +0000 | [diff] [blame] | 259 | (`analysis` mode, DBSCAN only). |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 260 | |
Roman Lebedev | 542e5d7 | 2019-02-25 09:36:12 +0000 | [diff] [blame] | 261 | .. option:: -analysis-clustering-epsilon=<dbscan epsilon parameter> |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 262 | |
Roman Lebedev | 542e5d7 | 2019-02-25 09:36:12 +0000 | [diff] [blame] | 263 | Specify the epsilon parameter used for clustering of benchmark points |
Clement Courbet | 5ec03cd | 2018-05-18 12:33:57 +0000 | [diff] [blame] | 264 | (`analysis` mode). |
| 265 | |
Roman Lebedev | 542e5d7 | 2019-02-25 09:36:12 +0000 | [diff] [blame] | 266 | .. option:: -analysis-inconsistency-epsilon=<epsilon> |
| 267 | |
| 268 | Specify the epsilon parameter used for detection of when the cluster |
| 269 | is different from the LLVM schedule profile values (`analysis` mode). |
| 270 | |
Roman Lebedev | 6971639 | 2019-02-20 09:14:04 +0000 | [diff] [blame] | 271 | .. option:: -analysis-display-unstable-clusters |
| 272 | |
| 273 | If there is more than one benchmark for an opcode, said benchmarks may end up |
| 274 | not being clustered into the same cluster if the measured performance |
| 275 | characteristics are different. by default all such opcodes are filtered out. |
| 276 | This flag will instead show only such unstable opcodes. |
| 277 | |
Simon Pilgrim | a563843 | 2018-06-18 20:05:02 +0000 | [diff] [blame] | 278 | .. option:: -ignore-invalid-sched-class=false |
Clement Courbet | e752fd6 | 2018-06-18 11:27:47 +0000 | [diff] [blame] | 279 | |
Simon Pilgrim | a563843 | 2018-06-18 20:05:02 +0000 | [diff] [blame] | 280 | If set, ignore instructions that do not have a sched class (class idx = 0). |
Clement Courbet | e752fd6 | 2018-06-18 11:27:47 +0000 | [diff] [blame] | 281 | |
Guillaume Chatelet | 848df5b | 2019-04-05 15:18:59 +0000 | [diff] [blame] | 282 | .. option:: -mcpu=<cpu name> |
Clement Courbet | 41c8af3 | 2018-10-25 07:44:01 +0000 | [diff] [blame] | 283 | |
Guillaume Chatelet | 848df5b | 2019-04-05 15:18:59 +0000 | [diff] [blame] | 284 | If set, measure the cpu characteristics using the counters for this CPU. This |
| 285 | is useful when creating new sched models (the host CPU is unknown to LLVM). |
| 286 | |
| 287 | .. option:: --dump-object-to-disk=true |
| 288 | |
| 289 | By default, llvm-exegesis will dump the generated code to a temporary file to |
| 290 | enable code inspection. You may disable it to speed up the execution and save |
| 291 | disk space. |
Clement Courbet | ac74acd | 2018-04-04 11:37:06 +0000 | [diff] [blame] | 292 | |
| 293 | EXIT STATUS |
| 294 | ----------- |
| 295 | |
| 296 | :program:`llvm-exegesis` returns 0 on success. Otherwise, an error message is |
| 297 | printed to standard error, and the tool returns a non 0 value. |