Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 1 | llvm-mca - LLVM Machine Code Analyzer |
| 2 | ===================================== |
| 3 | |
| 4 | SYNOPSIS |
| 5 | -------- |
| 6 | |
| 7 | :program:`llvm-mca` [*options*] [input] |
| 8 | |
| 9 | DESCRIPTION |
| 10 | ----------- |
| 11 | |
| 12 | :program:`llvm-mca` is a performance analysis tool that uses information |
| 13 | available in LLVM (e.g. scheduling models) to statically measure the performance |
| 14 | of machine code in a specific CPU. |
| 15 | |
| 16 | Performance is measured in terms of throughput as well as processor resource |
| 17 | consumption. The tool currently works for processors with an out-of-order |
| 18 | backend, for which there is a scheduling model available in LLVM. |
| 19 | |
| 20 | The main goal of this tool is not just to predict the performance of the code |
| 21 | when run on the target, but also help with diagnosing potential performance |
| 22 | issues. |
| 23 | |
Matt Davis | 07dee81 | 2018-07-23 21:10:50 +0000 | [diff] [blame] | 24 | Given an assembly code sequence, llvm-mca estimates the Instructions Per Cycle |
| 25 | (IPC), as well as hardware resource pressure. The analysis and reporting style |
| 26 | were inspired by the IACA tool from Intel. |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 27 | |
Andrea Di Biagio | c659012 | 2018-04-09 16:39:52 +0000 | [diff] [blame] | 28 | :program:`llvm-mca` allows the usage of special code comments to mark regions of |
| 29 | the assembly code to be analyzed. A comment starting with substring |
| 30 | ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment starting with |
| 31 | substring ``LLVM-MCA-END`` marks the end of a code region. For example: |
| 32 | |
| 33 | .. code-block:: none |
| 34 | |
| 35 | # LLVM-MCA-BEGIN My Code Region |
| 36 | ... |
| 37 | # LLVM-MCA-END |
| 38 | |
Sanjay Patel | 40ad926 | 2018-04-10 18:10:14 +0000 | [diff] [blame] | 39 | Multiple regions can be specified provided that they do not overlap. A code |
| 40 | region can have an optional description. If no user-defined region is specified, |
| 41 | then :program:`llvm-mca` assumes a default region which contains every |
| 42 | instruction in the input file. Every region is analyzed in isolation, and the |
| 43 | final performance report is the union of all the reports generated for every |
| 44 | code region. |
| 45 | |
Matt Davis | a448670b | 2018-07-17 16:11:54 +0000 | [diff] [blame] | 46 | Inline assembly directives may be used from source code to annotate the |
Sanjay Patel | c86033a | 2018-04-10 17:49:45 +0000 | [diff] [blame] | 47 | assembly text: |
| 48 | |
| 49 | .. code-block:: c++ |
| 50 | |
Sanjay Patel | e3a59e2 | 2018-04-10 17:56:24 +0000 | [diff] [blame] | 51 | int foo(int a, int b) { |
| 52 | __asm volatile("# LLVM-MCA-BEGIN foo"); |
| 53 | a += 42; |
| 54 | __asm volatile("# LLVM-MCA-END"); |
Andrea Di Biagio | ef507cb | 2018-04-24 10:09:32 +0000 | [diff] [blame] | 55 | a *= b; |
Sanjay Patel | e3a59e2 | 2018-04-10 17:56:24 +0000 | [diff] [blame] | 56 | return a; |
| 57 | } |
Sanjay Patel | c86033a | 2018-04-10 17:49:45 +0000 | [diff] [blame] | 58 | |
| 59 | So for example, you can compile code with clang, output assembly, and pipe it |
| 60 | directly into llvm-mca for analysis: |
| 61 | |
| 62 | .. code-block:: bash |
| 63 | |
Sanjay Patel | 40ad926 | 2018-04-10 18:10:14 +0000 | [diff] [blame] | 64 | $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2 |
Andrea Di Biagio | c659012 | 2018-04-09 16:39:52 +0000 | [diff] [blame] | 65 | |
Andrea Di Biagio | d8d940a | 2018-05-17 16:48:53 +0000 | [diff] [blame] | 66 | Or for Intel syntax: |
| 67 | |
Simon Pilgrim | 93d45bc | 2018-05-17 16:58:42 +0000 | [diff] [blame] | 68 | .. code-block:: bash |
Andrea Di Biagio | d8d940a | 2018-05-17 16:48:53 +0000 | [diff] [blame] | 69 | |
| 70 | $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2 |
| 71 | |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 72 | OPTIONS |
| 73 | ------- |
| 74 | |
| 75 | If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard |
| 76 | input. Otherwise, it will read from the specified filename. |
| 77 | |
| 78 | If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output |
| 79 | to standard output if the input is from standard input. If the :option:`-o` |
| 80 | option specifies "``-``", then the output will also be sent to standard output. |
| 81 | |
| 82 | |
| 83 | .. option:: -help |
| 84 | |
| 85 | Print a summary of command line options. |
| 86 | |
| 87 | .. option:: -mtriple=<target triple> |
| 88 | |
| 89 | Specify a target triple string. |
| 90 | |
| 91 | .. option:: -march=<arch> |
| 92 | |
| 93 | Specify the architecture for which to analyze the code. It defaults to the |
| 94 | host default target. |
| 95 | |
| 96 | .. option:: -mcpu=<cpuname> |
| 97 | |
Andrea Di Biagio | 93c49d5 | 2018-04-25 10:18:25 +0000 | [diff] [blame] | 98 | Specify the processor for which to analyze the code. By default, the cpu name |
| 99 | is autodetected from the host. |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 100 | |
| 101 | .. option:: -output-asm-variant=<variant id> |
| 102 | |
| 103 | Specify the output assembly variant for the report generated by the tool. |
| 104 | On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables |
| 105 | the AT&T (vic. Intel) assembly format for the code printed out by the tool in |
| 106 | the analysis report. |
| 107 | |
| 108 | .. option:: -dispatch=<width> |
| 109 | |
| 110 | Specify a different dispatch width for the processor. The dispatch width |
Andrea Di Biagio | efc3f39 | 2018-04-05 16:42:32 +0000 | [diff] [blame] | 111 | defaults to field 'IssueWidth' in the processor scheduling model. If width is |
| 112 | zero, then the default dispatch width is used. |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 113 | |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 114 | .. option:: -register-file-size=<size> |
| 115 | |
Andrea Di Biagio | efc3f39 | 2018-04-05 16:42:32 +0000 | [diff] [blame] | 116 | Specify the size of the register file. When specified, this flag limits how |
Matt Davis | e8c70bc | 2018-07-31 18:59:46 +0000 | [diff] [blame^] | 117 | many physical registers are available for register renaming purposes. A value |
| 118 | of zero for this flag means "unlimited number of physical registers". |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 119 | |
| 120 | .. option:: -iterations=<number of iterations> |
| 121 | |
| 122 | Specify the number of iterations to run. If this flag is set to 0, then the |
Andrea Di Biagio | 074cef3 | 2018-04-10 12:50:03 +0000 | [diff] [blame] | 123 | tool sets the number of iterations to a default value (i.e. 100). |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 124 | |
| 125 | .. option:: -noalias=<bool> |
| 126 | |
| 127 | If set, the tool assumes that loads and stores don't alias. This is the |
| 128 | default behavior. |
| 129 | |
| 130 | .. option:: -lqueue=<load queue size> |
| 131 | |
| 132 | Specify the size of the load queue in the load/store unit emulated by the tool. |
| 133 | By default, the tool assumes an unbound number of entries in the load queue. |
| 134 | A value of zero for this flag is ignored, and the default load queue size is |
Matt Davis | a448670b | 2018-07-17 16:11:54 +0000 | [diff] [blame] | 135 | used instead. |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 136 | |
| 137 | .. option:: -squeue=<store queue size> |
| 138 | |
| 139 | Specify the size of the store queue in the load/store unit emulated by the |
| 140 | tool. By default, the tool assumes an unbound number of entries in the store |
| 141 | queue. A value of zero for this flag is ignored, and the default store queue |
| 142 | size is used instead. |
| 143 | |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 144 | .. option:: -timeline |
| 145 | |
| 146 | Enable the timeline view. |
| 147 | |
| 148 | .. option:: -timeline-max-iterations=<iterations> |
| 149 | |
| 150 | Limit the number of iterations to print in the timeline view. By default, the |
| 151 | timeline view prints information for up to 10 iterations. |
| 152 | |
| 153 | .. option:: -timeline-max-cycles=<cycles> |
| 154 | |
| 155 | Limit the number of cycles in the timeline view. By default, the number of |
| 156 | cycles is set to 80. |
| 157 | |
Andrea Di Biagio | 1feccc2 | 2018-03-26 13:21:48 +0000 | [diff] [blame] | 158 | .. option:: -resource-pressure |
| 159 | |
| 160 | Enable the resource pressure view. This is enabled by default. |
| 161 | |
Andrea Di Biagio | 8dabf4f | 2018-04-03 16:46:23 +0000 | [diff] [blame] | 162 | .. option:: -register-file-stats |
| 163 | |
| 164 | Enable register file usage statistics. |
| 165 | |
Andrea Di Biagio | 821f650 | 2018-04-10 14:55:14 +0000 | [diff] [blame] | 166 | .. option:: -dispatch-stats |
| 167 | |
| 168 | Enable extra dispatch statistics. This view collects and analyzes instruction |
| 169 | dispatch events, as well as static/dynamic dispatch stall events. This view |
| 170 | is disabled by default. |
| 171 | |
Andrea Di Biagio | 1cc29c0 | 2018-04-11 11:37:46 +0000 | [diff] [blame] | 172 | .. option:: -scheduler-stats |
| 173 | |
| 174 | Enable extra scheduler statistics. This view collects and analyzes instruction |
| 175 | issue events. This view is disabled by default. |
| 176 | |
Andrea Di Biagio | f41ad5c | 2018-04-11 12:12:53 +0000 | [diff] [blame] | 177 | .. option:: -retire-stats |
| 178 | |
| 179 | Enable extra retire control unit statistics. This view is disabled by default. |
| 180 | |
Andrea Di Biagio | ff9c109 | 2018-03-26 13:44:54 +0000 | [diff] [blame] | 181 | .. option:: -instruction-info |
| 182 | |
| 183 | Enable the instruction info view. This is enabled by default. |
| 184 | |
Andrea Di Biagio | 650b5fc | 2018-05-17 12:27:03 +0000 | [diff] [blame] | 185 | .. option:: -all-stats |
| 186 | |
| 187 | Print all hardware statistics. This enables extra statistics related to the |
| 188 | dispatch logic, the hardware schedulers, the register file(s), and the retire |
| 189 | control unit. This option is disabled by default. |
| 190 | |
| 191 | .. option:: -all-views |
| 192 | |
| 193 | Enable all the view. |
| 194 | |
Andrea Di Biagio | d156929 | 2018-03-26 12:04:53 +0000 | [diff] [blame] | 195 | .. option:: -instruction-tables |
| 196 | |
| 197 | Prints resource pressure information based on the static information |
| 198 | available from the processor model. This differs from the resource pressure |
| 199 | view because it doesn't require that the code is simulated. It instead prints |
| 200 | the theoretical uniform distribution of resource pressure for every |
| 201 | instruction in sequence. |
| 202 | |
Matt Davis | a448670b | 2018-07-17 16:11:54 +0000 | [diff] [blame] | 203 | |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 204 | EXIT STATUS |
| 205 | ----------- |
| 206 | |
| 207 | :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed |
| 208 | to standard error, and the tool returns 1. |
| 209 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 210 | HOW LLVM-MCA WORKS |
| 211 | ------------------ |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 212 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 213 | :program:`llvm-mca` takes assembly code as input. The assembly code is parsed |
| 214 | into a sequence of MCInst with the help of the existing LLVM target assembly |
| 215 | parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module |
| 216 | to generate a performance report. |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 217 | |
| 218 | The Pipeline module simulates the execution of the machine code sequence in a |
| 219 | loop of iterations (default is 100). During this process, the pipeline collects |
| 220 | a number of execution related statistics. At the end of this process, the |
| 221 | pipeline generates and prints a report from the collected statistics. |
| 222 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 223 | Here is an example of a performance report generated by the tool for a |
| 224 | dot-product of two packed float vectors of four elements. The analysis is |
| 225 | conducted for target x86, cpu btver2. The following result can be produced via |
| 226 | the following command using the example located at |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 227 | ``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: |
| 228 | |
| 229 | .. code-block:: bash |
| 230 | |
| 231 | $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s |
| 232 | |
| 233 | .. code-block:: none |
| 234 | |
| 235 | Iterations: 300 |
| 236 | Instructions: 900 |
| 237 | Total Cycles: 610 |
| 238 | Dispatch Width: 2 |
| 239 | IPC: 1.48 |
| 240 | Block RThroughput: 2.0 |
| 241 | |
| 242 | |
| 243 | Instruction Info: |
| 244 | [1]: #uOps |
| 245 | [2]: Latency |
| 246 | [3]: RThroughput |
| 247 | [4]: MayLoad |
| 248 | [5]: MayStore |
| 249 | [6]: HasSideEffects (U) |
| 250 | |
| 251 | [1] [2] [3] [4] [5] [6] Instructions: |
| 252 | 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2 |
| 253 | 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3 |
| 254 | 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4 |
| 255 | |
| 256 | |
| 257 | Resources: |
| 258 | [0] - JALU0 |
| 259 | [1] - JALU1 |
| 260 | [2] - JDiv |
| 261 | [3] - JFPA |
| 262 | [4] - JFPM |
| 263 | [5] - JFPU0 |
| 264 | [6] - JFPU1 |
| 265 | [7] - JLAGU |
| 266 | [8] - JMul |
| 267 | [9] - JSAGU |
| 268 | [10] - JSTC |
| 269 | [11] - JVALU0 |
| 270 | [12] - JVALU1 |
| 271 | [13] - JVIMUL |
| 272 | |
| 273 | |
| 274 | Resource pressure per iteration: |
| 275 | [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] |
| 276 | - - - 2.00 1.00 2.00 1.00 - - - - - - - |
| 277 | |
| 278 | Resource pressure by instruction: |
| 279 | [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions: |
| 280 | - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2 |
| 281 | - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3 |
| 282 | - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4 |
| 283 | |
| 284 | According to this report, the dot-product kernel has been executed 300 times, |
| 285 | for a total of 900 dynamically executed instructions. |
| 286 | |
| 287 | The report is structured in three main sections. The first section collects a |
| 288 | few performance numbers; the goal of this section is to give a very quick |
| 289 | overview of the performance throughput. In this example, the two important |
Andrea Di Biagio | 1dac6ba | 2018-07-31 18:19:15 +0000 | [diff] [blame] | 290 | performance indicators are **IPC** and **Block RThroughput** (Block Reciprocal |
| 291 | Throughput). |
| 292 | |
| 293 | IPC is computed dividing the total number of simulated instructions by the total |
| 294 | number of cycles. A delta between Dispatch Width and IPC is an indicator of a |
| 295 | performance issue. In the absence of loop-carried data dependencies, the |
| 296 | observed IPC tends to a theoretical maximum which can be computed by dividing |
| 297 | the number of instructions of a single iteration by the *Block RThroughput*. |
| 298 | |
| 299 | IPC is bounded from above by the dispatch width. That is because the dispatch |
| 300 | width limits the maximum size of a dispatch group. IPC is also limited by the |
| 301 | amount of hardware parallelism. The availability of hardware resources affects |
| 302 | the resource pressure distribution, and it limits the number of instructions |
| 303 | that can be executed in parallel every cycle. A delta between Dispatch |
| 304 | Width and the theoretical maximum IPC is an indicator of a performance |
| 305 | bottleneck caused by the lack of hardware resources. In general, the lower the |
| 306 | Block RThroughput, the better. |
| 307 | |
| 308 | In this example, ``Instructions per iteration/Block RThroughput`` is 1.50. Since |
| 309 | there are no loop-carried dependencies, the observed IPC is expected to approach |
| 310 | 1.50 when the number of iterations tends to infinity. The delta between the |
| 311 | Dispatch Width (2.00), and the theoretical maximum IPC (1.50) is an indicator of |
| 312 | a performance bottleneck caused by the lack of hardware resources, and the |
| 313 | *Resource pressure view* can help to identify the problematic resource usage. |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 314 | |
| 315 | The second section of the report shows the latency and reciprocal |
| 316 | throughput of every instruction in the sequence. That section also reports |
| 317 | extra information related to the number of micro opcodes, and opcode properties |
| 318 | (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). |
| 319 | |
| 320 | The third section is the *Resource pressure view*. This view reports |
| 321 | the average number of resource cycles consumed every iteration by instructions |
| 322 | for every processor resource unit available on the target. Information is |
| 323 | structured in two tables. The first table reports the number of resource cycles |
| 324 | spent on average every iteration. The second table correlates the resource |
| 325 | cycles to the machine instruction in the sequence. For example, every iteration |
| 326 | of the instruction vmulps always executes on resource unit [6] |
| 327 | (JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 328 | per iteration. Note that on AMD Jaguar, vector floating-point multiply can |
| 329 | only be issued to pipeline JFPU1, while horizontal floating-point additions can |
| 330 | only be issued to pipeline JFPU0. |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 331 | |
| 332 | The resource pressure view helps with identifying bottlenecks caused by high |
| 333 | usage of specific hardware resources. Situations with resource pressure mainly |
| 334 | concentrated on a few resources should, in general, be avoided. Ideally, |
| 335 | pressure should be uniformly distributed between multiple resources. |
| 336 | |
| 337 | Timeline View |
| 338 | ^^^^^^^^^^^^^ |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 339 | The timeline view produces a detailed report of each instruction's state |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 340 | transitions through an instruction pipeline. This view is enabled by the |
| 341 | command line option ``-timeline``. As instructions transition through the |
| 342 | various stages of the pipeline, their states are depicted in the view report. |
| 343 | These states are represented by the following characters: |
| 344 | |
| 345 | * D : Instruction dispatched. |
| 346 | * e : Instruction executing. |
| 347 | * E : Instruction executed. |
| 348 | * R : Instruction retired. |
| 349 | * = : Instruction already dispatched, waiting to be executed. |
| 350 | * \- : Instruction executed, waiting to be retired. |
| 351 | |
| 352 | Below is the timeline view for a subset of the dot-product example located in |
| 353 | ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 354 | :program:`llvm-mca` using the following command: |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 355 | |
| 356 | .. code-block:: bash |
| 357 | |
| 358 | $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s |
| 359 | |
| 360 | .. code-block:: none |
| 361 | |
| 362 | Timeline view: |
| 363 | 012345 |
| 364 | Index 0123456789 |
| 365 | |
| 366 | [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2 |
| 367 | [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3 |
| 368 | [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 |
| 369 | [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 |
| 370 | [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3 |
| 371 | [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 |
| 372 | [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 |
| 373 | [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3 |
| 374 | [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4 |
| 375 | |
| 376 | |
| 377 | Average Wait times (based on the timeline view): |
| 378 | [0]: Executions |
| 379 | [1]: Average time spent waiting in a scheduler's queue |
| 380 | [2]: Average time spent waiting in a scheduler's queue while ready |
| 381 | [3]: Average time elapsed from WB until retire stage |
| 382 | |
| 383 | [0] [1] [2] [3] |
| 384 | 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2 |
| 385 | 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3 |
| 386 | 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 |
| 387 | |
| 388 | The timeline view is interesting because it shows instruction state changes |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 389 | during execution. It also gives an idea of how the tool processes instructions |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 390 | executed on the target, and how their timing information might be calculated. |
| 391 | |
| 392 | The timeline view is structured in two tables. The first table shows |
| 393 | instructions changing state over time (measured in cycles); the second table |
| 394 | (named *Average Wait times*) reports useful timing statistics, which should |
| 395 | help diagnose performance bottlenecks caused by long data dependencies and |
| 396 | sub-optimal usage of hardware resources. |
| 397 | |
| 398 | An instruction in the timeline view is identified by a pair of indices, where |
| 399 | the first index identifies an iteration, and the second index is the |
| 400 | instruction index (i.e., where it appears in the code sequence). Since this |
| 401 | example was generated using 3 iterations: ``-iterations=3``, the iteration |
| 402 | indices range from 0-2 inclusively. |
| 403 | |
| 404 | Excluding the first and last column, the remaining columns are in cycles. |
| 405 | Cycles are numbered sequentially starting from 0. |
| 406 | |
| 407 | From the example output above, we know the following: |
| 408 | |
| 409 | * Instruction [1,0] was dispatched at cycle 1. |
| 410 | * Instruction [1,0] started executing at cycle 2. |
| 411 | * Instruction [1,0] reached the write back stage at cycle 4. |
| 412 | * Instruction [1,0] was retired at cycle 10. |
| 413 | |
| 414 | Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the |
| 415 | scheduler's queue for the operands to become available. By the time vmulps is |
| 416 | dispatched, operands are already available, and pipeline JFPU1 is ready to |
| 417 | serve another instruction. So the instruction can be immediately issued on the |
| 418 | JFPU1 pipeline. That is demonstrated by the fact that the instruction only |
| 419 | spent 1cy in the scheduler's queue. |
| 420 | |
| 421 | There is a gap of 5 cycles between the write-back stage and the retire event. |
| 422 | That is because instructions must retire in program order, so [1,0] has to wait |
| 423 | for [0,2] to be retired first (i.e., it has to wait until cycle 10). |
| 424 | |
| 425 | In the example, all instructions are in a RAW (Read After Write) dependency |
| 426 | chain. Register %xmm2 written by vmulps is immediately used by the first |
| 427 | vhaddps, and register %xmm3 written by the first vhaddps is used by the second |
| 428 | vhaddps. Long data dependencies negatively impact the ILP (Instruction Level |
| 429 | Parallelism). |
| 430 | |
| 431 | In the dot-product example, there are anti-dependencies introduced by |
| 432 | instructions from different iterations. However, those dependencies can be |
| 433 | removed at register renaming stage (at the cost of allocating register aliases, |
Matt Davis | e8c70bc | 2018-07-31 18:59:46 +0000 | [diff] [blame^] | 434 | and therefore consuming physical registers). |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 435 | |
| 436 | Table *Average Wait times* helps diagnose performance issues that are caused by |
| 437 | the presence of long latency instructions and potentially long data dependencies |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 438 | which may limit the ILP. Note that :program:`llvm-mca`, by default, assumes at |
| 439 | least 1cy between the dispatch event and the issue event. |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 440 | |
| 441 | When the performance is limited by data dependencies and/or long latency |
| 442 | instructions, the number of cycles spent while in the *ready* state is expected |
| 443 | to be very small when compared with the total number of cycles spent in the |
| 444 | scheduler's queue. The difference between the two counters is a good indicator |
| 445 | of how large of an impact data dependencies had on the execution of the |
| 446 | instructions. When performance is mostly limited by the lack of hardware |
| 447 | resources, the delta between the two counters is small. However, the number of |
| 448 | cycles spent in the queue tends to be larger (i.e., more than 1-3cy), |
| 449 | especially when compared to other low latency instructions. |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 450 | |
| 451 | Extra Statistics to Further Diagnose Performance Issues |
| 452 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 453 | The ``-all-stats`` command line option enables extra statistics and performance |
| 454 | counters for the dispatch logic, the reorder buffer, the retire control unit, |
| 455 | and the register file. |
| 456 | |
| 457 | Below is an example of ``-all-stats`` output generated by MCA for the |
| 458 | dot-product example discussed in the previous sections. |
| 459 | |
| 460 | .. code-block:: none |
| 461 | |
| 462 | Dynamic Dispatch Stall Cycles: |
| 463 | RAT - Register unavailable: 0 |
| 464 | RCU - Retire tokens unavailable: 0 |
| 465 | SCHEDQ - Scheduler full: 272 |
| 466 | LQ - Load queue full: 0 |
| 467 | SQ - Store queue full: 0 |
| 468 | GROUP - Static restrictions on the dispatch group: 0 |
| 469 | |
| 470 | |
| 471 | Dispatch Logic - number of cycles where we saw N instructions dispatched: |
| 472 | [# dispatched], [# cycles] |
| 473 | 0, 24 (3.9%) |
| 474 | 1, 272 (44.6%) |
| 475 | 2, 314 (51.5%) |
| 476 | |
| 477 | |
| 478 | Schedulers - number of cycles where we saw N instructions issued: |
| 479 | [# issued], [# cycles] |
| 480 | 0, 7 (1.1%) |
| 481 | 1, 306 (50.2%) |
| 482 | 2, 297 (48.7%) |
| 483 | |
| 484 | |
| 485 | Scheduler's queue usage: |
| 486 | JALU01, 0/20 |
| 487 | JFPU01, 18/18 |
| 488 | JLSAGU, 0/12 |
| 489 | |
| 490 | |
| 491 | Retire Control Unit - number of cycles where we saw N instructions retired: |
| 492 | [# retired], [# cycles] |
| 493 | 0, 109 (17.9%) |
| 494 | 1, 102 (16.7%) |
| 495 | 2, 399 (65.4%) |
| 496 | |
| 497 | |
| 498 | Register File statistics: |
| 499 | Total number of mappings created: 900 |
| 500 | Max number of mappings used: 35 |
| 501 | |
| 502 | * Register File #1 -- JFpuPRF: |
| 503 | Number of physical registers: 72 |
| 504 | Total number of mappings created: 900 |
| 505 | Max number of mappings used: 35 |
| 506 | |
| 507 | * Register File #2 -- JIntegerPRF: |
| 508 | Number of physical registers: 64 |
| 509 | Total number of mappings created: 0 |
| 510 | Max number of mappings used: 0 |
| 511 | |
| 512 | If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for |
| 513 | SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch |
| 514 | logic is unable to dispatch a group of two instructions because the scheduler's |
| 515 | queue is full. |
| 516 | |
| 517 | Looking at the *Dispatch Logic* table, we see that the pipeline was only able |
| 518 | to dispatch two instructions 51.5% of the time. The dispatch group was limited |
| 519 | to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The |
| 520 | dispatch statistics are displayed by either using the command option |
| 521 | ``-all-stats`` or ``-dispatch-stats``. |
| 522 | |
| 523 | The next table, *Schedulers*, presents a histogram displaying a count, |
| 524 | representing the number of instructions issued on some number of cycles. In |
| 525 | this case, of the 610 simulated cycles, single |
| 526 | instructions were issued 306 times (50.2%) and there were 7 cycles where |
| 527 | no instructions were issued. |
| 528 | |
| 529 | The *Scheduler's queue usage* table shows that the maximum number of buffer |
| 530 | entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01 |
| 531 | reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements |
| 532 | three schedulers: |
| 533 | |
| 534 | * JALU01 - A scheduler for ALU instructions. |
| 535 | * JFPU01 - A scheduler floating point operations. |
| 536 | * JLSAGU - A scheduler for address generation. |
| 537 | |
| 538 | The dot-product is a kernel of three floating point instructions (a vector |
| 539 | multiply followed by two horizontal adds). That explains why only the floating |
| 540 | point scheduler appears to be used. |
| 541 | |
| 542 | A full scheduler queue is either caused by data dependency chains or by a |
| 543 | sub-optimal usage of hardware resources. Sometimes, resource pressure can be |
| 544 | mitigated by rewriting the kernel using different instructions that consume |
| 545 | different scheduler resources. Schedulers with a small queue are less resilient |
| 546 | to bottlenecks caused by the presence of long data dependencies. |
| 547 | The scheduler statistics are displayed by |
| 548 | using the command option ``-all-stats`` or ``-scheduler-stats``. |
| 549 | |
| 550 | The next table, *Retire Control Unit*, presents a histogram displaying a count, |
| 551 | representing the number of instructions retired on some number of cycles. In |
| 552 | this case, of the 610 simulated cycles, two instructions were retired during |
| 553 | the same cycle 399 times (65.4%) and there were 109 cycles where no |
| 554 | instructions were retired. The retire statistics are displayed by using the |
| 555 | command option ``-all-stats`` or ``-retire-stats``. |
| 556 | |
| 557 | The last table presented is *Register File statistics*. Each physical register |
| 558 | file (PRF) used by the pipeline is presented in this table. In the case of AMD |
| 559 | Jaguar, there are two register files, one for floating-point registers |
| 560 | (JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of |
| 561 | the 900 instructions processed, there were 900 mappings created. Since this |
| 562 | dot-product example utilized only floating point registers, the JFPuPRF was |
| 563 | responsible for creating the 900 mappings. However, we see that the pipeline |
| 564 | only used a maximum of 35 of 72 available register slots at any given time. We |
| 565 | can conclude that the floating point PRF was the only register file used for |
| 566 | the example, and that it was never resource constrained. The register file |
| 567 | statistics are displayed by using the command option ``-all-stats`` or |
| 568 | ``-register-file-stats``. |
| 569 | |
| 570 | In this example, we can conclude that the IPC is mostly limited by data |
| 571 | dependencies, and not by resource pressure. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 572 | |
| 573 | Instruction Flow |
| 574 | ^^^^^^^^^^^^^^^^ |
| 575 | This section describes the instruction flow through MCA's default out-of-order |
| 576 | pipeline, as well as the functional units involved in the process. |
| 577 | |
| 578 | The default pipeline implements the following sequence of stages used to |
| 579 | process instructions. |
| 580 | |
| 581 | * Dispatch (Instruction is dispatched to the schedulers). |
| 582 | * Issue (Instruction is issued to the processor pipelines). |
| 583 | * Write Back (Instruction is executed, and results are written back). |
| 584 | * Retire (Instruction is retired; writes are architecturally committed). |
| 585 | |
| 586 | The default pipeline only models the out-of-order portion of a processor. |
| 587 | Therefore, the instruction fetch and decode stages are not modeled. Performance |
| 588 | bottlenecks in the frontend are not diagnosed. MCA assumes that instructions |
| 589 | have all been decoded and placed into a queue. Also, MCA does not model branch |
| 590 | prediction. |
| 591 | |
| 592 | Instruction Dispatch |
| 593 | """""""""""""""""""" |
| 594 | During the dispatch stage, instructions are picked in program order from a |
| 595 | queue of already decoded instructions, and dispatched in groups to the |
| 596 | simulated hardware schedulers. |
| 597 | |
| 598 | The size of a dispatch group depends on the availability of the simulated |
| 599 | hardware resources. The processor dispatch width defaults to the value |
| 600 | of the ``IssueWidth`` in LLVM's scheduling model. |
| 601 | |
| 602 | An instruction can be dispatched if: |
| 603 | |
| 604 | * The size of the dispatch group is smaller than processor's dispatch width. |
| 605 | * There are enough entries in the reorder buffer. |
| 606 | * There are enough physical registers to do register renaming. |
| 607 | * The schedulers are not full. |
| 608 | |
| 609 | Scheduling models can optionally specify which register files are available on |
| 610 | the processor. MCA uses that information to initialize register file |
| 611 | descriptors. Users can limit the number of physical registers that are |
| 612 | globally available for register renaming by using the command option |
| 613 | ``-register-file-size``. A value of zero for this option means *unbounded*. |
| 614 | By knowing how many registers are available for renaming, MCA can predict |
| 615 | dispatch stalls caused by the lack of registers. |
| 616 | |
| 617 | The number of reorder buffer entries consumed by an instruction depends on the |
| 618 | number of micro-opcodes specified by the target scheduling model. MCA's |
| 619 | reorder buffer's purpose is to track the progress of instructions that are |
| 620 | "in-flight," and to retire instructions in program order. The number of |
| 621 | entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by |
| 622 | the target scheduling model. |
| 623 | |
| 624 | Instructions that are dispatched to the schedulers consume scheduler buffer |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 625 | entries. :program:`llvm-mca` queries the scheduling model to determine the set |
| 626 | of buffered resources consumed by an instruction. Buffered resources are |
| 627 | treated like scheduler resources. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 628 | |
| 629 | Instruction Issue |
| 630 | """"""""""""""""" |
| 631 | Each processor scheduler implements a buffer of instructions. An instruction |
| 632 | has to wait in the scheduler's buffer until input register operands become |
| 633 | available. Only at that point, does the instruction becomes eligible for |
| 634 | execution and may be issued (potentially out-of-order) for execution. |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 635 | Instruction latencies are computed by :program:`llvm-mca` with the help of the |
| 636 | scheduling model. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 637 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 638 | :program:`llvm-mca`'s scheduler is designed to simulate multiple processor |
| 639 | schedulers. The scheduler is responsible for tracking data dependencies, and |
| 640 | dynamically selecting which processor resources are consumed by instructions. |
| 641 | It delegates the management of processor resource units and resource groups to a |
| 642 | resource manager. The resource manager is responsible for selecting resource |
| 643 | units that are consumed by instructions. For example, if an instruction |
| 644 | consumes 1cy of a resource group, the resource manager selects one of the |
| 645 | available units from the group; by default, the resource manager uses a |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 646 | round-robin selector to guarantee that resource usage is uniformly distributed |
| 647 | between all units of a group. |
| 648 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 649 | :program:`llvm-mca`'s scheduler implements three instruction queues: |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 650 | |
| 651 | * WaitQueue: a queue of instructions whose operands are not ready. |
| 652 | * ReadyQueue: a queue of instructions ready to execute. |
| 653 | * IssuedQueue: a queue of instructions executing. |
| 654 | |
| 655 | Depending on the operand availability, instructions that are dispatched to the |
| 656 | scheduler are either placed into the WaitQueue or into the ReadyQueue. |
| 657 | |
| 658 | Every cycle, the scheduler checks if instructions can be moved from the |
| 659 | WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 660 | issued to the underlying pipelines. The algorithm prioritizes older instructions |
| 661 | over younger instructions. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 662 | |
| 663 | Write-Back and Retire Stage |
| 664 | """"""""""""""""""""""""""" |
| 665 | Issued instructions are moved from the ReadyQueue to the IssuedQueue. There, |
| 666 | instructions wait until they reach the write-back stage. At that point, they |
| 667 | get removed from the queue and the retire control unit is notified. |
| 668 | |
| 669 | When instructions are executed, the retire control unit flags the |
| 670 | instruction as "ready to retire." |
| 671 | |
| 672 | Instructions are retired in program order. The register file is notified of |
Matt Davis | e8c70bc | 2018-07-31 18:59:46 +0000 | [diff] [blame^] | 673 | the retirement so that it can free the physical registers that were allocated |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 674 | for the instruction during the register renaming stage. |
| 675 | |
| 676 | Load/Store Unit and Memory Consistency Model |
| 677 | """""""""""""""""""""""""""""""""""""""""""" |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 678 | To simulate an out-of-order execution of memory operations, :program:`llvm-mca` |
| 679 | utilizes a simulated load/store unit (LSUnit) to simulate the speculative |
| 680 | execution of loads and stores. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 681 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 682 | Each load (or store) consumes an entry in the load (or store) queue. Users can |
| 683 | specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the |
| 684 | load and store queues respectively. The queues are unbounded by default. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 685 | |
| 686 | The LSUnit implements a relaxed consistency model for memory loads and stores. |
| 687 | The rules are: |
| 688 | |
| 689 | 1. A younger load is allowed to pass an older load only if there are no |
| 690 | intervening stores or barriers between the two loads. |
| 691 | 2. A younger load is allowed to pass an older store provided that the load does |
| 692 | not alias with the store. |
| 693 | 3. A younger store is not allowed to pass an older store. |
| 694 | 4. A younger store is not allowed to pass an older load. |
| 695 | |
| 696 | By default, the LSUnit optimistically assumes that loads do not alias |
| 697 | (`-noalias=true`) store operations. Under this assumption, younger loads are |
| 698 | always allowed to pass older stores. Essentially, the LSUnit does not attempt |
| 699 | to run any alias analysis to predict when loads and stores do not alias with |
| 700 | each other. |
| 701 | |
| 702 | Note that, in the case of write-combining memory, rule 3 could be relaxed to |
| 703 | allow reordering of non-aliasing store operations. That being said, at the |
| 704 | moment, there is no way to further relax the memory model (``-noalias`` is the |
| 705 | only option). Essentially, there is no option to specify a different memory |
| 706 | type (e.g., write-back, write-combining, write-through; etc.) and consequently |
| 707 | to weaken, or strengthen, the memory model. |
| 708 | |
| 709 | Other limitations are: |
| 710 | |
| 711 | * The LSUnit does not know when store-to-load forwarding may occur. |
| 712 | * The LSUnit does not know anything about cache hierarchy and memory types. |
| 713 | * The LSUnit does not know how to identify serializing operations and memory |
| 714 | fences. |
| 715 | |
| 716 | The LSUnit does not attempt to predict if a load or store hits or misses the L1 |
| 717 | cache. It only knows if an instruction "MayLoad" and/or "MayStore." For |
| 718 | loads, the scheduling model provides an "optimistic" load-to-use latency (which |
| 719 | usually matches the load-to-use latency for when there is a hit in the L1D). |
| 720 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 721 | :program:`llvm-mca` does not know about serializing operations or memory-barrier |
| 722 | like instructions. The LSUnit conservatively assumes that an instruction which |
| 723 | has both "MayLoad" and unmodeled side effects behaves like a "soft" |
| 724 | load-barrier. That means, it serializes loads without forcing a flush of the |
| 725 | load queue. Similarly, instructions that "MayStore" and have unmodeled side |
| 726 | effects are treated like store barriers. A full memory barrier is a "MayLoad" |
| 727 | and "MayStore" instruction with unmodeled side effects. This is inaccurate, but |
| 728 | it is the best that we can do at the moment with the current information |
| 729 | available in LLVM. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 730 | |
| 731 | A load/store barrier consumes one entry of the load/store queue. A load/store |
| 732 | barrier enforces ordering of loads/stores. A younger load cannot pass a load |
| 733 | barrier. Also, a younger store cannot pass a store barrier. A younger load |
| 734 | has to wait for the memory/load barrier to execute. A load/store barrier is |
| 735 | "executed" when it becomes the oldest entry in the load/store queue(s). That |
| 736 | also means, by construction, all of the older loads/stores have been executed. |
| 737 | |
| 738 | In conclusion, the full set of load/store consistency rules are: |
| 739 | |
| 740 | #. A store may not pass a previous store. |
| 741 | #. A store may not pass a previous load (regardless of ``-noalias``). |
| 742 | #. A store has to wait until an older store barrier is fully executed. |
| 743 | #. A load may pass a previous load. |
| 744 | #. A load may not pass a previous store unless ``-noalias`` is set. |
| 745 | #. A load has to wait until an older load barrier is fully executed. |