Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 1 | llvm-mca - LLVM Machine Code Analyzer |
| 2 | ===================================== |
| 3 | |
James Henderson | a056684 | 2019-06-27 13:24:46 +0000 | [diff] [blame] | 4 | .. program:: llvm-mca |
| 5 | |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 6 | SYNOPSIS |
| 7 | -------- |
| 8 | |
| 9 | :program:`llvm-mca` [*options*] [input] |
| 10 | |
| 11 | DESCRIPTION |
| 12 | ----------- |
| 13 | |
| 14 | :program:`llvm-mca` is a performance analysis tool that uses information |
| 15 | available in LLVM (e.g. scheduling models) to statically measure the performance |
| 16 | of machine code in a specific CPU. |
| 17 | |
| 18 | Performance is measured in terms of throughput as well as processor resource |
| 19 | consumption. The tool currently works for processors with an out-of-order |
| 20 | backend, for which there is a scheduling model available in LLVM. |
| 21 | |
| 22 | The main goal of this tool is not just to predict the performance of the code |
| 23 | when run on the target, but also help with diagnosing potential performance |
| 24 | issues. |
| 25 | |
Matt Davis | b4588e5 | 2018-08-03 15:56:07 +0000 | [diff] [blame] | 26 | Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions |
| 27 | Per Cycle (IPC), as well as hardware resource pressure. The analysis and |
| 28 | reporting style were inspired by the IACA tool from Intel. |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 29 | |
Matt Davis | b4588e5 | 2018-08-03 15:56:07 +0000 | [diff] [blame] | 30 | For example, you can compile code with clang, output assembly, and pipe it |
| 31 | directly into :program:`llvm-mca` for analysis: |
Sanjay Patel | c86033a | 2018-04-10 17:49:45 +0000 | [diff] [blame] | 32 | |
| 33 | .. code-block:: bash |
| 34 | |
Sanjay Patel | 40ad926 | 2018-04-10 18:10:14 +0000 | [diff] [blame] | 35 | $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2 |
Andrea Di Biagio | c659012 | 2018-04-09 16:39:52 +0000 | [diff] [blame] | 36 | |
Andrea Di Biagio | d8d940a | 2018-05-17 16:48:53 +0000 | [diff] [blame] | 37 | Or for Intel syntax: |
| 38 | |
Simon Pilgrim | 93d45bc | 2018-05-17 16:58:42 +0000 | [diff] [blame] | 39 | .. code-block:: bash |
Andrea Di Biagio | d8d940a | 2018-05-17 16:48:53 +0000 | [diff] [blame] | 40 | |
| 41 | $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2 |
| 42 | |
Andrea Di Biagio | 792510f | 2019-06-19 16:10:58 +0000 | [diff] [blame] | 43 | Scheduling models are not just used to compute instruction latencies and |
| 44 | throughput, but also to understand what processor resources are available |
| 45 | and how to simulate them. |
| 46 | |
| 47 | By design, the quality of the analysis conducted by :program:`llvm-mca` is |
| 48 | inevitably affected by the quality of the scheduling models in LLVM. |
| 49 | |
| 50 | If you see that the performance report is not accurate for a processor, |
| 51 | please `file a bug <https://bugs.llvm.org/enter_bug.cgi?product=libraries>`_ |
| 52 | against the appropriate backend. |
| 53 | |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 54 | OPTIONS |
| 55 | ------- |
| 56 | |
| 57 | If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard |
| 58 | input. Otherwise, it will read from the specified filename. |
| 59 | |
| 60 | If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output |
| 61 | to standard output if the input is from standard input. If the :option:`-o` |
| 62 | option specifies "``-``", then the output will also be sent to standard output. |
| 63 | |
| 64 | |
| 65 | .. option:: -help |
| 66 | |
| 67 | Print a summary of command line options. |
| 68 | |
James Henderson | a056684 | 2019-06-27 13:24:46 +0000 | [diff] [blame] | 69 | .. option:: -o <filename> |
| 70 | |
| 71 | Use ``<filename>`` as the output filename. See the summary above for more |
| 72 | details. |
| 73 | |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 74 | .. option:: -mtriple=<target triple> |
| 75 | |
| 76 | Specify a target triple string. |
| 77 | |
| 78 | .. option:: -march=<arch> |
| 79 | |
| 80 | Specify the architecture for which to analyze the code. It defaults to the |
| 81 | host default target. |
| 82 | |
| 83 | .. option:: -mcpu=<cpuname> |
| 84 | |
Andrea Di Biagio | 93c49d5 | 2018-04-25 10:18:25 +0000 | [diff] [blame] | 85 | Specify the processor for which to analyze the code. By default, the cpu name |
| 86 | is autodetected from the host. |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 87 | |
| 88 | .. option:: -output-asm-variant=<variant id> |
| 89 | |
| 90 | Specify the output assembly variant for the report generated by the tool. |
| 91 | On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables |
| 92 | the AT&T (vic. Intel) assembly format for the code printed out by the tool in |
| 93 | the analysis report. |
| 94 | |
Andrea Di Biagio | 207e3af | 2019-08-02 10:38:25 +0000 | [diff] [blame] | 95 | .. option:: -print-imm-hex |
| 96 | |
| 97 | Prefer hex format for numeric literals in the output assembly printed as part |
| 98 | of the report. |
| 99 | |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 100 | .. option:: -dispatch=<width> |
| 101 | |
| 102 | Specify a different dispatch width for the processor. The dispatch width |
Andrea Di Biagio | efc3f39 | 2018-04-05 16:42:32 +0000 | [diff] [blame] | 103 | defaults to field 'IssueWidth' in the processor scheduling model. If width is |
| 104 | zero, then the default dispatch width is used. |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 105 | |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 106 | .. option:: -register-file-size=<size> |
| 107 | |
Andrea Di Biagio | efc3f39 | 2018-04-05 16:42:32 +0000 | [diff] [blame] | 108 | Specify the size of the register file. When specified, this flag limits how |
Matt Davis | e8c70bc | 2018-07-31 18:59:46 +0000 | [diff] [blame] | 109 | many physical registers are available for register renaming purposes. A value |
| 110 | of zero for this flag means "unlimited number of physical registers". |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 111 | |
| 112 | .. option:: -iterations=<number of iterations> |
| 113 | |
| 114 | Specify the number of iterations to run. If this flag is set to 0, then the |
Andrea Di Biagio | 074cef3 | 2018-04-10 12:50:03 +0000 | [diff] [blame] | 115 | tool sets the number of iterations to a default value (i.e. 100). |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 116 | |
| 117 | .. option:: -noalias=<bool> |
| 118 | |
| 119 | If set, the tool assumes that loads and stores don't alias. This is the |
| 120 | default behavior. |
| 121 | |
| 122 | .. option:: -lqueue=<load queue size> |
| 123 | |
| 124 | Specify the size of the load queue in the load/store unit emulated by the tool. |
| 125 | By default, the tool assumes an unbound number of entries in the load queue. |
| 126 | A value of zero for this flag is ignored, and the default load queue size is |
Matt Davis | a448670b | 2018-07-17 16:11:54 +0000 | [diff] [blame] | 127 | used instead. |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 128 | |
| 129 | .. option:: -squeue=<store queue size> |
| 130 | |
| 131 | Specify the size of the store queue in the load/store unit emulated by the |
| 132 | tool. By default, the tool assumes an unbound number of entries in the store |
| 133 | queue. A value of zero for this flag is ignored, and the default store queue |
| 134 | size is used instead. |
| 135 | |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 136 | .. option:: -timeline |
| 137 | |
| 138 | Enable the timeline view. |
| 139 | |
| 140 | .. option:: -timeline-max-iterations=<iterations> |
| 141 | |
| 142 | Limit the number of iterations to print in the timeline view. By default, the |
| 143 | timeline view prints information for up to 10 iterations. |
| 144 | |
| 145 | .. option:: -timeline-max-cycles=<cycles> |
| 146 | |
| 147 | Limit the number of cycles in the timeline view. By default, the number of |
| 148 | cycles is set to 80. |
| 149 | |
Andrea Di Biagio | 1feccc2 | 2018-03-26 13:21:48 +0000 | [diff] [blame] | 150 | .. option:: -resource-pressure |
| 151 | |
| 152 | Enable the resource pressure view. This is enabled by default. |
| 153 | |
Andrea Di Biagio | 8dabf4f | 2018-04-03 16:46:23 +0000 | [diff] [blame] | 154 | .. option:: -register-file-stats |
| 155 | |
| 156 | Enable register file usage statistics. |
| 157 | |
Andrea Di Biagio | 821f650 | 2018-04-10 14:55:14 +0000 | [diff] [blame] | 158 | .. option:: -dispatch-stats |
| 159 | |
| 160 | Enable extra dispatch statistics. This view collects and analyzes instruction |
| 161 | dispatch events, as well as static/dynamic dispatch stall events. This view |
| 162 | is disabled by default. |
| 163 | |
Andrea Di Biagio | 1cc29c0 | 2018-04-11 11:37:46 +0000 | [diff] [blame] | 164 | .. option:: -scheduler-stats |
| 165 | |
| 166 | Enable extra scheduler statistics. This view collects and analyzes instruction |
| 167 | issue events. This view is disabled by default. |
| 168 | |
Andrea Di Biagio | f41ad5c | 2018-04-11 12:12:53 +0000 | [diff] [blame] | 169 | .. option:: -retire-stats |
| 170 | |
| 171 | Enable extra retire control unit statistics. This view is disabled by default. |
| 172 | |
Andrea Di Biagio | ff9c109 | 2018-03-26 13:44:54 +0000 | [diff] [blame] | 173 | .. option:: -instruction-info |
| 174 | |
| 175 | Enable the instruction info view. This is enabled by default. |
| 176 | |
Andrea Di Biagio | cbec9af | 2019-08-09 11:26:27 +0000 | [diff] [blame] | 177 | .. option:: -show-encoding |
| 178 | |
| 179 | Enable the printing of instruction encodings within the instruction info view. |
| 180 | |
Andrea Di Biagio | 650b5fc | 2018-05-17 12:27:03 +0000 | [diff] [blame] | 181 | .. option:: -all-stats |
| 182 | |
| 183 | Print all hardware statistics. This enables extra statistics related to the |
| 184 | dispatch logic, the hardware schedulers, the register file(s), and the retire |
| 185 | control unit. This option is disabled by default. |
| 186 | |
| 187 | .. option:: -all-views |
| 188 | |
| 189 | Enable all the view. |
| 190 | |
Andrea Di Biagio | d156929 | 2018-03-26 12:04:53 +0000 | [diff] [blame] | 191 | .. option:: -instruction-tables |
| 192 | |
| 193 | Prints resource pressure information based on the static information |
| 194 | available from the processor model. This differs from the resource pressure |
| 195 | view because it doesn't require that the code is simulated. It instead prints |
| 196 | the theoretical uniform distribution of resource pressure for every |
| 197 | instruction in sequence. |
| 198 | |
Andrea Di Biagio | be3281a | 2019-03-04 11:52:34 +0000 | [diff] [blame] | 199 | .. option:: -bottleneck-analysis |
| 200 | |
| 201 | Print information about bottlenecks that affect the throughput. This analysis |
| 202 | can be expensive, and it is disabled by default. Bottlenecks are highlighted |
| 203 | in the summary view. |
| 204 | |
Matt Davis | a448670b | 2018-07-17 16:11:54 +0000 | [diff] [blame] | 205 | |
Andrea Di Biagio | 3a6b092 | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 206 | EXIT STATUS |
| 207 | ----------- |
| 208 | |
| 209 | :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed |
| 210 | to standard error, and the tool returns 1. |
| 211 | |
Matt Davis | b4588e5 | 2018-08-03 15:56:07 +0000 | [diff] [blame] | 212 | USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS |
| 213 | --------------------------------------------- |
| 214 | :program:`llvm-mca` allows for the optional usage of special code comments to |
| 215 | mark regions of the assembly code to be analyzed. A comment starting with |
| 216 | substring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment |
| 217 | starting with substring ``LLVM-MCA-END`` marks the end of a code region. For |
| 218 | example: |
| 219 | |
| 220 | .. code-block:: none |
| 221 | |
Andrea Di Biagio | 4e62554 | 2019-05-09 15:18:09 +0000 | [diff] [blame] | 222 | # LLVM-MCA-BEGIN |
Matt Davis | b4588e5 | 2018-08-03 15:56:07 +0000 | [diff] [blame] | 223 | ... |
| 224 | # LLVM-MCA-END |
| 225 | |
Andrea Di Biagio | 4e62554 | 2019-05-09 15:18:09 +0000 | [diff] [blame] | 226 | If no user-defined region is specified, then :program:`llvm-mca` assumes a |
| 227 | default region which contains every instruction in the input file. Every region |
| 228 | is analyzed in isolation, and the final performance report is the union of all |
| 229 | the reports generated for every code region. |
| 230 | |
| 231 | Code regions can have names. For example: |
| 232 | |
| 233 | .. code-block:: none |
| 234 | |
| 235 | # LLVM-MCA-BEGIN A simple example |
| 236 | add %eax, %eax |
| 237 | # LLVM-MCA-END |
| 238 | |
| 239 | The code from the example above defines a region named "A simple example" with a |
| 240 | single instruction in it. Note how the region name doesn't have to be repeated |
| 241 | in the ``LLVM-MCA-END`` directive. In the absence of overlapping regions, |
| 242 | an anonymous ``LLVM-MCA-END`` directive always ends the currently active user |
| 243 | defined region. |
| 244 | |
| 245 | Example of nesting regions: |
| 246 | |
| 247 | .. code-block:: none |
| 248 | |
| 249 | # LLVM-MCA-BEGIN foo |
| 250 | add %eax, %edx |
| 251 | # LLVM-MCA-BEGIN bar |
| 252 | sub %eax, %edx |
| 253 | # LLVM-MCA-END bar |
| 254 | # LLVM-MCA-END foo |
| 255 | |
| 256 | Example of overlapping regions: |
| 257 | |
| 258 | .. code-block:: none |
| 259 | |
| 260 | # LLVM-MCA-BEGIN foo |
| 261 | add %eax, %edx |
| 262 | # LLVM-MCA-BEGIN bar |
| 263 | sub %eax, %edx |
| 264 | # LLVM-MCA-END foo |
| 265 | add %eax, %edx |
| 266 | # LLVM-MCA-END bar |
| 267 | |
| 268 | Note that multiple anonymous regions cannot overlap. Also, overlapping regions |
| 269 | cannot have the same name. |
Matt Davis | b4588e5 | 2018-08-03 15:56:07 +0000 | [diff] [blame] | 270 | |
Matt Davis | 41bf444 | 2019-06-10 20:38:56 +0000 | [diff] [blame] | 271 | There is no support for marking regions from high-level source code, like C or |
| 272 | C++. As a workaround, inline assembly directives may be used: |
Matt Davis | b4588e5 | 2018-08-03 15:56:07 +0000 | [diff] [blame] | 273 | |
| 274 | .. code-block:: c++ |
| 275 | |
| 276 | int foo(int a, int b) { |
| 277 | __asm volatile("# LLVM-MCA-BEGIN foo"); |
| 278 | a += 42; |
| 279 | __asm volatile("# LLVM-MCA-END"); |
| 280 | a *= b; |
| 281 | return a; |
| 282 | } |
| 283 | |
Matt Davis | 41bf444 | 2019-06-10 20:38:56 +0000 | [diff] [blame] | 284 | However, this interferes with optimizations like loop vectorization and may have |
| 285 | an impact on the code generated. This is because the ``__asm`` statements are |
| 286 | seen as real code having important side effects, which limits how the code |
| 287 | around them can be transformed. If users want to make use of inline assembly |
| 288 | to emit markers, then the recommendation is to always verify that the output |
| 289 | assembly is equivalent to the assembly generated in the absence of markers. |
| 290 | The `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_ |
| 291 | can also help in detecting missed optimizations. |
| 292 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 293 | HOW LLVM-MCA WORKS |
| 294 | ------------------ |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 295 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 296 | :program:`llvm-mca` takes assembly code as input. The assembly code is parsed |
| 297 | into a sequence of MCInst with the help of the existing LLVM target assembly |
| 298 | parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module |
| 299 | to generate a performance report. |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 300 | |
| 301 | The Pipeline module simulates the execution of the machine code sequence in a |
| 302 | loop of iterations (default is 100). During this process, the pipeline collects |
| 303 | a number of execution related statistics. At the end of this process, the |
| 304 | pipeline generates and prints a report from the collected statistics. |
| 305 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 306 | Here is an example of a performance report generated by the tool for a |
| 307 | dot-product of two packed float vectors of four elements. The analysis is |
| 308 | conducted for target x86, cpu btver2. The following result can be produced via |
| 309 | the following command using the example located at |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 310 | ``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: |
| 311 | |
| 312 | .. code-block:: bash |
| 313 | |
| 314 | $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s |
| 315 | |
| 316 | .. code-block:: none |
| 317 | |
| 318 | Iterations: 300 |
| 319 | Instructions: 900 |
| 320 | Total Cycles: 610 |
Andrea Di Biagio | a2eee47 | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 321 | Total uOps: 900 |
| 322 | |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 323 | Dispatch Width: 2 |
Andrea Di Biagio | a2eee47 | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 324 | uOps Per Cycle: 1.48 |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 325 | IPC: 1.48 |
| 326 | Block RThroughput: 2.0 |
| 327 | |
| 328 | |
| 329 | Instruction Info: |
| 330 | [1]: #uOps |
| 331 | [2]: Latency |
| 332 | [3]: RThroughput |
| 333 | [4]: MayLoad |
| 334 | [5]: MayStore |
| 335 | [6]: HasSideEffects (U) |
| 336 | |
| 337 | [1] [2] [3] [4] [5] [6] Instructions: |
| 338 | 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2 |
| 339 | 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3 |
| 340 | 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4 |
| 341 | |
| 342 | |
| 343 | Resources: |
| 344 | [0] - JALU0 |
| 345 | [1] - JALU1 |
| 346 | [2] - JDiv |
| 347 | [3] - JFPA |
| 348 | [4] - JFPM |
| 349 | [5] - JFPU0 |
| 350 | [6] - JFPU1 |
| 351 | [7] - JLAGU |
| 352 | [8] - JMul |
| 353 | [9] - JSAGU |
| 354 | [10] - JSTC |
| 355 | [11] - JVALU0 |
| 356 | [12] - JVALU1 |
| 357 | [13] - JVIMUL |
| 358 | |
| 359 | |
| 360 | Resource pressure per iteration: |
| 361 | [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] |
| 362 | - - - 2.00 1.00 2.00 1.00 - - - - - - - |
| 363 | |
| 364 | Resource pressure by instruction: |
| 365 | [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions: |
| 366 | - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2 |
| 367 | - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3 |
| 368 | - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4 |
| 369 | |
| 370 | According to this report, the dot-product kernel has been executed 300 times, |
Andrea Di Biagio | a2eee47 | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 371 | for a total of 900 simulated instructions. The total number of simulated micro |
| 372 | opcodes (uOps) is also 900. |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 373 | |
| 374 | The report is structured in three main sections. The first section collects a |
| 375 | few performance numbers; the goal of this section is to give a very quick |
Andrea Di Biagio | a2eee47 | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 376 | overview of the performance throughput. Important performance indicators are |
| 377 | **IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal |
Andrea Di Biagio | 1dac6ba | 2018-07-31 18:19:15 +0000 | [diff] [blame] | 378 | Throughput). |
| 379 | |
Andrea Di Biagio | 225655f | 2019-08-05 13:18:37 +0000 | [diff] [blame] | 380 | Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched |
| 381 | to the out-of-order backend every simulated cycle. |
| 382 | |
Andrea Di Biagio | 1dac6ba | 2018-07-31 18:19:15 +0000 | [diff] [blame] | 383 | IPC is computed dividing the total number of simulated instructions by the total |
Andrea Di Biagio | 225655f | 2019-08-05 13:18:37 +0000 | [diff] [blame] | 384 | number of cycles. |
| 385 | |
| 386 | Field *Block RThroughput* is the reciprocal of the block throughput. Block |
| 387 | throuhgput is a theoretical quantity computed as the maximum number of blocks |
| 388 | (i.e. iterations) that can be executed per simulated clock cycle in the absence |
| 389 | of loop carried dependencies. Block throughput is is superiorly |
| 390 | limited by the dispatch rate, and the availability of hardware resources. |
| 391 | |
| 392 | In the absence of loop-carried data dependencies, the observed IPC tends to a |
| 393 | theoretical maximum which can be computed by dividing the number of instructions |
| 394 | of a single iteration by the `Block RThroughput`. |
Andrea Di Biagio | 1dac6ba | 2018-07-31 18:19:15 +0000 | [diff] [blame] | 395 | |
Andrea Di Biagio | a2eee47 | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 396 | Field 'uOps Per Cycle' is computed dividing the total number of simulated micro |
| 397 | opcodes by the total number of cycles. A delta between Dispatch Width and this |
| 398 | field is an indicator of a performance issue. In the absence of loop-carried |
| 399 | data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical |
| 400 | maximum throughput which can be computed by dividing the number of uOps of a |
Andrea Di Biagio | 225655f | 2019-08-05 13:18:37 +0000 | [diff] [blame] | 401 | single iteration by the `Block RThroughput`. |
Andrea Di Biagio | 1dac6ba | 2018-07-31 18:19:15 +0000 | [diff] [blame] | 402 | |
Andrea Di Biagio | a2eee47 | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 403 | Field *uOps Per Cycle* is bounded from above by the dispatch width. That is |
| 404 | because the dispatch width limits the maximum size of a dispatch group. Both IPC |
| 405 | and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The |
| 406 | availability of hardware resources affects the resource pressure distribution, |
| 407 | and it limits the number of instructions that can be executed in parallel every |
| 408 | cycle. A delta between Dispatch Width and the theoretical maximum uOps per |
| 409 | Cycle (computed by dividing the number of uOps of a single iteration by the |
Andrea Di Biagio | 225655f | 2019-08-05 13:18:37 +0000 | [diff] [blame] | 410 | `Block RThroughput`) is an indicator of a performance bottleneck caused by the |
Andrea Di Biagio | a2eee47 | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 411 | lack of hardware resources. |
| 412 | In general, the lower the Block RThroughput, the better. |
| 413 | |
| 414 | In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there |
Andrea Di Biagio | 225655f | 2019-08-05 13:18:37 +0000 | [diff] [blame] | 415 | are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to |
Andrea Di Biagio | a2eee47 | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 416 | approach 1.50 when the number of iterations tends to infinity. The delta between |
| 417 | the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is |
| 418 | an indicator of a performance bottleneck caused by the lack of hardware |
| 419 | resources, and the *Resource pressure view* can help to identify the problematic |
| 420 | resource usage. |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 421 | |
Andrea Di Biagio | cbec9af | 2019-08-09 11:26:27 +0000 | [diff] [blame] | 422 | The second section of the report is the `instruction info view`. It shows the |
| 423 | latency and reciprocal throughput of every instruction in the sequence. It also |
| 424 | reports extra information related to the number of micro opcodes, and opcode |
| 425 | properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 426 | |
Andrea Di Biagio | 225655f | 2019-08-05 13:18:37 +0000 | [diff] [blame] | 427 | Field *RThroughput* is the reciprocal of the instruction throughput. Throughput |
| 428 | is computed as the maximum number of instructions of a same type that can be |
| 429 | executed per clock cycle in the absence of operand dependencies. In this |
| 430 | example, the reciprocal throughput of a vector float multiply is 1 |
| 431 | cycles/instruction. That is because the FP multiplier JFPM is only available |
| 432 | from pipeline JFPU1. |
| 433 | |
Andrea Di Biagio | cbec9af | 2019-08-09 11:26:27 +0000 | [diff] [blame] | 434 | Instruction encodings are displayed within the instruction info view when flag |
| 435 | `-show-encoding` is specified. |
| 436 | |
| 437 | Below is an example of `-show-encoding` output for the dot-product kernel: |
| 438 | |
| 439 | .. code-block:: none |
| 440 | |
| 441 | Instruction Info: |
| 442 | [1]: #uOps |
| 443 | [2]: Latency |
| 444 | [3]: RThroughput |
| 445 | [4]: MayLoad |
| 446 | [5]: MayStore |
| 447 | [6]: HasSideEffects (U) |
| 448 | [7]: Encoding Size |
| 449 | |
| 450 | [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions: |
| 451 | 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2 |
| 452 | 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3 |
| 453 | 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4 |
| 454 | |
| 455 | The `Encoding Size` column shows the size in bytes of instructions. The |
| 456 | `Encodings` column shows the actual instruction encodings (byte sequences in |
| 457 | hex). |
| 458 | |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 459 | The third section is the *Resource pressure view*. This view reports |
| 460 | the average number of resource cycles consumed every iteration by instructions |
| 461 | for every processor resource unit available on the target. Information is |
| 462 | structured in two tables. The first table reports the number of resource cycles |
| 463 | spent on average every iteration. The second table correlates the resource |
| 464 | cycles to the machine instruction in the sequence. For example, every iteration |
| 465 | of the instruction vmulps always executes on resource unit [6] |
| 466 | (JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 467 | per iteration. Note that on AMD Jaguar, vector floating-point multiply can |
| 468 | only be issued to pipeline JFPU1, while horizontal floating-point additions can |
| 469 | only be issued to pipeline JFPU0. |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 470 | |
| 471 | The resource pressure view helps with identifying bottlenecks caused by high |
| 472 | usage of specific hardware resources. Situations with resource pressure mainly |
| 473 | concentrated on a few resources should, in general, be avoided. Ideally, |
| 474 | pressure should be uniformly distributed between multiple resources. |
| 475 | |
| 476 | Timeline View |
| 477 | ^^^^^^^^^^^^^ |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 478 | The timeline view produces a detailed report of each instruction's state |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 479 | transitions through an instruction pipeline. This view is enabled by the |
| 480 | command line option ``-timeline``. As instructions transition through the |
| 481 | various stages of the pipeline, their states are depicted in the view report. |
| 482 | These states are represented by the following characters: |
| 483 | |
| 484 | * D : Instruction dispatched. |
| 485 | * e : Instruction executing. |
| 486 | * E : Instruction executed. |
| 487 | * R : Instruction retired. |
| 488 | * = : Instruction already dispatched, waiting to be executed. |
| 489 | * \- : Instruction executed, waiting to be retired. |
| 490 | |
| 491 | Below is the timeline view for a subset of the dot-product example located in |
| 492 | ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 493 | :program:`llvm-mca` using the following command: |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 494 | |
| 495 | .. code-block:: bash |
| 496 | |
| 497 | $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s |
| 498 | |
| 499 | .. code-block:: none |
| 500 | |
| 501 | Timeline view: |
| 502 | 012345 |
| 503 | Index 0123456789 |
| 504 | |
| 505 | [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2 |
| 506 | [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3 |
| 507 | [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 |
| 508 | [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 |
| 509 | [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3 |
| 510 | [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 |
| 511 | [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 |
| 512 | [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3 |
| 513 | [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4 |
| 514 | |
| 515 | |
| 516 | Average Wait times (based on the timeline view): |
| 517 | [0]: Executions |
| 518 | [1]: Average time spent waiting in a scheduler's queue |
| 519 | [2]: Average time spent waiting in a scheduler's queue while ready |
| 520 | [3]: Average time elapsed from WB until retire stage |
| 521 | |
| 522 | [0] [1] [2] [3] |
| 523 | 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2 |
| 524 | 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3 |
| 525 | 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 |
| 526 | |
| 527 | The timeline view is interesting because it shows instruction state changes |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 528 | during execution. It also gives an idea of how the tool processes instructions |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 529 | executed on the target, and how their timing information might be calculated. |
| 530 | |
| 531 | The timeline view is structured in two tables. The first table shows |
| 532 | instructions changing state over time (measured in cycles); the second table |
| 533 | (named *Average Wait times*) reports useful timing statistics, which should |
| 534 | help diagnose performance bottlenecks caused by long data dependencies and |
| 535 | sub-optimal usage of hardware resources. |
| 536 | |
| 537 | An instruction in the timeline view is identified by a pair of indices, where |
| 538 | the first index identifies an iteration, and the second index is the |
| 539 | instruction index (i.e., where it appears in the code sequence). Since this |
| 540 | example was generated using 3 iterations: ``-iterations=3``, the iteration |
| 541 | indices range from 0-2 inclusively. |
| 542 | |
| 543 | Excluding the first and last column, the remaining columns are in cycles. |
| 544 | Cycles are numbered sequentially starting from 0. |
| 545 | |
| 546 | From the example output above, we know the following: |
| 547 | |
| 548 | * Instruction [1,0] was dispatched at cycle 1. |
| 549 | * Instruction [1,0] started executing at cycle 2. |
| 550 | * Instruction [1,0] reached the write back stage at cycle 4. |
| 551 | * Instruction [1,0] was retired at cycle 10. |
| 552 | |
| 553 | Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the |
| 554 | scheduler's queue for the operands to become available. By the time vmulps is |
| 555 | dispatched, operands are already available, and pipeline JFPU1 is ready to |
| 556 | serve another instruction. So the instruction can be immediately issued on the |
| 557 | JFPU1 pipeline. That is demonstrated by the fact that the instruction only |
| 558 | spent 1cy in the scheduler's queue. |
| 559 | |
| 560 | There is a gap of 5 cycles between the write-back stage and the retire event. |
| 561 | That is because instructions must retire in program order, so [1,0] has to wait |
| 562 | for [0,2] to be retired first (i.e., it has to wait until cycle 10). |
| 563 | |
| 564 | In the example, all instructions are in a RAW (Read After Write) dependency |
| 565 | chain. Register %xmm2 written by vmulps is immediately used by the first |
| 566 | vhaddps, and register %xmm3 written by the first vhaddps is used by the second |
| 567 | vhaddps. Long data dependencies negatively impact the ILP (Instruction Level |
| 568 | Parallelism). |
| 569 | |
| 570 | In the dot-product example, there are anti-dependencies introduced by |
| 571 | instructions from different iterations. However, those dependencies can be |
| 572 | removed at register renaming stage (at the cost of allocating register aliases, |
Matt Davis | e8c70bc | 2018-07-31 18:59:46 +0000 | [diff] [blame] | 573 | and therefore consuming physical registers). |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 574 | |
| 575 | Table *Average Wait times* helps diagnose performance issues that are caused by |
| 576 | the presence of long latency instructions and potentially long data dependencies |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 577 | which may limit the ILP. Note that :program:`llvm-mca`, by default, assumes at |
| 578 | least 1cy between the dispatch event and the issue event. |
Matt Davis | bc093ea | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 579 | |
| 580 | When the performance is limited by data dependencies and/or long latency |
| 581 | instructions, the number of cycles spent while in the *ready* state is expected |
| 582 | to be very small when compared with the total number of cycles spent in the |
| 583 | scheduler's queue. The difference between the two counters is a good indicator |
| 584 | of how large of an impact data dependencies had on the execution of the |
| 585 | instructions. When performance is mostly limited by the lack of hardware |
| 586 | resources, the delta between the two counters is small. However, the number of |
| 587 | cycles spent in the queue tends to be larger (i.e., more than 1-3cy), |
| 588 | especially when compared to other low latency instructions. |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 589 | |
Andrea Di Biagio | 225655f | 2019-08-05 13:18:37 +0000 | [diff] [blame] | 590 | Bottleneck Analysis |
| 591 | ^^^^^^^^^^^^^^^^^^^ |
| 592 | The ``-bottleneck-analysis`` command line option enables the analysis of |
| 593 | performance bottlenecks. |
| 594 | |
| 595 | This analysis is potentially expensive. It attempts to correlate increases in |
| 596 | backend pressure (caused by pipeline resource pressure and data dependencies) to |
| 597 | dynamic dispatch stalls. |
| 598 | |
| 599 | Below is an example of ``-bottleneck-analysis`` output generated by |
| 600 | :program:`llvm-mca` for 500 iterations of the dot-product example on btver2. |
| 601 | |
| 602 | .. code-block:: none |
| 603 | |
| 604 | |
| 605 | Cycles with backend pressure increase [ 48.07% ] |
| 606 | Throughput Bottlenecks: |
| 607 | Resource Pressure [ 47.77% ] |
| 608 | - JFPA [ 47.77% ] |
| 609 | - JFPU0 [ 47.77% ] |
| 610 | Data Dependencies: [ 0.30% ] |
| 611 | - Register Dependencies [ 0.30% ] |
| 612 | - Memory Dependencies [ 0.00% ] |
| 613 | |
| 614 | Critical sequence based on the simulation: |
| 615 | |
| 616 | Instruction Dependency Information |
| 617 | +----< 2. vhaddps %xmm3, %xmm3, %xmm4 |
| 618 | | |
| 619 | | < loop carried > |
| 620 | | |
| 621 | | 0. vmulps %xmm0, %xmm1, %xmm2 |
| 622 | +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] |
| 623 | +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3 |
| 624 | | |
| 625 | | < loop carried > |
| 626 | | |
| 627 | +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] |
| 628 | |
| 629 | |
| 630 | According to the analysis, throughput is limited by resource pressure and not by |
| 631 | data dependencies. The analysis observed increases in backend pressure during |
| 632 | 48.07% of the simulated run. Almost all those pressure increase events were |
| 633 | caused by contention on processor resources JFPA/JFPU0. |
| 634 | |
| 635 | The `critical sequence` is the most expensive sequence of instructions according |
| 636 | to the simulation. It is annotated to provide extra information about critical |
| 637 | register dependencies and resource interferences between instructions. |
| 638 | |
| 639 | Instructions from the critical sequence are expected to significantly impact |
| 640 | performance. By construction, the accuracy of this analysis is strongly |
| 641 | dependent on the simulation and (as always) by the quality of the processor |
| 642 | model in llvm. |
| 643 | |
| 644 | |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 645 | Extra Statistics to Further Diagnose Performance Issues |
| 646 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 647 | The ``-all-stats`` command line option enables extra statistics and performance |
| 648 | counters for the dispatch logic, the reorder buffer, the retire control unit, |
| 649 | and the register file. |
| 650 | |
Andrea Di Biagio | eaca8ed | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 651 | Below is an example of ``-all-stats`` output generated by :program:`llvm-mca` |
Andrea Di Biagio | b89b96c | 2018-08-27 14:52:52 +0000 | [diff] [blame] | 652 | for 300 iterations of the dot-product example discussed in the previous |
| 653 | sections. |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 654 | |
| 655 | .. code-block:: none |
| 656 | |
| 657 | Dynamic Dispatch Stall Cycles: |
| 658 | RAT - Register unavailable: 0 |
| 659 | RCU - Retire tokens unavailable: 0 |
Andrea Di Biagio | 8b647dc | 2018-08-30 10:50:20 +0000 | [diff] [blame] | 660 | SCHEDQ - Scheduler full: 272 (44.6%) |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 661 | LQ - Load queue full: 0 |
| 662 | SQ - Store queue full: 0 |
| 663 | GROUP - Static restrictions on the dispatch group: 0 |
| 664 | |
| 665 | |
Andrea Di Biagio | 8b647dc | 2018-08-30 10:50:20 +0000 | [diff] [blame] | 666 | Dispatch Logic - number of cycles where we saw N micro opcodes dispatched: |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 667 | [# dispatched], [# cycles] |
| 668 | 0, 24 (3.9%) |
| 669 | 1, 272 (44.6%) |
| 670 | 2, 314 (51.5%) |
| 671 | |
| 672 | |
Andrea Di Biagio | f6a60f1 | 2019-04-08 16:05:54 +0000 | [diff] [blame] | 673 | Schedulers - number of cycles where we saw N micro opcodes issued: |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 674 | [# issued], [# cycles] |
| 675 | 0, 7 (1.1%) |
| 676 | 1, 306 (50.2%) |
| 677 | 2, 297 (48.7%) |
| 678 | |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 679 | Scheduler's queue usage: |
Andrea Di Biagio | b89b96c | 2018-08-27 14:52:52 +0000 | [diff] [blame] | 680 | [1] Resource name. |
| 681 | [2] Average number of used buffer entries. |
| 682 | [3] Maximum number of used buffer entries. |
| 683 | [4] Total number of buffer entries. |
| 684 | |
| 685 | [1] [2] [3] [4] |
| 686 | JALU01 0 0 20 |
| 687 | JFPU01 17 18 18 |
| 688 | JLSAGU 0 0 12 |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 689 | |
| 690 | |
| 691 | Retire Control Unit - number of cycles where we saw N instructions retired: |
| 692 | [# retired], [# cycles] |
| 693 | 0, 109 (17.9%) |
| 694 | 1, 102 (16.7%) |
| 695 | 2, 399 (65.4%) |
| 696 | |
Andrea Di Biagio | 07a8255 | 2018-11-23 12:12:57 +0000 | [diff] [blame] | 697 | Total ROB Entries: 64 |
| 698 | Max Used ROB Entries: 35 ( 54.7% ) |
| 699 | Average Used ROB Entries per cy: 32 ( 50.0% ) |
| 700 | |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 701 | |
| 702 | Register File statistics: |
| 703 | Total number of mappings created: 900 |
| 704 | Max number of mappings used: 35 |
| 705 | |
| 706 | * Register File #1 -- JFpuPRF: |
| 707 | Number of physical registers: 72 |
| 708 | Total number of mappings created: 900 |
| 709 | Max number of mappings used: 35 |
| 710 | |
| 711 | * Register File #2 -- JIntegerPRF: |
| 712 | Number of physical registers: 64 |
| 713 | Total number of mappings created: 0 |
| 714 | Max number of mappings used: 0 |
| 715 | |
| 716 | If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for |
| 717 | SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch |
Andrea Di Biagio | 8b647dc | 2018-08-30 10:50:20 +0000 | [diff] [blame] | 718 | logic is unable to dispatch a full group because the scheduler's queue is full. |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 719 | |
Andrea Di Biagio | eaca8ed | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 720 | Looking at the *Dispatch Logic* table, we see that the pipeline was only able to |
Andrea Di Biagio | 8b647dc | 2018-08-30 10:50:20 +0000 | [diff] [blame] | 721 | dispatch two micro opcodes 51.5% of the time. The dispatch group was limited to |
| 722 | one micro opcode 44.6% of the cycles, which corresponds to 272 cycles. The |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 723 | dispatch statistics are displayed by either using the command option |
| 724 | ``-all-stats`` or ``-dispatch-stats``. |
| 725 | |
| 726 | The next table, *Schedulers*, presents a histogram displaying a count, |
Andrea Di Biagio | f6a60f1 | 2019-04-08 16:05:54 +0000 | [diff] [blame] | 727 | representing the number of micro opcodes issued on some number of cycles. In |
| 728 | this case, of the 610 simulated cycles, single opcodes were issued 306 times |
| 729 | (50.2%) and there were 7 cycles where no opcodes were issued. |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 730 | |
Andrea Di Biagio | b89b96c | 2018-08-27 14:52:52 +0000 | [diff] [blame] | 731 | The *Scheduler's queue usage* table shows that the average and maximum number of |
| 732 | buffer entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01 |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 733 | reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements |
| 734 | three schedulers: |
| 735 | |
| 736 | * JALU01 - A scheduler for ALU instructions. |
| 737 | * JFPU01 - A scheduler floating point operations. |
| 738 | * JLSAGU - A scheduler for address generation. |
| 739 | |
| 740 | The dot-product is a kernel of three floating point instructions (a vector |
| 741 | multiply followed by two horizontal adds). That explains why only the floating |
| 742 | point scheduler appears to be used. |
| 743 | |
| 744 | A full scheduler queue is either caused by data dependency chains or by a |
| 745 | sub-optimal usage of hardware resources. Sometimes, resource pressure can be |
| 746 | mitigated by rewriting the kernel using different instructions that consume |
| 747 | different scheduler resources. Schedulers with a small queue are less resilient |
Andrea Di Biagio | eaca8ed | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 748 | to bottlenecks caused by the presence of long data dependencies. The scheduler |
| 749 | statistics are displayed by using the command option ``-all-stats`` or |
| 750 | ``-scheduler-stats``. |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 751 | |
| 752 | The next table, *Retire Control Unit*, presents a histogram displaying a count, |
| 753 | representing the number of instructions retired on some number of cycles. In |
Andrea Di Biagio | eaca8ed | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 754 | this case, of the 610 simulated cycles, two instructions were retired during the |
| 755 | same cycle 399 times (65.4%) and there were 109 cycles where no instructions |
| 756 | were retired. The retire statistics are displayed by using the command option |
| 757 | ``-all-stats`` or ``-retire-stats``. |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 758 | |
| 759 | The last table presented is *Register File statistics*. Each physical register |
| 760 | file (PRF) used by the pipeline is presented in this table. In the case of AMD |
Andrea Di Biagio | eaca8ed | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 761 | Jaguar, there are two register files, one for floating-point registers (JFpuPRF) |
| 762 | and one for integer registers (JIntegerPRF). The table shows that of the 900 |
| 763 | instructions processed, there were 900 mappings created. Since this dot-product |
| 764 | example utilized only floating point registers, the JFPuPRF was responsible for |
| 765 | creating the 900 mappings. However, we see that the pipeline only used a |
| 766 | maximum of 35 of 72 available register slots at any given time. We can conclude |
| 767 | that the floating point PRF was the only register file used for the example, and |
| 768 | that it was never resource constrained. The register file statistics are |
| 769 | displayed by using the command option ``-all-stats`` or |
Matt Davis | f2603c0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 770 | ``-register-file-stats``. |
| 771 | |
| 772 | In this example, we can conclude that the IPC is mostly limited by data |
| 773 | dependencies, and not by resource pressure. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 774 | |
| 775 | Instruction Flow |
| 776 | ^^^^^^^^^^^^^^^^ |
Andrea Di Biagio | eaca8ed | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 777 | This section describes the instruction flow through the default pipeline of |
| 778 | :program:`llvm-mca`, as well as the functional units involved in the process. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 779 | |
| 780 | The default pipeline implements the following sequence of stages used to |
| 781 | process instructions. |
| 782 | |
| 783 | * Dispatch (Instruction is dispatched to the schedulers). |
| 784 | * Issue (Instruction is issued to the processor pipelines). |
| 785 | * Write Back (Instruction is executed, and results are written back). |
| 786 | * Retire (Instruction is retired; writes are architecturally committed). |
| 787 | |
| 788 | The default pipeline only models the out-of-order portion of a processor. |
| 789 | Therefore, the instruction fetch and decode stages are not modeled. Performance |
Andrea Di Biagio | eaca8ed | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 790 | bottlenecks in the frontend are not diagnosed. :program:`llvm-mca` assumes that |
| 791 | instructions have all been decoded and placed into a queue before the simulation |
| 792 | start. Also, :program:`llvm-mca` does not model branch prediction. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 793 | |
| 794 | Instruction Dispatch |
| 795 | """""""""""""""""""" |
| 796 | During the dispatch stage, instructions are picked in program order from a |
| 797 | queue of already decoded instructions, and dispatched in groups to the |
| 798 | simulated hardware schedulers. |
| 799 | |
| 800 | The size of a dispatch group depends on the availability of the simulated |
| 801 | hardware resources. The processor dispatch width defaults to the value |
| 802 | of the ``IssueWidth`` in LLVM's scheduling model. |
| 803 | |
| 804 | An instruction can be dispatched if: |
| 805 | |
| 806 | * The size of the dispatch group is smaller than processor's dispatch width. |
| 807 | * There are enough entries in the reorder buffer. |
| 808 | * There are enough physical registers to do register renaming. |
| 809 | * The schedulers are not full. |
| 810 | |
| 811 | Scheduling models can optionally specify which register files are available on |
Andrea Di Biagio | eaca8ed | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 812 | the processor. :program:`llvm-mca` uses that information to initialize register |
| 813 | file descriptors. Users can limit the number of physical registers that are |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 814 | globally available for register renaming by using the command option |
Andrea Di Biagio | eaca8ed | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 815 | ``-register-file-size``. A value of zero for this option means *unbounded*. By |
| 816 | knowing how many registers are available for renaming, the tool can predict |
| 817 | dispatch stalls caused by the lack of physical registers. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 818 | |
| 819 | The number of reorder buffer entries consumed by an instruction depends on the |
Andrea Di Biagio | eaca8ed | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 820 | number of micro-opcodes specified for that instruction by the target scheduling |
| 821 | model. The reorder buffer is responsible for tracking the progress of |
| 822 | instructions that are "in-flight", and retiring them in program order. The |
| 823 | number of entries in the reorder buffer defaults to the value specified by field |
| 824 | `MicroOpBufferSize` in the target scheduling model. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 825 | |
| 826 | Instructions that are dispatched to the schedulers consume scheduler buffer |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 827 | entries. :program:`llvm-mca` queries the scheduling model to determine the set |
| 828 | of buffered resources consumed by an instruction. Buffered resources are |
| 829 | treated like scheduler resources. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 830 | |
| 831 | Instruction Issue |
| 832 | """"""""""""""""" |
| 833 | Each processor scheduler implements a buffer of instructions. An instruction |
| 834 | has to wait in the scheduler's buffer until input register operands become |
| 835 | available. Only at that point, does the instruction becomes eligible for |
| 836 | execution and may be issued (potentially out-of-order) for execution. |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 837 | Instruction latencies are computed by :program:`llvm-mca` with the help of the |
| 838 | scheduling model. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 839 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 840 | :program:`llvm-mca`'s scheduler is designed to simulate multiple processor |
| 841 | schedulers. The scheduler is responsible for tracking data dependencies, and |
| 842 | dynamically selecting which processor resources are consumed by instructions. |
| 843 | It delegates the management of processor resource units and resource groups to a |
| 844 | resource manager. The resource manager is responsible for selecting resource |
| 845 | units that are consumed by instructions. For example, if an instruction |
| 846 | consumes 1cy of a resource group, the resource manager selects one of the |
| 847 | available units from the group; by default, the resource manager uses a |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 848 | round-robin selector to guarantee that resource usage is uniformly distributed |
| 849 | between all units of a group. |
| 850 | |
Andrea Di Biagio | 1c3bcc6 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 851 | :program:`llvm-mca`'s scheduler internally groups instructions into three sets: |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 852 | |
Andrea Di Biagio | 1c3bcc6 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 853 | * WaitSet: a set of instructions whose operands are not ready. |
| 854 | * ReadySet: a set of instructions ready to execute. |
| 855 | * IssuedSet: a set of instructions executing. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 856 | |
Andrea Di Biagio | 1c3bcc6 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 857 | Depending on the operands availability, instructions that are dispatched to the |
| 858 | scheduler are either placed into the WaitSet or into the ReadySet. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 859 | |
Andrea Di Biagio | 1c3bcc6 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 860 | Every cycle, the scheduler checks if instructions can be moved from the WaitSet |
| 861 | to the ReadySet, and if instructions from the ReadySet can be issued to the |
| 862 | underlying pipelines. The algorithm prioritizes older instructions over younger |
| 863 | instructions. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 864 | |
| 865 | Write-Back and Retire Stage |
| 866 | """"""""""""""""""""""""""" |
Andrea Di Biagio | 1c3bcc6 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 867 | Issued instructions are moved from the ReadySet to the IssuedSet. There, |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 868 | instructions wait until they reach the write-back stage. At that point, they |
| 869 | get removed from the queue and the retire control unit is notified. |
| 870 | |
Andrea Di Biagio | 1c3bcc6 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 871 | When instructions are executed, the retire control unit flags the instruction as |
| 872 | "ready to retire." |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 873 | |
Andrea Di Biagio | 1c3bcc6 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 874 | Instructions are retired in program order. The register file is notified of the |
| 875 | retirement so that it can free the physical registers that were allocated for |
| 876 | the instruction during the register renaming stage. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 877 | |
| 878 | Load/Store Unit and Memory Consistency Model |
| 879 | """""""""""""""""""""""""""""""""""""""""""" |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 880 | To simulate an out-of-order execution of memory operations, :program:`llvm-mca` |
| 881 | utilizes a simulated load/store unit (LSUnit) to simulate the speculative |
| 882 | execution of loads and stores. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 883 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 884 | Each load (or store) consumes an entry in the load (or store) queue. Users can |
| 885 | specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the |
| 886 | load and store queues respectively. The queues are unbounded by default. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 887 | |
| 888 | The LSUnit implements a relaxed consistency model for memory loads and stores. |
| 889 | The rules are: |
| 890 | |
| 891 | 1. A younger load is allowed to pass an older load only if there are no |
| 892 | intervening stores or barriers between the two loads. |
| 893 | 2. A younger load is allowed to pass an older store provided that the load does |
| 894 | not alias with the store. |
| 895 | 3. A younger store is not allowed to pass an older store. |
| 896 | 4. A younger store is not allowed to pass an older load. |
| 897 | |
| 898 | By default, the LSUnit optimistically assumes that loads do not alias |
| 899 | (`-noalias=true`) store operations. Under this assumption, younger loads are |
| 900 | always allowed to pass older stores. Essentially, the LSUnit does not attempt |
| 901 | to run any alias analysis to predict when loads and stores do not alias with |
| 902 | each other. |
| 903 | |
| 904 | Note that, in the case of write-combining memory, rule 3 could be relaxed to |
| 905 | allow reordering of non-aliasing store operations. That being said, at the |
| 906 | moment, there is no way to further relax the memory model (``-noalias`` is the |
| 907 | only option). Essentially, there is no option to specify a different memory |
| 908 | type (e.g., write-back, write-combining, write-through; etc.) and consequently |
| 909 | to weaken, or strengthen, the memory model. |
| 910 | |
| 911 | Other limitations are: |
| 912 | |
| 913 | * The LSUnit does not know when store-to-load forwarding may occur. |
| 914 | * The LSUnit does not know anything about cache hierarchy and memory types. |
| 915 | * The LSUnit does not know how to identify serializing operations and memory |
| 916 | fences. |
| 917 | |
| 918 | The LSUnit does not attempt to predict if a load or store hits or misses the L1 |
| 919 | cache. It only knows if an instruction "MayLoad" and/or "MayStore." For |
| 920 | loads, the scheduling model provides an "optimistic" load-to-use latency (which |
| 921 | usually matches the load-to-use latency for when there is a hit in the L1D). |
| 922 | |
Andrea Di Biagio | bdcf6ad | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 923 | :program:`llvm-mca` does not know about serializing operations or memory-barrier |
| 924 | like instructions. The LSUnit conservatively assumes that an instruction which |
| 925 | has both "MayLoad" and unmodeled side effects behaves like a "soft" |
| 926 | load-barrier. That means, it serializes loads without forcing a flush of the |
| 927 | load queue. Similarly, instructions that "MayStore" and have unmodeled side |
| 928 | effects are treated like store barriers. A full memory barrier is a "MayLoad" |
| 929 | and "MayStore" instruction with unmodeled side effects. This is inaccurate, but |
| 930 | it is the best that we can do at the moment with the current information |
| 931 | available in LLVM. |
Matt Davis | 8d253a7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 932 | |
| 933 | A load/store barrier consumes one entry of the load/store queue. A load/store |
| 934 | barrier enforces ordering of loads/stores. A younger load cannot pass a load |
| 935 | barrier. Also, a younger store cannot pass a store barrier. A younger load |
| 936 | has to wait for the memory/load barrier to execute. A load/store barrier is |
| 937 | "executed" when it becomes the oldest entry in the load/store queue(s). That |
| 938 | also means, by construction, all of the older loads/stores have been executed. |
| 939 | |
| 940 | In conclusion, the full set of load/store consistency rules are: |
| 941 | |
| 942 | #. A store may not pass a previous store. |
| 943 | #. A store may not pass a previous load (regardless of ``-noalias``). |
| 944 | #. A store has to wait until an older store barrier is fully executed. |
| 945 | #. A load may pass a previous load. |
| 946 | #. A load may not pass a previous store unless ``-noalias`` is set. |
| 947 | #. A load has to wait until an older load barrier is fully executed. |