|  | llvm-mca - LLVM Machine Code Analyzer | 
|  | ===================================== | 
|  |  | 
|  | .. program:: llvm-mca | 
|  |  | 
|  | SYNOPSIS | 
|  | -------- | 
|  |  | 
|  | :program:`llvm-mca` [*options*] [input] | 
|  |  | 
|  | DESCRIPTION | 
|  | ----------- | 
|  |  | 
|  | :program:`llvm-mca` is a performance analysis tool that uses information | 
|  | available in LLVM (e.g. scheduling models) to statically measure the performance | 
|  | of machine code in a specific CPU. | 
|  |  | 
|  | Performance is measured in terms of throughput as well as processor resource | 
|  | consumption. The tool currently works for processors with an out-of-order | 
|  | backend, for which there is a scheduling model available in LLVM. | 
|  |  | 
|  | The main goal of this tool is not just to predict the performance of the code | 
|  | when run on the target, but also help with diagnosing potential performance | 
|  | issues. | 
|  |  | 
|  | Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions | 
|  | Per Cycle (IPC), as well as hardware resource pressure. The analysis and | 
|  | reporting style were inspired by the IACA tool from Intel. | 
|  |  | 
|  | For example, you can compile code with clang, output assembly, and pipe it | 
|  | directly into :program:`llvm-mca` for analysis: | 
|  |  | 
|  | .. code-block:: bash | 
|  |  | 
|  | $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2 | 
|  |  | 
|  | Or for Intel syntax: | 
|  |  | 
|  | .. code-block:: bash | 
|  |  | 
|  | $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2 | 
|  |  | 
|  | Scheduling models are not just used to compute instruction latencies and | 
|  | throughput, but also to understand what processor resources are available | 
|  | and how to simulate them. | 
|  |  | 
|  | By design, the quality of the analysis conducted by :program:`llvm-mca` is | 
|  | inevitably affected by the quality of the scheduling models in LLVM. | 
|  |  | 
|  | If you see that the performance report is not accurate for a processor, | 
|  | please `file a bug <https://bugs.llvm.org/enter_bug.cgi?product=libraries>`_ | 
|  | against the appropriate backend. | 
|  |  | 
|  | OPTIONS | 
|  | ------- | 
|  |  | 
|  | If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard | 
|  | input. Otherwise, it will read from the specified filename. | 
|  |  | 
|  | If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output | 
|  | to standard output if the input is from standard input.  If the :option:`-o` | 
|  | option specifies "``-``", then the output will also be sent to standard output. | 
|  |  | 
|  |  | 
|  | .. option:: -help | 
|  |  | 
|  | Print a summary of command line options. | 
|  |  | 
|  | .. option:: -o <filename> | 
|  |  | 
|  | Use ``<filename>`` as the output filename. See the summary above for more | 
|  | details. | 
|  |  | 
|  | .. option:: -mtriple=<target triple> | 
|  |  | 
|  | Specify a target triple string. | 
|  |  | 
|  | .. option:: -march=<arch> | 
|  |  | 
|  | Specify the architecture for which to analyze the code. It defaults to the | 
|  | host default target. | 
|  |  | 
|  | .. option:: -mcpu=<cpuname> | 
|  |  | 
|  | Specify the processor for which to analyze the code.  By default, the cpu name | 
|  | is autodetected from the host. | 
|  |  | 
|  | .. option:: -output-asm-variant=<variant id> | 
|  |  | 
|  | Specify the output assembly variant for the report generated by the tool. | 
|  | On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables | 
|  | the AT&T (vic. Intel) assembly format for the code printed out by the tool in | 
|  | the analysis report. | 
|  |  | 
|  | .. option:: -dispatch=<width> | 
|  |  | 
|  | Specify a different dispatch width for the processor. The dispatch width | 
|  | defaults to field 'IssueWidth' in the processor scheduling model.  If width is | 
|  | zero, then the default dispatch width is used. | 
|  |  | 
|  | .. option:: -register-file-size=<size> | 
|  |  | 
|  | Specify the size of the register file. When specified, this flag limits how | 
|  | many physical registers are available for register renaming purposes. A value | 
|  | of zero for this flag means "unlimited number of physical registers". | 
|  |  | 
|  | .. option:: -iterations=<number of iterations> | 
|  |  | 
|  | Specify the number of iterations to run. If this flag is set to 0, then the | 
|  | tool sets the number of iterations to a default value (i.e. 100). | 
|  |  | 
|  | .. option:: -noalias=<bool> | 
|  |  | 
|  | If set, the tool assumes that loads and stores don't alias. This is the | 
|  | default behavior. | 
|  |  | 
|  | .. option:: -lqueue=<load queue size> | 
|  |  | 
|  | Specify the size of the load queue in the load/store unit emulated by the tool. | 
|  | By default, the tool assumes an unbound number of entries in the load queue. | 
|  | A value of zero for this flag is ignored, and the default load queue size is | 
|  | used instead. | 
|  |  | 
|  | .. option:: -squeue=<store queue size> | 
|  |  | 
|  | Specify the size of the store queue in the load/store unit emulated by the | 
|  | tool. By default, the tool assumes an unbound number of entries in the store | 
|  | queue. A value of zero for this flag is ignored, and the default store queue | 
|  | size is used instead. | 
|  |  | 
|  | .. option:: -timeline | 
|  |  | 
|  | Enable the timeline view. | 
|  |  | 
|  | .. option:: -timeline-max-iterations=<iterations> | 
|  |  | 
|  | Limit the number of iterations to print in the timeline view. By default, the | 
|  | timeline view prints information for up to 10 iterations. | 
|  |  | 
|  | .. option:: -timeline-max-cycles=<cycles> | 
|  |  | 
|  | Limit the number of cycles in the timeline view. By default, the number of | 
|  | cycles is set to 80. | 
|  |  | 
|  | .. option:: -resource-pressure | 
|  |  | 
|  | Enable the resource pressure view. This is enabled by default. | 
|  |  | 
|  | .. option:: -register-file-stats | 
|  |  | 
|  | Enable register file usage statistics. | 
|  |  | 
|  | .. option:: -dispatch-stats | 
|  |  | 
|  | Enable extra dispatch statistics. This view collects and analyzes instruction | 
|  | dispatch events, as well as static/dynamic dispatch stall events. This view | 
|  | is disabled by default. | 
|  |  | 
|  | .. option:: -scheduler-stats | 
|  |  | 
|  | Enable extra scheduler statistics. This view collects and analyzes instruction | 
|  | issue events. This view is disabled by default. | 
|  |  | 
|  | .. option:: -retire-stats | 
|  |  | 
|  | Enable extra retire control unit statistics. This view is disabled by default. | 
|  |  | 
|  | .. option:: -instruction-info | 
|  |  | 
|  | Enable the instruction info view. This is enabled by default. | 
|  |  | 
|  | .. option:: -all-stats | 
|  |  | 
|  | Print all hardware statistics. This enables extra statistics related to the | 
|  | dispatch logic, the hardware schedulers, the register file(s), and the retire | 
|  | control unit. This option is disabled by default. | 
|  |  | 
|  | .. option:: -all-views | 
|  |  | 
|  | Enable all the view. | 
|  |  | 
|  | .. option:: -instruction-tables | 
|  |  | 
|  | Prints resource pressure information based on the static information | 
|  | available from the processor model. This differs from the resource pressure | 
|  | view because it doesn't require that the code is simulated. It instead prints | 
|  | the theoretical uniform distribution of resource pressure for every | 
|  | instruction in sequence. | 
|  |  | 
|  | .. option:: -bottleneck-analysis | 
|  |  | 
|  | Print information about bottlenecks that affect the throughput. This analysis | 
|  | can be expensive, and it is disabled by default.  Bottlenecks are highlighted | 
|  | in the summary view. | 
|  |  | 
|  |  | 
|  | EXIT STATUS | 
|  | ----------- | 
|  |  | 
|  | :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed | 
|  | to standard error, and the tool returns 1. | 
|  |  | 
|  | USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS | 
|  | --------------------------------------------- | 
|  | :program:`llvm-mca` allows for the optional usage of special code comments to | 
|  | mark regions of the assembly code to be analyzed.  A comment starting with | 
|  | substring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment | 
|  | starting with substring ``LLVM-MCA-END`` marks the end of a code region.  For | 
|  | example: | 
|  |  | 
|  | .. code-block:: none | 
|  |  | 
|  | # LLVM-MCA-BEGIN | 
|  | ... | 
|  | # LLVM-MCA-END | 
|  |  | 
|  | If no user-defined region is specified, then :program:`llvm-mca` assumes a | 
|  | default region which contains every instruction in the input file.  Every region | 
|  | is analyzed in isolation, and the final performance report is the union of all | 
|  | the reports generated for every code region. | 
|  |  | 
|  | Code regions can have names. For example: | 
|  |  | 
|  | .. code-block:: none | 
|  |  | 
|  | # LLVM-MCA-BEGIN A simple example | 
|  | add %eax, %eax | 
|  | # LLVM-MCA-END | 
|  |  | 
|  | The code from the example above defines a region named "A simple example" with a | 
|  | single instruction in it. Note how the region name doesn't have to be repeated | 
|  | in the ``LLVM-MCA-END`` directive. In the absence of overlapping regions, | 
|  | an anonymous ``LLVM-MCA-END`` directive always ends the currently active user | 
|  | defined region. | 
|  |  | 
|  | Example of nesting regions: | 
|  |  | 
|  | .. code-block:: none | 
|  |  | 
|  | # LLVM-MCA-BEGIN foo | 
|  | add %eax, %edx | 
|  | # LLVM-MCA-BEGIN bar | 
|  | sub %eax, %edx | 
|  | # LLVM-MCA-END bar | 
|  | # LLVM-MCA-END foo | 
|  |  | 
|  | Example of overlapping regions: | 
|  |  | 
|  | .. code-block:: none | 
|  |  | 
|  | # LLVM-MCA-BEGIN foo | 
|  | add %eax, %edx | 
|  | # LLVM-MCA-BEGIN bar | 
|  | sub %eax, %edx | 
|  | # LLVM-MCA-END foo | 
|  | add %eax, %edx | 
|  | # LLVM-MCA-END bar | 
|  |  | 
|  | Note that multiple anonymous regions cannot overlap. Also, overlapping regions | 
|  | cannot have the same name. | 
|  |  | 
|  | There is no support for marking regions from high-level source code, like C or | 
|  | C++. As a workaround, inline assembly directives may be used: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | int foo(int a, int b) { | 
|  | __asm volatile("# LLVM-MCA-BEGIN foo"); | 
|  | a += 42; | 
|  | __asm volatile("# LLVM-MCA-END"); | 
|  | a *= b; | 
|  | return a; | 
|  | } | 
|  |  | 
|  | However, this interferes with optimizations like loop vectorization and may have | 
|  | an impact on the code generated. This is because the ``__asm`` statements are | 
|  | seen as real code having important side effects, which limits how the code | 
|  | around them can be transformed. If users want to make use of inline assembly | 
|  | to emit markers, then the recommendation is to always verify that the output | 
|  | assembly is equivalent to the assembly generated in the absence of markers. | 
|  | The `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_ | 
|  | can also help in detecting missed optimizations. | 
|  |  | 
|  | HOW LLVM-MCA WORKS | 
|  | ------------------ | 
|  |  | 
|  | :program:`llvm-mca` takes assembly code as input. The assembly code is parsed | 
|  | into a sequence of MCInst with the help of the existing LLVM target assembly | 
|  | parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module | 
|  | to generate a performance report. | 
|  |  | 
|  | The Pipeline module simulates the execution of the machine code sequence in a | 
|  | loop of iterations (default is 100). During this process, the pipeline collects | 
|  | a number of execution related statistics. At the end of this process, the | 
|  | pipeline generates and prints a report from the collected statistics. | 
|  |  | 
|  | Here is an example of a performance report generated by the tool for a | 
|  | dot-product of two packed float vectors of four elements. The analysis is | 
|  | conducted for target x86, cpu btver2.  The following result can be produced via | 
|  | the following command using the example located at | 
|  | ``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: | 
|  |  | 
|  | .. code-block:: bash | 
|  |  | 
|  | $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s | 
|  |  | 
|  | .. code-block:: none | 
|  |  | 
|  | Iterations:        300 | 
|  | Instructions:      900 | 
|  | Total Cycles:      610 | 
|  | Total uOps:        900 | 
|  |  | 
|  | Dispatch Width:    2 | 
|  | uOps Per Cycle:    1.48 | 
|  | IPC:               1.48 | 
|  | Block RThroughput: 2.0 | 
|  |  | 
|  |  | 
|  | Instruction Info: | 
|  | [1]: #uOps | 
|  | [2]: Latency | 
|  | [3]: RThroughput | 
|  | [4]: MayLoad | 
|  | [5]: MayStore | 
|  | [6]: HasSideEffects (U) | 
|  |  | 
|  | [1]    [2]    [3]    [4]    [5]    [6]    Instructions: | 
|  | 1      2     1.00                        vmulps	%xmm0, %xmm1, %xmm2 | 
|  | 1      3     1.00                        vhaddps	%xmm2, %xmm2, %xmm3 | 
|  | 1      3     1.00                        vhaddps	%xmm3, %xmm3, %xmm4 | 
|  |  | 
|  |  | 
|  | Resources: | 
|  | [0]   - JALU0 | 
|  | [1]   - JALU1 | 
|  | [2]   - JDiv | 
|  | [3]   - JFPA | 
|  | [4]   - JFPM | 
|  | [5]   - JFPU0 | 
|  | [6]   - JFPU1 | 
|  | [7]   - JLAGU | 
|  | [8]   - JMul | 
|  | [9]   - JSAGU | 
|  | [10]  - JSTC | 
|  | [11]  - JVALU0 | 
|  | [12]  - JVALU1 | 
|  | [13]  - JVIMUL | 
|  |  | 
|  |  | 
|  | Resource pressure per iteration: | 
|  | [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13] | 
|  | -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      - | 
|  |  | 
|  | Resource pressure by instruction: | 
|  | [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions: | 
|  | -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps	%xmm0, %xmm1, %xmm2 | 
|  | -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm2, %xmm2, %xmm3 | 
|  | -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm3, %xmm3, %xmm4 | 
|  |  | 
|  | According to this report, the dot-product kernel has been executed 300 times, | 
|  | for a total of 900 simulated instructions. The total number of simulated micro | 
|  | opcodes (uOps) is also 900. | 
|  |  | 
|  | The report is structured in three main sections.  The first section collects a | 
|  | few performance numbers; the goal of this section is to give a very quick | 
|  | overview of the performance throughput. Important performance indicators are | 
|  | **IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal | 
|  | Throughput). | 
|  |  | 
|  | IPC is computed dividing the total number of simulated instructions by the total | 
|  | number of cycles. In the absence of loop-carried data dependencies, the | 
|  | observed IPC tends to a theoretical maximum which can be computed by dividing | 
|  | the number of instructions of a single iteration by the *Block RThroughput*. | 
|  |  | 
|  | Field 'uOps Per Cycle' is computed dividing the total number of simulated micro | 
|  | opcodes by the total number of cycles. A delta between Dispatch Width and this | 
|  | field is an indicator of a performance issue. In the absence of loop-carried | 
|  | data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical | 
|  | maximum throughput which can be computed by dividing the number of uOps of a | 
|  | single iteration by the *Block RThroughput*. | 
|  |  | 
|  | Field *uOps Per Cycle* is bounded from above by the dispatch width. That is | 
|  | because the dispatch width limits the maximum size of a dispatch group. Both IPC | 
|  | and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The | 
|  | availability of hardware resources affects the resource pressure distribution, | 
|  | and it limits the number of instructions that can be executed in parallel every | 
|  | cycle.  A delta between Dispatch Width and the theoretical maximum uOps per | 
|  | Cycle (computed by dividing the number of uOps of a single iteration by the | 
|  | *Block RTrhoughput*) is an indicator of a performance bottleneck caused by the | 
|  | lack of hardware resources. | 
|  | In general, the lower the Block RThroughput, the better. | 
|  |  | 
|  | In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there | 
|  | are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to | 
|  | approach 1.50 when the number of iterations tends to infinity. The delta between | 
|  | the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is | 
|  | an indicator of a performance bottleneck caused by the lack of hardware | 
|  | resources, and the *Resource pressure view* can help to identify the problematic | 
|  | resource usage. | 
|  |  | 
|  | The second section of the report shows the latency and reciprocal | 
|  | throughput of every instruction in the sequence. That section also reports | 
|  | extra information related to the number of micro opcodes, and opcode properties | 
|  | (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). | 
|  |  | 
|  | The third section is the *Resource pressure view*.  This view reports | 
|  | the average number of resource cycles consumed every iteration by instructions | 
|  | for every processor resource unit available on the target.  Information is | 
|  | structured in two tables. The first table reports the number of resource cycles | 
|  | spent on average every iteration. The second table correlates the resource | 
|  | cycles to the machine instruction in the sequence. For example, every iteration | 
|  | of the instruction vmulps always executes on resource unit [6] | 
|  | (JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle | 
|  | per iteration.  Note that on AMD Jaguar, vector floating-point multiply can | 
|  | only be issued to pipeline JFPU1, while horizontal floating-point additions can | 
|  | only be issued to pipeline JFPU0. | 
|  |  | 
|  | The resource pressure view helps with identifying bottlenecks caused by high | 
|  | usage of specific hardware resources.  Situations with resource pressure mainly | 
|  | concentrated on a few resources should, in general, be avoided.  Ideally, | 
|  | pressure should be uniformly distributed between multiple resources. | 
|  |  | 
|  | Timeline View | 
|  | ^^^^^^^^^^^^^ | 
|  | The timeline view produces a detailed report of each instruction's state | 
|  | transitions through an instruction pipeline.  This view is enabled by the | 
|  | command line option ``-timeline``.  As instructions transition through the | 
|  | various stages of the pipeline, their states are depicted in the view report. | 
|  | These states are represented by the following characters: | 
|  |  | 
|  | * D : Instruction dispatched. | 
|  | * e : Instruction executing. | 
|  | * E : Instruction executed. | 
|  | * R : Instruction retired. | 
|  | * = : Instruction already dispatched, waiting to be executed. | 
|  | * \- : Instruction executed, waiting to be retired. | 
|  |  | 
|  | Below is the timeline view for a subset of the dot-product example located in | 
|  | ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by | 
|  | :program:`llvm-mca` using the following command: | 
|  |  | 
|  | .. code-block:: bash | 
|  |  | 
|  | $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s | 
|  |  | 
|  | .. code-block:: none | 
|  |  | 
|  | Timeline view: | 
|  | 012345 | 
|  | Index     0123456789 | 
|  |  | 
|  | [0,0]     DeeER.    .    .   vmulps	%xmm0, %xmm1, %xmm2 | 
|  | [0,1]     D==eeeER  .    .   vhaddps	%xmm2, %xmm2, %xmm3 | 
|  | [0,2]     .D====eeeER    .   vhaddps	%xmm3, %xmm3, %xmm4 | 
|  | [1,0]     .DeeE-----R    .   vmulps	%xmm0, %xmm1, %xmm2 | 
|  | [1,1]     . D=eeeE---R   .   vhaddps	%xmm2, %xmm2, %xmm3 | 
|  | [1,2]     . D====eeeER   .   vhaddps	%xmm3, %xmm3, %xmm4 | 
|  | [2,0]     .  DeeE-----R  .   vmulps	%xmm0, %xmm1, %xmm2 | 
|  | [2,1]     .  D====eeeER  .   vhaddps	%xmm2, %xmm2, %xmm3 | 
|  | [2,2]     .   D======eeeER   vhaddps	%xmm3, %xmm3, %xmm4 | 
|  |  | 
|  |  | 
|  | Average Wait times (based on the timeline view): | 
|  | [0]: Executions | 
|  | [1]: Average time spent waiting in a scheduler's queue | 
|  | [2]: Average time spent waiting in a scheduler's queue while ready | 
|  | [3]: Average time elapsed from WB until retire stage | 
|  |  | 
|  | [0]    [1]    [2]    [3] | 
|  | 0.     3     1.0    1.0    3.3       vmulps	%xmm0, %xmm1, %xmm2 | 
|  | 1.     3     3.3    0.7    1.0       vhaddps	%xmm2, %xmm2, %xmm3 | 
|  | 2.     3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4 | 
|  |  | 
|  | The timeline view is interesting because it shows instruction state changes | 
|  | during execution.  It also gives an idea of how the tool processes instructions | 
|  | executed on the target, and how their timing information might be calculated. | 
|  |  | 
|  | The timeline view is structured in two tables.  The first table shows | 
|  | instructions changing state over time (measured in cycles); the second table | 
|  | (named *Average Wait times*) reports useful timing statistics, which should | 
|  | help diagnose performance bottlenecks caused by long data dependencies and | 
|  | sub-optimal usage of hardware resources. | 
|  |  | 
|  | An instruction in the timeline view is identified by a pair of indices, where | 
|  | the first index identifies an iteration, and the second index is the | 
|  | instruction index (i.e., where it appears in the code sequence).  Since this | 
|  | example was generated using 3 iterations: ``-iterations=3``, the iteration | 
|  | indices range from 0-2 inclusively. | 
|  |  | 
|  | Excluding the first and last column, the remaining columns are in cycles. | 
|  | Cycles are numbered sequentially starting from 0. | 
|  |  | 
|  | From the example output above, we know the following: | 
|  |  | 
|  | * Instruction [1,0] was dispatched at cycle 1. | 
|  | * Instruction [1,0] started executing at cycle 2. | 
|  | * Instruction [1,0] reached the write back stage at cycle 4. | 
|  | * Instruction [1,0] was retired at cycle 10. | 
|  |  | 
|  | Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the | 
|  | scheduler's queue for the operands to become available. By the time vmulps is | 
|  | dispatched, operands are already available, and pipeline JFPU1 is ready to | 
|  | serve another instruction.  So the instruction can be immediately issued on the | 
|  | JFPU1 pipeline. That is demonstrated by the fact that the instruction only | 
|  | spent 1cy in the scheduler's queue. | 
|  |  | 
|  | There is a gap of 5 cycles between the write-back stage and the retire event. | 
|  | That is because instructions must retire in program order, so [1,0] has to wait | 
|  | for [0,2] to be retired first (i.e., it has to wait until cycle 10). | 
|  |  | 
|  | In the example, all instructions are in a RAW (Read After Write) dependency | 
|  | chain.  Register %xmm2 written by vmulps is immediately used by the first | 
|  | vhaddps, and register %xmm3 written by the first vhaddps is used by the second | 
|  | vhaddps.  Long data dependencies negatively impact the ILP (Instruction Level | 
|  | Parallelism). | 
|  |  | 
|  | In the dot-product example, there are anti-dependencies introduced by | 
|  | instructions from different iterations.  However, those dependencies can be | 
|  | removed at register renaming stage (at the cost of allocating register aliases, | 
|  | and therefore consuming physical registers). | 
|  |  | 
|  | Table *Average Wait times* helps diagnose performance issues that are caused by | 
|  | the presence of long latency instructions and potentially long data dependencies | 
|  | which may limit the ILP.  Note that :program:`llvm-mca`, by default, assumes at | 
|  | least 1cy between the dispatch event and the issue event. | 
|  |  | 
|  | When the performance is limited by data dependencies and/or long latency | 
|  | instructions, the number of cycles spent while in the *ready* state is expected | 
|  | to be very small when compared with the total number of cycles spent in the | 
|  | scheduler's queue.  The difference between the two counters is a good indicator | 
|  | of how large of an impact data dependencies had on the execution of the | 
|  | instructions.  When performance is mostly limited by the lack of hardware | 
|  | resources, the delta between the two counters is small.  However, the number of | 
|  | cycles spent in the queue tends to be larger (i.e., more than 1-3cy), | 
|  | especially when compared to other low latency instructions. | 
|  |  | 
|  | Extra Statistics to Further Diagnose Performance Issues | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | The ``-all-stats`` command line option enables extra statistics and performance | 
|  | counters for the dispatch logic, the reorder buffer, the retire control unit, | 
|  | and the register file. | 
|  |  | 
|  | Below is an example of ``-all-stats`` output generated by  :program:`llvm-mca` | 
|  | for 300 iterations of the dot-product example discussed in the previous | 
|  | sections. | 
|  |  | 
|  | .. code-block:: none | 
|  |  | 
|  | Dynamic Dispatch Stall Cycles: | 
|  | RAT     - Register unavailable:                      0 | 
|  | RCU     - Retire tokens unavailable:                 0 | 
|  | SCHEDQ  - Scheduler full:                            272  (44.6%) | 
|  | LQ      - Load queue full:                           0 | 
|  | SQ      - Store queue full:                          0 | 
|  | GROUP   - Static restrictions on the dispatch group: 0 | 
|  |  | 
|  |  | 
|  | Dispatch Logic - number of cycles where we saw N micro opcodes dispatched: | 
|  | [# dispatched], [# cycles] | 
|  | 0,              24  (3.9%) | 
|  | 1,              272  (44.6%) | 
|  | 2,              314  (51.5%) | 
|  |  | 
|  |  | 
|  | Schedulers - number of cycles where we saw N micro opcodes issued: | 
|  | [# issued], [# cycles] | 
|  | 0,          7  (1.1%) | 
|  | 1,          306  (50.2%) | 
|  | 2,          297  (48.7%) | 
|  |  | 
|  | Scheduler's queue usage: | 
|  | [1] Resource name. | 
|  | [2] Average number of used buffer entries. | 
|  | [3] Maximum number of used buffer entries. | 
|  | [4] Total number of buffer entries. | 
|  |  | 
|  | [1]            [2]        [3]        [4] | 
|  | JALU01           0          0          20 | 
|  | JFPU01           17         18         18 | 
|  | JLSAGU           0          0          12 | 
|  |  | 
|  |  | 
|  | Retire Control Unit - number of cycles where we saw N instructions retired: | 
|  | [# retired], [# cycles] | 
|  | 0,           109  (17.9%) | 
|  | 1,           102  (16.7%) | 
|  | 2,           399  (65.4%) | 
|  |  | 
|  | Total ROB Entries:                64 | 
|  | Max Used ROB Entries:             35  ( 54.7% ) | 
|  | Average Used ROB Entries per cy:  32  ( 50.0% ) | 
|  |  | 
|  |  | 
|  | Register File statistics: | 
|  | Total number of mappings created:    900 | 
|  | Max number of mappings used:         35 | 
|  |  | 
|  | *  Register File #1 -- JFpuPRF: | 
|  | Number of physical registers:     72 | 
|  | Total number of mappings created: 900 | 
|  | Max number of mappings used:      35 | 
|  |  | 
|  | *  Register File #2 -- JIntegerPRF: | 
|  | Number of physical registers:     64 | 
|  | Total number of mappings created: 0 | 
|  | Max number of mappings used:      0 | 
|  |  | 
|  | If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for | 
|  | SCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch | 
|  | logic is unable to dispatch a full group because the scheduler's queue is full. | 
|  |  | 
|  | Looking at the *Dispatch Logic* table, we see that the pipeline was only able to | 
|  | dispatch two micro opcodes 51.5% of the time.  The dispatch group was limited to | 
|  | one micro opcode 44.6% of the cycles, which corresponds to 272 cycles.  The | 
|  | dispatch statistics are displayed by either using the command option | 
|  | ``-all-stats`` or ``-dispatch-stats``. | 
|  |  | 
|  | The next table, *Schedulers*, presents a histogram displaying a count, | 
|  | representing the number of micro opcodes issued on some number of cycles. In | 
|  | this case, of the 610 simulated cycles, single opcodes were issued 306 times | 
|  | (50.2%) and there were 7 cycles where no opcodes were issued. | 
|  |  | 
|  | The *Scheduler's queue usage* table shows that the average and maximum number of | 
|  | buffer entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01 | 
|  | reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements | 
|  | three schedulers: | 
|  |  | 
|  | * JALU01 - A scheduler for ALU instructions. | 
|  | * JFPU01 - A scheduler floating point operations. | 
|  | * JLSAGU - A scheduler for address generation. | 
|  |  | 
|  | The dot-product is a kernel of three floating point instructions (a vector | 
|  | multiply followed by two horizontal adds).  That explains why only the floating | 
|  | point scheduler appears to be used. | 
|  |  | 
|  | A full scheduler queue is either caused by data dependency chains or by a | 
|  | sub-optimal usage of hardware resources.  Sometimes, resource pressure can be | 
|  | mitigated by rewriting the kernel using different instructions that consume | 
|  | different scheduler resources.  Schedulers with a small queue are less resilient | 
|  | to bottlenecks caused by the presence of long data dependencies.  The scheduler | 
|  | statistics are displayed by using the command option ``-all-stats`` or | 
|  | ``-scheduler-stats``. | 
|  |  | 
|  | The next table, *Retire Control Unit*, presents a histogram displaying a count, | 
|  | representing the number of instructions retired on some number of cycles.  In | 
|  | this case, of the 610 simulated cycles, two instructions were retired during the | 
|  | same cycle 399 times (65.4%) and there were 109 cycles where no instructions | 
|  | were retired.  The retire statistics are displayed by using the command option | 
|  | ``-all-stats`` or ``-retire-stats``. | 
|  |  | 
|  | The last table presented is *Register File statistics*.  Each physical register | 
|  | file (PRF) used by the pipeline is presented in this table.  In the case of AMD | 
|  | Jaguar, there are two register files, one for floating-point registers (JFpuPRF) | 
|  | and one for integer registers (JIntegerPRF).  The table shows that of the 900 | 
|  | instructions processed, there were 900 mappings created.  Since this dot-product | 
|  | example utilized only floating point registers, the JFPuPRF was responsible for | 
|  | creating the 900 mappings.  However, we see that the pipeline only used a | 
|  | maximum of 35 of 72 available register slots at any given time. We can conclude | 
|  | that the floating point PRF was the only register file used for the example, and | 
|  | that it was never resource constrained.  The register file statistics are | 
|  | displayed by using the command option ``-all-stats`` or | 
|  | ``-register-file-stats``. | 
|  |  | 
|  | In this example, we can conclude that the IPC is mostly limited by data | 
|  | dependencies, and not by resource pressure. | 
|  |  | 
|  | Instruction Flow | 
|  | ^^^^^^^^^^^^^^^^ | 
|  | This section describes the instruction flow through the default pipeline of | 
|  | :program:`llvm-mca`, as well as the functional units involved in the process. | 
|  |  | 
|  | The default pipeline implements the following sequence of stages used to | 
|  | process instructions. | 
|  |  | 
|  | * Dispatch (Instruction is dispatched to the schedulers). | 
|  | * Issue (Instruction is issued to the processor pipelines). | 
|  | * Write Back (Instruction is executed, and results are written back). | 
|  | * Retire (Instruction is retired; writes are architecturally committed). | 
|  |  | 
|  | The default pipeline only models the out-of-order portion of a processor. | 
|  | Therefore, the instruction fetch and decode stages are not modeled. Performance | 
|  | bottlenecks in the frontend are not diagnosed. :program:`llvm-mca` assumes that | 
|  | instructions have all been decoded and placed into a queue before the simulation | 
|  | start.  Also, :program:`llvm-mca` does not model branch prediction. | 
|  |  | 
|  | Instruction Dispatch | 
|  | """""""""""""""""""" | 
|  | During the dispatch stage, instructions are picked in program order from a | 
|  | queue of already decoded instructions, and dispatched in groups to the | 
|  | simulated hardware schedulers. | 
|  |  | 
|  | The size of a dispatch group depends on the availability of the simulated | 
|  | hardware resources.  The processor dispatch width defaults to the value | 
|  | of the ``IssueWidth`` in LLVM's scheduling model. | 
|  |  | 
|  | An instruction can be dispatched if: | 
|  |  | 
|  | * The size of the dispatch group is smaller than processor's dispatch width. | 
|  | * There are enough entries in the reorder buffer. | 
|  | * There are enough physical registers to do register renaming. | 
|  | * The schedulers are not full. | 
|  |  | 
|  | Scheduling models can optionally specify which register files are available on | 
|  | the processor. :program:`llvm-mca` uses that information to initialize register | 
|  | file descriptors.  Users can limit the number of physical registers that are | 
|  | globally available for register renaming by using the command option | 
|  | ``-register-file-size``.  A value of zero for this option means *unbounded*. By | 
|  | knowing how many registers are available for renaming, the tool can predict | 
|  | dispatch stalls caused by the lack of physical registers. | 
|  |  | 
|  | The number of reorder buffer entries consumed by an instruction depends on the | 
|  | number of micro-opcodes specified for that instruction by the target scheduling | 
|  | model.  The reorder buffer is responsible for tracking the progress of | 
|  | instructions that are "in-flight", and retiring them in program order.  The | 
|  | number of entries in the reorder buffer defaults to the value specified by field | 
|  | `MicroOpBufferSize` in the target scheduling model. | 
|  |  | 
|  | Instructions that are dispatched to the schedulers consume scheduler buffer | 
|  | entries. :program:`llvm-mca` queries the scheduling model to determine the set | 
|  | of buffered resources consumed by an instruction.  Buffered resources are | 
|  | treated like scheduler resources. | 
|  |  | 
|  | Instruction Issue | 
|  | """"""""""""""""" | 
|  | Each processor scheduler implements a buffer of instructions.  An instruction | 
|  | has to wait in the scheduler's buffer until input register operands become | 
|  | available.  Only at that point, does the instruction becomes eligible for | 
|  | execution and may be issued (potentially out-of-order) for execution. | 
|  | Instruction latencies are computed by :program:`llvm-mca` with the help of the | 
|  | scheduling model. | 
|  |  | 
|  | :program:`llvm-mca`'s scheduler is designed to simulate multiple processor | 
|  | schedulers.  The scheduler is responsible for tracking data dependencies, and | 
|  | dynamically selecting which processor resources are consumed by instructions. | 
|  | It delegates the management of processor resource units and resource groups to a | 
|  | resource manager.  The resource manager is responsible for selecting resource | 
|  | units that are consumed by instructions.  For example, if an instruction | 
|  | consumes 1cy of a resource group, the resource manager selects one of the | 
|  | available units from the group; by default, the resource manager uses a | 
|  | round-robin selector to guarantee that resource usage is uniformly distributed | 
|  | between all units of a group. | 
|  |  | 
|  | :program:`llvm-mca`'s scheduler internally groups instructions into three sets: | 
|  |  | 
|  | * WaitSet: a set of instructions whose operands are not ready. | 
|  | * ReadySet: a set of instructions ready to execute. | 
|  | * IssuedSet: a set of instructions executing. | 
|  |  | 
|  | Depending on the operands availability, instructions that are dispatched to the | 
|  | scheduler are either placed into the WaitSet or into the ReadySet. | 
|  |  | 
|  | Every cycle, the scheduler checks if instructions can be moved from the WaitSet | 
|  | to the ReadySet, and if instructions from the ReadySet can be issued to the | 
|  | underlying pipelines. The algorithm prioritizes older instructions over younger | 
|  | instructions. | 
|  |  | 
|  | Write-Back and Retire Stage | 
|  | """"""""""""""""""""""""""" | 
|  | Issued instructions are moved from the ReadySet to the IssuedSet.  There, | 
|  | instructions wait until they reach the write-back stage.  At that point, they | 
|  | get removed from the queue and the retire control unit is notified. | 
|  |  | 
|  | When instructions are executed, the retire control unit flags the instruction as | 
|  | "ready to retire." | 
|  |  | 
|  | Instructions are retired in program order.  The register file is notified of the | 
|  | retirement so that it can free the physical registers that were allocated for | 
|  | the instruction during the register renaming stage. | 
|  |  | 
|  | Load/Store Unit and Memory Consistency Model | 
|  | """""""""""""""""""""""""""""""""""""""""""" | 
|  | To simulate an out-of-order execution of memory operations, :program:`llvm-mca` | 
|  | utilizes a simulated load/store unit (LSUnit) to simulate the speculative | 
|  | execution of loads and stores. | 
|  |  | 
|  | Each load (or store) consumes an entry in the load (or store) queue. Users can | 
|  | specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the | 
|  | load and store queues respectively. The queues are unbounded by default. | 
|  |  | 
|  | The LSUnit implements a relaxed consistency model for memory loads and stores. | 
|  | The rules are: | 
|  |  | 
|  | 1. A younger load is allowed to pass an older load only if there are no | 
|  | intervening stores or barriers between the two loads. | 
|  | 2. A younger load is allowed to pass an older store provided that the load does | 
|  | not alias with the store. | 
|  | 3. A younger store is not allowed to pass an older store. | 
|  | 4. A younger store is not allowed to pass an older load. | 
|  |  | 
|  | By default, the LSUnit optimistically assumes that loads do not alias | 
|  | (`-noalias=true`) store operations.  Under this assumption, younger loads are | 
|  | always allowed to pass older stores.  Essentially, the LSUnit does not attempt | 
|  | to run any alias analysis to predict when loads and stores do not alias with | 
|  | each other. | 
|  |  | 
|  | Note that, in the case of write-combining memory, rule 3 could be relaxed to | 
|  | allow reordering of non-aliasing store operations.  That being said, at the | 
|  | moment, there is no way to further relax the memory model (``-noalias`` is the | 
|  | only option).  Essentially, there is no option to specify a different memory | 
|  | type (e.g., write-back, write-combining, write-through; etc.) and consequently | 
|  | to weaken, or strengthen, the memory model. | 
|  |  | 
|  | Other limitations are: | 
|  |  | 
|  | * The LSUnit does not know when store-to-load forwarding may occur. | 
|  | * The LSUnit does not know anything about cache hierarchy and memory types. | 
|  | * The LSUnit does not know how to identify serializing operations and memory | 
|  | fences. | 
|  |  | 
|  | The LSUnit does not attempt to predict if a load or store hits or misses the L1 | 
|  | cache.  It only knows if an instruction "MayLoad" and/or "MayStore."  For | 
|  | loads, the scheduling model provides an "optimistic" load-to-use latency (which | 
|  | usually matches the load-to-use latency for when there is a hit in the L1D). | 
|  |  | 
|  | :program:`llvm-mca` does not know about serializing operations or memory-barrier | 
|  | like instructions.  The LSUnit conservatively assumes that an instruction which | 
|  | has both "MayLoad" and unmodeled side effects behaves like a "soft" | 
|  | load-barrier.  That means, it serializes loads without forcing a flush of the | 
|  | load queue.  Similarly, instructions that "MayStore" and have unmodeled side | 
|  | effects are treated like store barriers.  A full memory barrier is a "MayLoad" | 
|  | and "MayStore" instruction with unmodeled side effects.  This is inaccurate, but | 
|  | it is the best that we can do at the moment with the current information | 
|  | available in LLVM. | 
|  |  | 
|  | A load/store barrier consumes one entry of the load/store queue.  A load/store | 
|  | barrier enforces ordering of loads/stores.  A younger load cannot pass a load | 
|  | barrier.  Also, a younger store cannot pass a store barrier.  A younger load | 
|  | has to wait for the memory/load barrier to execute.  A load/store barrier is | 
|  | "executed" when it becomes the oldest entry in the load/store queue(s). That | 
|  | also means, by construction, all of the older loads/stores have been executed. | 
|  |  | 
|  | In conclusion, the full set of load/store consistency rules are: | 
|  |  | 
|  | #. A store may not pass a previous store. | 
|  | #. A store may not pass a previous load (regardless of ``-noalias``). | 
|  | #. A store has to wait until an older store barrier is fully executed. | 
|  | #. A load may pass a previous load. | 
|  | #. A load may not pass a previous store unless ``-noalias`` is set. | 
|  | #. A load has to wait until an older load barrier is fully executed. |