|  | ============================= | 
|  | User Guide for AMDGPU Backend | 
|  | ============================= | 
|  |  | 
|  | .. contents:: | 
|  | :local: | 
|  |  | 
|  | Introduction | 
|  | ============ | 
|  |  | 
|  | The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the | 
|  | R600 family up until the current GCN families. It lives in the | 
|  | ``lib/Target/AMDGPU`` directory. | 
|  |  | 
|  | LLVM | 
|  | ==== | 
|  |  | 
|  | .. _amdgpu-target-triples: | 
|  |  | 
|  | Target Triples | 
|  | -------------- | 
|  |  | 
|  | Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to | 
|  | specify the target triple: | 
|  |  | 
|  | .. table:: AMDGPU Architectures | 
|  | :name: amdgpu-architecture-table | 
|  |  | 
|  | ============ ============================================================== | 
|  | Architecture Description | 
|  | ============ ============================================================== | 
|  | ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. | 
|  | ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders. | 
|  | ============ ============================================================== | 
|  |  | 
|  | .. table:: AMDGPU Vendors | 
|  | :name: amdgpu-vendor-table | 
|  |  | 
|  | ============ ============================================================== | 
|  | Vendor       Description | 
|  | ============ ============================================================== | 
|  | ``amd``      Can be used for all AMD GPU usage. | 
|  | ``mesa3d``   Can be used if the OS is ``mesa3d``. | 
|  | ============ ============================================================== | 
|  |  | 
|  | .. table:: AMDGPU Operating Systems | 
|  | :name: amdgpu-os-table | 
|  |  | 
|  | ============== ============================================================ | 
|  | OS             Description | 
|  | ============== ============================================================ | 
|  | *<empty>*      Defaults to the *unknown* OS. | 
|  | ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes | 
|  | such as AMD's ROCm [AMD-ROCm]_. | 
|  | ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL | 
|  | runtime. | 
|  | ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D | 
|  | runtime. | 
|  | ============== ============================================================ | 
|  |  | 
|  | .. table:: AMDGPU Environments | 
|  | :name: amdgpu-environment-table | 
|  |  | 
|  | ============ ============================================================== | 
|  | Environment  Description | 
|  | ============ ============================================================== | 
|  | *<empty>*    Default. | 
|  | ============ ============================================================== | 
|  |  | 
|  | .. _amdgpu-processors: | 
|  |  | 
|  | Processors | 
|  | ---------- | 
|  |  | 
|  | Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The | 
|  | names from both the *Processor* and *Alternative Processor* can be used. | 
|  |  | 
|  | .. table:: AMDGPU Processors | 
|  | :name: amdgpu-processor-table | 
|  |  | 
|  | =========== =============== ============ ===== ========== ======= ====================== | 
|  | Processor   Alternative     Target       dGPU/ Target     ROCm    Example | 
|  | Processor       Triple       APU   Features   Support Products | 
|  | Architecture       Supported | 
|  | [Default] | 
|  | =========== =============== ============ ===== ========== ======= ====================== | 
|  | **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ | 
|  | ---------------------------------------------------------------------------------------- | 
|  | ``r600``                    ``r600``     dGPU | 
|  | ``r630``                    ``r600``     dGPU | 
|  | ``rs880``                   ``r600``     dGPU | 
|  | ``rv670``                   ``r600``     dGPU | 
|  | **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ | 
|  | ---------------------------------------------------------------------------------------- | 
|  | ``rv710``                   ``r600``     dGPU | 
|  | ``rv730``                   ``r600``     dGPU | 
|  | ``rv770``                   ``r600``     dGPU | 
|  | **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ | 
|  | ---------------------------------------------------------------------------------------- | 
|  | ``cedar``                   ``r600``     dGPU | 
|  | ``cypress``                 ``r600``     dGPU | 
|  | ``juniper``                 ``r600``     dGPU | 
|  | ``redwood``                 ``r600``     dGPU | 
|  | ``sumo``                    ``r600``     dGPU | 
|  | **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ | 
|  | ---------------------------------------------------------------------------------------- | 
|  | ``barts``                   ``r600``     dGPU | 
|  | ``caicos``                  ``r600``     dGPU | 
|  | ``cayman``                  ``r600``     dGPU | 
|  | ``turks``                   ``r600``     dGPU | 
|  | **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ | 
|  | ---------------------------------------------------------------------------------------- | 
|  | ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU | 
|  | ``gfx601``  - ``hainan``    ``amdgcn``   dGPU | 
|  | - ``oland`` | 
|  | - ``pitcairn`` | 
|  | - ``verde`` | 
|  | **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ | 
|  | ---------------------------------------------------------------------------------------- | 
|  | ``gfx700``  - ``kaveri``    ``amdgcn``   APU                      - A6-7000 | 
|  | - A6 Pro-7050B | 
|  | - A8-7100 | 
|  | - A8 Pro-7150B | 
|  | - A10-7300 | 
|  | - A10 Pro-7350B | 
|  | - FX-7500 | 
|  | - A8-7200P | 
|  | - A10-7400P | 
|  | - FX-7600P | 
|  | ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU             ROCm    - FirePro W8100 | 
|  | - FirePro W9100 | 
|  | - FirePro S9150 | 
|  | - FirePro S9170 | 
|  | ``gfx702``                  ``amdgcn``   dGPU             ROCm    - Radeon R9 290 | 
|  | - Radeon R9 290x | 
|  | - Radeon R390 | 
|  | - Radeon R390x | 
|  | ``gfx703``  - ``kabini``    ``amdgcn``   APU                      - E1-2100 | 
|  | - ``mullins``                                         - E1-2200 | 
|  | - E1-2500 | 
|  | - E2-3000 | 
|  | - E2-3800 | 
|  | - A4-5000 | 
|  | - A4-5100 | 
|  | - A6-5200 | 
|  | - A4 Pro-3340B | 
|  | ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                     - Radeon HD 7790 | 
|  | - Radeon HD 8770 | 
|  | - R7 260 | 
|  | - R7 260X | 
|  | **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ | 
|  | ---------------------------------------------------------------------------------------- | 
|  | ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack            - A6-8500P | 
|  | [on]             - Pro A6-8500B | 
|  | - A8-8600P | 
|  | - Pro A8-8600B | 
|  | - FX-8800P | 
|  | - Pro A12-8800B | 
|  | \                           ``amdgcn``   APU   - xnack    ROCm    - A10-8700P | 
|  | [on]             - Pro A10-8700B | 
|  | - A10-8780P | 
|  | \                           ``amdgcn``   APU   - xnack            - A10-9600P | 
|  | [on]             - A10-9630P | 
|  | - A12-9700P | 
|  | - A12-9730P | 
|  | - FX-9800P | 
|  | - FX-9830P | 
|  | \                           ``amdgcn``   APU   - xnack            - E2-9010 | 
|  | [on]             - A6-9210 | 
|  | - A9-9410 | 
|  | ``gfx802``  - ``iceland``   ``amdgcn``   dGPU  - xnack    ROCm    - FirePro S7150 | 
|  | - ``tonga``                          [off]            - FirePro S7100 | 
|  | - FirePro W7100 | 
|  | - Radeon R285 | 
|  | - Radeon R9 380 | 
|  | - Radeon R9 385 | 
|  | - Mobile FirePro | 
|  | M7170 | 
|  | ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack    ROCm    - Radeon R9 Nano | 
|  | [off]            - Radeon R9 Fury | 
|  | - Radeon R9 FuryX | 
|  | - Radeon Pro Duo | 
|  | - FirePro S9300x2 | 
|  | - Radeon Instinct MI8 | 
|  | \           - ``polaris10`` ``amdgcn``   dGPU  - xnack    ROCm    - Radeon RX 470 | 
|  | [off]            - Radeon RX 480 | 
|  | - Radeon Instinct MI6 | 
|  | \           - ``polaris11`` ``amdgcn``   dGPU  - xnack    ROCm    - Radeon RX 460 | 
|  | [off] | 
|  | ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack | 
|  | [on] | 
|  | **GCN GFX9** [AMD-GCN-GFX9]_ | 
|  | ---------------------------------------------------------------------------------------- | 
|  | ``gfx900``                  ``amdgcn``   dGPU  - xnack    ROCm    - Radeon Vega | 
|  | [off]              Frontier Edition | 
|  | - Radeon RX Vega 56 | 
|  | - Radeon RX Vega 64 | 
|  | - Radeon RX Vega 64 | 
|  | Liquid | 
|  | - Radeon Instinct MI25 | 
|  | ``gfx902``                  ``amdgcn``   APU   - xnack            - Ryzen 3 2200G | 
|  | [on]             - Ryzen 5 2400G | 
|  | ``gfx904``                  ``amdgcn``   dGPU  - xnack            *TBA* | 
|  | [off] | 
|  | .. TODO | 
|  | Add product | 
|  | names. | 
|  | ``gfx906``                  ``amdgcn``   dGPU  - xnack            - Radeon Instinct MI50 | 
|  | [off]            - Radeon Instinct MI60 | 
|  | sram-ecc | 
|  | [on] | 
|  | ``gfx909``                  ``amdgcn``   APU   - xnack            *TBA* (Raven Ridge 2) | 
|  | [on] | 
|  | .. TODO | 
|  | Add product | 
|  | names. | 
|  | =========== =============== ============ ===== ========== ======= ====================== | 
|  |  | 
|  | .. _amdgpu-target-features: | 
|  |  | 
|  | Target Features | 
|  | --------------- | 
|  |  | 
|  | Target features control how code is generated to support certain | 
|  | processor specific features. Not all target features are supported by | 
|  | all processors. The runtime must ensure that the features supported by | 
|  | the device used to execute the code match the features enabled when | 
|  | generating the code. A mismatch of features may result in incorrect | 
|  | execution, or a reduction in performance. | 
|  |  | 
|  | The target features supported by each processor, and the default value | 
|  | used if not specified explicitly, is listed in | 
|  | :ref:`amdgpu-processor-table`. | 
|  |  | 
|  | Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU | 
|  | target features. | 
|  |  | 
|  | For example: | 
|  |  | 
|  | ``-mxnack`` | 
|  | Enable the ``xnack`` feature. | 
|  | ``-mno-xnack`` | 
|  | Disable the ``xnack`` feature. | 
|  |  | 
|  | .. table:: AMDGPU Target Features | 
|  | :name: amdgpu-target-feature-table | 
|  |  | 
|  | =============== ================================================== | 
|  | Target Feature  Description | 
|  | =============== ================================================== | 
|  | -m[no-]xnack    Enable/disable generating code that has | 
|  | memory clauses that are compatible with | 
|  | having XNACK replay enabled. | 
|  |  | 
|  | This is used for demand paging and page | 
|  | migration. If XNACK replay is enabled in | 
|  | the device, then if a page fault occurs | 
|  | the code may execute incorrectly if the | 
|  | ``xnack`` feature is not enabled. Executing | 
|  | code that has the feature enabled on a | 
|  | device that does not have XNACK replay | 
|  | enabled will execute correctly, but may | 
|  | be less performant than code with the | 
|  | feature disabled. | 
|  | -m[no-]sram-ecc Enable/disable generating code that assumes SRAM | 
|  | ECC is enabled/disabled. | 
|  | =============== ================================================== | 
|  |  | 
|  | .. _amdgpu-address-spaces: | 
|  |  | 
|  | Address Spaces | 
|  | -------------- | 
|  |  | 
|  | The AMDGPU backend uses the following address space mappings. | 
|  |  | 
|  | The memory space names used in the table, aside from the region memory space, is | 
|  | from the OpenCL standard. | 
|  |  | 
|  | LLVM Address Space number is used throughout LLVM (for example, in LLVM IR). | 
|  |  | 
|  | .. table:: Address Space Mapping | 
|  | :name: amdgpu-address-space-mapping-table | 
|  |  | 
|  | ================== ================= | 
|  | LLVM Address Space Memory Space | 
|  | ================== ================= | 
|  | 0                  Generic (Flat) | 
|  | 1                  Global | 
|  | 2                  Region (GDS) | 
|  | 3                  Local (group/LDS) | 
|  | 4                  Constant | 
|  | 5                  Private (Scratch) | 
|  | 6                  Constant 32-bit | 
|  | ================== ================= | 
|  |  | 
|  | .. _amdgpu-memory-scopes: | 
|  |  | 
|  | Memory Scopes | 
|  | ------------- | 
|  |  | 
|  | This section provides LLVM memory synchronization scopes supported by the AMDGPU | 
|  | backend memory model when the target triple OS is ``amdhsa`` (see | 
|  | :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). | 
|  |  | 
|  | The memory model supported is based on the HSA memory model [HSA]_ which is | 
|  | based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before | 
|  | relation is transitive over the synchonizes-with relation independent of scope, | 
|  | and synchonizes-with allows the memory scope instances to be inclusive (see | 
|  | table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). | 
|  |  | 
|  | This is different to the OpenCL [OpenCL]_ memory model which does not have scope | 
|  | inclusion and requires the memory scopes to exactly match. However, this | 
|  | is conservatively correct for OpenCL. | 
|  |  | 
|  | .. table:: AMDHSA LLVM Sync Scopes | 
|  | :name: amdgpu-amdhsa-llvm-sync-scopes-table | 
|  |  | 
|  | ================ ========================================================== | 
|  | LLVM Sync Scope  Description | 
|  | ================ ========================================================== | 
|  | *none*           The default: ``system``. | 
|  |  | 
|  | Synchronizes with, and participates in modification and | 
|  | seq_cst total orderings with, other operations (except | 
|  | image operations) for all address spaces (except private, | 
|  | or generic that accesses private) provided the other | 
|  | operation's sync scope is: | 
|  |  | 
|  | - ``system``. | 
|  | - ``agent`` and executed by a thread on the same agent. | 
|  | - ``workgroup`` and executed by a thread in the same | 
|  | workgroup. | 
|  | - ``wavefront`` and executed by a thread in the same | 
|  | wavefront. | 
|  |  | 
|  | ``agent``        Synchronizes with, and participates in modification and | 
|  | seq_cst total orderings with, other operations (except | 
|  | image operations) for all address spaces (except private, | 
|  | or generic that accesses private) provided the other | 
|  | operation's sync scope is: | 
|  |  | 
|  | - ``system`` or ``agent`` and executed by a thread on the | 
|  | same agent. | 
|  | - ``workgroup`` and executed by a thread in the same | 
|  | workgroup. | 
|  | - ``wavefront`` and executed by a thread in the same | 
|  | wavefront. | 
|  |  | 
|  | ``workgroup``    Synchronizes with, and participates in modification and | 
|  | seq_cst total orderings with, other operations (except | 
|  | image operations) for all address spaces (except private, | 
|  | or generic that accesses private) provided the other | 
|  | operation's sync scope is: | 
|  |  | 
|  | - ``system``, ``agent`` or ``workgroup`` and executed by a | 
|  | thread in the same workgroup. | 
|  | - ``wavefront`` and executed by a thread in the same | 
|  | wavefront. | 
|  |  | 
|  | ``wavefront``    Synchronizes with, and participates in modification and | 
|  | seq_cst total orderings with, other operations (except | 
|  | image operations) for all address spaces (except private, | 
|  | or generic that accesses private) provided the other | 
|  | operation's sync scope is: | 
|  |  | 
|  | - ``system``, ``agent``, ``workgroup`` or ``wavefront`` | 
|  | and executed by a thread in the same wavefront. | 
|  |  | 
|  | ``singlethread`` Only synchronizes with, and participates in modification | 
|  | and seq_cst total orderings with, other operations (except | 
|  | image operations) running in the same thread for all | 
|  | address spaces (for example, in signal handlers). | 
|  | ================ ========================================================== | 
|  |  | 
|  | AMDGPU Intrinsics | 
|  | ----------------- | 
|  |  | 
|  | The AMDGPU backend implements the following LLVM IR intrinsics. | 
|  |  | 
|  | *This section is WIP.* | 
|  |  | 
|  | .. TODO | 
|  | List AMDGPU intrinsics | 
|  |  | 
|  | AMDGPU Attributes | 
|  | ----------------- | 
|  |  | 
|  | The AMDGPU backend supports the following LLVM IR attributes. | 
|  |  | 
|  | .. table:: AMDGPU LLVM IR Attributes | 
|  | :name: amdgpu-llvm-ir-attributes-table | 
|  |  | 
|  | ======================================= ========================================================== | 
|  | LLVM Attribute                          Description | 
|  | ======================================= ========================================================== | 
|  | "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that | 
|  | will be specified when the kernel is dispatched. Generated | 
|  | by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. | 
|  | "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel | 
|  | argument block size for the implicit arguments. This | 
|  | varies by OS and language (for OpenCL see | 
|  | :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). | 
|  | "amdgpu-max-work-group-size"="n"        Specify the maximum work-group size that will be specifed | 
|  | when the kernel is dispatched. | 
|  | "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by | 
|  | the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. | 
|  | "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the | 
|  | ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. | 
|  | "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per | 
|  | execution unit. Generated by the ``amdgpu_waves_per_eu`` | 
|  | CLANG attribute [CLANG-ATTR]_. | 
|  | ======================================= ========================================================== | 
|  |  | 
|  | Code Object | 
|  | =========== | 
|  |  | 
|  | The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that | 
|  | can be linked by ``lld`` to produce a standard ELF shared code object which can | 
|  | be loaded and executed on an AMDGPU target. | 
|  |  | 
|  | Header | 
|  | ------ | 
|  |  | 
|  | The AMDGPU backend uses the following ELF header: | 
|  |  | 
|  | .. table:: AMDGPU ELF Header | 
|  | :name: amdgpu-elf-header-table | 
|  |  | 
|  | ========================== =============================== | 
|  | Field                      Value | 
|  | ========================== =============================== | 
|  | ``e_ident[EI_CLASS]``      ``ELFCLASS64`` | 
|  | ``e_ident[EI_DATA]``       ``ELFDATA2LSB`` | 
|  | ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE`` | 
|  | - ``ELFOSABI_AMDGPU_HSA`` | 
|  | - ``ELFOSABI_AMDGPU_PAL`` | 
|  | - ``ELFOSABI_AMDGPU_MESA3D`` | 
|  | ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA`` | 
|  | - ``ELFABIVERSION_AMDGPU_PAL`` | 
|  | - ``ELFABIVERSION_AMDGPU_MESA3D`` | 
|  | ``e_type``                 - ``ET_REL`` | 
|  | - ``ET_DYN`` | 
|  | ``e_machine``              ``EM_AMDGPU`` | 
|  | ``e_entry``                0 | 
|  | ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table` | 
|  | ========================== =============================== | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: AMDGPU ELF Header Enumeration Values | 
|  | :name: amdgpu-elf-header-enumeration-values-table | 
|  |  | 
|  | =============================== ===== | 
|  | Name                            Value | 
|  | =============================== ===== | 
|  | ``EM_AMDGPU``                   224 | 
|  | ``ELFOSABI_NONE``               0 | 
|  | ``ELFOSABI_AMDGPU_HSA``         64 | 
|  | ``ELFOSABI_AMDGPU_PAL``         65 | 
|  | ``ELFOSABI_AMDGPU_MESA3D``      66 | 
|  | ``ELFABIVERSION_AMDGPU_HSA``    1 | 
|  | ``ELFABIVERSION_AMDGPU_PAL``    0 | 
|  | ``ELFABIVERSION_AMDGPU_MESA3D`` 0 | 
|  | =============================== ===== | 
|  |  | 
|  | ``e_ident[EI_CLASS]`` | 
|  | The ELF class is: | 
|  |  | 
|  | * ``ELFCLASS32`` for ``r600`` architecture. | 
|  |  | 
|  | * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64 | 
|  | bit applications. | 
|  |  | 
|  | ``e_ident[EI_DATA]`` | 
|  | All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. | 
|  |  | 
|  | ``e_ident[EI_OSABI]`` | 
|  | One of the following AMD GPU architecture specific OS ABIs | 
|  | (see :ref:`amdgpu-os-table`): | 
|  |  | 
|  | * ``ELFOSABI_NONE`` for *unknown* OS. | 
|  |  | 
|  | * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. | 
|  |  | 
|  | * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. | 
|  |  | 
|  | * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. | 
|  |  | 
|  | ``e_ident[EI_ABIVERSION]`` | 
|  | The ABI version of the AMD GPU architecture specific OS ABI to which the code | 
|  | object conforms: | 
|  |  | 
|  | * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA | 
|  | runtime ABI. | 
|  |  | 
|  | * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL | 
|  | runtime ABI. | 
|  |  | 
|  | * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA | 
|  | 3D runtime ABI. | 
|  |  | 
|  | ``e_type`` | 
|  | Can be one of the following values: | 
|  |  | 
|  |  | 
|  | ``ET_REL`` | 
|  | The type produced by the AMD GPU backend compiler as it is relocatable code | 
|  | object. | 
|  |  | 
|  | ``ET_DYN`` | 
|  | The type produced by the linker as it is a shared code object. | 
|  |  | 
|  | The AMD HSA runtime loader requires a ``ET_DYN`` code object. | 
|  |  | 
|  | ``e_machine`` | 
|  | The value ``EM_AMDGPU`` is used for the machine for all processors supported | 
|  | by the ``r600`` and ``amdgcn`` architectures (see | 
|  | :ref:`amdgpu-processor-table`). The specific processor is specified in the | 
|  | ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see | 
|  | :ref:`amdgpu-elf-header-e_flags-table`). | 
|  |  | 
|  | ``e_entry`` | 
|  | The entry point is 0 as the entry points for individual kernels must be | 
|  | selected in order to invoke them through AQL packets. | 
|  |  | 
|  | ``e_flags`` | 
|  | The AMDGPU backend uses the following ELF header flags: | 
|  |  | 
|  | .. table:: AMDGPU ELF Header ``e_flags`` | 
|  | :name: amdgpu-elf-header-e_flags-table | 
|  |  | 
|  | ================================= ========== ============================= | 
|  | Name                              Value      Description | 
|  | ================================= ========== ============================= | 
|  | **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`. | 
|  | -------------------------------------------- ----------------------------- | 
|  | ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection | 
|  | mask for | 
|  | ``EF_AMDGPU_MACH_xxx`` values | 
|  | defined in | 
|  | :ref:`amdgpu-ef-amdgpu-mach-table`. | 
|  | ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack`` | 
|  | target feature is | 
|  | enabled for all code | 
|  | contained in the code object. | 
|  | If the processor | 
|  | does not support the | 
|  | ``xnack`` target | 
|  | feature then must | 
|  | be 0. | 
|  | See | 
|  | :ref:`amdgpu-target-features`. | 
|  | ``EF_AMDGPU_SRAM_ECC``            0x00000200 Indicates if the ``sram-ecc`` | 
|  | target feature is | 
|  | enabled for all code | 
|  | contained in the code object. | 
|  | If the processor | 
|  | does not support the | 
|  | ``sram-ecc`` target | 
|  | feature then must | 
|  | be 0. | 
|  | See | 
|  | :ref:`amdgpu-target-features`. | 
|  | ================================= ========== ============================= | 
|  |  | 
|  | .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values | 
|  | :name: amdgpu-ef-amdgpu-mach-table | 
|  |  | 
|  | ================================= ========== ============================= | 
|  | Name                              Value      Description (see | 
|  | :ref:`amdgpu-processor-table`) | 
|  | ================================= ========== ============================= | 
|  | ``EF_AMDGPU_MACH_NONE``           0x000      *not specified* | 
|  | ``EF_AMDGPU_MACH_R600_R600``      0x001      ``r600`` | 
|  | ``EF_AMDGPU_MACH_R600_R630``      0x002      ``r630`` | 
|  | ``EF_AMDGPU_MACH_R600_RS880``     0x003      ``rs880`` | 
|  | ``EF_AMDGPU_MACH_R600_RV670``     0x004      ``rv670`` | 
|  | ``EF_AMDGPU_MACH_R600_RV710``     0x005      ``rv710`` | 
|  | ``EF_AMDGPU_MACH_R600_RV730``     0x006      ``rv730`` | 
|  | ``EF_AMDGPU_MACH_R600_RV770``     0x007      ``rv770`` | 
|  | ``EF_AMDGPU_MACH_R600_CEDAR``     0x008      ``cedar`` | 
|  | ``EF_AMDGPU_MACH_R600_CYPRESS``   0x009      ``cypress`` | 
|  | ``EF_AMDGPU_MACH_R600_JUNIPER``   0x00a      ``juniper`` | 
|  | ``EF_AMDGPU_MACH_R600_REDWOOD``   0x00b      ``redwood`` | 
|  | ``EF_AMDGPU_MACH_R600_SUMO``      0x00c      ``sumo`` | 
|  | ``EF_AMDGPU_MACH_R600_BARTS``     0x00d      ``barts`` | 
|  | ``EF_AMDGPU_MACH_R600_CAICOS``    0x00e      ``caicos`` | 
|  | ``EF_AMDGPU_MACH_R600_CAYMAN``    0x00f      ``cayman`` | 
|  | ``EF_AMDGPU_MACH_R600_TURKS``     0x010      ``turks`` | 
|  | *reserved*                        0x011 -    Reserved for ``r600`` | 
|  | 0x01f      architecture processors. | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX600``  0x020      ``gfx600`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX601``  0x021      ``gfx601`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX700``  0x022      ``gfx700`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX701``  0x023      ``gfx701`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX702``  0x024      ``gfx702`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX703``  0x025      ``gfx703`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX704``  0x026      ``gfx704`` | 
|  | *reserved*                        0x027      Reserved. | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX801``  0x028      ``gfx801`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX802``  0x029      ``gfx802`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX803``  0x02a      ``gfx803`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX810``  0x02b      ``gfx810`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX900``  0x02c      ``gfx900`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX902``  0x02d      ``gfx902`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX904``  0x02e      ``gfx904`` | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX906``  0x02f      ``gfx906`` | 
|  | *reserved*                        0x030      Reserved. | 
|  | ``EF_AMDGPU_MACH_AMDGCN_GFX909``  0x031      ``gfx909`` | 
|  | ================================= ========== ============================= | 
|  |  | 
|  | Sections | 
|  | -------- | 
|  |  | 
|  | An AMDGPU target ELF code object has the standard ELF sections which include: | 
|  |  | 
|  | .. table:: AMDGPU ELF Sections | 
|  | :name: amdgpu-elf-sections-table | 
|  |  | 
|  | ================== ================ ================================= | 
|  | Name               Type             Attributes | 
|  | ================== ================ ================================= | 
|  | ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE`` | 
|  | ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` | 
|  | ``.debug_``\ *\**  ``SHT_PROGBITS`` *none* | 
|  | ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC`` | 
|  | ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC`` | 
|  | ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC`` | 
|  | ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` | 
|  | ``.hash``          ``SHT_HASH``     ``SHF_ALLOC`` | 
|  | ``.note``          ``SHT_NOTE``     *none* | 
|  | ``.rela``\ *name*  ``SHT_RELA``     *none* | 
|  | ``.rela.dyn``      ``SHT_RELA``     *none* | 
|  | ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC`` | 
|  | ``.shstrtab``      ``SHT_STRTAB``   *none* | 
|  | ``.strtab``        ``SHT_STRTAB``   *none* | 
|  | ``.symtab``        ``SHT_SYMTAB``   *none* | 
|  | ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` | 
|  | ================== ================ ================================= | 
|  |  | 
|  | These sections have their standard meanings (see [ELF]_) and are only generated | 
|  | if needed. | 
|  |  | 
|  | ``.debug``\ *\** | 
|  | The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the | 
|  | DWARF produced by the AMDGPU backend. | 
|  |  | 
|  | ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` | 
|  | The standard sections used by a dynamic loader. | 
|  |  | 
|  | ``.note`` | 
|  | See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU | 
|  | backend. | 
|  |  | 
|  | ``.rela``\ *name*, ``.rela.dyn`` | 
|  | For relocatable code objects, *name* is the name of the section that the | 
|  | relocation records apply. For example, ``.rela.text`` is the section name for | 
|  | relocation records associated with the ``.text`` section. | 
|  |  | 
|  | For linked shared code objects, ``.rela.dyn`` contains all the relocation | 
|  | records from each of the relocatable code object's ``.rela``\ *name* sections. | 
|  |  | 
|  | See :ref:`amdgpu-relocation-records` for the relocation records supported by | 
|  | the AMDGPU backend. | 
|  |  | 
|  | ``.text`` | 
|  | The executable machine code for the kernels and functions they call. Generated | 
|  | as position independent code. See :ref:`amdgpu-code-conventions` for | 
|  | information on conventions used in the isa generation. | 
|  |  | 
|  | .. _amdgpu-note-records: | 
|  |  | 
|  | Note Records | 
|  | ------------ | 
|  |  | 
|  | As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must | 
|  | be generated after the ``name`` field to ensure the ``desc`` field is 4 byte | 
|  | aligned. In addition, minimal zero byte padding must be generated to ensure the | 
|  | ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the | 
|  | ``.note`` section must be at least 4 to indicate at least 8 byte alignment. | 
|  |  | 
|  | .. _amdgpu-note-records-v2: | 
|  |  | 
|  | Code Object V2 Note Records (-mattr=-code-object-v3) | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | The AMDGPU backend code object uses the following ELF note record in the | 
|  | ``.note`` section. | 
|  |  | 
|  | Additional note records can be present. | 
|  |  | 
|  | .. table:: AMDGPU Code Object V2 ELF Note Records | 
|  | :name: amdgpu-elf-note-records-table-v2 | 
|  |  | 
|  | ===== ============================== ====================================== | 
|  | Name  Type                           Description | 
|  | ===== ============================== ====================================== | 
|  | "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string> | 
|  | ===== ============================== ====================================== | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values | 
|  | :name: amdgpu-elf-note-record-enumeration-values-table-v2 | 
|  |  | 
|  | ============================== ===== | 
|  | Name                           Value | 
|  | ============================== ===== | 
|  | *reserved*                       0-9 | 
|  | ``NT_AMD_AMDGPU_HSA_METADATA``    10 | 
|  | *reserved*                        11 | 
|  | ============================== ===== | 
|  |  | 
|  | ``NT_AMD_AMDGPU_HSA_METADATA`` | 
|  | Specifies extensible metadata associated with the code objects executed on HSA | 
|  | [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when | 
|  | the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See | 
|  | :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code | 
|  | object metadata string. | 
|  |  | 
|  | .. _amdgpu-note-records-v3: | 
|  |  | 
|  | Code Object V3 Note Records (-mattr=+code-object-v3) | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | The AMDGPU backend code object uses the following ELF note record in the | 
|  | ``.note`` section. | 
|  |  | 
|  | Additional note records can be present. | 
|  |  | 
|  | .. table:: AMDGPU Code Object V3 ELF Note Records | 
|  | :name: amdgpu-elf-note-records-table-v3 | 
|  |  | 
|  | ======== ============================== ====================================== | 
|  | Name     Type                           Description | 
|  | ======== ============================== ====================================== | 
|  | "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_ | 
|  | binary format. | 
|  | ======== ============================== ====================================== | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values | 
|  | :name: amdgpu-elf-note-record-enumeration-values-table-v3 | 
|  |  | 
|  | ============================== ===== | 
|  | Name                           Value | 
|  | ============================== ===== | 
|  | *reserved*                     0-31 | 
|  | ``NT_AMDGPU_METADATA``         32 | 
|  | ============================== ===== | 
|  |  | 
|  | ``NT_AMDGPU_METADATA`` | 
|  | Specifies extensible metadata associated with an AMDGPU code | 
|  | object. It is encoded as a map in the Message Pack [MsgPack]_ binary | 
|  | data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the | 
|  | map keys defined for the ``amdhsa`` OS. | 
|  |  | 
|  | .. _amdgpu-symbols: | 
|  |  | 
|  | Symbols | 
|  | ------- | 
|  |  | 
|  | Symbols include the following: | 
|  |  | 
|  | .. table:: AMDGPU ELF Symbols | 
|  | :name: amdgpu-elf-symbols-table | 
|  |  | 
|  | ===================== ============== ============= ================== | 
|  | Name                  Type           Section       Description | 
|  | ===================== ============== ============= ================== | 
|  | *link-name*           ``STT_OBJECT`` - ``.data``   Global variable | 
|  | - ``.rodata`` | 
|  | - ``.bss`` | 
|  | *link-name*\ ``.kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor | 
|  | *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point | 
|  | ===================== ============== ============= ================== | 
|  |  | 
|  | Global variable | 
|  | Global variables both used and defined by the compilation unit. | 
|  |  | 
|  | If the symbol is defined in the compilation unit then it is allocated in the | 
|  | appropriate section according to if it has initialized data or is readonly. | 
|  |  | 
|  | If the symbol is external then its section is ``STN_UNDEF`` and the loader | 
|  | will resolve relocations using the definition provided by another code object | 
|  | or explicitly defined by the runtime. | 
|  |  | 
|  | All global symbols, whether defined in the compilation unit or external, are | 
|  | accessed by the machine code indirectly through a GOT table entry. This | 
|  | allows them to be preemptable. The GOT table is only supported when the target | 
|  | triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). | 
|  |  | 
|  | .. TODO | 
|  | Add description of linked shared object symbols. Seems undefined symbols | 
|  | are marked as STT_NOTYPE. | 
|  |  | 
|  | Kernel descriptor | 
|  | Every HSA kernel has an associated kernel descriptor. It is the address of the | 
|  | kernel descriptor that is used in the AQL dispatch packet used to invoke the | 
|  | kernel, not the kernel entry point. The layout of the HSA kernel descriptor is | 
|  | defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. | 
|  |  | 
|  | Kernel entry point | 
|  | Every HSA kernel also has a symbol for its machine code entry point. | 
|  |  | 
|  | .. _amdgpu-relocation-records: | 
|  |  | 
|  | Relocation Records | 
|  | ------------------ | 
|  |  | 
|  | AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported | 
|  | relocatable fields are: | 
|  |  | 
|  | ``word32`` | 
|  | This specifies a 32-bit field occupying 4 bytes with arbitrary byte | 
|  | alignment. These values use the same byte order as other word values in the | 
|  | AMD GPU architecture. | 
|  |  | 
|  | ``word64`` | 
|  | This specifies a 64-bit field occupying 8 bytes with arbitrary byte | 
|  | alignment. These values use the same byte order as other word values in the | 
|  | AMD GPU architecture. | 
|  |  | 
|  | Following notations are used for specifying relocation calculations: | 
|  |  | 
|  | **A** | 
|  | Represents the addend used to compute the value of the relocatable field. | 
|  |  | 
|  | **G** | 
|  | Represents the offset into the global offset table at which the relocation | 
|  | entry's symbol will reside during execution. | 
|  |  | 
|  | **GOT** | 
|  | Represents the address of the global offset table. | 
|  |  | 
|  | **P** | 
|  | Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) | 
|  | of the storage unit being relocated (computed using ``r_offset``). | 
|  |  | 
|  | **S** | 
|  | Represents the value of the symbol whose index resides in the relocation | 
|  | entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``. | 
|  |  | 
|  | **B** | 
|  | Represents the base address of a loaded executable or shared object which is | 
|  | the difference between the ELF address and the actual load address. Relocations | 
|  | using this are only valid in executable or shared objects. | 
|  |  | 
|  | The following relocation types are supported: | 
|  |  | 
|  | .. table:: AMDGPU ELF Relocation Records | 
|  | :name: amdgpu-elf-relocation-records-table | 
|  |  | 
|  | ========================== ======= =====  ==========  ============================== | 
|  | Relocation Type            Kind    Value  Field       Calculation | 
|  | ========================== ======= =====  ==========  ============================== | 
|  | ``R_AMDGPU_NONE``                  0      *none*      *none* | 
|  | ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF | 
|  | Dynamic | 
|  | ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32 | 
|  | Dynamic | 
|  | ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A | 
|  | Dynamic | 
|  | ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P | 
|  | ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P | 
|  | ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A | 
|  | Dynamic | 
|  | ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P | 
|  | ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF | 
|  | ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32 | 
|  | ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF | 
|  | ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32 | 
|  | *reserved*                         12 | 
|  | ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A | 
|  | ========================== ======= =====  ==========  ============================== | 
|  |  | 
|  | ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by | 
|  | the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. | 
|  |  | 
|  | There is no current OS loader support for 32 bit programs and so | 
|  | ``R_AMDGPU_ABS32`` is not used. | 
|  |  | 
|  | .. _amdgpu-dwarf: | 
|  |  | 
|  | DWARF | 
|  | ----- | 
|  |  | 
|  | Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain | 
|  | information that maps the code object executable code and data to the source | 
|  | language constructs. It can be used by tools such as debuggers and profilers. | 
|  |  | 
|  | Address Space Mapping | 
|  | ~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | The following address space mapping is used: | 
|  |  | 
|  | .. table:: AMDGPU DWARF Address Space Mapping | 
|  | :name: amdgpu-dwarf-address-space-mapping-table | 
|  |  | 
|  | =================== ================= | 
|  | DWARF Address Space Memory Space | 
|  | =================== ================= | 
|  | 1                   Private (Scratch) | 
|  | 2                   Local (group/LDS) | 
|  | *omitted*           Global | 
|  | *omitted*           Constant | 
|  | *omitted*           Generic (Flat) | 
|  | *not supported*     Region (GDS) | 
|  | =================== ================= | 
|  |  | 
|  | See :ref:`amdgpu-address-spaces` for information on the memory space terminology | 
|  | used in the table. | 
|  |  | 
|  | An ``address_class`` attribute is generated on pointer type DIEs to specify the | 
|  | DWARF address space of the value of the pointer when it is in the *private* or | 
|  | *local* address space. Otherwise the attribute is omitted. | 
|  |  | 
|  | An ``XDEREF`` operation is generated in location list expressions for variables | 
|  | that are allocated in the *private* and *local* address space. Otherwise no | 
|  | ``XDREF`` is omitted. | 
|  |  | 
|  | Register Mapping | 
|  | ~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | *This section is WIP.* | 
|  |  | 
|  | .. TODO | 
|  | Define DWARF register enumeration. | 
|  |  | 
|  | If want to present a wavefront state then should expose vector registers as | 
|  | 64 wide (rather than per work-item view that LLVM uses). Either as separate | 
|  | registers, or a 64x4 byte single register. In either case use a new LANE op | 
|  | (akin to XDREF) to select the current lane usage in a location | 
|  | expression. This would also allow scalar register spilling to vector register | 
|  | lanes to be expressed (currently no debug information is being generated for | 
|  | spilling). If choose a wide single register approach then use LANE in | 
|  | conjunction with PIECE operation to select the dword part of the register for | 
|  | the current lane. If the separate register approach then use LANE to select | 
|  | the register. | 
|  |  | 
|  | Source Text | 
|  | ~~~~~~~~~~~ | 
|  |  | 
|  | Source text for online-compiled programs (e.g. those compiled by the OpenCL | 
|  | runtime) may be embedded into the DWARF v5 line table using the ``clang | 
|  | -gembed-source`` option, described in table :ref:`amdgpu-debug-options`. | 
|  |  | 
|  | For example: | 
|  |  | 
|  | ``-gembed-source`` | 
|  | Enable the embedded source DWARF v5 extension. | 
|  | ``-gno-embed-source`` | 
|  | Disable the embedded source DWARF v5 extension. | 
|  |  | 
|  | .. table:: AMDGPU Debug Options | 
|  | :name: amdgpu-debug-options | 
|  |  | 
|  | ==================== ================================================== | 
|  | Debug Flag           Description | 
|  | ==================== ================================================== | 
|  | -g[no-]embed-source  Enable/disable embedding source text in DWARF | 
|  | debug sections. Useful for environments where | 
|  | source cannot be written to disk, such as | 
|  | when performing online compilation. | 
|  | ==================== ================================================== | 
|  |  | 
|  | This option enables one extended content types in the DWARF v5 Line Number | 
|  | Program Header, which is used to encode embedded source. | 
|  |  | 
|  | .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types | 
|  | :name: amdgpu-dwarf-extended-content-types | 
|  |  | 
|  | ============================  ====================== | 
|  | Content Type                  Form | 
|  | ============================  ====================== | 
|  | ``DW_LNCT_LLVM_source``       ``DW_FORM_line_strp`` | 
|  | ============================  ====================== | 
|  |  | 
|  | The source field will contain the UTF-8 encoded, null-terminated source text | 
|  | with ``'\n'`` line endings. When the source field is present, consumers can use | 
|  | the embedded source instead of attempting to discover the source on disk. When | 
|  | the source field is absent, consumers can access the file to get the source | 
|  | text. | 
|  |  | 
|  | The above content type appears in the ``file_name_entry_format`` field of the | 
|  | line table prologue, and its corresponding value appear in the ``file_names`` | 
|  | field. The current encoding of the content type is documented in table | 
|  | :ref:`amdgpu-dwarf-extended-content-types-encoding` | 
|  |  | 
|  | .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding | 
|  | :name: amdgpu-dwarf-extended-content-types-encoding | 
|  |  | 
|  | ============================  ==================== | 
|  | Content Type                  Value | 
|  | ============================  ==================== | 
|  | ``DW_LNCT_LLVM_source``       0x2001 | 
|  | ============================  ==================== | 
|  |  | 
|  | .. _amdgpu-code-conventions: | 
|  |  | 
|  | Code Conventions | 
|  | ================ | 
|  |  | 
|  | This section provides code conventions used for each supported target triple OS | 
|  | (see :ref:`amdgpu-target-triples`). | 
|  |  | 
|  | AMDHSA | 
|  | ------ | 
|  |  | 
|  | This section provides code conventions used when the target triple OS is | 
|  | ``amdhsa`` (see :ref:`amdgpu-target-triples`). | 
|  |  | 
|  | .. _amdgpu-amdhsa-code-object-target-identification: | 
|  |  | 
|  | Code Object Target Identification | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | The AMDHSA OS uses the following syntax to specify the code object | 
|  | target as a single string: | 
|  |  | 
|  | ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>`` | 
|  |  | 
|  | Where: | 
|  |  | 
|  | - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>`` | 
|  | are the same as the *Target Triple* (see | 
|  | :ref:`amdgpu-target-triples`). | 
|  |  | 
|  | - ``<Processor>`` is the same as the *Processor* (see | 
|  | :ref:`amdgpu-processors`). | 
|  |  | 
|  | - ``<Target Features>`` is a list of the enabled *Target Features* | 
|  | (see :ref:`amdgpu-target-features`), each prefixed by a plus, that | 
|  | apply to *Processor*. The list must be in the same order as listed | 
|  | in the table :ref:`amdgpu-target-feature-table`. Note that *Target | 
|  | Features* must be included in the list if they are enabled even if | 
|  | that is the default for *Processor*. | 
|  |  | 
|  | For example: | 
|  |  | 
|  | ``"amdgcn-amd-amdhsa--gfx902+xnack"`` | 
|  |  | 
|  | .. _amdgpu-amdhsa-code-object-metadata: | 
|  |  | 
|  | Code Object Metadata | 
|  | ~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | The code object metadata specifies extensible metadata associated with the code | 
|  | objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm | 
|  | [AMD-ROCm]_. It is specified in a note record (see :ref:`amdgpu-note-records`) | 
|  | and is required when the target triple OS is ``amdhsa`` (see | 
|  | :ref:`amdgpu-target-triples`). It must contain the minimum information | 
|  | necessary to support the ROCM kernel queries. For example, the segment sizes | 
|  | needed in a dispatch packet. In addition, a high level language runtime may | 
|  | require other information to be included. For example, the AMD OpenCL runtime | 
|  | records kernel argument information. | 
|  |  | 
|  | .. _amdgpu-amdhsa-code-object-metadata-v2: | 
|  |  | 
|  | Code Object V2 Metadata (-mattr=-code-object-v3) | 
|  | ++++++++++++++++++++++++++++++++++++++++++++++++ | 
|  |  | 
|  | Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note | 
|  | record (see :ref:`amdgpu-note-records-v2`). | 
|  |  | 
|  | The metadata is specified as a YAML formatted string (see [YAML]_ and | 
|  | :doc:`YamlIO`). | 
|  |  | 
|  | .. TODO | 
|  | Is the string null terminated? It probably should not if YAML allows it to | 
|  | contain null characters, otherwise it should be. | 
|  |  | 
|  | The metadata is represented as a single YAML document comprised of the mapping | 
|  | defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and | 
|  | referenced tables. | 
|  |  | 
|  | For boolean values, the string values of ``false`` and ``true`` are used for | 
|  | false and true respectively. | 
|  |  | 
|  | Additional information can be added to the mappings. To avoid conflicts, any | 
|  | non-AMD key names should be prefixed by "*vendor-name*.". | 
|  |  | 
|  | .. table:: AMDHSA Code Object V2 Metadata Map | 
|  | :name: amdgpu-amdhsa-code-object-metadata-map-table-v2 | 
|  |  | 
|  | ========== ============== ========= ======================================= | 
|  | String Key Value Type     Required? Description | 
|  | ========== ============== ========= ======================================= | 
|  | "Version"  sequence of    Required  - The first integer is the major | 
|  | 2 integers                 version. Currently 1. | 
|  | - The second integer is the minor | 
|  | version. Currently 0. | 
|  | "Printf"   sequence of              Each string is encoded information | 
|  | strings                  about a printf function call. The | 
|  | encoded information is organized as | 
|  | fields separated by colon (':'): | 
|  |  | 
|  | ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` | 
|  |  | 
|  | where: | 
|  |  | 
|  | ``ID`` | 
|  | A 32 bit integer as a unique id for | 
|  | each printf function call | 
|  |  | 
|  | ``N`` | 
|  | A 32 bit integer equal to the number | 
|  | of arguments of printf function call | 
|  | minus 1 | 
|  |  | 
|  | ``S[i]`` (where i = 0, 1, ... , N-1) | 
|  | 32 bit integers for the size in bytes | 
|  | of the i-th FormatString argument of | 
|  | the printf function call | 
|  |  | 
|  | FormatString | 
|  | The format string passed to the | 
|  | printf function call. | 
|  | "Kernels"  sequence of    Required  Sequence of the mappings for each | 
|  | mapping                  kernel in the code object. See | 
|  | :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2` | 
|  | for the definition of the mapping. | 
|  | ========== ============== ========= ======================================= | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: AMDHSA Code Object V2 Kernel Metadata Map | 
|  | :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2 | 
|  |  | 
|  | ================= ============== ========= ================================ | 
|  | String Key        Value Type     Required? Description | 
|  | ================= ============== ========= ================================ | 
|  | "Name"            string         Required  Source name of the kernel. | 
|  | "SymbolName"      string         Required  Name of the kernel | 
|  | descriptor ELF symbol. | 
|  | "Language"        string                   Source language of the kernel. | 
|  | Values include: | 
|  |  | 
|  | - "OpenCL C" | 
|  | - "OpenCL C++" | 
|  | - "HCC" | 
|  | - "OpenMP" | 
|  |  | 
|  | "LanguageVersion" sequence of              - The first integer is the major | 
|  | 2 integers                 version. | 
|  | - The second integer is the | 
|  | minor version. | 
|  | "Attrs"           mapping                  Mapping of kernel attributes. | 
|  | See | 
|  | :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2` | 
|  | for the mapping definition. | 
|  | "Args"            sequence of              Sequence of mappings of the | 
|  | mapping                  kernel arguments. See | 
|  | :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2` | 
|  | for the definition of the mapping. | 
|  | "CodeProps"       mapping                  Mapping of properties related to | 
|  | the kernel code. See | 
|  | :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2` | 
|  | for the mapping definition. | 
|  | ================= ============== ========= ================================ | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map | 
|  | :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2 | 
|  |  | 
|  | =================== ============== ========= ============================== | 
|  | String Key          Value Type     Required? Description | 
|  | =================== ============== ========= ============================== | 
|  | "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values | 
|  | 3 integers               must be >=1 and the dispatch | 
|  | work-group size X, Y, Z must | 
|  | correspond to the specified | 
|  | values. Defaults to 0, 0, 0. | 
|  |  | 
|  | Corresponds to the OpenCL | 
|  | ``reqd_work_group_size`` | 
|  | attribute. | 
|  | "WorkGroupSizeHint" sequence of              The dispatch work-group size | 
|  | 3 integers               X, Y, Z is likely to be the | 
|  | specified values. | 
|  |  | 
|  | Corresponds to the OpenCL | 
|  | ``work_group_size_hint`` | 
|  | attribute. | 
|  | "VecTypeHint"       string                   The name of a scalar or vector | 
|  | type. | 
|  |  | 
|  | Corresponds to the OpenCL | 
|  | ``vec_type_hint`` attribute. | 
|  |  | 
|  | "RuntimeHandle"     string                   The external symbol name | 
|  | associated with a kernel. | 
|  | OpenCL runtime allocates a | 
|  | global buffer for the symbol | 
|  | and saves the kernel's address | 
|  | to it, which is used for | 
|  | device side enqueueing. Only | 
|  | available for device side | 
|  | enqueued kernels. | 
|  | =================== ============== ========= ============================== | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map | 
|  | :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2 | 
|  |  | 
|  | ================= ============== ========= ================================ | 
|  | String Key        Value Type     Required? Description | 
|  | ================= ============== ========= ================================ | 
|  | "Name"            string                   Kernel argument name. | 
|  | "TypeName"        string                   Kernel argument type name. | 
|  | "Size"            integer        Required  Kernel argument size in bytes. | 
|  | "Align"           integer        Required  Kernel argument alignment in | 
|  | bytes. Must be a power of two. | 
|  | "ValueKind"       string         Required  Kernel argument kind that | 
|  | specifies how to set up the | 
|  | corresponding argument. | 
|  | Values include: | 
|  |  | 
|  | "ByValue" | 
|  | The argument is copied | 
|  | directly into the kernarg. | 
|  |  | 
|  | "GlobalBuffer" | 
|  | A global address space pointer | 
|  | to the buffer data is passed | 
|  | in the kernarg. | 
|  |  | 
|  | "DynamicSharedPointer" | 
|  | A group address space pointer | 
|  | to dynamically allocated LDS | 
|  | is passed in the kernarg. | 
|  |  | 
|  | "Sampler" | 
|  | A global address space | 
|  | pointer to a S# is passed in | 
|  | the kernarg. | 
|  |  | 
|  | "Image" | 
|  | A global address space | 
|  | pointer to a T# is passed in | 
|  | the kernarg. | 
|  |  | 
|  | "Pipe" | 
|  | A global address space pointer | 
|  | to an OpenCL pipe is passed in | 
|  | the kernarg. | 
|  |  | 
|  | "Queue" | 
|  | A global address space pointer | 
|  | to an OpenCL device enqueue | 
|  | queue is passed in the | 
|  | kernarg. | 
|  |  | 
|  | "HiddenGlobalOffsetX" | 
|  | The OpenCL grid dispatch | 
|  | global offset for the X | 
|  | dimension is passed in the | 
|  | kernarg. | 
|  |  | 
|  | "HiddenGlobalOffsetY" | 
|  | The OpenCL grid dispatch | 
|  | global offset for the Y | 
|  | dimension is passed in the | 
|  | kernarg. | 
|  |  | 
|  | "HiddenGlobalOffsetZ" | 
|  | The OpenCL grid dispatch | 
|  | global offset for the Z | 
|  | dimension is passed in the | 
|  | kernarg. | 
|  |  | 
|  | "HiddenNone" | 
|  | An argument that is not used | 
|  | by the kernel. Space needs to | 
|  | be left for it, but it does | 
|  | not need to be set up. | 
|  |  | 
|  | "HiddenPrintfBuffer" | 
|  | A global address space pointer | 
|  | to the runtime printf buffer | 
|  | is passed in kernarg. | 
|  |  | 
|  | "HiddenDefaultQueue" | 
|  | A global address space pointer | 
|  | to the OpenCL device enqueue | 
|  | queue that should be used by | 
|  | the kernel by default is | 
|  | passed in the kernarg. | 
|  |  | 
|  | "HiddenCompletionAction" | 
|  | A global address space pointer | 
|  | to help link enqueued kernels into | 
|  | the ancestor tree for determining | 
|  | when the parent kernel has finished. | 
|  |  | 
|  | "ValueType"       string         Required  Kernel argument value type. Only | 
|  | present if "ValueKind" is | 
|  | "ByValue". For vector data | 
|  | types, the value is for the | 
|  | element type. Values include: | 
|  |  | 
|  | - "Struct" | 
|  | - "I8" | 
|  | - "U8" | 
|  | - "I16" | 
|  | - "U16" | 
|  | - "F16" | 
|  | - "I32" | 
|  | - "U32" | 
|  | - "F32" | 
|  | - "I64" | 
|  | - "U64" | 
|  | - "F64" | 
|  |  | 
|  | .. TODO | 
|  | How can it be determined if a | 
|  | vector type, and what size | 
|  | vector? | 
|  | "PointeeAlign"    integer                  Alignment in bytes of pointee | 
|  | type for pointer type kernel | 
|  | argument. Must be a power | 
|  | of 2. Only present if | 
|  | "ValueKind" is | 
|  | "DynamicSharedPointer". | 
|  | "AddrSpaceQual"   string                   Kernel argument address space | 
|  | qualifier. Only present if | 
|  | "ValueKind" is "GlobalBuffer" or | 
|  | "DynamicSharedPointer". Values | 
|  | are: | 
|  |  | 
|  | - "Private" | 
|  | - "Global" | 
|  | - "Constant" | 
|  | - "Local" | 
|  | - "Generic" | 
|  | - "Region" | 
|  |  | 
|  | .. TODO | 
|  | Is GlobalBuffer only Global | 
|  | or Constant? Is | 
|  | DynamicSharedPointer always | 
|  | Local? Can HCC allow Generic? | 
|  | How can Private or Region | 
|  | ever happen? | 
|  | "AccQual"         string                   Kernel argument access | 
|  | qualifier. Only present if | 
|  | "ValueKind" is "Image" or | 
|  | "Pipe". Values | 
|  | are: | 
|  |  | 
|  | - "ReadOnly" | 
|  | - "WriteOnly" | 
|  | - "ReadWrite" | 
|  |  | 
|  | .. TODO | 
|  | Does this apply to | 
|  | GlobalBuffer? | 
|  | "ActualAccQual"   string                   The actual memory accesses | 
|  | performed by the kernel on the | 
|  | kernel argument. Only present if | 
|  | "ValueKind" is "GlobalBuffer", | 
|  | "Image", or "Pipe". This may be | 
|  | more restrictive than indicated | 
|  | by "AccQual" to reflect what the | 
|  | kernel actual does. If not | 
|  | present then the runtime must | 
|  | assume what is implied by | 
|  | "AccQual" and "IsConst". Values | 
|  | are: | 
|  |  | 
|  | - "ReadOnly" | 
|  | - "WriteOnly" | 
|  | - "ReadWrite" | 
|  |  | 
|  | "IsConst"         boolean                  Indicates if the kernel argument | 
|  | is const qualified. Only present | 
|  | if "ValueKind" is | 
|  | "GlobalBuffer". | 
|  |  | 
|  | "IsRestrict"      boolean                  Indicates if the kernel argument | 
|  | is restrict qualified. Only | 
|  | present if "ValueKind" is | 
|  | "GlobalBuffer". | 
|  |  | 
|  | "IsVolatile"      boolean                  Indicates if the kernel argument | 
|  | is volatile qualified. Only | 
|  | present if "ValueKind" is | 
|  | "GlobalBuffer". | 
|  |  | 
|  | "IsPipe"          boolean                  Indicates if the kernel argument | 
|  | is pipe qualified. Only present | 
|  | if "ValueKind" is "Pipe". | 
|  |  | 
|  | .. TODO | 
|  | Can GlobalBuffer be pipe | 
|  | qualified? | 
|  | ================= ============== ========= ================================ | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map | 
|  | :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2 | 
|  |  | 
|  | ============================ ============== ========= ===================== | 
|  | String Key                   Value Type     Required? Description | 
|  | ============================ ============== ========= ===================== | 
|  | "KernargSegmentSize"         integer        Required  The size in bytes of | 
|  | the kernarg segment | 
|  | that holds the values | 
|  | of the arguments to | 
|  | the kernel. | 
|  | "GroupSegmentFixedSize"      integer        Required  The amount of group | 
|  | segment memory | 
|  | required by a | 
|  | work-group in | 
|  | bytes. This does not | 
|  | include any | 
|  | dynamically allocated | 
|  | group segment memory | 
|  | that may be added | 
|  | when the kernel is | 
|  | dispatched. | 
|  | "PrivateSegmentFixedSize"    integer        Required  The amount of fixed | 
|  | private address space | 
|  | memory required for a | 
|  | work-item in | 
|  | bytes. If the kernel | 
|  | uses a dynamic call | 
|  | stack then additional | 
|  | space must be added | 
|  | to this value for the | 
|  | call stack. | 
|  | "KernargSegmentAlign"        integer        Required  The maximum byte | 
|  | alignment of | 
|  | arguments in the | 
|  | kernarg segment. Must | 
|  | be a power of 2. | 
|  | "WavefrontSize"              integer        Required  Wavefront size. Must | 
|  | be a power of 2. | 
|  | "NumSGPRs"                   integer        Required  Number of scalar | 
|  | registers used by a | 
|  | wavefront for | 
|  | GFX6-GFX9. This | 
|  | includes the special | 
|  | SGPRs for VCC, Flat | 
|  | Scratch (GFX7-GFX9) | 
|  | and XNACK (for | 
|  | GFX8-GFX9). It does | 
|  | not include the 16 | 
|  | SGPR added if a trap | 
|  | handler is | 
|  | enabled. It is not | 
|  | rounded up to the | 
|  | allocation | 
|  | granularity. | 
|  | "NumVGPRs"                   integer        Required  Number of vector | 
|  | registers used by | 
|  | each work-item for | 
|  | GFX6-GFX9 | 
|  | "MaxFlatWorkGroupSize"       integer        Required  Maximum flat | 
|  | work-group size | 
|  | supported by the | 
|  | kernel in work-items. | 
|  | Must be >=1 and | 
|  | consistent with | 
|  | ReqdWorkGroupSize if | 
|  | not 0, 0, 0. | 
|  | "NumSpilledSGPRs"            integer                  Number of stores from | 
|  | a scalar register to | 
|  | a register allocator | 
|  | created spill | 
|  | location. | 
|  | "NumSpilledVGPRs"            integer                  Number of stores from | 
|  | a vector register to | 
|  | a register allocator | 
|  | created spill | 
|  | location. | 
|  | ============================ ============== ========= ===================== | 
|  |  | 
|  | .. _amdgpu-amdhsa-code-object-metadata-v3: | 
|  |  | 
|  | Code Object V3 Metadata (-mattr=+code-object-v3) | 
|  | ++++++++++++++++++++++++++++++++++++++++++++++++ | 
|  |  | 
|  | Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record | 
|  | (see :ref:`amdgpu-note-records-v3`). | 
|  |  | 
|  | The metadata is represented as Message Pack formatted binary data (see | 
|  | [MsgPack]_). The top level is a Message Pack map that includes the | 
|  | keys defined in table | 
|  | :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced | 
|  | tables. | 
|  |  | 
|  | Additional information can be added to the maps. To avoid conflicts, | 
|  | any key names should be prefixed by "*vendor-name*." where | 
|  | ``vendor-name`` can be the the name of the vendor and specific vendor | 
|  | tool that generates the information. The prefix is abbreviated to | 
|  | simply "." when it appears within a map that has been added by the | 
|  | same *vendor-name*. | 
|  |  | 
|  | .. table:: AMDHSA Code Object V3 Metadata Map | 
|  | :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 | 
|  |  | 
|  | ================= ============== ========= ======================================= | 
|  | String Key        Value Type     Required? Description | 
|  | ================= ============== ========= ======================================= | 
|  | "amdhsa.version"  sequence of    Required  - The first integer is the major | 
|  | 2 integers                 version. Currently 1. | 
|  | - The second integer is the minor | 
|  | version. Currently 0. | 
|  | "amdhsa.printf"   sequence of              Each string is encoded information | 
|  | strings                  about a printf function call. The | 
|  | encoded information is organized as | 
|  | fields separated by colon (':'): | 
|  |  | 
|  | ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` | 
|  |  | 
|  | where: | 
|  |  | 
|  | ``ID`` | 
|  | A 32 bit integer as a unique id for | 
|  | each printf function call | 
|  |  | 
|  | ``N`` | 
|  | A 32 bit integer equal to the number | 
|  | of arguments of printf function call | 
|  | minus 1 | 
|  |  | 
|  | ``S[i]`` (where i = 0, 1, ... , N-1) | 
|  | 32 bit integers for the size in bytes | 
|  | of the i-th FormatString argument of | 
|  | the printf function call | 
|  |  | 
|  | FormatString | 
|  | The format string passed to the | 
|  | printf function call. | 
|  | "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each | 
|  | map                      kernel in the code object. See | 
|  | :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` | 
|  | for the definition of the keys included | 
|  | in that map. | 
|  | ================= ============== ========= ======================================= | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: AMDHSA Code Object V3 Kernel Metadata Map | 
|  | :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 | 
|  |  | 
|  | =================================== ============== ========= ================================ | 
|  | String Key                          Value Type     Required? Description | 
|  | =================================== ============== ========= ================================ | 
|  | ".name"                             string         Required  Source name of the kernel. | 
|  | ".symbol"                           string         Required  Name of the kernel | 
|  | descriptor ELF symbol. | 
|  | ".language"                         string                   Source language of the kernel. | 
|  | Values include: | 
|  |  | 
|  | - "OpenCL C" | 
|  | - "OpenCL C++" | 
|  | - "HCC" | 
|  | - "HIP" | 
|  | - "OpenMP" | 
|  | - "Assembler" | 
|  |  | 
|  | ".language_version"                 sequence of              - The first integer is the major | 
|  | 2 integers                 version. | 
|  | - The second integer is the | 
|  | minor version. | 
|  | ".args"                             sequence of              Sequence of maps of the | 
|  | map                      kernel arguments. See | 
|  | :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` | 
|  | for the definition of the keys | 
|  | included in that map. | 
|  | ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values | 
|  | 3 integers               must be >=1 and the dispatch | 
|  | work-group size X, Y, Z must | 
|  | correspond to the specified | 
|  | values. Defaults to 0, 0, 0. | 
|  |  | 
|  | Corresponds to the OpenCL | 
|  | ``reqd_work_group_size`` | 
|  | attribute. | 
|  | ".workgroup_size_hint"              sequence of              The dispatch work-group size | 
|  | 3 integers               X, Y, Z is likely to be the | 
|  | specified values. | 
|  |  | 
|  | Corresponds to the OpenCL | 
|  | ``work_group_size_hint`` | 
|  | attribute. | 
|  | ".vec_type_hint"                    string                   The name of a scalar or vector | 
|  | type. | 
|  |  | 
|  | Corresponds to the OpenCL | 
|  | ``vec_type_hint`` attribute. | 
|  |  | 
|  | ".device_enqueue_symbol"            string                   The external symbol name | 
|  | associated with a kernel. | 
|  | OpenCL runtime allocates a | 
|  | global buffer for the symbol | 
|  | and saves the kernel's address | 
|  | to it, which is used for | 
|  | device side enqueueing. Only | 
|  | available for device side | 
|  | enqueued kernels. | 
|  | ".kernarg_segment_size"             integer        Required  The size in bytes of | 
|  | the kernarg segment | 
|  | that holds the values | 
|  | of the arguments to | 
|  | the kernel. | 
|  | ".group_segment_fixed_size"         integer        Required  The amount of group | 
|  | segment memory | 
|  | required by a | 
|  | work-group in | 
|  | bytes. This does not | 
|  | include any | 
|  | dynamically allocated | 
|  | group segment memory | 
|  | that may be added | 
|  | when the kernel is | 
|  | dispatched. | 
|  | ".private_segment_fixed_size"       integer        Required  The amount of fixed | 
|  | private address space | 
|  | memory required for a | 
|  | work-item in | 
|  | bytes. If the kernel | 
|  | uses a dynamic call | 
|  | stack then additional | 
|  | space must be added | 
|  | to this value for the | 
|  | call stack. | 
|  | ".kernarg_segment_align"            integer        Required  The maximum byte | 
|  | alignment of | 
|  | arguments in the | 
|  | kernarg segment. Must | 
|  | be a power of 2. | 
|  | ".wavefront_size"                   integer        Required  Wavefront size. Must | 
|  | be a power of 2. | 
|  | ".sgpr_count"                       integer        Required  Number of scalar | 
|  | registers required by a | 
|  | wavefront for | 
|  | GFX6-GFX9. A register | 
|  | is required if it is | 
|  | used explicitly, or | 
|  | if a higher numbered | 
|  | register is used | 
|  | explicitly. This | 
|  | includes the special | 
|  | SGPRs for VCC, Flat | 
|  | Scratch (GFX7-GFX9) | 
|  | and XNACK (for | 
|  | GFX8-GFX9). It does | 
|  | not include the 16 | 
|  | SGPR added if a trap | 
|  | handler is | 
|  | enabled. It is not | 
|  | rounded up to the | 
|  | allocation | 
|  | granularity. | 
|  | ".vgpr_count"                       integer        Required  Number of vector | 
|  | registers required by | 
|  | each work-item for | 
|  | GFX6-GFX9. A register | 
|  | is required if it is | 
|  | used explicitly, or | 
|  | if a higher numbered | 
|  | register is used | 
|  | explicitly. | 
|  | ".max_flat_workgroup_size"          integer        Required  Maximum flat | 
|  | work-group size | 
|  | supported by the | 
|  | kernel in work-items. | 
|  | Must be >=1 and | 
|  | consistent with | 
|  | ReqdWorkGroupSize if | 
|  | not 0, 0, 0. | 
|  | ".sgpr_spill_count"                 integer                  Number of stores from | 
|  | a scalar register to | 
|  | a register allocator | 
|  | created spill | 
|  | location. | 
|  | ".vgpr_spill_count"                 integer                  Number of stores from | 
|  | a vector register to | 
|  | a register allocator | 
|  | created spill | 
|  | location. | 
|  | =================================== ============== ========= ================================ | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map | 
|  | :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 | 
|  |  | 
|  | ====================== ============== ========= ================================ | 
|  | String Key             Value Type     Required? Description | 
|  | ====================== ============== ========= ================================ | 
|  | ".name"                string                   Kernel argument name. | 
|  | ".type_name"           string                   Kernel argument type name. | 
|  | ".size"                integer        Required  Kernel argument size in bytes. | 
|  | ".offset"              integer        Required  Kernel argument offset in | 
|  | bytes. The offset must be a | 
|  | multiple of the alignment | 
|  | required by the argument. | 
|  | ".value_kind"          string         Required  Kernel argument kind that | 
|  | specifies how to set up the | 
|  | corresponding argument. | 
|  | Values include: | 
|  |  | 
|  | "by_value" | 
|  | The argument is copied | 
|  | directly into the kernarg. | 
|  |  | 
|  | "global_buffer" | 
|  | A global address space pointer | 
|  | to the buffer data is passed | 
|  | in the kernarg. | 
|  |  | 
|  | "dynamic_shared_pointer" | 
|  | A group address space pointer | 
|  | to dynamically allocated LDS | 
|  | is passed in the kernarg. | 
|  |  | 
|  | "sampler" | 
|  | A global address space | 
|  | pointer to a S# is passed in | 
|  | the kernarg. | 
|  |  | 
|  | "image" | 
|  | A global address space | 
|  | pointer to a T# is passed in | 
|  | the kernarg. | 
|  |  | 
|  | "pipe" | 
|  | A global address space pointer | 
|  | to an OpenCL pipe is passed in | 
|  | the kernarg. | 
|  |  | 
|  | "queue" | 
|  | A global address space pointer | 
|  | to an OpenCL device enqueue | 
|  | queue is passed in the | 
|  | kernarg. | 
|  |  | 
|  | "hidden_global_offset_x" | 
|  | The OpenCL grid dispatch | 
|  | global offset for the X | 
|  | dimension is passed in the | 
|  | kernarg. | 
|  |  | 
|  | "hidden_global_offset_y" | 
|  | The OpenCL grid dispatch | 
|  | global offset for the Y | 
|  | dimension is passed in the | 
|  | kernarg. | 
|  |  | 
|  | "hidden_global_offset_z" | 
|  | The OpenCL grid dispatch | 
|  | global offset for the Z | 
|  | dimension is passed in the | 
|  | kernarg. | 
|  |  | 
|  | "hidden_none" | 
|  | An argument that is not used | 
|  | by the kernel. Space needs to | 
|  | be left for it, but it does | 
|  | not need to be set up. | 
|  |  | 
|  | "hidden_printf_buffer" | 
|  | A global address space pointer | 
|  | to the runtime printf buffer | 
|  | is passed in kernarg. | 
|  |  | 
|  | "hidden_default_queue" | 
|  | A global address space pointer | 
|  | to the OpenCL device enqueue | 
|  | queue that should be used by | 
|  | the kernel by default is | 
|  | passed in the kernarg. | 
|  |  | 
|  | "hidden_completion_action" | 
|  | A global address space pointer | 
|  | to help link enqueued kernels into | 
|  | the ancestor tree for determining | 
|  | when the parent kernel has finished. | 
|  |  | 
|  | ".value_type"          string         Required  Kernel argument value type. Only | 
|  | present if ".value_kind" is | 
|  | "by_value". For vector data | 
|  | types, the value is for the | 
|  | element type. Values include: | 
|  |  | 
|  | - "struct" | 
|  | - "i8" | 
|  | - "u8" | 
|  | - "i16" | 
|  | - "u16" | 
|  | - "f16" | 
|  | - "i32" | 
|  | - "u32" | 
|  | - "f32" | 
|  | - "i64" | 
|  | - "u64" | 
|  | - "f64" | 
|  |  | 
|  | .. TODO | 
|  | How can it be determined if a | 
|  | vector type, and what size | 
|  | vector? | 
|  | ".pointee_align"       integer                  Alignment in bytes of pointee | 
|  | type for pointer type kernel | 
|  | argument. Must be a power | 
|  | of 2. Only present if | 
|  | ".value_kind" is | 
|  | "dynamic_shared_pointer". | 
|  | ".address_space"       string                   Kernel argument address space | 
|  | qualifier. Only present if | 
|  | ".value_kind" is "global_buffer" or | 
|  | "dynamic_shared_pointer". Values | 
|  | are: | 
|  |  | 
|  | - "private" | 
|  | - "global" | 
|  | - "constant" | 
|  | - "local" | 
|  | - "generic" | 
|  | - "region" | 
|  |  | 
|  | .. TODO | 
|  | Is "global_buffer" only "global" | 
|  | or "constant"? Is | 
|  | "dynamic_shared_pointer" always | 
|  | "local"? Can HCC allow "generic"? | 
|  | How can "private" or "region" | 
|  | ever happen? | 
|  | ".access"              string                   Kernel argument access | 
|  | qualifier. Only present if | 
|  | ".value_kind" is "image" or | 
|  | "pipe". Values | 
|  | are: | 
|  |  | 
|  | - "read_only" | 
|  | - "write_only" | 
|  | - "read_write" | 
|  |  | 
|  | .. TODO | 
|  | Does this apply to | 
|  | "global_buffer"? | 
|  | ".actual_access"       string                   The actual memory accesses | 
|  | performed by the kernel on the | 
|  | kernel argument. Only present if | 
|  | ".value_kind" is "global_buffer", | 
|  | "image", or "pipe". This may be | 
|  | more restrictive than indicated | 
|  | by ".access" to reflect what the | 
|  | kernel actual does. If not | 
|  | present then the runtime must | 
|  | assume what is implied by | 
|  | ".access" and ".is_const"      . Values | 
|  | are: | 
|  |  | 
|  | - "read_only" | 
|  | - "write_only" | 
|  | - "read_write" | 
|  |  | 
|  | ".is_const"            boolean                  Indicates if the kernel argument | 
|  | is const qualified. Only present | 
|  | if ".value_kind" is | 
|  | "global_buffer". | 
|  |  | 
|  | ".is_restrict"         boolean                  Indicates if the kernel argument | 
|  | is restrict qualified. Only | 
|  | present if ".value_kind" is | 
|  | "global_buffer". | 
|  |  | 
|  | ".is_volatile"         boolean                  Indicates if the kernel argument | 
|  | is volatile qualified. Only | 
|  | present if ".value_kind" is | 
|  | "global_buffer". | 
|  |  | 
|  | ".is_pipe"             boolean                  Indicates if the kernel argument | 
|  | is pipe qualified. Only present | 
|  | if ".value_kind" is "pipe". | 
|  |  | 
|  | .. TODO | 
|  | Can "global_buffer" be pipe | 
|  | qualified? | 
|  | ====================== ============== ========= ================================ | 
|  |  | 
|  | .. | 
|  |  | 
|  | Kernel Dispatch | 
|  | ~~~~~~~~~~~~~~~ | 
|  |  | 
|  | The HSA architected queuing language (AQL) defines a user space memory interface | 
|  | that can be used to control the dispatch of kernels, in an agent independent | 
|  | way. An agent can have zero or more AQL queues created for it using the ROCm | 
|  | runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the | 
|  | *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue | 
|  | mechanics and packet layouts. | 
|  |  | 
|  | The packet processor of a kernel agent is responsible for detecting and | 
|  | dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the | 
|  | packet processor is implemented by the hardware command processor (CP), | 
|  | asynchronous dispatch controller (ADC) and shader processor input controller | 
|  | (SPI). | 
|  |  | 
|  | The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel | 
|  | mode driver to initialize and register the AQL queue with CP. | 
|  |  | 
|  | To dispatch a kernel the following actions are performed. This can occur in the | 
|  | CPU host program, or from an HSA kernel executing on a GPU. | 
|  |  | 
|  | 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be | 
|  | executed is obtained. | 
|  | 2. A pointer to the kernel descriptor (see | 
|  | :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is | 
|  | obtained. It must be for a kernel that is contained in a code object that that | 
|  | was loaded by the ROCm runtime on the kernel agent with which the AQL queue is | 
|  | associated. | 
|  | 3. Space is allocated for the kernel arguments using the ROCm runtime allocator | 
|  | for a memory region with the kernarg property for the kernel agent that will | 
|  | execute the kernel. It must be at least 16 byte aligned. | 
|  | 4. Kernel argument values are assigned to the kernel argument memory | 
|  | allocation. The layout is defined in the *HSA Programmer's Language Reference* | 
|  | [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument | 
|  | memory in the same way constant memory is accessed. (Note that the HSA | 
|  | specification allows an implementation to copy the kernel argument contents to | 
|  | another location that is accessed by the kernel.) | 
|  | 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime | 
|  | api uses 64 bit atomic operations to reserve space in the AQL queue for the | 
|  | packet. The packet must be set up, and the final write must use an atomic | 
|  | store release to set the packet kind to ensure the packet contents are | 
|  | visible to the kernel agent. AQL defines a doorbell signal mechanism to | 
|  | notify the kernel agent that the AQL queue has been updated. These rules, and | 
|  | the layout of the AQL queue and kernel dispatch packet is defined in the *HSA | 
|  | System Architecture Specification* [HSA]_. | 
|  | 6. A kernel dispatch packet includes information about the actual dispatch, | 
|  | such as grid and work-group size, together with information from the code | 
|  | object about the kernel, such as segment sizes. The ROCm runtime queries on | 
|  | the kernel symbol can be used to obtain the code object values which are | 
|  | recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. | 
|  | 7. CP executes micro-code and is responsible for detecting and setting up the | 
|  | GPU to execute the wavefronts of a kernel dispatch. | 
|  | 8. CP ensures that when the a wavefront starts executing the kernel machine | 
|  | code, the scalar general purpose registers (SGPR) and vector general purpose | 
|  | registers (VGPR) are set up as required by the machine code. The required | 
|  | setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial | 
|  | register state is defined in | 
|  | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. | 
|  | 9. The prolog of the kernel machine code (see | 
|  | :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary | 
|  | before continuing executing the machine code that corresponds to the kernel. | 
|  | 10. When the kernel dispatch has completed execution, CP signals the completion | 
|  | signal specified in the kernel dispatch packet if not 0. | 
|  |  | 
|  | .. _amdgpu-amdhsa-memory-spaces: | 
|  |  | 
|  | Memory Spaces | 
|  | ~~~~~~~~~~~~~ | 
|  |  | 
|  | The memory space properties are: | 
|  |  | 
|  | .. table:: AMDHSA Memory Spaces | 
|  | :name: amdgpu-amdhsa-memory-spaces-table | 
|  |  | 
|  | ================= =========== ======== ======= ================== | 
|  | Memory Space Name HSA Segment Hardware Address NULL Value | 
|  | Name        Name     Size | 
|  | ================= =========== ======== ======= ================== | 
|  | Private           private     scratch  32      0x00000000 | 
|  | Local             group       LDS      32      0xFFFFFFFF | 
|  | Global            global      global   64      0x0000000000000000 | 
|  | Constant          constant    *same as 64      0x0000000000000000 | 
|  | global* | 
|  | Generic           flat        flat     64      0x0000000000000000 | 
|  | Region            N/A         GDS      32      *not implemented | 
|  | for AMDHSA* | 
|  | ================= =========== ======== ======= ================== | 
|  |  | 
|  | The global and constant memory spaces both use global virtual addresses, which | 
|  | are the same virtual address space used by the CPU. However, some virtual | 
|  | addresses may only be accessible to the CPU, some only accessible by the GPU, | 
|  | and some by both. | 
|  |  | 
|  | Using the constant memory space indicates that the data will not change during | 
|  | the execution of the kernel. This allows scalar read instructions to be | 
|  | used. The vector and scalar L1 caches are invalidated of volatile data before | 
|  | each kernel dispatch execution to allow constant memory to change values between | 
|  | kernel dispatches. | 
|  |  | 
|  | The local memory space uses the hardware Local Data Store (LDS) which is | 
|  | automatically allocated when the hardware creates work-groups of wavefronts, and | 
|  | freed when all the wavefronts of a work-group have terminated. The data store | 
|  | (DS) instructions can be used to access it. | 
|  |  | 
|  | The private memory space uses the hardware scratch memory support. If the kernel | 
|  | uses scratch, then the hardware allocates memory that is accessed using | 
|  | wavefront lane dword (4 byte) interleaving. The mapping used from private | 
|  | address to physical address is: | 
|  |  | 
|  | ``wavefront-scratch-base + | 
|  | (private-address * wavefront-size * 4) + | 
|  | (wavefront-lane-id * 4)`` | 
|  |  | 
|  | There are different ways that the wavefront scratch base address is determined | 
|  | by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This | 
|  | memory can be accessed in an interleaved manner using buffer instruction with | 
|  | the scratch buffer descriptor and per wavefront scratch offset, by the scratch | 
|  | instructions, or by flat instructions. If each lane of a wavefront accesses the | 
|  | same private address, the interleaving results in adjacent dwords being accessed | 
|  | and hence requires fewer cache lines to be fetched. Multi-dword access is not | 
|  | supported except by flat and scratch instructions in GFX9. | 
|  |  | 
|  | The generic address space uses the hardware flat address support available in | 
|  | GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and | 
|  | local appertures), that are outside the range of addressible global memory, to | 
|  | map from a flat address to a private or local address. | 
|  |  | 
|  | FLAT instructions can take a flat address and access global, private (scratch) | 
|  | and group (LDS) memory depending in if the address is within one of the | 
|  | apperture ranges. Flat access to scratch requires hardware aperture setup and | 
|  | setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat | 
|  | access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup | 
|  | (see :ref:`amdgpu-amdhsa-m0`). | 
|  |  | 
|  | To convert between a segment address and a flat address the base address of the | 
|  | appertures address can be used. For GFX7-GFX8 these are available in the | 
|  | :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with | 
|  | Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For | 
|  | GFX9 the appature base addresses are directly available as inline constant | 
|  | registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit | 
|  | address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32 | 
|  | which makes it easier to convert from flat to segment or segment to flat. | 
|  |  | 
|  | Image and Samplers | 
|  | ~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Image and sample handles created by the ROCm runtime are 64 bit addresses of a | 
|  | hardware 32 byte V# and 48 byte S# object respectively. In order to support the | 
|  | HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG | 
|  | enumeration values for the queries that are not trivially deducible from the S# | 
|  | representation. | 
|  |  | 
|  | HSA Signals | 
|  | ~~~~~~~~~~~ | 
|  |  | 
|  | HSA signal handles created by the ROCm runtime are 64 bit addresses of a | 
|  | structure allocated in memory accessible from both the CPU and GPU. The | 
|  | structure is defined by the ROCm runtime and subject to change between releases | 
|  | (see [AMD-ROCm-github]_). | 
|  |  | 
|  | .. _amdgpu-amdhsa-hsa-aql-queue: | 
|  |  | 
|  | HSA AQL Queue | 
|  | ~~~~~~~~~~~~~ | 
|  |  | 
|  | The HSA AQL queue structure is defined by the ROCm runtime and subject to change | 
|  | between releases (see [AMD-ROCm-github]_). For some processors it contains | 
|  | fields needed to implement certain language features such as the flat address | 
|  | aperture bases. It also contains fields used by CP such as managing the | 
|  | allocation of scratch memory. | 
|  |  | 
|  | .. _amdgpu-amdhsa-kernel-descriptor: | 
|  |  | 
|  | Kernel Descriptor | 
|  | ~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | A kernel descriptor consists of the information needed by CP to initiate the | 
|  | execution of a kernel, including the entry point address of the machine code | 
|  | that implements the kernel. | 
|  |  | 
|  | Kernel Descriptor for GFX6-GFX9 | 
|  | +++++++++++++++++++++++++++++++ | 
|  |  | 
|  | CP microcode requires the Kernel descriptor to be allocated on 64 byte | 
|  | alignment. | 
|  |  | 
|  | .. table:: Kernel Descriptor for GFX6-GFX9 | 
|  | :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table | 
|  |  | 
|  | ======= ======= =============================== ============================ | 
|  | Bits    Size    Field Name                      Description | 
|  | ======= ======= =============================== ============================ | 
|  | 31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local | 
|  | address space memory | 
|  | required for a work-group | 
|  | in bytes. This does not | 
|  | include any dynamically | 
|  | allocated local address | 
|  | space memory that may be | 
|  | added when the kernel is | 
|  | dispatched. | 
|  | 63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed | 
|  | private address space | 
|  | memory required for a | 
|  | work-item in bytes. If | 
|  | is_dynamic_callstack is 1 | 
|  | then additional space must | 
|  | be added to this value for | 
|  | the call stack. | 
|  | 127:64  8 bytes                                 Reserved, must be 0. | 
|  | 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly | 
|  | negative) from base | 
|  | address of kernel | 
|  | descriptor to kernel's | 
|  | entry point instruction | 
|  | which must be 256 byte | 
|  | aligned. | 
|  | 383:192 24                                      Reserved, must be 0. | 
|  | bytes | 
|  | 415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS) | 
|  | program settings used by | 
|  | CP to set up | 
|  | ``COMPUTE_PGM_RSRC1`` | 
|  | configuration | 
|  | register. See | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | 447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS) | 
|  | program settings used by | 
|  | CP to set up | 
|  | ``COMPUTE_PGM_RSRC2`` | 
|  | configuration | 
|  | register. See | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | 448     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the | 
|  | _BUFFER                         SGPR user data registers | 
|  | (see | 
|  | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  |  | 
|  | The total number of SGPR | 
|  | user data registers | 
|  | requested must not exceed | 
|  | 16 and match value in | 
|  | ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. | 
|  | Any requests beyond 16 | 
|  | will be ignored. | 
|  | 449     1 bit   ENABLE_SGPR_DISPATCH_PTR        *see above* | 
|  | 450     1 bit   ENABLE_SGPR_QUEUE_PTR           *see above* | 
|  | 451     1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above* | 
|  | 452     1 bit   ENABLE_SGPR_DISPATCH_ID         *see above* | 
|  | 453     1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   *see above* | 
|  | 454     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     *see above* | 
|  | _SIZE | 
|  | 455     1 bit                                   Reserved, must be 0. | 
|  | 511:456 8 bytes                                 Reserved, must be 0. | 
|  | 512     **Total size 64 bytes.** | 
|  | ======= ==================================================================== | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: compute_pgm_rsrc1 for GFX6-GFX9 | 
|  | :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table | 
|  |  | 
|  | ======= ======= =============================== =========================================================================== | 
|  | Bits    Size    Field Name                      Description | 
|  | ======= ======= =============================== =========================================================================== | 
|  | 5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register | 
|  | blocks used by each work-item; | 
|  | granularity is device | 
|  | specific: | 
|  |  | 
|  | GFX6-GFX9 | 
|  | - vgprs_used 0..256 | 
|  | - max(0, ceil(vgprs_used / 4) - 1) | 
|  |  | 
|  | Where vgprs_used is defined | 
|  | as the highest VGPR number | 
|  | explicitly referenced plus | 
|  | one. | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC1.VGPRS``. | 
|  |  | 
|  | The | 
|  | :ref:`amdgpu-assembler` | 
|  | calculates this | 
|  | automatically for the | 
|  | selected processor from | 
|  | values provided to the | 
|  | `.amdhsa_kernel` directive | 
|  | by the | 
|  | `.amdhsa_next_free_vgpr` | 
|  | nested directive (see | 
|  | :ref:`amdhsa-kernel-directives-table`). | 
|  | 9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register | 
|  | blocks used by a wavefront; | 
|  | granularity is device | 
|  | specific: | 
|  |  | 
|  | GFX6-GFX8 | 
|  | - sgprs_used 0..112 | 
|  | - max(0, ceil(sgprs_used / 8) - 1) | 
|  | GFX9 | 
|  | - sgprs_used 0..112 | 
|  | - 2 * max(0, ceil(sgprs_used / 16) - 1) | 
|  |  | 
|  | Where sgprs_used is | 
|  | defined as the highest | 
|  | SGPR number explicitly | 
|  | referenced plus one, plus | 
|  | a target-specific number | 
|  | of additional special | 
|  | SGPRs for VCC, | 
|  | FLAT_SCRATCH (GFX7+) and | 
|  | XNACK_MASK (GFX8+), and | 
|  | any additional | 
|  | target-specific | 
|  | limitations. It does not | 
|  | include the 16 SGPRs added | 
|  | if a trap handler is | 
|  | enabled. | 
|  |  | 
|  | The target-specific | 
|  | limitations and special | 
|  | SGPR layout are defined in | 
|  | the hardware | 
|  | documentation, which can | 
|  | be found in the | 
|  | :ref:`amdgpu-processors` | 
|  | table. | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC1.SGPRS``. | 
|  |  | 
|  | The | 
|  | :ref:`amdgpu-assembler` | 
|  | calculates this | 
|  | automatically for the | 
|  | selected processor from | 
|  | values provided to the | 
|  | `.amdhsa_kernel` directive | 
|  | by the | 
|  | `.amdhsa_next_free_sgpr` | 
|  | and `.amdhsa_reserve_*` | 
|  | nested directives (see | 
|  | :ref:`amdhsa-kernel-directives-table`). | 
|  | 11:10   2 bits  PRIORITY                        Must be 0. | 
|  |  | 
|  | Start executing wavefront | 
|  | at the specified priority. | 
|  |  | 
|  | CP is responsible for | 
|  | filling in | 
|  | ``COMPUTE_PGM_RSRC1.PRIORITY``. | 
|  | 13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution | 
|  | with specified rounding | 
|  | mode for single (32 | 
|  | bit) floating point | 
|  | precision floating point | 
|  | operations. | 
|  |  | 
|  | Floating point rounding | 
|  | mode values are defined in | 
|  | :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. | 
|  | 15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution | 
|  | with specified rounding | 
|  | denorm mode for half/double (16 | 
|  | and 64 bit) floating point | 
|  | precision floating point | 
|  | operations. | 
|  |  | 
|  | Floating point rounding | 
|  | mode values are defined in | 
|  | :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. | 
|  | 17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution | 
|  | with specified denorm mode | 
|  | for single (32 | 
|  | bit)  floating point | 
|  | precision floating point | 
|  | operations. | 
|  |  | 
|  | Floating point denorm mode | 
|  | values are defined in | 
|  | :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. | 
|  | 19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution | 
|  | with specified denorm mode | 
|  | for half/double (16 | 
|  | and 64 bit) floating point | 
|  | precision floating point | 
|  | operations. | 
|  |  | 
|  | Floating point denorm mode | 
|  | values are defined in | 
|  | :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. | 
|  | 20      1 bit   PRIV                            Must be 0. | 
|  |  | 
|  | Start executing wavefront | 
|  | in privilege trap handler | 
|  | mode. | 
|  |  | 
|  | CP is responsible for | 
|  | filling in | 
|  | ``COMPUTE_PGM_RSRC1.PRIV``. | 
|  | 21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution | 
|  | with DX10 clamp mode | 
|  | enabled. Used by the vector | 
|  | ALU to force DX10 style | 
|  | treatment of NaN's (when | 
|  | set, clamp NaN to zero, | 
|  | otherwise pass NaN | 
|  | through). | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. | 
|  | 22      1 bit   DEBUG_MODE                      Must be 0. | 
|  |  | 
|  | Start executing wavefront | 
|  | in single step mode. | 
|  |  | 
|  | CP is responsible for | 
|  | filling in | 
|  | ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. | 
|  | 23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution | 
|  | with IEEE mode | 
|  | enabled. Floating point | 
|  | opcodes that support | 
|  | exception flag gathering | 
|  | will quiet and propagate | 
|  | signaling-NaN inputs per | 
|  | IEEE 754-2008. Min_dx10 and | 
|  | max_dx10 become IEEE | 
|  | 754-2008 compliant due to | 
|  | signaling-NaN propagation | 
|  | and quieting. | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC1.IEEE_MODE``. | 
|  | 24      1 bit   BULKY                           Must be 0. | 
|  |  | 
|  | Only one work-group allowed | 
|  | to execute on a compute | 
|  | unit. | 
|  |  | 
|  | CP is responsible for | 
|  | filling in | 
|  | ``COMPUTE_PGM_RSRC1.BULKY``. | 
|  | 25      1 bit   CDBG_USER                       Must be 0. | 
|  |  | 
|  | Flag that can be used to | 
|  | control debugging code. | 
|  |  | 
|  | CP is responsible for | 
|  | filling in | 
|  | ``COMPUTE_PGM_RSRC1.CDBG_USER``. | 
|  | 26      1 bit   FP16_OVFL                       GFX6-GFX8 | 
|  | Reserved, must be 0. | 
|  | GFX9 | 
|  | Wavefront starts execution | 
|  | with specified fp16 overflow | 
|  | mode. | 
|  |  | 
|  | - If 0, fp16 overflow generates | 
|  | +/-INF values. | 
|  | - If 1, fp16 overflow that is the | 
|  | result of an +/-INF input value | 
|  | or divide by 0 produces a +/-INF, | 
|  | otherwise clamps computed | 
|  | overflow to +/-MAX_FP16 as | 
|  | appropriate. | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC1.FP16_OVFL``. | 
|  | 31:27   5 bits                                  Reserved, must be 0. | 
|  | 32      **Total size 4 bytes** | 
|  | ======= =================================================================================================================== | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: compute_pgm_rsrc2 for GFX6-GFX9 | 
|  | :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table | 
|  |  | 
|  | ======= ======= =============================== =========================================================================== | 
|  | Bits    Size    Field Name                      Description | 
|  | ======= ======= =============================== =========================================================================== | 
|  | 0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the | 
|  | _WAVEFRONT_OFFSET               SGPR wavefront scratch offset | 
|  | system register (see | 
|  | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. | 
|  | 5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR | 
|  | user data registers | 
|  | requested. This number must | 
|  | match the number of user | 
|  | data registers enabled. | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC2.USER_SGPR``. | 
|  | 6       1 bit   ENABLE_TRAP_HANDLER             Must be 0. | 
|  |  | 
|  | This bit represents | 
|  | ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, | 
|  | which is set by the CP if | 
|  | the runtime has installed a | 
|  | trap handler. | 
|  | 7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the | 
|  | system SGPR register for | 
|  | the work-group id in the X | 
|  | dimension (see | 
|  | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC2.TGID_X_EN``. | 
|  | 8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the | 
|  | system SGPR register for | 
|  | the work-group id in the Y | 
|  | dimension (see | 
|  | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. | 
|  | 9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the | 
|  | system SGPR register for | 
|  | the work-group id in the Z | 
|  | dimension (see | 
|  | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. | 
|  | 10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the | 
|  | system SGPR register for | 
|  | work-group information (see | 
|  | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. | 
|  | 12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the | 
|  | VGPR system registers used | 
|  | for the work-item ID. | 
|  | :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` | 
|  | defines the values. | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. | 
|  | 13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0. | 
|  |  | 
|  | Wavefront starts execution | 
|  | with address watch | 
|  | exceptions enabled which | 
|  | are generated when L1 has | 
|  | witnessed a thread access | 
|  | an *address of | 
|  | interest*. | 
|  |  | 
|  | CP is responsible for | 
|  | filling in the address | 
|  | watch bit in | 
|  | ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` | 
|  | according to what the | 
|  | runtime requests. | 
|  | 14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0. | 
|  |  | 
|  | Wavefront starts execution | 
|  | with memory violation | 
|  | exceptions exceptions | 
|  | enabled which are generated | 
|  | when a memory violation has | 
|  | occurred for this wavefront from | 
|  | L1 or LDS | 
|  | (write-to-read-only-memory, | 
|  | mis-aligned atomic, LDS | 
|  | address out of range, | 
|  | illegal address, etc.). | 
|  |  | 
|  | CP sets the memory | 
|  | violation bit in | 
|  | ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` | 
|  | according to what the | 
|  | runtime requests. | 
|  | 23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0. | 
|  |  | 
|  | CP uses the rounded value | 
|  | from the dispatch packet, | 
|  | not this value, as the | 
|  | dispatch may contain | 
|  | dynamically allocated group | 
|  | segment memory. CP writes | 
|  | directly to | 
|  | ``COMPUTE_PGM_RSRC2.LDS_SIZE``. | 
|  |  | 
|  | Amount of group segment | 
|  | (LDS) to allocate for each | 
|  | work-group. Granularity is | 
|  | device specific: | 
|  |  | 
|  | GFX6: | 
|  | roundup(lds-size / (64 * 4)) | 
|  | GFX7-GFX9: | 
|  | roundup(lds-size / (128 * 4)) | 
|  |  | 
|  | 24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution | 
|  | _INVALID_OPERATION              with specified exceptions | 
|  | enabled. | 
|  |  | 
|  | Used by CP to set up | 
|  | ``COMPUTE_PGM_RSRC2.EXCP_EN`` | 
|  | (set from bits 0..6). | 
|  |  | 
|  | IEEE 754 FP Invalid | 
|  | Operation | 
|  | 25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more | 
|  | _SOURCE                         input operands is a | 
|  | denormal number | 
|  | 26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by | 
|  | _DIVISION_BY_ZERO               Zero | 
|  | 27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow | 
|  | _OVERFLOW | 
|  | 28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow | 
|  | _UNDERFLOW | 
|  | 29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact | 
|  | _INEXACT | 
|  | 30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero | 
|  | _ZERO                           (rcp_iflag_f32 instruction | 
|  | only) | 
|  | 31      1 bit                                   Reserved, must be 0. | 
|  | 32      **Total size 4 bytes.** | 
|  | ======= =================================================================================================================== | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: Floating Point Rounding Mode Enumeration Values | 
|  | :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table | 
|  |  | 
|  | ====================================== ===== ============================== | 
|  | Enumeration Name                       Value Description | 
|  | ====================================== ===== ============================== | 
|  | FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even | 
|  | FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity | 
|  | FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity | 
|  | FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0 | 
|  | ====================================== ===== ============================== | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: Floating Point Denorm Mode Enumeration Values | 
|  | :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table | 
|  |  | 
|  | ====================================== ===== ============================== | 
|  | Enumeration Name                       Value Description | 
|  | ====================================== ===== ============================== | 
|  | FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination | 
|  | Denorms | 
|  | FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms | 
|  | FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms | 
|  | FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush | 
|  | ====================================== ===== ============================== | 
|  |  | 
|  | .. | 
|  |  | 
|  | .. table:: System VGPR Work-Item ID Enumeration Values | 
|  | :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table | 
|  |  | 
|  | ======================================== ===== ============================ | 
|  | Enumeration Name                         Value Description | 
|  | ======================================== ===== ============================ | 
|  | SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension | 
|  | ID. | 
|  | SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y | 
|  | dimensions ID. | 
|  | SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z | 
|  | dimensions ID. | 
|  | SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined. | 
|  | ======================================== ===== ============================ | 
|  |  | 
|  | .. _amdgpu-amdhsa-initial-kernel-execution-state: | 
|  |  | 
|  | Initial Kernel Execution State | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | This section defines the register state that will be set up by the packet | 
|  | processor prior to the start of execution of every wavefront. This is limited by | 
|  | the constraints of the hardware controllers of CP/ADC/SPI. | 
|  |  | 
|  | The order of the SGPR registers is defined, but the compiler can specify which | 
|  | ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit | 
|  | fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used | 
|  | for enabled registers are dense starting at SGPR0: the first enabled register is | 
|  | SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have | 
|  | an SGPR number. | 
|  |  | 
|  | The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to | 
|  | all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using | 
|  | the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually | 
|  | initialized. These are then immediately followed by the System SGPRs that are | 
|  | set up by ADC/SPI and can have different values for each wavefront of the grid | 
|  | dispatch. | 
|  |  | 
|  | SGPR register initial state is defined in | 
|  | :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. | 
|  |  | 
|  | .. table:: SGPR Register Set Up Order | 
|  | :name: amdgpu-amdhsa-sgpr-register-set-up-order-table | 
|  |  | 
|  | ========== ========================== ====== ============================== | 
|  | SGPR Order Name                       Number Description | 
|  | (kernel descriptor enable  of | 
|  | field)                     SGPRs | 
|  | ========== ========================== ====== ============================== | 
|  | First      Private Segment Buffer     4      V# that can be used, together | 
|  | (enable_sgpr_private              with Scratch Wavefront Offset | 
|  | _segment_buffer)                  as an offset, to access the | 
|  | private memory space using a | 
|  | segment address. | 
|  |  | 
|  | CP uses the value provided by | 
|  | the runtime. | 
|  | then       Dispatch Ptr               2      64 bit address of AQL dispatch | 
|  | (enable_sgpr_dispatch_ptr)        packet for kernel dispatch | 
|  | actually executing. | 
|  | then       Queue Ptr                  2      64 bit address of amd_queue_t | 
|  | (enable_sgpr_queue_ptr)           object for AQL queue on which | 
|  | the dispatch packet was | 
|  | queued. | 
|  | then       Kernarg Segment Ptr        2      64 bit address of Kernarg | 
|  | (enable_sgpr_kernarg              segment. This is directly | 
|  | _segment_ptr)                     copied from the | 
|  | kernarg_address in the kernel | 
|  | dispatch packet. | 
|  |  | 
|  | Having CP load it once avoids | 
|  | loading it at the beginning of | 
|  | every wavefront. | 
|  | then       Dispatch Id                2      64 bit Dispatch ID of the | 
|  | (enable_sgpr_dispatch_id)         dispatch packet being | 
|  | executed. | 
|  | then       Flat Scratch Init          2      This is 2 SGPRs: | 
|  | (enable_sgpr_flat_scratch | 
|  | _init)                            GFX6 | 
|  | Not supported. | 
|  | GFX7-GFX8 | 
|  | The first SGPR is a 32 bit | 
|  | byte offset from | 
|  | ``SH_HIDDEN_PRIVATE_BASE_VIMID`` | 
|  | to per SPI base of memory | 
|  | for scratch for the queue | 
|  | executing the kernel | 
|  | dispatch. CP obtains this | 
|  | from the runtime. (The | 
|  | Scratch Segment Buffer base | 
|  | address is | 
|  | ``SH_HIDDEN_PRIVATE_BASE_VIMID`` | 
|  | plus this offset.) The value | 
|  | of Scratch Wavefront Offset must | 
|  | be added to this offset by | 
|  | the kernel machine code, | 
|  | right shifted by 8, and | 
|  | moved to the FLAT_SCRATCH_HI | 
|  | SGPR register. | 
|  | FLAT_SCRATCH_HI corresponds | 
|  | to SGPRn-4 on GFX7, and | 
|  | SGPRn-6 on GFX8 (where SGPRn | 
|  | is the highest numbered SGPR | 
|  | allocated to the wavefront). | 
|  | FLAT_SCRATCH_HI is | 
|  | multiplied by 256 (as it is | 
|  | in units of 256 bytes) and | 
|  | added to | 
|  | ``SH_HIDDEN_PRIVATE_BASE_VIMID`` | 
|  | to calculate the per wavefront | 
|  | FLAT SCRATCH BASE in flat | 
|  | memory instructions that | 
|  | access the scratch | 
|  | apperture. | 
|  |  | 
|  | The second SGPR is 32 bit | 
|  | byte size of a single | 
|  | work-item's scratch memory | 
|  | usage. CP obtains this from | 
|  | the runtime, and it is | 
|  | always a multiple of DWORD. | 
|  | CP checks that the value in | 
|  | the kernel dispatch packet | 
|  | Private Segment Byte Size is | 
|  | not larger, and requests the | 
|  | runtime to increase the | 
|  | queue's scratch size if | 
|  | necessary. The kernel code | 
|  | must move it to | 
|  | FLAT_SCRATCH_LO which is | 
|  | SGPRn-3 on GFX7 and SGPRn-5 | 
|  | on GFX8. FLAT_SCRATCH_LO is | 
|  | used as the FLAT SCRATCH | 
|  | SIZE in flat memory | 
|  | instructions. Having CP load | 
|  | it once avoids loading it at | 
|  | the beginning of every | 
|  | wavefront. | 
|  | GFX9 | 
|  | This is the | 
|  | 64 bit base address of the | 
|  | per SPI scratch backing | 
|  | memory managed by SPI for | 
|  | the queue executing the | 
|  | kernel dispatch. CP obtains | 
|  | this from the runtime (and | 
|  | divides it if there are | 
|  | multiple Shader Arrays each | 
|  | with its own SPI). The value | 
|  | of Scratch Wavefront Offset must | 
|  | be added by the kernel | 
|  | machine code and the result | 
|  | moved to the FLAT_SCRATCH | 
|  | SGPR which is SGPRn-6 and | 
|  | SGPRn-5. It is used as the | 
|  | FLAT SCRATCH BASE in flat | 
|  | memory instructions. | 
|  | then       Private Segment Size       1      The 32 bit byte size of a | 
|  | (enable_sgpr_private single | 
|  | work-item's | 
|  | scratch_segment_size) memory | 
|  | allocation. This is the | 
|  | value from the kernel | 
|  | dispatch packet Private | 
|  | Segment Byte Size rounded up | 
|  | by CP to a multiple of | 
|  | DWORD. | 
|  |  | 
|  | Having CP load it once avoids | 
|  | loading it at the beginning of | 
|  | every wavefront. | 
|  |  | 
|  | This is not used for | 
|  | GFX7-GFX8 since it is the same | 
|  | value as the second SGPR of | 
|  | Flat Scratch Init. However, it | 
|  | may be needed for GFX9 which | 
|  | changes the meaning of the | 
|  | Flat Scratch Init value. | 
|  | then       Grid Work-Group Count X    1      32 bit count of the number of | 
|  | (enable_sgpr_grid                 work-groups in the X dimension | 
|  | _workgroup_count_X)               for the grid being | 
|  | executed. Computed from the | 
|  | fields in the kernel dispatch | 
|  | packet as ((grid_size.x + | 
|  | workgroup_size.x - 1) / | 
|  | workgroup_size.x). | 
|  | then       Grid Work-Group Count Y    1      32 bit count of the number of | 
|  | (enable_sgpr_grid                 work-groups in the Y dimension | 
|  | _workgroup_count_Y &&             for the grid being | 
|  | less than 16 previous             executed. Computed from the | 
|  | SGPRs)                            fields in the kernel dispatch | 
|  | packet as ((grid_size.y + | 
|  | workgroup_size.y - 1) / | 
|  | workgroupSize.y). | 
|  |  | 
|  | Only initialized if <16 | 
|  | previous SGPRs initialized. | 
|  | then       Grid Work-Group Count Z    1      32 bit count of the number of | 
|  | (enable_sgpr_grid                 work-groups in the Z dimension | 
|  | _workgroup_count_Z &&             for the grid being | 
|  | less than 16 previous             executed. Computed from the | 
|  | SGPRs)                            fields in the kernel dispatch | 
|  | packet as ((grid_size.z + | 
|  | workgroup_size.z - 1) / | 
|  | workgroupSize.z). | 
|  |  | 
|  | Only initialized if <16 | 
|  | previous SGPRs initialized. | 
|  | then       Work-Group Id X            1      32 bit work-group id in X | 
|  | (enable_sgpr_workgroup_id         dimension of grid for | 
|  | _X)                               wavefront. | 
|  | then       Work-Group Id Y            1      32 bit work-group id in Y | 
|  | (enable_sgpr_workgroup_id         dimension of grid for | 
|  | _Y)                               wavefront. | 
|  | then       Work-Group Id Z            1      32 bit work-group id in Z | 
|  | (enable_sgpr_workgroup_id         dimension of grid for | 
|  | _Z)                               wavefront. | 
|  | then       Work-Group Info            1      {first_wavefront, 14'b0000, | 
|  | (enable_sgpr_workgroup            ordered_append_term[10:0], | 
|  | _info)                            threadgroup_size_in_wavefronts[5:0]} | 
|  | then       Scratch Wavefront Offset   1      32 bit byte offset from base | 
|  | (enable_sgpr_private              of scratch base of queue | 
|  | _segment_wavefront_offset)        executing the kernel | 
|  | dispatch. Must be used as an | 
|  | offset with Private | 
|  | segment address when using | 
|  | Scratch Segment Buffer. It | 
|  | must be used to set up FLAT | 
|  | SCRATCH for flat addressing | 
|  | (see | 
|  | :ref:`amdgpu-amdhsa-flat-scratch`). | 
|  | ========== ========================== ====== ============================== | 
|  |  | 
|  | The order of the VGPR registers is defined, but the compiler can specify which | 
|  | ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit | 
|  | fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used | 
|  | for enabled registers are dense starting at VGPR0: the first enabled register is | 
|  | VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a | 
|  | VGPR number. | 
|  |  | 
|  | VGPR register initial state is defined in | 
|  | :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. | 
|  |  | 
|  | .. table:: VGPR Register Set Up Order | 
|  | :name: amdgpu-amdhsa-vgpr-register-set-up-order-table | 
|  |  | 
|  | ========== ========================== ====== ============================== | 
|  | VGPR Order Name                       Number Description | 
|  | (kernel descriptor enable  of | 
|  | field)                     VGPRs | 
|  | ========== ========================== ====== ============================== | 
|  | First      Work-Item Id X             1      32 bit work item id in X | 
|  | (Always initialized)              dimension of work-group for | 
|  | wavefront lane. | 
|  | then       Work-Item Id Y             1      32 bit work item id in Y | 
|  | (enable_vgpr_workitem_id          dimension of work-group for | 
|  | > 0)                              wavefront lane. | 
|  | then       Work-Item Id Z             1      32 bit work item id in Z | 
|  | (enable_vgpr_workitem_id          dimension of work-group for | 
|  | > 1)                              wavefront lane. | 
|  | ========== ========================== ====== ============================== | 
|  |  | 
|  | The setting of registers is done by GPU CP/ADC/SPI hardware as follows: | 
|  |  | 
|  | 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data | 
|  | registers. | 
|  | 2. Work-group Id registers X, Y, Z are set by ADC which supports any | 
|  | combination including none. | 
|  | 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why | 
|  | its value cannot included with the flat scratch init value which is per queue. | 
|  | 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) | 
|  | or (X, Y, Z). | 
|  |  | 
|  | Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit | 
|  | value to the hardware required SGPRn-3 and SGPRn-4 respectively. | 
|  |  | 
|  | The global segment can be accessed either using buffer instructions (GFX6 which | 
|  | has V# 64 bit address support), flat instructions (GFX7-GFX9), or global | 
|  | instructions (GFX9). | 
|  |  | 
|  | If buffer operations are used then the compiler can generate a V# with the | 
|  | following properties: | 
|  |  | 
|  | * base address of 0 | 
|  | * no swizzle | 
|  | * ATC: 1 if IOMMU present (such as APU) | 
|  | * ptr64: 1 | 
|  | * MTYPE set to support memory coherence that matches the runtime (such as CC for | 
|  | APU and NC for dGPU). | 
|  |  | 
|  | .. _amdgpu-amdhsa-kernel-prolog: | 
|  |  | 
|  | Kernel Prolog | 
|  | ~~~~~~~~~~~~~ | 
|  |  | 
|  | .. _amdgpu-amdhsa-m0: | 
|  |  | 
|  | M0 | 
|  | ++ | 
|  |  | 
|  | GFX6-GFX8 | 
|  | The M0 register must be initialized with a value at least the total LDS size | 
|  | if the kernel may access LDS via DS or flat operations. Total LDS size is | 
|  | available in dispatch packet. For M0, it is also possible to use maximum | 
|  | possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for | 
|  | GFX7-GFX8). | 
|  | GFX9 | 
|  | The M0 register is not used for range checking LDS accesses and so does not | 
|  | need to be initialized in the prolog. | 
|  |  | 
|  | .. _amdgpu-amdhsa-flat-scratch: | 
|  |  | 
|  | Flat Scratch | 
|  | ++++++++++++ | 
|  |  | 
|  | If the kernel may use flat operations to access scratch memory, the prolog code | 
|  | must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which | 
|  | are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront | 
|  | Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): | 
|  |  | 
|  | GFX6 | 
|  | Flat scratch is not supported. | 
|  |  | 
|  | GFX7-GFX8 | 
|  | 1. The low word of Flat Scratch Init is 32 bit byte offset from | 
|  | ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory | 
|  | being managed by SPI for the queue executing the kernel dispatch. This is | 
|  | the same value used in the Scratch Segment Buffer V# base address. The | 
|  | prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte | 
|  | scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since | 
|  | FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted | 
|  | by 8 before moving into FLAT_SCRATCH_LO. | 
|  | 2. The second word of Flat Scratch Init is 32 bit byte size of a single | 
|  | work-items scratch memory usage. This is directly loaded from the kernel | 
|  | dispatch packet Private Segment Byte Size and rounded up to a multiple of | 
|  | DWORD. Having CP load it once avoids loading it at the beginning of every | 
|  | wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH | 
|  | SIZE. | 
|  |  | 
|  | GFX9 | 
|  | The Flat Scratch Init is the 64 bit address of the base of scratch backing | 
|  | memory being managed by SPI for the queue executing the kernel dispatch. The | 
|  | prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH | 
|  | pair for use as the flat scratch base in flat memory instructions. | 
|  |  | 
|  | .. _amdgpu-amdhsa-memory-model: | 
|  |  | 
|  | Memory Model | 
|  | ~~~~~~~~~~~~ | 
|  |  | 
|  | This section describes the mapping of LLVM memory model onto AMDGPU machine code | 
|  | (see :ref:`memmodel`). *The implementation is WIP.* | 
|  |  | 
|  | .. TODO | 
|  | Update when implementation complete. | 
|  |  | 
|  | The AMDGPU backend supports the memory synchronization scopes specified in | 
|  | :ref:`amdgpu-memory-scopes`. | 
|  |  | 
|  | The code sequences used to implement the memory model are defined in table | 
|  | :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. | 
|  |  | 
|  | The sequences specify the order of instructions that a single thread must | 
|  | execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect | 
|  | to other memory instructions executed by the same thread. This allows them to be | 
|  | moved earlier or later which can allow them to be combined with other instances | 
|  | of the same instruction, or hoisted/sunk out of loops to improve | 
|  | performance. Only the instructions related to the memory model are given; | 
|  | additional ``s_waitcnt`` instructions are required to ensure registers are | 
|  | defined before being used. These may be able to be combined with the memory | 
|  | model ``s_waitcnt`` instructions as described above. | 
|  |  | 
|  | The AMDGPU backend supports the following memory models: | 
|  |  | 
|  | HSA Memory Model [HSA]_ | 
|  | The HSA memory model uses a single happens-before relation for all address | 
|  | spaces (see :ref:`amdgpu-address-spaces`). | 
|  | OpenCL Memory Model [OpenCL]_ | 
|  | The OpenCL memory model which has separate happens-before relations for the | 
|  | global and local address spaces. Only a fence specifying both global and | 
|  | local address space, and seq_cst instructions join the relationships. Since | 
|  | the LLVM ``memfence`` instruction does not allow an address space to be | 
|  | specified the OpenCL fence has to convervatively assume both local and | 
|  | global address space was specified. However, optimizations can often be | 
|  | done to eliminate the additional ``s_waitcnt`` instructions when there are | 
|  | no intervening memory instructions which access the corresponding address | 
|  | space. The code sequences in the table indicate what can be omitted for the | 
|  | OpenCL memory. The target triple environment is used to determine if the | 
|  | source language is OpenCL (see :ref:`amdgpu-opencl`). | 
|  |  | 
|  | ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS | 
|  | operations. | 
|  |  | 
|  | ``buffer/global/flat_load/store/atomic`` instructions to global memory are | 
|  | termed vector memory operations. | 
|  |  | 
|  | For GFX6-GFX9: | 
|  |  | 
|  | * Each agent has multiple compute units (CU). | 
|  | * Each CU has multiple SIMDs that execute wavefronts. | 
|  | * The wavefronts for a single work-group are executed in the same CU but may be | 
|  | executed by different SIMDs. | 
|  | * Each CU has a single LDS memory shared by the wavefronts of the work-groups | 
|  | executing on it. | 
|  | * All LDS operations of a CU are performed as wavefront wide operations in a | 
|  | global order and involve no caching. Completion is reported to a wavefront in | 
|  | execution order. | 
|  | * The LDS memory has multiple request queues shared by the SIMDs of a | 
|  | CU. Therefore, the LDS operations performed by different wavefronts of a work-group | 
|  | can be reordered relative to each other, which can result in reordering the | 
|  | visibility of vector memory operations with respect to LDS operations of other | 
|  | wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to | 
|  | ensure synchronization between LDS operations and vector memory operations | 
|  | between wavefronts of a work-group, but not between operations performed by the | 
|  | same wavefront. | 
|  | * The vector memory operations are performed as wavefront wide operations and | 
|  | completion is reported to a wavefront in execution order. The exception is | 
|  | that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of | 
|  | vector memory order if they access LDS memory, and out of LDS operation order | 
|  | if they access global memory. | 
|  | * The vector memory operations access a single vector L1 cache shared by all | 
|  | SIMDs a CU. Therefore, no special action is required for coherence between the | 
|  | lanes of a single wavefront, or for coherence between wavefronts in the same | 
|  | work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts | 
|  | executing in different work-groups as they may be executing on different CUs. | 
|  | * The scalar memory operations access a scalar L1 cache shared by all wavefronts | 
|  | on a group of CUs. The scalar and vector L1 caches are not coherent. However, | 
|  | scalar operations are used in a restricted way so do not impact the memory | 
|  | model. See :ref:`amdgpu-amdhsa-memory-spaces`. | 
|  | * The vector and scalar memory operations use an L2 cache shared by all CUs on | 
|  | the same agent. | 
|  | * The L2 cache has independent channels to service disjoint ranges of virtual | 
|  | addresses. | 
|  | * Each CU has a separate request queue per channel. Therefore, the vector and | 
|  | scalar memory operations performed by wavefronts executing in different work-groups | 
|  | (which may be executing on different CUs) of an agent can be reordered | 
|  | relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure | 
|  | synchronization between vector memory operations of different CUs. It ensures a | 
|  | previous vector memory operation has completed before executing a subsequent | 
|  | vector memory or LDS operation and so can be used to meet the requirements of | 
|  | acquire and release. | 
|  | * The L2 cache can be kept coherent with other agents on some targets, or ranges | 
|  | of virtual addresses can be set up to bypass it to ensure system coherence. | 
|  |  | 
|  | Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8), | 
|  | or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the | 
|  | memory, atomic memory orderings are not meaningful and all accesses are treated | 
|  | as non-atomic. | 
|  |  | 
|  | Constant address space uses ``buffer/global_load`` instructions (or equivalent | 
|  | scalar memory instructions). Since the constant address space contents do not | 
|  | change during the execution of a kernel dispatch it is not legal to perform | 
|  | stores, and atomic memory orderings are not meaningful and all access are | 
|  | treated as non-atomic. | 
|  |  | 
|  | A memory synchronization scope wider than work-group is not meaningful for the | 
|  | group (LDS) address space and is treated as work-group. | 
|  |  | 
|  | The memory model does not support the region address space which is treated as | 
|  | non-atomic. | 
|  |  | 
|  | Acquire memory ordering is not meaningful on store atomic instructions and is | 
|  | treated as non-atomic. | 
|  |  | 
|  | Release memory ordering is not meaningful on load atomic instructions and is | 
|  | treated a non-atomic. | 
|  |  | 
|  | Acquire-release memory ordering is not meaningful on load or store atomic | 
|  | instructions and is treated as acquire and release respectively. | 
|  |  | 
|  | AMDGPU backend only uses scalar memory operations to access memory that is | 
|  | proven to not change during the execution of the kernel dispatch. This includes | 
|  | constant address space and global address space for program scope const | 
|  | variables. Therefore the kernel machine code does not have to maintain the | 
|  | scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar | 
|  | and vector L1 caches are invalidated between kernel dispatches by CP since | 
|  | constant address space data may change between kernel dispatch executions. See | 
|  | :ref:`amdgpu-amdhsa-memory-spaces`. | 
|  |  | 
|  | The one execption is if scalar writes are used to spill SGPR registers. In this | 
|  | case the AMDGPU backend ensures the memory location used to spill is never | 
|  | accessed by vector memory operations at the same time. If scalar writes are used | 
|  | then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function | 
|  | return since the locations may be used for vector memory instructions by a | 
|  | future wavefront that uses the same scratch area, or a function call that creates a | 
|  | frame at the same address, respectively. There is no need for a ``s_dcache_inv`` | 
|  | as all scalar writes are write-before-read in the same thread. | 
|  |  | 
|  | Scratch backing memory (which is used for the private address space) | 
|  | is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private | 
|  | address space is only accessed by a single thread, and is always | 
|  | write-before-read, there is never a need to invalidate these entries from the L1 | 
|  | cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the | 
|  | volatile cache lines. | 
|  |  | 
|  | On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing | 
|  | to invalidate the L2 cache. This also causes it to be treated as | 
|  | non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC | 
|  | (cache coherent) and so the L2 cache will coherent with the CPU and other | 
|  | agents. | 
|  |  | 
|  | .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 | 
|  | :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table | 
|  |  | 
|  | ============ ============ ============== ========== =============================== | 
|  | LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code | 
|  | Ordering     Sync Scope     Address | 
|  | Space | 
|  | ============ ============ ============== ========== =============================== | 
|  | **Non-Atomic** | 
|  | ----------------------------------------------------------------------------------- | 
|  | load         *none*       *none*         - global   - !volatile & !nontemporal | 
|  | - generic | 
|  | - private    1. buffer/global/flat_load | 
|  | - constant | 
|  | - volatile & !nontemporal | 
|  |  | 
|  | 1. buffer/global/flat_load | 
|  | glc=1 | 
|  |  | 
|  | - nontemporal | 
|  |  | 
|  | 1. buffer/global/flat_load | 
|  | glc=1 slc=1 | 
|  |  | 
|  | load         *none*       *none*         - local    1. ds_load | 
|  | store        *none*       *none*         - global   - !nontemporal | 
|  | - generic | 
|  | - private    1. buffer/global/flat_store | 
|  | - constant | 
|  | - nontemporal | 
|  |  | 
|  | 1. buffer/global/flat_stote | 
|  | glc=1 slc=1 | 
|  |  | 
|  | store        *none*       *none*         - local    1. ds_store | 
|  | **Unordered Atomic** | 
|  | ----------------------------------------------------------------------------------- | 
|  | load atomic  unordered    *any*          *any*      *Same as non-atomic*. | 
|  | store atomic unordered    *any*          *any*      *Same as non-atomic*. | 
|  | atomicrmw    unordered    *any*          *any*      *Same as monotonic | 
|  | atomic*. | 
|  | **Monotonic Atomic** | 
|  | ----------------------------------------------------------------------------------- | 
|  | load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load | 
|  | - wavefront    - generic | 
|  | - workgroup | 
|  | load atomic  monotonic    - singlethread - local    1. ds_load | 
|  | - wavefront | 
|  | - workgroup | 
|  | load atomic  monotonic    - agent        - global   1. buffer/global/flat_load | 
|  | - system       - generic     glc=1 | 
|  | store atomic monotonic    - singlethread - global   1. buffer/global/flat_store | 
|  | - wavefront    - generic | 
|  | - workgroup | 
|  | - agent | 
|  | - system | 
|  | store atomic monotonic    - singlethread - local    1. ds_store | 
|  | - wavefront | 
|  | - workgroup | 
|  | atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic | 
|  | - wavefront    - generic | 
|  | - workgroup | 
|  | - agent | 
|  | - system | 
|  | atomicrmw    monotonic    - singlethread - local    1. ds_atomic | 
|  | - wavefront | 
|  | - workgroup | 
|  | **Acquire Atomic** | 
|  | ----------------------------------------------------------------------------------- | 
|  | load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load | 
|  | - wavefront    - local | 
|  | - generic | 
|  | load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load | 
|  | load atomic  acquire      - workgroup    - local    1. ds_load | 
|  | 2. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Ensures any | 
|  | following global | 
|  | data read is no | 
|  | older than the load | 
|  | atomic value being | 
|  | acquired. | 
|  | load atomic  acquire      - workgroup    - generic  1. flat_load | 
|  | 2. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Ensures any | 
|  | following global | 
|  | data read is no | 
|  | older than the load | 
|  | atomic value being | 
|  | acquired. | 
|  | load atomic  acquire      - agent        - global   1. buffer/global/flat_load | 
|  | - system                     glc=1 | 
|  | 2. s_waitcnt vmcnt(0) | 
|  |  | 
|  | - Must happen before | 
|  | following | 
|  | buffer_wbinvl1_vol. | 
|  | - Ensures the load | 
|  | has completed | 
|  | before invalidating | 
|  | the cache. | 
|  |  | 
|  | 3. buffer_wbinvl1_vol | 
|  |  | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/atomicrmw. | 
|  | - Ensures that | 
|  | following | 
|  | loads will not see | 
|  | stale global data. | 
|  |  | 
|  | load atomic  acquire      - agent        - generic  1. flat_load glc=1 | 
|  | - system                  2. s_waitcnt vmcnt(0) & | 
|  | lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL omit | 
|  | lgkmcnt(0). | 
|  | - Must happen before | 
|  | following | 
|  | buffer_wbinvl1_vol. | 
|  | - Ensures the flat_load | 
|  | has completed | 
|  | before invalidating | 
|  | the cache. | 
|  |  | 
|  | 3. buffer_wbinvl1_vol | 
|  |  | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/atomicrmw. | 
|  | - Ensures that | 
|  | following loads | 
|  | will not see stale | 
|  | global data. | 
|  |  | 
|  | atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic | 
|  | - wavefront    - local | 
|  | - generic | 
|  | atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic | 
|  | atomicrmw    acquire      - workgroup    - local    1. ds_atomic | 
|  | 2. waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Ensures any | 
|  | following global | 
|  | data read is no | 
|  | older than the | 
|  | atomicrmw value | 
|  | being acquired. | 
|  |  | 
|  | atomicrmw    acquire      - workgroup    - generic  1. flat_atomic | 
|  | 2. waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Ensures any | 
|  | following global | 
|  | data read is no | 
|  | older than the | 
|  | atomicrmw value | 
|  | being acquired. | 
|  |  | 
|  | atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic | 
|  | - system                  2. s_waitcnt vmcnt(0) | 
|  |  | 
|  | - Must happen before | 
|  | following | 
|  | buffer_wbinvl1_vol. | 
|  | - Ensures the | 
|  | atomicrmw has | 
|  | completed before | 
|  | invalidating the | 
|  | cache. | 
|  |  | 
|  | 3. buffer_wbinvl1_vol | 
|  |  | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/atomicrmw. | 
|  | - Ensures that | 
|  | following loads | 
|  | will not see stale | 
|  | global data. | 
|  |  | 
|  | atomicrmw    acquire      - agent        - generic  1. flat_atomic | 
|  | - system                  2. s_waitcnt vmcnt(0) & | 
|  | lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit | 
|  | lgkmcnt(0). | 
|  | - Must happen before | 
|  | following | 
|  | buffer_wbinvl1_vol. | 
|  | - Ensures the | 
|  | atomicrmw has | 
|  | completed before | 
|  | invalidating the | 
|  | cache. | 
|  |  | 
|  | 3. buffer_wbinvl1_vol | 
|  |  | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/atomicrmw. | 
|  | - Ensures that | 
|  | following loads | 
|  | will not see stale | 
|  | global data. | 
|  |  | 
|  | fence        acquire      - singlethread *none*     *none* | 
|  | - wavefront | 
|  | fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL and | 
|  | address space is | 
|  | not generic, omit. | 
|  | - However, since LLVM | 
|  | currently has no | 
|  | address space on | 
|  | the fence need to | 
|  | conservatively | 
|  | always generate. If | 
|  | fence had an | 
|  | address space then | 
|  | set to address | 
|  | space of OpenCL | 
|  | fence flag, or to | 
|  | generic if both | 
|  | local and global | 
|  | flags are | 
|  | specified. | 
|  | - Must happen after | 
|  | any preceding | 
|  | local/generic load | 
|  | atomic/atomicrmw | 
|  | with an equal or | 
|  | wider sync scope | 
|  | and memory ordering | 
|  | stronger than | 
|  | unordered (this is | 
|  | termed the | 
|  | fence-paired-atomic). | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Ensures any | 
|  | following global | 
|  | data read is no | 
|  | older than the | 
|  | value read by the | 
|  | fence-paired-atomic. | 
|  |  | 
|  | fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) & | 
|  | - system                     vmcnt(0) | 
|  |  | 
|  | - If OpenCL and | 
|  | address space is | 
|  | not generic, omit | 
|  | lgkmcnt(0). | 
|  | - However, since LLVM | 
|  | currently has no | 
|  | address space on | 
|  | the fence need to | 
|  | conservatively | 
|  | always generate | 
|  | (see comment for | 
|  | previous fence). | 
|  | - Could be split into | 
|  | separate s_waitcnt | 
|  | vmcnt(0) and | 
|  | s_waitcnt | 
|  | lgkmcnt(0) to allow | 
|  | them to be | 
|  | independently moved | 
|  | according to the | 
|  | following rules. | 
|  | - s_waitcnt vmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | global/generic load | 
|  | atomic/atomicrmw | 
|  | with an equal or | 
|  | wider sync scope | 
|  | and memory ordering | 
|  | stronger than | 
|  | unordered (this is | 
|  | termed the | 
|  | fence-paired-atomic). | 
|  | - s_waitcnt lgkmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | local/generic load | 
|  | atomic/atomicrmw | 
|  | with an equal or | 
|  | wider sync scope | 
|  | and memory ordering | 
|  | stronger than | 
|  | unordered (this is | 
|  | termed the | 
|  | fence-paired-atomic). | 
|  | - Must happen before | 
|  | the following | 
|  | buffer_wbinvl1_vol. | 
|  | - Ensures that the | 
|  | fence-paired atomic | 
|  | has completed | 
|  | before invalidating | 
|  | the | 
|  | cache. Therefore | 
|  | any following | 
|  | locations read must | 
|  | be no older than | 
|  | the value read by | 
|  | the | 
|  | fence-paired-atomic. | 
|  |  | 
|  | 2. buffer_wbinvl1_vol | 
|  |  | 
|  | - Must happen before any | 
|  | following global/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Ensures that | 
|  | following loads | 
|  | will not see stale | 
|  | global data. | 
|  |  | 
|  | **Release Atomic** | 
|  | ----------------------------------------------------------------------------------- | 
|  | store atomic release      - singlethread - global   1. buffer/global/ds/flat_store | 
|  | - wavefront    - local | 
|  | - generic | 
|  | store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | the following | 
|  | store. | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to local have | 
|  | completed before | 
|  | performing the | 
|  | store that is being | 
|  | released. | 
|  |  | 
|  | 2. buffer/global/flat_store | 
|  | store atomic release      - workgroup    - local    1. ds_store | 
|  | store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | the following | 
|  | store. | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to local have | 
|  | completed before | 
|  | performing the | 
|  | store that is being | 
|  | released. | 
|  |  | 
|  | 2. flat_store | 
|  | store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) & | 
|  | - system       - generic     vmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit | 
|  | lgkmcnt(0). | 
|  | - Could be split into | 
|  | separate s_waitcnt | 
|  | vmcnt(0) and | 
|  | s_waitcnt | 
|  | lgkmcnt(0) to allow | 
|  | them to be | 
|  | independently moved | 
|  | according to the | 
|  | following rules. | 
|  | - s_waitcnt vmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | global/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - s_waitcnt lgkmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | the following | 
|  | store. | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to memory have | 
|  | completed before | 
|  | performing the | 
|  | store that is being | 
|  | released. | 
|  |  | 
|  | 2. buffer/global/ds/flat_store | 
|  | atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic | 
|  | - wavefront    - local | 
|  | - generic | 
|  | atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | the following | 
|  | atomicrmw. | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to local have | 
|  | completed before | 
|  | performing the | 
|  | atomicrmw that is | 
|  | being released. | 
|  |  | 
|  | 2. buffer/global/flat_atomic | 
|  | atomicrmw    release      - workgroup    - local    1. ds_atomic | 
|  | atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | the following | 
|  | atomicrmw. | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to local have | 
|  | completed before | 
|  | performing the | 
|  | atomicrmw that is | 
|  | being released. | 
|  |  | 
|  | 2. flat_atomic | 
|  | atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) & | 
|  | - system       - generic     vmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit | 
|  | lgkmcnt(0). | 
|  | - Could be split into | 
|  | separate s_waitcnt | 
|  | vmcnt(0) and | 
|  | s_waitcnt | 
|  | lgkmcnt(0) to allow | 
|  | them to be | 
|  | independently moved | 
|  | according to the | 
|  | following rules. | 
|  | - s_waitcnt vmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | global/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - s_waitcnt lgkmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | the following | 
|  | atomicrmw. | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to global and local | 
|  | have completed | 
|  | before performing | 
|  | the atomicrmw that | 
|  | is being released. | 
|  |  | 
|  | 2. buffer/global/ds/flat_atomic | 
|  | fence        release      - singlethread *none*     *none* | 
|  | - wavefront | 
|  | fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL and | 
|  | address space is | 
|  | not generic, omit. | 
|  | - However, since LLVM | 
|  | currently has no | 
|  | address space on | 
|  | the fence need to | 
|  | conservatively | 
|  | always generate. If | 
|  | fence had an | 
|  | address space then | 
|  | set to address | 
|  | space of OpenCL | 
|  | fence flag, or to | 
|  | generic if both | 
|  | local and global | 
|  | flags are | 
|  | specified. | 
|  | - Must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | any following store | 
|  | atomic/atomicrmw | 
|  | with an equal or | 
|  | wider sync scope | 
|  | and memory ordering | 
|  | stronger than | 
|  | unordered (this is | 
|  | termed the | 
|  | fence-paired-atomic). | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to local have | 
|  | completed before | 
|  | performing the | 
|  | following | 
|  | fence-paired-atomic. | 
|  |  | 
|  | fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) & | 
|  | - system                     vmcnt(0) | 
|  |  | 
|  | - If OpenCL and | 
|  | address space is | 
|  | not generic, omit | 
|  | lgkmcnt(0). | 
|  | - If OpenCL and | 
|  | address space is | 
|  | local, omit | 
|  | vmcnt(0). | 
|  | - However, since LLVM | 
|  | currently has no | 
|  | address space on | 
|  | the fence need to | 
|  | conservatively | 
|  | always generate. If | 
|  | fence had an | 
|  | address space then | 
|  | set to address | 
|  | space of OpenCL | 
|  | fence flag, or to | 
|  | generic if both | 
|  | local and global | 
|  | flags are | 
|  | specified. | 
|  | - Could be split into | 
|  | separate s_waitcnt | 
|  | vmcnt(0) and | 
|  | s_waitcnt | 
|  | lgkmcnt(0) to allow | 
|  | them to be | 
|  | independently moved | 
|  | according to the | 
|  | following rules. | 
|  | - s_waitcnt vmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | global/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - s_waitcnt lgkmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | any following store | 
|  | atomic/atomicrmw | 
|  | with an equal or | 
|  | wider sync scope | 
|  | and memory ordering | 
|  | stronger than | 
|  | unordered (this is | 
|  | termed the | 
|  | fence-paired-atomic). | 
|  | - Ensures that all | 
|  | memory operations | 
|  | have | 
|  | completed before | 
|  | performing the | 
|  | following | 
|  | fence-paired-atomic. | 
|  |  | 
|  | **Acquire-Release Atomic** | 
|  | ----------------------------------------------------------------------------------- | 
|  | atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic | 
|  | - wavefront    - local | 
|  | - generic | 
|  | atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | the following | 
|  | atomicrmw. | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to local have | 
|  | completed before | 
|  | performing the | 
|  | atomicrmw that is | 
|  | being released. | 
|  |  | 
|  | 2. buffer/global/flat_atomic | 
|  | atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic | 
|  | 2. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Ensures any | 
|  | following global | 
|  | data read is no | 
|  | older than the load | 
|  | atomic value being | 
|  | acquired. | 
|  |  | 
|  | atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | the following | 
|  | atomicrmw. | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to local have | 
|  | completed before | 
|  | performing the | 
|  | atomicrmw that is | 
|  | being released. | 
|  |  | 
|  | 2. flat_atomic | 
|  | 3. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit. | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Ensures any | 
|  | following global | 
|  | data read is no | 
|  | older than the load | 
|  | atomic value being | 
|  | acquired. | 
|  |  | 
|  | atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) & | 
|  | - system                     vmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit | 
|  | lgkmcnt(0). | 
|  | - Could be split into | 
|  | separate s_waitcnt | 
|  | vmcnt(0) and | 
|  | s_waitcnt | 
|  | lgkmcnt(0) to allow | 
|  | them to be | 
|  | independently moved | 
|  | according to the | 
|  | following rules. | 
|  | - s_waitcnt vmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | global/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - s_waitcnt lgkmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | the following | 
|  | atomicrmw. | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to global have | 
|  | completed before | 
|  | performing the | 
|  | atomicrmw that is | 
|  | being released. | 
|  |  | 
|  | 2. buffer/global/flat_atomic | 
|  | 3. s_waitcnt vmcnt(0) | 
|  |  | 
|  | - Must happen before | 
|  | following | 
|  | buffer_wbinvl1_vol. | 
|  | - Ensures the | 
|  | atomicrmw has | 
|  | completed before | 
|  | invalidating the | 
|  | cache. | 
|  |  | 
|  | 4. buffer_wbinvl1_vol | 
|  |  | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/atomicrmw. | 
|  | - Ensures that | 
|  | following loads | 
|  | will not see stale | 
|  | global data. | 
|  |  | 
|  | atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) & | 
|  | - system                     vmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit | 
|  | lgkmcnt(0). | 
|  | - Could be split into | 
|  | separate s_waitcnt | 
|  | vmcnt(0) and | 
|  | s_waitcnt | 
|  | lgkmcnt(0) to allow | 
|  | them to be | 
|  | independently moved | 
|  | according to the | 
|  | following rules. | 
|  | - s_waitcnt vmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | global/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - s_waitcnt lgkmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | the following | 
|  | atomicrmw. | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to global have | 
|  | completed before | 
|  | performing the | 
|  | atomicrmw that is | 
|  | being released. | 
|  |  | 
|  | 2. flat_atomic | 
|  | 3. s_waitcnt vmcnt(0) & | 
|  | lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL, omit | 
|  | lgkmcnt(0). | 
|  | - Must happen before | 
|  | following | 
|  | buffer_wbinvl1_vol. | 
|  | - Ensures the | 
|  | atomicrmw has | 
|  | completed before | 
|  | invalidating the | 
|  | cache. | 
|  |  | 
|  | 4. buffer_wbinvl1_vol | 
|  |  | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/atomicrmw. | 
|  | - Ensures that | 
|  | following loads | 
|  | will not see stale | 
|  | global data. | 
|  |  | 
|  | fence        acq_rel      - singlethread *none*     *none* | 
|  | - wavefront | 
|  | fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) | 
|  |  | 
|  | - If OpenCL and | 
|  | address space is | 
|  | not generic, omit. | 
|  | - However, | 
|  | since LLVM | 
|  | currently has no | 
|  | address space on | 
|  | the fence need to | 
|  | conservatively | 
|  | always generate | 
|  | (see comment for | 
|  | previous fence). | 
|  | - Must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Ensures that all | 
|  | memory operations | 
|  | to local have | 
|  | completed before | 
|  | performing any | 
|  | following global | 
|  | memory operations. | 
|  | - Ensures that the | 
|  | preceding | 
|  | local/generic load | 
|  | atomic/atomicrmw | 
|  | with an equal or | 
|  | wider sync scope | 
|  | and memory ordering | 
|  | stronger than | 
|  | unordered (this is | 
|  | termed the | 
|  | acquire-fence-paired-atomic | 
|  | ) has completed | 
|  | before following | 
|  | global memory | 
|  | operations. This | 
|  | satisfies the | 
|  | requirements of | 
|  | acquire. | 
|  | - Ensures that all | 
|  | previous memory | 
|  | operations have | 
|  | completed before a | 
|  | following | 
|  | local/generic store | 
|  | atomic/atomicrmw | 
|  | with an equal or | 
|  | wider sync scope | 
|  | and memory ordering | 
|  | stronger than | 
|  | unordered (this is | 
|  | termed the | 
|  | release-fence-paired-atomic | 
|  | ). This satisfies the | 
|  | requirements of | 
|  | release. | 
|  |  | 
|  | fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) & | 
|  | - system                     vmcnt(0) | 
|  |  | 
|  | - If OpenCL and | 
|  | address space is | 
|  | not generic, omit | 
|  | lgkmcnt(0). | 
|  | - However, since LLVM | 
|  | currently has no | 
|  | address space on | 
|  | the fence need to | 
|  | conservatively | 
|  | always generate | 
|  | (see comment for | 
|  | previous fence). | 
|  | - Could be split into | 
|  | separate s_waitcnt | 
|  | vmcnt(0) and | 
|  | s_waitcnt | 
|  | lgkmcnt(0) to allow | 
|  | them to be | 
|  | independently moved | 
|  | according to the | 
|  | following rules. | 
|  | - s_waitcnt vmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | global/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - s_waitcnt lgkmcnt(0) | 
|  | must happen after | 
|  | any preceding | 
|  | local/generic | 
|  | load/store/load | 
|  | atomic/store | 
|  | atomic/atomicrmw. | 
|  | - Must happen before | 
|  | the following | 
|  | buffer_wbinvl1_vol. | 
|  | - Ensures that the | 
|  | preceding | 
|  | global/local/generic | 
|  | load | 
|  | atomic/atomicrmw | 
|  | with an equal or | 
|  | wider sync scope | 
|  | and memory ordering | 
|  | stronger than | 
|  | unordered (this is | 
|  | termed the | 
|  | acquire-fence-paired-atomic | 
|  | ) has completed | 
|  | before invalidating | 
|  | the cache. This | 
|  | satisfies the | 
|  | requirements of | 
|  | acquire. | 
|  | - Ensures that all | 
|  | previous memory | 
|  | operations have | 
|  | completed before a | 
|  | following | 
|  | global/local/generic | 
|  | store | 
|  | atomic/atomicrmw | 
|  | with an equal or | 
|  | wider sync scope | 
|  | and memory ordering | 
|  | stronger than | 
|  | unordered (this is | 
|  | termed the | 
|  | release-fence-paired-atomic | 
|  | ). This satisfies the | 
|  | requirements of | 
|  | release. | 
|  |  | 
|  | 2. buffer_wbinvl1_vol | 
|  |  | 
|  | - Must happen before | 
|  | any following | 
|  | global/generic | 
|  | load/load | 
|  | atomic/store/store | 
|  | atomic/atomicrmw. | 
|  | - Ensures that | 
|  | following loads | 
|  | will not see stale | 
|  | global data. This | 
|  | satisfies the | 
|  | requirements of | 
|  | acquire. | 
|  |  | 
|  | **Sequential Consistent Atomic** | 
|  | ----------------------------------------------------------------------------------- | 
|  | load atomic  seq_cst      - singlethread - global   *Same as corresponding | 
|  | - wavefront    - local    load atomic acquire, | 
|  | - generic  except must generated | 
|  | all instructions even | 
|  | for OpenCL.* | 
|  | load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0) | 
|  | - generic | 
|  | - Must | 
|  | happen after | 
|  | preceding | 
|  | global/generic load | 
|  | atomic/store | 
|  | atomic/atomicrmw | 
|  | with memory | 
|  | ordering of seq_cst | 
|  | and with equal or | 
|  | wider sync scope. | 
|  | (Note that seq_cst | 
|  | fences have their | 
|  | own s_waitcnt | 
|  | lgkmcnt(0) and so do | 
|  | not need to be | 
|  | considered.) | 
|  | - Ensures any | 
|  | preceding | 
|  | sequential | 
|  | consistent local | 
|  | memory instructions | 
|  | have completed | 
|  | before executing | 
|  | this sequentially | 
|  | consistent | 
|  | instruction. This | 
|  | prevents reordering | 
|  | a seq_cst store | 
|  | followed by a | 
|  | seq_cst load. (Note | 
|  | that seq_cst is | 
|  | stronger than | 
|  | acquire/release as | 
|  | the reordering of | 
|  | load acquire | 
|  | followed by a store | 
|  | release is | 
|  | prevented by the | 
|  | waitcnt of | 
|  | the release, but | 
|  | there is nothing | 
|  | preventing a store | 
|  | release followed by | 
|  | load acquire from | 
|  | competing out of | 
|  | order.) | 
|  |  | 
|  | 2. *Following | 
|  | instructions same as | 
|  | corresponding load | 
|  | atomic acquire, | 
|  | except must generated | 
|  | all instructions even | 
|  | for OpenCL.* | 
|  | load atomic  seq_cst      - workgroup    - local    *Same as corresponding | 
|  | load atomic acquire, | 
|  | except must generated | 
|  | all instructions even | 
|  | for OpenCL.* | 
|  | load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) & | 
|  | - system       - generic     vmcnt(0) | 
|  |  | 
|  | - Could be split into | 
|  | separate s_waitcnt | 
|  | vmcnt(0) | 
|  | and s_waitcnt | 
|  | lgkmcnt(0) to allow | 
|  | them to be | 
|  | independently moved | 
|  | according to the | 
|  | following rules. | 
|  | - waitcnt lgkmcnt(0) | 
|  | must happen after | 
|  | preceding | 
|  | global/generic load | 
|  | atomic/store | 
|  | atomic/atomicrmw | 
|  | with memory | 
|  | ordering of seq_cst | 
|  | and with equal or | 
|  | wider sync scope. | 
|  | (Note that seq_cst | 
|  | fences have their | 
|  | own s_waitcnt | 
|  | lgkmcnt(0) and so do | 
|  | not need to be | 
|  | considered.) | 
|  | - waitcnt vmcnt(0) | 
|  | must happen after | 
|  | preceding | 
|  | global/generic load | 
|  | atomic/store | 
|  | atomic/atomicrmw | 
|  | with memory | 
|  | ordering of seq_cst | 
|  | and with equal or | 
|  | wider sync scope. | 
|  | (Note that seq_cst | 
|  | fences have their | 
|  | own s_waitcnt | 
|  | vmcnt(0) and so do | 
|  | not need to be | 
|  | considered.) | 
|  | - Ensures any | 
|  | preceding | 
|  | sequential | 
|  | consistent global | 
|  | memory instructions | 
|  | have completed | 
|  | before executing | 
|  | this sequentially | 
|  | consistent | 
|  | instruction. This | 
|  | prevents reordering | 
|  | a seq_cst store | 
|  | followed by a | 
|  | seq_cst load. (Note | 
|  | that seq_cst is | 
|  | stronger than | 
|  | acquire/release as | 
|  | the reordering of | 
|  | load acquire | 
|  | followed by a store | 
|  | release is | 
|  | prevented by the | 
|  | waitcnt of | 
|  | the release, but | 
|  | there is nothing | 
|  | preventing a store | 
|  | release followed by | 
|  | load acquire from | 
|  | competing out of | 
|  | order.) | 
|  |  | 
|  | 2. *Following | 
|  | instructions same as | 
|  | corresponding load | 
|  | atomic acquire, | 
|  | except must generated | 
|  | all instructions even | 
|  | for OpenCL.* | 
|  | store atomic seq_cst      - singlethread - global   *Same as corresponding | 
|  | - wavefront    - local    store atomic release, | 
|  | - workgroup    - generic  except must generated | 
|  | all instructions even | 
|  | for OpenCL.* | 
|  | store atomic seq_cst      - agent        - global   *Same as corresponding | 
|  | - system       - generic  store atomic release, | 
|  | except must generated | 
|  | all instructions even | 
|  | for OpenCL.* | 
|  | atomicrmw    seq_cst      - singlethread - global   *Same as corresponding | 
|  | - wavefront    - local    atomicrmw acq_rel, | 
|  | - workgroup    - generic  except must generated | 
|  | all instructions even | 
|  | for OpenCL.* | 
|  | atomicrmw    seq_cst      - agent        - global   *Same as corresponding | 
|  | - system       - generic  atomicrmw acq_rel, | 
|  | except must generated | 
|  | all instructions even | 
|  | for OpenCL.* | 
|  | fence        seq_cst      - singlethread *none*     *Same as corresponding | 
|  | - wavefront               fence acq_rel, | 
|  | - workgroup               except must generated | 
|  | - agent                   all instructions even | 
|  | - system                  for OpenCL.* | 
|  | ============ ============ ============== ========== =============================== | 
|  |  | 
|  | The memory order also adds the single thread optimization constrains defined in | 
|  | table | 
|  | :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`. | 
|  |  | 
|  | .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9 | 
|  | :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table | 
|  |  | 
|  | ============ ============================================================== | 
|  | LLVM Memory  Optimization Constraints | 
|  | Ordering | 
|  | ============ ============================================================== | 
|  | unordered    *none* | 
|  | monotonic    *none* | 
|  | acquire      - If a load atomic/atomicrmw then no following load/load | 
|  | atomic/store/ store atomic/atomicrmw/fence instruction can | 
|  | be moved before the acquire. | 
|  | - If a fence then same as load atomic, plus no preceding | 
|  | associated fence-paired-atomic can be moved after the fence. | 
|  | release      - If a store atomic/atomicrmw then no preceding load/load | 
|  | atomic/store/ store atomic/atomicrmw/fence instruction can | 
|  | be moved after the release. | 
|  | - If a fence then same as store atomic, plus no following | 
|  | associated fence-paired-atomic can be moved before the | 
|  | fence. | 
|  | acq_rel      Same constraints as both acquire and release. | 
|  | seq_cst      - If a load atomic then same constraints as acquire, plus no | 
|  | preceding sequentially consistent load atomic/store | 
|  | atomic/atomicrmw/fence instruction can be moved after the | 
|  | seq_cst. | 
|  | - If a store atomic then the same constraints as release, plus | 
|  | no following sequentially consistent load atomic/store | 
|  | atomic/atomicrmw/fence instruction can be moved before the | 
|  | seq_cst. | 
|  | - If an atomicrmw/fence then same constraints as acq_rel. | 
|  | ============ ============================================================== | 
|  |  | 
|  | Trap Handler ABI | 
|  | ~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes | 
|  | (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports | 
|  | the ``s_trap`` instruction with the following usage: | 
|  |  | 
|  | .. table:: AMDGPU Trap Handler for AMDHSA OS | 
|  | :name: amdgpu-trap-handler-for-amdhsa-os-table | 
|  |  | 
|  | =================== =============== =============== ======================= | 
|  | Usage               Code Sequence   Trap Handler    Description | 
|  | Inputs | 
|  | =================== =============== =============== ======================= | 
|  | reserved            ``s_trap 0x00``                 Reserved by hardware. | 
|  | ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA | 
|  | ``queue_ptr`` ``debugtrap`` | 
|  | ``VGPR0``:      intrinsic (not | 
|  | ``arg``       implemented). | 
|  | ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be | 
|  | ``queue_ptr`` terminated and its | 
|  | associated queue put | 
|  | into the error state. | 
|  | ``llvm.debugtrap``  ``s_trap 0x03``                 - If debugger not | 
|  | installed then | 
|  | behaves as a | 
|  | no-operation. The | 
|  | trap handler is | 
|  | entered and | 
|  | immediately returns | 
|  | to continue | 
|  | execution of the | 
|  | wavefront. | 
|  | - If the debugger is | 
|  | installed, causes | 
|  | the debug trap to be | 
|  | reported by the | 
|  | debugger and the | 
|  | wavefront is put in | 
|  | the halt state until | 
|  | resumed by the | 
|  | debugger. | 
|  | reserved            ``s_trap 0x04``                 Reserved. | 
|  | reserved            ``s_trap 0x05``                 Reserved. | 
|  | reserved            ``s_trap 0x06``                 Reserved. | 
|  | debugger breakpoint ``s_trap 0x07``                 Reserved for debugger | 
|  | breakpoints. | 
|  | reserved            ``s_trap 0x08``                 Reserved. | 
|  | reserved            ``s_trap 0xfe``                 Reserved. | 
|  | reserved            ``s_trap 0xff``                 Reserved. | 
|  | =================== =============== =============== ======================= | 
|  |  | 
|  | AMDPAL | 
|  | ------ | 
|  |  | 
|  | This section provides code conventions used when the target triple OS is | 
|  | ``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters | 
|  | from the application/runtime to each invocation of a hardware shader. These | 
|  | parameters include both generic, application-controlled parameters called | 
|  | *user data* as well as system-generated parameters that are a product of the | 
|  | draw or dispatch execution. | 
|  |  | 
|  | User Data | 
|  | ~~~~~~~~~ | 
|  |  | 
|  | Each hardware stage has a set of 32-bit *user data registers* which can be | 
|  | written from a command buffer and then loaded into SGPRs when waves are launched | 
|  | via a subsequent dispatch or draw operation. This is the way most arguments are | 
|  | passed from the application/runtime to a hardware shader. | 
|  |  | 
|  | Compute User Data | 
|  | ~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Compute shader user data mappings are simpler than graphics shaders, and have a | 
|  | fixed mapping. | 
|  |  | 
|  | Note that there are always 10 available *user data entries* in registers - | 
|  | entries beyond that limit must be fetched from memory (via the spill table | 
|  | pointer) by the shader. | 
|  |  | 
|  | .. table:: PAL Compute Shader User Data Registers | 
|  | :name: pal-compute-user-data-registers | 
|  |  | 
|  | ============= ================================ | 
|  | User Register Description | 
|  | ============= ================================ | 
|  | 0             Global Internal Table (32-bit pointer) | 
|  | 1             Per-Shader Internal Table (32-bit pointer) | 
|  | 2 - 11        Application-Controlled User Data (10 32-bit values) | 
|  | 12            Spill Table (32-bit pointer) | 
|  | 13 - 14       Thread Group Count (64-bit pointer) | 
|  | 15            GDS Range | 
|  | ============= ================================ | 
|  |  | 
|  | Graphics User Data | 
|  | ~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Graphics pipelines support a much more flexible user data mapping: | 
|  |  | 
|  | .. table:: PAL Graphics Shader User Data Registers | 
|  | :name: pal-graphics-user-data-registers | 
|  |  | 
|  | ============= ================================ | 
|  | User Register Description | 
|  | ============= ================================ | 
|  | 0             Global Internal Table (32-bit pointer) | 
|  | +             Per-Shader Internal Table (32-bit pointer) | 
|  | + 1-15        Application Controlled User Data | 
|  | (1-15 Contiguous 32-bit Values in Registers) | 
|  | +             Spill Table (32-bit pointer) | 
|  | +             Draw Index (First Stage Only) | 
|  | +             Vertex Offset (First Stage Only) | 
|  | +             Instance Offset (First Stage Only) | 
|  | ============= ================================ | 
|  |  | 
|  | The placement of the global internal table remains fixed in the first *user | 
|  | data SGPR register*. Otherwise all parameters are optional, and can be mapped | 
|  | to any desired *user data SGPR register*, with the following regstrictions: | 
|  |  | 
|  | * Draw Index, Vertex Offset, and Instance Offset can only be used by the first | 
|  | activehardware stage in a graphics pipeline (i.e. where the API vertex | 
|  | shader runs). | 
|  |  | 
|  | * Application-controlled user data must be mapped into a contiguous range of | 
|  | user data registers. | 
|  |  | 
|  | * The application-controlled user data range supports compaction remapping, so | 
|  | only *entries* that are actually consumed by the shader must be assigned to | 
|  | corresponding *registers*. Note that in order to support an efficient runtime | 
|  | implementation, the remapping must pack *registers* in the same order as | 
|  | *entries*, with unused *entries* removed. | 
|  |  | 
|  | .. _pal_global_internal_table: | 
|  |  | 
|  | Global Internal Table | 
|  | ~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | The global internal table is a table of *shader resource descriptors* (SRDs) that | 
|  | define how certain engine-wide, runtime-managed resources should be accessed | 
|  | from a shader. The majority of these resources have HW-defined formats, and it | 
|  | is up to the compiler to write/read data as required by the target hardware. | 
|  |  | 
|  | The following table illustrates the required format: | 
|  |  | 
|  | .. table:: PAL Global Internal Table | 
|  | :name: pal-git-table | 
|  |  | 
|  | ============= ================================ | 
|  | Offset        Description | 
|  | ============= ================================ | 
|  | 0-3           Graphics Scratch SRD | 
|  | 4-7           Compute Scratch SRD | 
|  | 8-11          ES/GS Ring Output SRD | 
|  | 12-15         ES/GS Ring Input SRD | 
|  | 16-19         GS/VS Ring Output #0 | 
|  | 20-23         GS/VS Ring Output #1 | 
|  | 24-27         GS/VS Ring Output #2 | 
|  | 28-31         GS/VS Ring Output #3 | 
|  | 32-35         GS/VS Ring Input SRD | 
|  | 36-39         Tessellation Factor Buffer SRD | 
|  | 40-43         Off-Chip LDS Buffer SRD | 
|  | 44-47         Off-Chip Param Cache Buffer SRD | 
|  | 48-51         Sample Position Buffer SRD | 
|  | 52            vaRange::ShadowDescriptorTable High Bits | 
|  | ============= ================================ | 
|  |  | 
|  | The pointer to the global internal table passed to the shader as user data | 
|  | is a 32-bit pointer. The top 32 bits should be assumed to be the same as | 
|  | the top 32 bits of the pipeline, so the shader may use the program | 
|  | counter's top 32 bits. | 
|  |  | 
|  | Unspecified OS | 
|  | -------------- | 
|  |  | 
|  | This section provides code conventions used when the target triple OS is | 
|  | empty (see :ref:`amdgpu-target-triples`). | 
|  |  | 
|  | Trap Handler ABI | 
|  | ~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does | 
|  | not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` | 
|  | instructions are handled as follows: | 
|  |  | 
|  | .. table:: AMDGPU Trap Handler for Non-AMDHSA OS | 
|  | :name: amdgpu-trap-handler-for-non-amdhsa-os-table | 
|  |  | 
|  | =============== =============== =========================================== | 
|  | Usage           Code Sequence   Description | 
|  | =============== =============== =========================================== | 
|  | llvm.trap       s_endpgm        Causes wavefront to be terminated. | 
|  | llvm.debugtrap  *none*          Compiler warning given that there is no | 
|  | trap handler installed. | 
|  | =============== =============== =========================================== | 
|  |  | 
|  | Source Languages | 
|  | ================ | 
|  |  | 
|  | .. _amdgpu-opencl: | 
|  |  | 
|  | OpenCL | 
|  | ------ | 
|  |  | 
|  | When the language is OpenCL the following differences occur: | 
|  |  | 
|  | 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). | 
|  | 2. The AMDGPU backend appends additional arguments to the kernel's explicit | 
|  | arguments for the AMDHSA OS (see | 
|  | :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). | 
|  | 3. Additional metadata is generated | 
|  | (see :ref:`amdgpu-amdhsa-code-object-metadata`). | 
|  |  | 
|  | .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS | 
|  | :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table | 
|  |  | 
|  | ======== ==== ========= =========================================== | 
|  | Position Byte Byte      Description | 
|  | Size Alignment | 
|  | ======== ==== ========= =========================================== | 
|  | 1        8    8         OpenCL Global Offset X | 
|  | 2        8    8         OpenCL Global Offset Y | 
|  | 3        8    8         OpenCL Global Offset Z | 
|  | 4        8    8         OpenCL address of printf buffer | 
|  | 5        8    8         OpenCL address of virtual queue used by | 
|  | enqueue_kernel. | 
|  | 6        8    8         OpenCL address of AqlWrap struct used by | 
|  | enqueue_kernel. | 
|  | ======== ==== ========= =========================================== | 
|  |  | 
|  | .. _amdgpu-hcc: | 
|  |  | 
|  | HCC | 
|  | --- | 
|  |  | 
|  | When the language is HCC the following differences occur: | 
|  |  | 
|  | 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). | 
|  |  | 
|  | .. _amdgpu-assembler: | 
|  |  | 
|  | Assembler | 
|  | --------- | 
|  |  | 
|  | AMDGPU backend has LLVM-MC based assembler which is currently in development. | 
|  | It supports AMDGCN GFX6-GFX9. | 
|  |  | 
|  | This section describes general syntax for instructions and operands. | 
|  |  | 
|  | Instructions | 
|  | ~~~~~~~~~~~~ | 
|  |  | 
|  | .. toctree:: | 
|  | :hidden: | 
|  |  | 
|  | AMDGPU/AMDGPUAsmGFX7 | 
|  | AMDGPU/AMDGPUAsmGFX8 | 
|  | AMDGPU/AMDGPUAsmGFX9 | 
|  | AMDGPUModifierSyntax | 
|  | AMDGPUOperandSyntax | 
|  | AMDGPUInstructionSyntax | 
|  | AMDGPUInstructionNotation | 
|  |  | 
|  | An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: | 
|  |  | 
|  | ``<``\ *opcode*\ ``>    <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` | 
|  |  | 
|  | :doc:`Operands<AMDGPUOperandSyntax>` are normally comma-separated while | 
|  | :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. | 
|  |  | 
|  | The order of *operands* and *modifiers* is fixed. | 
|  | Most *modifiers* are optional and may be omitted. | 
|  |  | 
|  | See detailed instruction syntax description for :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`, | 
|  | :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`. | 
|  |  | 
|  | Note that features under development are not included in this description. | 
|  |  | 
|  | For more information about instructions, their semantics and supported combinations of | 
|  | operands, refer to one of instruction set architecture manuals | 
|  | [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_. | 
|  |  | 
|  | Operands | 
|  | ~~~~~~~~ | 
|  |  | 
|  | Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. | 
|  |  | 
|  | Modifiers | 
|  | ~~~~~~~~~ | 
|  |  | 
|  | Detailed description of modifiers may be found :doc:`here<AMDGPUModifierSyntax>`. | 
|  |  | 
|  | Instruction Examples | 
|  | ~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | DS | 
|  | ++ | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | ds_add_u32 v2, v4 offset:16 | 
|  | ds_write_src2_b64 v2 offset0:4 offset1:8 | 
|  | ds_cmpst_f32 v2, v4, v6 | 
|  | ds_min_rtn_f64 v[8:9], v2, v[4:5] | 
|  |  | 
|  |  | 
|  | For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual. | 
|  |  | 
|  | FLAT | 
|  | ++++ | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | flat_load_dword v1, v[3:4] | 
|  | flat_store_dwordx3 v[3:4], v[5:7] | 
|  | flat_atomic_swap v1, v[3:4], v5 glc | 
|  | flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc | 
|  | flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc | 
|  |  | 
|  | For full list of supported instructions, refer to "FLAT instructions" in ISA Manual. | 
|  |  | 
|  | MUBUF | 
|  | +++++ | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | buffer_load_dword v1, off, s[4:7], s1 | 
|  | buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe | 
|  | buffer_store_format_xy v[1:2], off, s[4:7], s1 | 
|  | buffer_wbinvl1 | 
|  | buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc | 
|  |  | 
|  | For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual. | 
|  |  | 
|  | SMRD/SMEM | 
|  | +++++++++ | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | s_load_dword s1, s[2:3], 0xfc | 
|  | s_load_dwordx8 s[8:15], s[2:3], s4 | 
|  | s_load_dwordx16 s[88:103], s[2:3], s4 | 
|  | s_dcache_inv_vol | 
|  | s_memtime s[4:5] | 
|  |  | 
|  | For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual. | 
|  |  | 
|  | SOP1 | 
|  | ++++ | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | s_mov_b32 s1, s2 | 
|  | s_mov_b64 s[0:1], 0x80000000 | 
|  | s_cmov_b32 s1, 200 | 
|  | s_wqm_b64 s[2:3], s[4:5] | 
|  | s_bcnt0_i32_b64 s1, s[2:3] | 
|  | s_swappc_b64 s[2:3], s[4:5] | 
|  | s_cbranch_join s[4:5] | 
|  |  | 
|  | For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual. | 
|  |  | 
|  | SOP2 | 
|  | ++++ | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | s_add_u32 s1, s2, s3 | 
|  | s_and_b64 s[2:3], s[4:5], s[6:7] | 
|  | s_cselect_b32 s1, s2, s3 | 
|  | s_andn2_b32 s2, s4, s6 | 
|  | s_lshr_b64 s[2:3], s[4:5], s6 | 
|  | s_ashr_i32 s2, s4, s6 | 
|  | s_bfm_b64 s[2:3], s4, s6 | 
|  | s_bfe_i64 s[2:3], s[4:5], s6 | 
|  | s_cbranch_g_fork s[4:5], s[6:7] | 
|  |  | 
|  | For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual. | 
|  |  | 
|  | SOPC | 
|  | ++++ | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | s_cmp_eq_i32 s1, s2 | 
|  | s_bitcmp1_b32 s1, s2 | 
|  | s_bitcmp0_b64 s[2:3], s4 | 
|  | s_setvskip s3, s5 | 
|  |  | 
|  | For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual. | 
|  |  | 
|  | SOPP | 
|  | ++++ | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | s_barrier | 
|  | s_nop 2 | 
|  | s_endpgm | 
|  | s_waitcnt 0 ; Wait for all counters to be 0 | 
|  | s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above | 
|  | s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. | 
|  | s_sethalt 9 | 
|  | s_sleep 10 | 
|  | s_sendmsg 0x1 | 
|  | s_sendmsg sendmsg(MSG_INTERRUPT) | 
|  | s_trap 1 | 
|  |  | 
|  | For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual. | 
|  |  | 
|  | Unless otherwise mentioned, little verification is performed on the operands | 
|  | of SOPP Instructions, so it is up to the programmer to be familiar with the | 
|  | range or acceptable values. | 
|  |  | 
|  | VALU | 
|  | ++++ | 
|  |  | 
|  | For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), | 
|  | the assembler will automatically use optimal encoding based on its operands. | 
|  | To force specific encoding, one can add a suffix to the opcode of the instruction: | 
|  |  | 
|  | * _e32 for 32-bit VOP1/VOP2/VOPC | 
|  | * _e64 for 64-bit VOP3 | 
|  | * _dpp for VOP_DPP | 
|  | * _sdwa for VOP_SDWA | 
|  |  | 
|  | VOP1/VOP2/VOP3/VOPC examples: | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | v_mov_b32 v1, v2 | 
|  | v_mov_b32_e32 v1, v2 | 
|  | v_nop | 
|  | v_cvt_f64_i32_e32 v[1:2], v2 | 
|  | v_floor_f32_e32 v1, v2 | 
|  | v_bfrev_b32_e32 v1, v2 | 
|  | v_add_f32_e32 v1, v2, v3 | 
|  | v_mul_i32_i24_e64 v1, v2, 3 | 
|  | v_mul_i32_i24_e32 v1, -3, v3 | 
|  | v_mul_i32_i24_e32 v1, -100, v3 | 
|  | v_addc_u32 v1, s[0:1], v2, v3, s[2:3] | 
|  | v_max_f16_e32 v1, v2, v3 | 
|  |  | 
|  | VOP_DPP examples: | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | v_mov_b32 v0, v0 quad_perm:[0,2,1,1] | 
|  | v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
|  | v_mov_b32 v0, v0 wave_shl:1 | 
|  | v_mov_b32 v0, v0 row_mirror | 
|  | v_mov_b32 v0, v0 row_bcast:31 | 
|  | v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
|  | v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
|  | v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
|  |  | 
|  | VOP_SDWA examples: | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD | 
|  | v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD | 
|  | v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 | 
|  | v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 | 
|  | v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 | 
|  |  | 
|  | For full list of supported instructions, refer to "Vector ALU instructions". | 
|  |  | 
|  | .. TODO | 
|  | Remove once we switch to code object v3 by default. | 
|  |  | 
|  | HSA Code Object Directives | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | AMDGPU ABI defines auxiliary data in output code object. In assembly source, | 
|  | one can specify them with assembler directives. | 
|  |  | 
|  | .hsa_code_object_version major, minor | 
|  | +++++++++++++++++++++++++++++++++++++ | 
|  |  | 
|  | *major* and *minor* are integers that specify the version of the HSA code | 
|  | object that will be generated by the assembler. | 
|  |  | 
|  | .hsa_code_object_isa [major, minor, stepping, vendor, arch] | 
|  | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | 
|  |  | 
|  |  | 
|  | *major*, *minor*, and *stepping* are all integers that describe the instruction | 
|  | set architecture (ISA) version of the assembly program. | 
|  |  | 
|  | *vendor* and *arch* are quoted strings.  *vendor* should always be equal to | 
|  | "AMD" and *arch* should always be equal to "AMDGPU". | 
|  |  | 
|  | By default, the assembler will derive the ISA version, *vendor*, and *arch* | 
|  | from the value of the -mcpu option that is passed to the assembler. | 
|  |  | 
|  | .amdgpu_hsa_kernel (name) | 
|  | +++++++++++++++++++++++++ | 
|  |  | 
|  | This directives specifies that the symbol with given name is a kernel entry point | 
|  | (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL. | 
|  |  | 
|  | .amd_kernel_code_t | 
|  | ++++++++++++++++++ | 
|  |  | 
|  | This directive marks the beginning of a list of key / value pairs that are used | 
|  | to specify the amd_kernel_code_t object that will be emitted by the assembler. | 
|  | The list must be terminated by the *.end_amd_kernel_code_t* directive.  For | 
|  | any amd_kernel_code_t values that are unspecified a default value will be | 
|  | used.  The default value for all keys is 0, with the following exceptions: | 
|  |  | 
|  | - *kernel_code_version_major* defaults to 1. | 
|  | - *machine_kind* defaults to 1. | 
|  | - *machine_version_major*, *machine_version_minor*, and | 
|  | *machine_version_stepping* are derived from the value of the -mcpu option | 
|  | that is passed to the assembler. | 
|  | - *kernel_code_entry_byte_offset* defaults to 256. | 
|  | - *wavefront_size* defaults to 6. | 
|  | - *kernarg_segment_alignment*, *group_segment_alignment*, and | 
|  | *private_segment_alignment* default to 4. Note that alignments are specified | 
|  | as a power of 2, so a value of **n** means an alignment of 2^ **n**. | 
|  |  | 
|  | The *.amd_kernel_code_t* directive must be placed immediately after the | 
|  | function label and before any instructions. | 
|  |  | 
|  | For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, | 
|  | comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. | 
|  |  | 
|  | Here is an example of a minimal amd_kernel_code_t specification: | 
|  |  | 
|  | .. code-block:: none | 
|  |  | 
|  | .hsa_code_object_version 1,0 | 
|  | .hsa_code_object_isa | 
|  |  | 
|  | .hsatext | 
|  | .globl  hello_world | 
|  | .p2align 8 | 
|  | .amdgpu_hsa_kernel hello_world | 
|  |  | 
|  | hello_world: | 
|  |  | 
|  | .amd_kernel_code_t | 
|  | enable_sgpr_kernarg_segment_ptr = 1 | 
|  | is_ptr64 = 1 | 
|  | compute_pgm_rsrc1_vgprs = 0 | 
|  | compute_pgm_rsrc1_sgprs = 0 | 
|  | compute_pgm_rsrc2_user_sgpr = 2 | 
|  | kernarg_segment_byte_size = 8 | 
|  | wavefront_sgpr_count = 2 | 
|  | workitem_vgpr_count = 3 | 
|  | .end_amd_kernel_code_t | 
|  |  | 
|  | s_load_dwordx2 s[0:1], s[0:1] 0x0 | 
|  | v_mov_b32 v0, 3.14159 | 
|  | s_waitcnt lgkmcnt(0) | 
|  | v_mov_b32 v1, s0 | 
|  | v_mov_b32 v2, s1 | 
|  | flat_store_dword v[1:2], v0 | 
|  | s_endpgm | 
|  | .Lfunc_end0: | 
|  | .size   hello_world, .Lfunc_end0-hello_world | 
|  |  | 
|  | Predefined Symbols (-mattr=+code-object-v3) | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | The AMDGPU assembler defines and updates some symbols automatically. These | 
|  | symbols do not affect code generation. | 
|  |  | 
|  | .amdgcn.gfx_generation_number | 
|  | +++++++++++++++++++++++++++++ | 
|  |  | 
|  | Set to the GFX major generation number of the target being assembled for. For | 
|  | example, when assembling for a "GFX9" target this will be set to the integer | 
|  | value "9". The possible GFX major generation numbers are presented in | 
|  | :ref:`amdgpu-processors`. | 
|  |  | 
|  | .amdgcn.gfx_generation_minor | 
|  | ++++++++++++++++++++++++++++ | 
|  |  | 
|  | Set to the GFX minor generation number of the target being assembled for. For | 
|  | example, when assembling for a "GFX810" target this will be set to the integer | 
|  | value "1". The possible GFX minor generation numbers are presented in | 
|  | :ref:`amdgpu-processors`. | 
|  |  | 
|  | .amdgcn.gfx_generation_stepping | 
|  | +++++++++++++++++++++++++++++++ | 
|  |  | 
|  | Set to the GFX stepping generation number of the target being assembled for. | 
|  | For example, when assembling for a "GFX704" target this will be set to the | 
|  | integer value "4". The possible GFX stepping generation numbers are presented | 
|  | in :ref:`amdgpu-processors`. | 
|  |  | 
|  | .amdgcn.next_free_vgpr | 
|  | ++++++++++++++++++++++ | 
|  |  | 
|  | Set to zero before assembly begins. At each instruction, if the current value | 
|  | of this symbol is less than or equal to the maximum VGPR number explicitly | 
|  | referenced within that instruction then the symbol value is updated to equal | 
|  | that VGPR number plus one. | 
|  |  | 
|  | May be used to set the `.amdhsa_next_free_vpgr` directive in | 
|  | :ref:`amdhsa-kernel-directives-table`. | 
|  |  | 
|  | May be set at any time, e.g. manually set to zero at the start of each kernel. | 
|  |  | 
|  | .amdgcn.next_free_sgpr | 
|  | ++++++++++++++++++++++ | 
|  |  | 
|  | Set to zero before assembly begins. At each instruction, if the current value | 
|  | of this symbol is less than or equal the maximum SGPR number explicitly | 
|  | referenced within that instruction then the symbol value is updated to equal | 
|  | that SGPR number plus one. | 
|  |  | 
|  | May be used to set the `.amdhsa_next_free_spgr` directive in | 
|  | :ref:`amdhsa-kernel-directives-table`. | 
|  |  | 
|  | May be set at any time, e.g. manually set to zero at the start of each kernel. | 
|  |  | 
|  | Code Object Directives (-mattr=+code-object-v3) | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` | 
|  | architecture processors, and are not OS-specific. Directives which begin with | 
|  | ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the | 
|  | ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and | 
|  | :ref:`amdgpu-processors`. | 
|  |  | 
|  | .amdgcn_target <target> | 
|  | +++++++++++++++++++++++ | 
|  |  | 
|  | Optional directive which declares the target supported by the containing | 
|  | assembler source file. Valid values are described in | 
|  | :ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler | 
|  | to validate command-line options such as ``-triple``, ``-mcpu``, and those | 
|  | which specify target features. | 
|  |  | 
|  | .amdhsa_kernel <name> | 
|  | +++++++++++++++++++++ | 
|  |  | 
|  | Creates a correctly aligned AMDHSA kernel descriptor and a symbol, | 
|  | ``<name>.kd``, in the current location of the current section. Only valid when | 
|  | the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first | 
|  | instruction to execute, and does not need to be previously defined. | 
|  |  | 
|  | Marks the beginning of a list of directives used to generate the bytes of a | 
|  | kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. | 
|  | Directives which may appear in this list are described in | 
|  | :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must | 
|  | be valid for the target being assembled for, and cannot be repeated. Directives | 
|  | support the range of values specified by the field they reference in | 
|  | :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is | 
|  | assumed to have its default value, unless it is marked as "Required", in which | 
|  | case it is an error to omit the directive. This list of directives is | 
|  | terminated by an ``.end_amdhsa_kernel`` directive. | 
|  |  | 
|  | .. table:: AMDHSA Kernel Assembler Directives | 
|  | :name: amdhsa-kernel-directives-table | 
|  |  | 
|  | ======================================================== ================ ============ =================== | 
|  | Directive                                                Default          Supported On Description | 
|  | ======================================================== ================ ============ =================== | 
|  | ``.amdhsa_group_segment_fixed_size``                     0                GFX6-GFX9    Controls GROUP_SEGMENT_FIXED_SIZE in | 
|  | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. | 
|  | ``.amdhsa_private_segment_fixed_size``                   0                GFX6-GFX9    Controls PRIVATE_SEGMENT_FIXED_SIZE in | 
|  | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. | 
|  | ``.amdhsa_user_sgpr_private_segment_buffer``             0                GFX6-GFX9    Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in | 
|  | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. | 
|  | ``.amdhsa_user_sgpr_dispatch_ptr``                       0                GFX6-GFX9    Controls ENABLE_SGPR_DISPATCH_PTR in | 
|  | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. | 
|  | ``.amdhsa_user_sgpr_queue_ptr``                          0                GFX6-GFX9    Controls ENABLE_SGPR_QUEUE_PTR in | 
|  | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. | 
|  | ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                GFX6-GFX9    Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in | 
|  | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. | 
|  | ``.amdhsa_user_sgpr_dispatch_id``                        0                GFX6-GFX9    Controls ENABLE_SGPR_DISPATCH_ID in | 
|  | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. | 
|  | ``.amdhsa_user_sgpr_flat_scratch_init``                  0                GFX6-GFX9    Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in | 
|  | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. | 
|  | ``.amdhsa_user_sgpr_private_segment_size``               0                GFX6-GFX9    Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in | 
|  | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. | 
|  | ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                GFX6-GFX9    Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ``.amdhsa_system_sgpr_workgroup_id_x``                   1                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_ID_X in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ``.amdhsa_system_sgpr_workgroup_id_y``                   0                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_ID_Y in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ``.amdhsa_system_sgpr_workgroup_id_z``                   0                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_ID_Z in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ``.amdhsa_system_sgpr_workgroup_info``                   0                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_INFO in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ``.amdhsa_system_vgpr_workitem_id``                      0                GFX6-GFX9    Controls ENABLE_VGPR_WORKITEM_ID in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | Possible values are defined in | 
|  | :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. | 
|  | ``.amdhsa_next_free_vgpr``                               Required         GFX6-GFX9    Maximum VGPR number explicitly referenced, plus one. | 
|  | Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | ``.amdhsa_next_free_sgpr``                               Required         GFX6-GFX9    Maximum SGPR number explicitly referenced, plus one. | 
|  | Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | ``.amdhsa_reserve_vcc``                                  1                GFX6-GFX9    Whether the kernel may use the special VCC SGPR. | 
|  | Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | ``.amdhsa_reserve_flat_scratch``                         1                GFX7-GFX9    Whether the kernel may use flat instructions to access | 
|  | scratch memory. Used to calculate | 
|  | GRANULATED_WAVEFRONT_SGPR_COUNT in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | ``.amdhsa_reserve_xnack_mask``                           Target           GFX8-GFX9    Whether the kernel may trigger XNACK replay. | 
|  | Feature                       Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in | 
|  | Specific                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | (+xnack) | 
|  | ``.amdhsa_float_round_mode_32``                          0                GFX6-GFX9    Controls FLOAT_ROUND_MODE_32 in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | Possible values are defined in | 
|  | :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. | 
|  | ``.amdhsa_float_round_mode_16_64``                       0                GFX6-GFX9    Controls FLOAT_ROUND_MODE_16_64 in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | Possible values are defined in | 
|  | :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. | 
|  | ``.amdhsa_float_denorm_mode_32``                         0                GFX6-GFX9    Controls FLOAT_DENORM_MODE_32 in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | Possible values are defined in | 
|  | :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. | 
|  | ``.amdhsa_float_denorm_mode_16_64``                      3                GFX6-GFX9    Controls FLOAT_DENORM_MODE_16_64 in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | Possible values are defined in | 
|  | :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. | 
|  | ``.amdhsa_dx10_clamp``                                   1                GFX6-GFX9    Controls ENABLE_DX10_CLAMP in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | ``.amdhsa_ieee_mode``                                    1                GFX6-GFX9    Controls ENABLE_IEEE_MODE in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | ``.amdhsa_fp16_overflow``                                0                GFX9         Controls FP16_OVFL in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. | 
|  | ``.amdhsa_exception_fp_ieee_invalid_op``                 0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ``.amdhsa_exception_fp_denorm_src``                      0                GFX6-GFX9    Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ``.amdhsa_exception_fp_ieee_div_zero``                   0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ``.amdhsa_exception_fp_ieee_overflow``                   0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ``.amdhsa_exception_fp_ieee_underflow``                  0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ``.amdhsa_exception_fp_ieee_inexact``                    0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ``.amdhsa_exception_int_div_zero``                       0                GFX6-GFX9    Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in | 
|  | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. | 
|  | ======================================================== ================ ============ =================== | 
|  |  | 
|  | .amdgpu_metadata | 
|  | ++++++++++++++++ | 
|  |  | 
|  | Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` | 
|  | note record (see :ref:`amdgpu-elf-note-records-table-v3`). | 
|  |  | 
|  | The contents must be in the [YAML]_ markup format, with the same structure and | 
|  | semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`. | 
|  |  | 
|  | This directive is terminated by an ``.end_amdgpu_metadata`` directive. | 
|  |  | 
|  | Example HSA Source Code (-mattr=+code-object-v3) | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Here is an example of a minimal assembly source file, defining one HSA kernel: | 
|  |  | 
|  | .. code-block:: none | 
|  |  | 
|  | .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional | 
|  |  | 
|  | .text | 
|  | .globl hello_world | 
|  | .p2align 8 | 
|  | .type hello_world,@function | 
|  | hello_world: | 
|  | s_load_dwordx2 s[0:1], s[0:1] 0x0 | 
|  | v_mov_b32 v0, 3.14159 | 
|  | s_waitcnt lgkmcnt(0) | 
|  | v_mov_b32 v1, s0 | 
|  | v_mov_b32 v2, s1 | 
|  | flat_store_dword v[1:2], v0 | 
|  | s_endpgm | 
|  | .Lfunc_end0: | 
|  | .size   hello_world, .Lfunc_end0-hello_world | 
|  |  | 
|  | .rodata | 
|  | .p2align 6 | 
|  | .amdhsa_kernel hello_world | 
|  | .amdhsa_user_sgpr_kernarg_segment_ptr 1 | 
|  | .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr | 
|  | .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr | 
|  | .end_amdhsa_kernel | 
|  |  | 
|  | .amdgpu_metadata | 
|  | --- | 
|  | amdhsa.version: | 
|  | - 1 | 
|  | - 0 | 
|  | amdhsa.kernels: | 
|  | - .name: hello_world | 
|  | .symbol: hello_world.kd | 
|  | .kernarg_segment_size: 48 | 
|  | .group_segment_fixed_size: 0 | 
|  | .private_segment_fixed_size: 0 | 
|  | .kernarg_segment_align: 4 | 
|  | .wavefront_size: 64 | 
|  | .sgpr_count: 2 | 
|  | .vgpr_count: 3 | 
|  | .max_flat_workgroup_size: 256 | 
|  | ... | 
|  | .end_amdgpu_metadata | 
|  |  | 
|  | Additional Documentation | 
|  | ======================== | 
|  |  | 
|  | .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ | 
|  | .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ | 
|  | .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ | 
|  | .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ | 
|  | .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ | 
|  | .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ | 
|  | .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ | 
|  | .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ | 
|  | .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__ | 
|  | .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__ | 
|  | .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ | 
|  | .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ | 
|  | .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ | 
|  | .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ | 
|  | .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ | 
|  | .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ | 
|  | .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ | 
|  | .. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__ |