|  | ============================== | 
|  | User Guide for AMDGPU Back-end | 
|  | ============================== | 
|  |  | 
|  | Introduction | 
|  | ============ | 
|  |  | 
|  | The AMDGPU back-end provides ISA code generation for AMD GPUs, starting with | 
|  | the R600 family up until the current Volcanic Islands (GCN Gen 3). | 
|  |  | 
|  | Refer to `AMDGPU section in Architecture & Platform Information for Compiler Writers <CompilerWriterInfo.html#amdgpu>`_ | 
|  | for additional documentation. | 
|  |  | 
|  | Conventions | 
|  | =========== | 
|  |  | 
|  | Address Spaces | 
|  | -------------- | 
|  |  | 
|  | The AMDGPU back-end uses the following address space mapping: | 
|  |  | 
|  | ================== =================== ============== | 
|  | LLVM Address Space DWARF Address Space Memory Space | 
|  | ================== =================== ============== | 
|  | 0                  1                   Private | 
|  | 1                  N/A                 Global | 
|  | 2                  N/A                 Constant | 
|  | 3                  2                   Local | 
|  | 4                  N/A                 Generic (Flat) | 
|  | 5                  N/A                 Region | 
|  | ================== =================== ============== | 
|  |  | 
|  | The terminology in the table, aside from the region memory space, is from the | 
|  | OpenCL standard. | 
|  |  | 
|  | LLVM Address Space is used throughout LLVM (for example, in LLVM IR). DWARF | 
|  | Address Space is emitted in DWARF, and is used by tools, such as debugger, | 
|  | profiler and others. | 
|  |  | 
|  | Trap Handler ABI | 
|  | ---------------- | 
|  | The OS element of the target triple controls the trap handler behavior. | 
|  |  | 
|  | HSA OS | 
|  | ^^^^^^ | 
|  | For code objects generated by AMDGPU back-end for the HSA OS, the runtime | 
|  | installs a trap handler that supports the s_trap instruction with the following | 
|  | usage: | 
|  |  | 
|  | +--------------+-------------+-------------------+----------------------------+ | 
|  | |Usage         |Code Sequence|Trap Handler Inputs|Description                 | | 
|  | +==============+=============+===================+============================+ | 
|  | |reserved      |s_trap 0x00  |                   |Reserved by hardware.       | | 
|  | +--------------+-------------+-------------------+----------------------------+ | 
|  | |HSA debugtrap |s_trap 0x01  |SGPR0-1: queue_ptr |Reserved for HSA debugtrap  | | 
|  | |(arg)         |             |VGPR0: arg         |intrinsic (not implemented).| | 
|  | +--------------+-------------+-------------------+----------------------------+ | 
|  | |llvm.trap     |s_trap 0x02  |SGPR0-1: queue_ptr |Causes dispatch to be       | | 
|  | |              |             |                   |terminated and its          | | 
|  | |              |             |                   |associated queue put into   | | 
|  | |              |             |                   |the error state.            | | 
|  | +--------------+-------------+-------------------+----------------------------+ | 
|  | |llvm.debugtrap| s_trap 0x03 |SGPR0-1: queue_ptr |If debugger not installed   | | 
|  | |              |             |                   |handled same as llvm.trap.  | | 
|  | +--------------+-------------+-------------------+----------------------------+ | 
|  | |debugger      |s_trap 0x07  |                   |Reserved for debugger       | | 
|  | |breakpoint    |             |                   |breakpoints.                | | 
|  | +--------------+-------------+-------------------+----------------------------+ | 
|  | |debugger      |s_trap 0x08  |                   |Reserved for debugger.      | | 
|  | +--------------+-------------+-------------------+----------------------------+ | 
|  | |debugger      |s_trap 0xfe  |                   |Reserved for debugger.      | | 
|  | +--------------+-------------+-------------------+----------------------------+ | 
|  | |debugger      |s_trap 0xff  |                   |Reserved for debugger.      | | 
|  | +--------------+-------------+-------------------+----------------------------+ | 
|  |  | 
|  | Non-HSA OS | 
|  | ^^^^^^^^^^ | 
|  | For code objects generated by AMDGPU back-end for non-HSA OS, the runtime does | 
|  | not install a trap handler. The llvm.trap and llvm.debugtrap instructions are | 
|  | handler as follows: | 
|  |  | 
|  | =============== ============= =============================================== | 
|  | Usage           Code Sequence Description | 
|  | =============== ============= =============================================== | 
|  | llvm.trap       s_endpgm      Causes wavefront to be terminated. | 
|  | llvm.debugtrap  s_nop         No operation. Compiler warning generated that | 
|  | there is no trap handler installed. | 
|  | =============== ============= =============================================== | 
|  |  | 
|  | Assembler | 
|  | ========= | 
|  |  | 
|  | AMDGPU backend has LLVM-MC based assembler which is currently in development. | 
|  | It supports Southern Islands ISA, Sea Islands and Volcanic Islands. | 
|  |  | 
|  | This document describes general syntax for instructions and operands. For more | 
|  | information about instructions, their semantics and supported combinations | 
|  | of operands, refer to one of Instruction Set Architecture manuals. | 
|  |  | 
|  | An instruction has the following syntax (register operands are | 
|  | normally comma-separated while extra operands are space-separated): | 
|  |  | 
|  | *<opcode> <register_operand0>, ... <extra_operand0> ...* | 
|  |  | 
|  |  | 
|  | Operands | 
|  | -------- | 
|  |  | 
|  | The following syntax for register operands is supported: | 
|  |  | 
|  | * SGPR registers: s0, ... or s[0], ... | 
|  | * VGPR registers: v0, ... or v[0], ... | 
|  | * TTMP registers: ttmp0, ... or ttmp[0], ... | 
|  | * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi) | 
|  | * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi) | 
|  | * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ... | 
|  | * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3] | 
|  | * Register index expressions: v[2*2], s[1-1:2-1] | 
|  | * 'off' indicates that an operand is not enabled | 
|  |  | 
|  | The following extra operands are supported: | 
|  |  | 
|  | * offset, offset0, offset1 | 
|  | * idxen, offen bits | 
|  | * glc, slc, tfe bits | 
|  | * waitcnt: integer or combination of counter values | 
|  | * VOP3 modifiers: | 
|  |  | 
|  | - abs (\| \|), neg (\-) | 
|  |  | 
|  | * DPP modifiers: | 
|  |  | 
|  | - row_shl, row_shr, row_ror, row_rol | 
|  | - row_mirror, row_half_mirror, row_bcast | 
|  | - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm | 
|  | - row_mask, bank_mask, bound_ctrl | 
|  |  | 
|  | * SDWA modifiers: | 
|  |  | 
|  | - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD) | 
|  | - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE) | 
|  | - abs, neg, sext | 
|  |  | 
|  | DS Instructions Examples | 
|  | ------------------------ | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | ds_add_u32 v2, v4 offset:16 | 
|  | ds_write_src2_b64 v2 offset0:4 offset1:8 | 
|  | ds_cmpst_f32 v2, v4, v6 | 
|  | ds_min_rtn_f64 v[8:9], v2, v[4:5] | 
|  |  | 
|  |  | 
|  | For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual. | 
|  |  | 
|  | FLAT Instruction Examples | 
|  | -------------------------- | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | flat_load_dword v1, v[3:4] | 
|  | flat_store_dwordx3 v[3:4], v[5:7] | 
|  | flat_atomic_swap v1, v[3:4], v5 glc | 
|  | flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc | 
|  | flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc | 
|  |  | 
|  | For full list of supported instructions, refer to "FLAT instructions" in ISA Manual. | 
|  |  | 
|  | MUBUF Instruction Examples | 
|  | --------------------------- | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | buffer_load_dword v1, off, s[4:7], s1 | 
|  | buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe | 
|  | buffer_store_format_xy v[1:2], off, s[4:7], s1 | 
|  | buffer_wbinvl1 | 
|  | buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc | 
|  |  | 
|  | For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual. | 
|  |  | 
|  | SMRD/SMEM Instruction Examples | 
|  | ------------------------------- | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | s_load_dword s1, s[2:3], 0xfc | 
|  | s_load_dwordx8 s[8:15], s[2:3], s4 | 
|  | s_load_dwordx16 s[88:103], s[2:3], s4 | 
|  | s_dcache_inv_vol | 
|  | s_memtime s[4:5] | 
|  |  | 
|  | For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual. | 
|  |  | 
|  | SOP1 Instruction Examples | 
|  | -------------------------- | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | s_mov_b32 s1, s2 | 
|  | s_mov_b64 s[0:1], 0x80000000 | 
|  | s_cmov_b32 s1, 200 | 
|  | s_wqm_b64 s[2:3], s[4:5] | 
|  | s_bcnt0_i32_b64 s1, s[2:3] | 
|  | s_swappc_b64 s[2:3], s[4:5] | 
|  | s_cbranch_join s[4:5] | 
|  |  | 
|  | For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual. | 
|  |  | 
|  | SOP2 Instruction Examples | 
|  | ------------------------- | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | s_add_u32 s1, s2, s3 | 
|  | s_and_b64 s[2:3], s[4:5], s[6:7] | 
|  | s_cselect_b32 s1, s2, s3 | 
|  | s_andn2_b32 s2, s4, s6 | 
|  | s_lshr_b64 s[2:3], s[4:5], s6 | 
|  | s_ashr_i32 s2, s4, s6 | 
|  | s_bfm_b64 s[2:3], s4, s6 | 
|  | s_bfe_i64 s[2:3], s[4:5], s6 | 
|  | s_cbranch_g_fork s[4:5], s[6:7] | 
|  |  | 
|  | For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual. | 
|  |  | 
|  | SOPC Instruction Examples | 
|  | -------------------------- | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | s_cmp_eq_i32 s1, s2 | 
|  | s_bitcmp1_b32 s1, s2 | 
|  | s_bitcmp0_b64 s[2:3], s4 | 
|  | s_setvskip s3, s5 | 
|  |  | 
|  | For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual. | 
|  |  | 
|  | SOPP Instruction Examples | 
|  | -------------------------- | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | s_barrier | 
|  | s_nop 2 | 
|  | s_endpgm | 
|  | s_waitcnt 0 ; Wait for all counters to be 0 | 
|  | s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above | 
|  | s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. | 
|  | s_sethalt 9 | 
|  | s_sleep 10 | 
|  | s_sendmsg 0x1 | 
|  | s_sendmsg sendmsg(MSG_INTERRUPT) | 
|  | s_trap 1 | 
|  |  | 
|  | For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual. | 
|  |  | 
|  | Unless otherwise mentioned, little verification is performed on the operands | 
|  | of SOPP Instructions, so it is up to the programmer to be familiar with the | 
|  | range or acceptable values. | 
|  |  | 
|  | Vector ALU Instruction Examples | 
|  | ------------------------------- | 
|  |  | 
|  | For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), | 
|  | the assembler will automatically use optimal encoding based on its operands. | 
|  | To force specific encoding, one can add a suffix to the opcode of the instruction: | 
|  |  | 
|  | * _e32 for 32-bit VOP1/VOP2/VOPC | 
|  | * _e64 for 64-bit VOP3 | 
|  | * _dpp for VOP_DPP | 
|  | * _sdwa for VOP_SDWA | 
|  |  | 
|  | VOP1/VOP2/VOP3/VOPC examples: | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | v_mov_b32 v1, v2 | 
|  | v_mov_b32_e32 v1, v2 | 
|  | v_nop | 
|  | v_cvt_f64_i32_e32 v[1:2], v2 | 
|  | v_floor_f32_e32 v1, v2 | 
|  | v_bfrev_b32_e32 v1, v2 | 
|  | v_add_f32_e32 v1, v2, v3 | 
|  | v_mul_i32_i24_e64 v1, v2, 3 | 
|  | v_mul_i32_i24_e32 v1, -3, v3 | 
|  | v_mul_i32_i24_e32 v1, -100, v3 | 
|  | v_addc_u32 v1, s[0:1], v2, v3, s[2:3] | 
|  | v_max_f16_e32 v1, v2, v3 | 
|  |  | 
|  | VOP_DPP examples: | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | v_mov_b32 v0, v0 quad_perm:[0,2,1,1] | 
|  | v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
|  | v_mov_b32 v0, v0 wave_shl:1 | 
|  | v_mov_b32 v0, v0 row_mirror | 
|  | v_mov_b32 v0, v0 row_bcast:31 | 
|  | v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
|  | v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
|  | v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
|  |  | 
|  | VOP_SDWA examples: | 
|  |  | 
|  | .. code-block:: nasm | 
|  |  | 
|  | v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD | 
|  | v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD | 
|  | v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 | 
|  | v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 | 
|  | v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 | 
|  |  | 
|  | For full list of supported instructions, refer to "Vector ALU instructions". | 
|  |  | 
|  | HSA Code Object Directives | 
|  | -------------------------- | 
|  |  | 
|  | AMDGPU ABI defines auxiliary data in output code object. In assembly source, | 
|  | one can specify them with assembler directives. | 
|  |  | 
|  | .hsa_code_object_version major, minor | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | *major* and *minor* are integers that specify the version of the HSA code | 
|  | object that will be generated by the assembler. | 
|  |  | 
|  | .hsa_code_object_isa [major, minor, stepping, vendor, arch] | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | *major*, *minor*, and *stepping* are all integers that describe the instruction | 
|  | set architecture (ISA) version of the assembly program. | 
|  |  | 
|  | *vendor* and *arch* are quoted strings.  *vendor* should always be equal to | 
|  | "AMD" and *arch* should always be equal to "AMDGPU". | 
|  |  | 
|  | By default, the assembler will derive the ISA version, *vendor*, and *arch* | 
|  | from the value of the -mcpu option that is passed to the assembler. | 
|  |  | 
|  | .amdgpu_hsa_kernel (name) | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | This directives specifies that the symbol with given name is a kernel entry point | 
|  | (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL. | 
|  |  | 
|  | .amd_kernel_code_t | 
|  | ^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | This directive marks the beginning of a list of key / value pairs that are used | 
|  | to specify the amd_kernel_code_t object that will be emitted by the assembler. | 
|  | The list must be terminated by the *.end_amd_kernel_code_t* directive.  For | 
|  | any amd_kernel_code_t values that are unspecified a default value will be | 
|  | used.  The default value for all keys is 0, with the following exceptions: | 
|  |  | 
|  | - *kernel_code_version_major* defaults to 1. | 
|  | - *machine_kind* defaults to 1. | 
|  | - *machine_version_major*, *machine_version_minor*, and | 
|  | *machine_version_stepping* are derived from the value of the -mcpu option | 
|  | that is passed to the assembler. | 
|  | - *kernel_code_entry_byte_offset* defaults to 256. | 
|  | - *wavefront_size* defaults to 6. | 
|  | - *kernarg_segment_alignment*, *group_segment_alignment*, and | 
|  | *private_segment_alignment* default to 4.  Note that alignments are specified | 
|  | as a power of two, so a value of **n** means an alignment of 2^ **n**. | 
|  |  | 
|  | The *.amd_kernel_code_t* directive must be placed immediately after the | 
|  | function label and before any instructions. | 
|  |  | 
|  | For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, | 
|  | comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. | 
|  |  | 
|  | Here is an example of a minimal amd_kernel_code_t specification: | 
|  |  | 
|  | .. code-block:: none | 
|  |  | 
|  | .hsa_code_object_version 1,0 | 
|  | .hsa_code_object_isa | 
|  |  | 
|  | .hsatext | 
|  | .globl  hello_world | 
|  | .p2align 8 | 
|  | .amdgpu_hsa_kernel hello_world | 
|  |  | 
|  | hello_world: | 
|  |  | 
|  | .amd_kernel_code_t | 
|  | enable_sgpr_kernarg_segment_ptr = 1 | 
|  | is_ptr64 = 1 | 
|  | compute_pgm_rsrc1_vgprs = 0 | 
|  | compute_pgm_rsrc1_sgprs = 0 | 
|  | compute_pgm_rsrc2_user_sgpr = 2 | 
|  | kernarg_segment_byte_size = 8 | 
|  | wavefront_sgpr_count = 2 | 
|  | workitem_vgpr_count = 3 | 
|  | .end_amd_kernel_code_t | 
|  |  | 
|  | s_load_dwordx2 s[0:1], s[0:1] 0x0 | 
|  | v_mov_b32 v0, 3.14159 | 
|  | s_waitcnt lgkmcnt(0) | 
|  | v_mov_b32 v1, s0 | 
|  | v_mov_b32 v2, s1 | 
|  | flat_store_dword v[1:2], v0 | 
|  | s_endpgm | 
|  | .Lfunc_end0: | 
|  | .size   hello_world, .Lfunc_end0-hello_world |