| Eugene Zelenko | 3507b04 | 2018-03-21 17:09:35 +0000 | [diff] [blame] | 1 | ============================= | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2 | User Guide for AMDGPU Backend | 
|  | 3 | ============================= | 
|  | 4 |  | 
|  | 5 | .. contents:: | 
|  | 6 | :local: | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 7 |  | 
|  | 8 | Introduction | 
|  | 9 | ============ | 
|  | 10 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 11 | The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the | 
|  | 12 | R600 family up until the current GCN families. It lives in the | 
|  | 13 | ``lib/Target/AMDGPU`` directory. | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 14 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 15 | LLVM | 
|  | 16 | ==== | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 17 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 18 | .. _amdgpu-target-triples: | 
|  | 19 |  | 
|  | 20 | Target Triples | 
|  | 21 | -------------- | 
|  | 22 |  | 
|  | 23 | Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to | 
|  | 24 | specify the target triple: | 
|  | 25 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 26 | .. table:: AMDGPU Architectures | 
|  | 27 | :name: amdgpu-architecture-table | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 28 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 29 | ============ ============================================================== | 
|  | 30 | Architecture Description | 
|  | 31 | ============ ============================================================== | 
|  | 32 | ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. | 
|  | 33 | ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders. | 
|  | 34 | ============ ============================================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 35 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 36 | .. table:: AMDGPU Vendors | 
|  | 37 | :name: amdgpu-vendor-table | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 38 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 39 | ============ ============================================================== | 
|  | 40 | Vendor       Description | 
|  | 41 | ============ ============================================================== | 
|  | 42 | ``amd``      Can be used for all AMD GPU usage. | 
|  | 43 | ``mesa3d``   Can be used if the OS is ``mesa3d``. | 
|  | 44 | ============ ============================================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 45 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 46 | .. table:: AMDGPU Operating Systems | 
|  | 47 | :name: amdgpu-os-table | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 48 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 49 | ============== ============================================================ | 
|  | 50 | OS             Description | 
|  | 51 | ============== ============================================================ | 
|  | 52 | *<empty>*      Defaults to the *unknown* OS. | 
|  | 53 | ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes | 
|  | 54 | such as AMD's ROCm [AMD-ROCm]_. | 
|  | 55 | ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL | 
|  | 56 | runtime. | 
|  | 57 | ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D | 
|  | 58 | runtime. | 
|  | 59 | ============== ============================================================ | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 60 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 61 | .. table:: AMDGPU Environments | 
|  | 62 | :name: amdgpu-environment-table | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 63 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 64 | ============ ============================================================== | 
|  | 65 | Environment  Description | 
|  | 66 | ============ ============================================================== | 
| Tony Tye | 7a893d4 | 2018-03-23 18:45:18 +0000 | [diff] [blame] | 67 | *<empty>*    Default. | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 68 | ============ ============================================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 69 |  | 
|  | 70 | .. _amdgpu-processors: | 
|  | 71 |  | 
|  | 72 | Processors | 
|  | 73 | ---------- | 
|  | 74 |  | 
|  | 75 | Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The | 
|  | 76 | names from both the *Processor* and *Alternative Processor* can be used. | 
|  | 77 |  | 
|  | 78 | .. table:: AMDGPU Processors | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 79 | :name: amdgpu-processor-table | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 80 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 81 | =========== =============== ============ ===== ================= ======= ====================== | 
|  | 82 | Processor   Alternative     Target       dGPU/ Target            ROCm    Example | 
|  | 83 | Processor       Triple       APU   Features          Support Products | 
| Tony Tye | 31105cc | 2017-12-11 15:35:27 +0000 | [diff] [blame] | 84 | Architecture       Supported | 
|  | 85 | [Default] | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 86 | =========== =============== ============ ===== ================= ======= ====================== | 
| Konstantin Zhuravlyov | 265d253 | 2017-10-18 17:59:20 +0000 | [diff] [blame] | 87 | **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 88 | ----------------------------------------------------------------------------------------------- | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 89 | ``r600``                    ``r600``     dGPU | 
|  | 90 | ``r630``                    ``r600``     dGPU | 
|  | 91 | ``rs880``                   ``r600``     dGPU | 
|  | 92 | ``rv670``                   ``r600``     dGPU | 
| Konstantin Zhuravlyov | 265d253 | 2017-10-18 17:59:20 +0000 | [diff] [blame] | 93 | **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 94 | ----------------------------------------------------------------------------------------------- | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 95 | ``rv710``                   ``r600``     dGPU | 
|  | 96 | ``rv730``                   ``r600``     dGPU | 
|  | 97 | ``rv770``                   ``r600``     dGPU | 
| Konstantin Zhuravlyov | 265d253 | 2017-10-18 17:59:20 +0000 | [diff] [blame] | 98 | **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 99 | ----------------------------------------------------------------------------------------------- | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 100 | ``cedar``                   ``r600``     dGPU | 
| Konstantin Zhuravlyov | 9122a63 | 2018-02-16 22:33:59 +0000 | [diff] [blame] | 101 | ``cypress``                 ``r600``     dGPU | 
|  | 102 | ``juniper``                 ``r600``     dGPU | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 103 | ``redwood``                 ``r600``     dGPU | 
|  | 104 | ``sumo``                    ``r600``     dGPU | 
| Konstantin Zhuravlyov | 265d253 | 2017-10-18 17:59:20 +0000 | [diff] [blame] | 105 | **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 106 | ----------------------------------------------------------------------------------------------- | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 107 | ``barts``                   ``r600``     dGPU | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 108 | ``caicos``                  ``r600``     dGPU | 
|  | 109 | ``cayman``                  ``r600``     dGPU | 
| Konstantin Zhuravlyov | 9122a63 | 2018-02-16 22:33:59 +0000 | [diff] [blame] | 110 | ``turks``                   ``r600``     dGPU | 
| Konstantin Zhuravlyov | 265d253 | 2017-10-18 17:59:20 +0000 | [diff] [blame] | 111 | **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 112 | ----------------------------------------------------------------------------------------------- | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 113 | ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU | 
| Konstantin Zhuravlyov | 9122a63 | 2018-02-16 22:33:59 +0000 | [diff] [blame] | 114 | ``gfx601``  - ``hainan``    ``amdgcn``   dGPU | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 115 | - ``oland`` | 
| Konstantin Zhuravlyov | 9122a63 | 2018-02-16 22:33:59 +0000 | [diff] [blame] | 116 | - ``pitcairn`` | 
|  | 117 | - ``verde`` | 
| Konstantin Zhuravlyov | 265d253 | 2017-10-18 17:59:20 +0000 | [diff] [blame] | 118 | **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 119 | ----------------------------------------------------------------------------------------------- | 
|  | 120 | ``gfx700``  - ``kaveri``    ``amdgcn``   APU                             - A6-7000 | 
|  | 121 | - A6 Pro-7050B | 
|  | 122 | - A8-7100 | 
|  | 123 | - A8 Pro-7150B | 
|  | 124 | - A10-7300 | 
|  | 125 | - A10 Pro-7350B | 
|  | 126 | - FX-7500 | 
|  | 127 | - A8-7200P | 
|  | 128 | - A10-7400P | 
|  | 129 | - FX-7600P | 
|  | 130 | ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    ROCm    - FirePro W8100 | 
|  | 131 | - FirePro W9100 | 
|  | 132 | - FirePro S9150 | 
|  | 133 | - FirePro S9170 | 
|  | 134 | ``gfx702``                  ``amdgcn``   dGPU                    ROCm    - Radeon R9 290 | 
|  | 135 | - Radeon R9 290x | 
|  | 136 | - Radeon R390 | 
|  | 137 | - Radeon R390x | 
|  | 138 | ``gfx703``  - ``kabini``    ``amdgcn``   APU                             - E1-2100 | 
|  | 139 | - ``mullins``                                                - E1-2200 | 
|  | 140 | - E1-2500 | 
|  | 141 | - E2-3000 | 
|  | 142 | - E2-3800 | 
|  | 143 | - A4-5000 | 
|  | 144 | - A4-5100 | 
|  | 145 | - A6-5200 | 
|  | 146 | - A4 Pro-3340B | 
|  | 147 | ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                            - Radeon HD 7790 | 
|  | 148 | - Radeon HD 8770 | 
|  | 149 | - R7 260 | 
|  | 150 | - R7 260X | 
| Konstantin Zhuravlyov | 265d253 | 2017-10-18 17:59:20 +0000 | [diff] [blame] | 151 | **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 152 | ----------------------------------------------------------------------------------------------- | 
|  | 153 | ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack                   - A6-8500P | 
|  | 154 | [on]                    - Pro A6-8500B | 
|  | 155 | - A8-8600P | 
|  | 156 | - Pro A8-8600B | 
|  | 157 | - FX-8800P | 
|  | 158 | - Pro A12-8800B | 
|  | 159 | \                           ``amdgcn``   APU   - xnack           ROCm    - A10-8700P | 
|  | 160 | [on]                    - Pro A10-8700B | 
|  | 161 | - A10-8780P | 
|  | 162 | \                           ``amdgcn``   APU   - xnack                   - A10-9600P | 
|  | 163 | [on]                    - A10-9630P | 
|  | 164 | - A12-9700P | 
|  | 165 | - A12-9730P | 
|  | 166 | - FX-9800P | 
|  | 167 | - FX-9830P | 
|  | 168 | \                           ``amdgcn``   APU   - xnack                   - E2-9010 | 
|  | 169 | [on]                    - A6-9210 | 
|  | 170 | - A9-9410 | 
|  | 171 | ``gfx802``  - ``iceland``   ``amdgcn``   dGPU  - xnack           ROCm    - FirePro S7150 | 
|  | 172 | - ``tonga``                          [off]                   - FirePro S7100 | 
|  | 173 | - FirePro W7100 | 
|  | 174 | - Radeon R285 | 
|  | 175 | - Radeon R9 380 | 
|  | 176 | - Radeon R9 385 | 
|  | 177 | - Mobile FirePro | 
|  | 178 | M7170 | 
|  | 179 | ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack           ROCm    - Radeon R9 Nano | 
|  | 180 | [off]                   - Radeon R9 Fury | 
|  | 181 | - Radeon R9 FuryX | 
|  | 182 | - Radeon Pro Duo | 
|  | 183 | - FirePro S9300x2 | 
|  | 184 | - Radeon Instinct MI8 | 
|  | 185 | \           - ``polaris10`` ``amdgcn``   dGPU  - xnack           ROCm    - Radeon RX 470 | 
|  | 186 | [off]                   - Radeon RX 480 | 
|  | 187 | - Radeon Instinct MI6 | 
|  | 188 | \           - ``polaris11`` ``amdgcn``   dGPU  - xnack           ROCm    - Radeon RX 460 | 
| Tony Tye | 31105cc | 2017-12-11 15:35:27 +0000 | [diff] [blame] | 189 | [off] | 
|  | 190 | ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack | 
|  | 191 | [on] | 
| Konstantin Zhuravlyov | 265d253 | 2017-10-18 17:59:20 +0000 | [diff] [blame] | 192 | **GCN GFX9** [AMD-GCN-GFX9]_ | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 193 | ----------------------------------------------------------------------------------------------- | 
|  | 194 | ``gfx900``                  ``amdgcn``   dGPU  - xnack           ROCm    - Radeon Vega | 
|  | 195 | [off]                     Frontier Edition | 
|  | 196 | - Radeon RX Vega 56 | 
|  | 197 | - Radeon RX Vega 64 | 
|  | 198 | - Radeon RX Vega 64 | 
|  | 199 | Liquid | 
|  | 200 | - Radeon Instinct MI25 | 
|  | 201 | ``gfx902``                  ``amdgcn``   APU   - xnack                   - Ryzen 3 2200G | 
|  | 202 | [on]                    - Ryzen 5 2400G | 
|  | 203 | ``gfx904``                  ``amdgcn``   dGPU  - xnack                   *TBA* | 
| Matt Arsenault | 0084adc | 2018-04-30 19:08:16 +0000 | [diff] [blame] | 204 | [off] | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 205 | .. TODO | 
|  | 206 | Add product | 
|  | 207 | names. | 
|  | 208 | ``gfx906``                  ``amdgcn``   dGPU  - xnack                   - Radeon Instinct MI50 | 
|  | 209 | [off]                   - Radeon Instinct MI60 | 
|  | 210 | ``gfx909``                  ``amdgcn``   APU   - xnack                   *TBA* (Raven Ridge 2) | 
| Tim Renouf | 2a1b1d9 | 2018-10-24 08:14:07 +0000 | [diff] [blame] | 211 | [on] | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 212 | .. TODO | 
|  | 213 | Add product | 
|  | 214 | names. | 
|  | 215 | **GCN GFX10** [AMD-GCN-GFX10]_ | 
|  | 216 | ----------------------------------------------------------------------------------------------- | 
|  | 217 | ``gfx1010``                 ``amdgcn``   dGPU  - xnack                   *TBA* | 
|  | 218 | [off] | 
|  | 219 | - wavefrontsize64 | 
|  | 220 | [off] | 
|  | 221 | - cumode | 
|  | 222 | [off] | 
|  | 223 | .. TODO | 
|  | 224 | Add product | 
|  | 225 | names. | 
|  | 226 | ``gfx1011``                 ``amdgcn``   dGPU  - xnack                   *TBA* | 
|  | 227 | [off] | 
|  | 228 | - wavefrontsize64 | 
|  | 229 | [off] | 
|  | 230 | - cumode | 
|  | 231 | [off] | 
|  | 232 | .. TODO | 
|  | 233 | Add product | 
|  | 234 | names. | 
|  | 235 | ``gfx1012``                 ``amdgcn``   dGPU  - xnack                   *TBA* | 
|  | 236 | [off] | 
|  | 237 | - wavefrontsize64 | 
|  | 238 | [off] | 
|  | 239 | - cumode | 
|  | 240 | [off] | 
|  | 241 | .. TODO | 
|  | 242 | Add product | 
|  | 243 | names. | 
|  | 244 | =========== =============== ============ ===== ================= ======= ====================== | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 245 |  | 
|  | 246 | .. _amdgpu-target-features: | 
|  | 247 |  | 
|  | 248 | Target Features | 
|  | 249 | --------------- | 
|  | 250 |  | 
|  | 251 | Target features control how code is generated to support certain | 
| Tony Tye | 31105cc | 2017-12-11 15:35:27 +0000 | [diff] [blame] | 252 | processor specific features. Not all target features are supported by | 
|  | 253 | all processors. The runtime must ensure that the features supported by | 
|  | 254 | the device used to execute the code match the features enabled when | 
|  | 255 | generating the code. A mismatch of features may result in incorrect | 
|  | 256 | execution, or a reduction in performance. | 
|  | 257 |  | 
|  | 258 | The target features supported by each processor, and the default value | 
|  | 259 | used if not specified explicitly, is listed in | 
|  | 260 | :ref:`amdgpu-processor-table`. | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 261 |  | 
|  | 262 | Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU | 
|  | 263 | target features. | 
|  | 264 |  | 
|  | 265 | For example: | 
|  | 266 |  | 
|  | 267 | ``-mxnack`` | 
| Tony Tye | 31105cc | 2017-12-11 15:35:27 +0000 | [diff] [blame] | 268 | Enable the ``xnack`` feature. | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 269 | ``-mno-xnack`` | 
| Tony Tye | 31105cc | 2017-12-11 15:35:27 +0000 | [diff] [blame] | 270 | Disable the ``xnack`` feature. | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 271 |  | 
|  | 272 | .. table:: AMDGPU Target Features | 
|  | 273 | :name: amdgpu-target-feature-table | 
|  | 274 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 275 | ====================== ================================================== | 
|  | 276 | Target Feature         Description | 
|  | 277 | ====================== ================================================== | 
|  | 278 | -m[no-]xnack           Enable/disable generating code that has | 
|  | 279 | memory clauses that are compatible with | 
|  | 280 | having XNACK replay enabled. | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 281 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 282 | This is used for demand paging and page | 
|  | 283 | migration. If XNACK replay is enabled in | 
|  | 284 | the device, then if a page fault occurs | 
|  | 285 | the code may execute incorrectly if the | 
|  | 286 | ``xnack`` feature is not enabled. Executing | 
|  | 287 | code that has the feature enabled on a | 
|  | 288 | device that does not have XNACK replay | 
|  | 289 | enabled will execute correctly, but may | 
|  | 290 | be less performant than code with the | 
|  | 291 | feature disabled. | 
|  | 292 |  | 
|  | 293 | -m[no-]sram-ecc        Enable/disable generating code that assumes SRAM | 
|  | 294 | ECC is enabled/disabled. | 
|  | 295 |  | 
|  | 296 | -m[no-]wavefrontsize64 Control the default wavefront size used when | 
|  | 297 | generating code for kernels. When disabled | 
|  | 298 | native wavefront size 32 is used, when enabled | 
|  | 299 | wavefront size 64 is used. | 
|  | 300 |  | 
|  | 301 | -m[no-]cumode          Control the default wavefront execution mode used | 
|  | 302 | when generating code for kernels. When disabled | 
|  | 303 | native WGP wavefront execution mode is used, | 
|  | 304 | when enabled CU wavefront execution mode is used | 
|  | 305 | (see :ref:`amdgpu-amdhsa-memory-model`). | 
|  | 306 | ====================== ================================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 307 |  | 
|  | 308 | .. _amdgpu-address-spaces: | 
| Tom Stellard | 3ec09e6 | 2016-04-06 01:29:19 +0000 | [diff] [blame] | 309 |  | 
|  | 310 | Address Spaces | 
|  | 311 | -------------- | 
|  | 312 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 313 | The AMDGPU backend uses the following address space mappings. | 
| Tom Stellard | 3ec09e6 | 2016-04-06 01:29:19 +0000 | [diff] [blame] | 314 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 315 | The memory space names used in the table, aside from the region memory space, is | 
|  | 316 | from the OpenCL standard. | 
| Tom Stellard | 3ec09e6 | 2016-04-06 01:29:19 +0000 | [diff] [blame] | 317 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 318 | LLVM Address Space number is used throughout LLVM (for example, in LLVM IR). | 
| Tom Stellard | 3ec09e6 | 2016-04-06 01:29:19 +0000 | [diff] [blame] | 319 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 320 | .. table:: Address Space Mapping | 
|  | 321 | :name: amdgpu-address-space-mapping-table | 
|  | 322 |  | 
| Neil Henning | 523dab0 | 2019-03-18 14:44:28 +0000 | [diff] [blame] | 323 | ================== ================================= | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 324 | LLVM Address Space Memory Space | 
| Neil Henning | 523dab0 | 2019-03-18 14:44:28 +0000 | [diff] [blame] | 325 | ================== ================================= | 
| Yaxun Liu | 0124b54 | 2018-02-13 18:00:25 +0000 | [diff] [blame] | 326 | 0                  Generic (Flat) | 
|  | 327 | 1                  Global | 
|  | 328 | 2                  Region (GDS) | 
|  | 329 | 3                  Local (group/LDS) | 
|  | 330 | 4                  Constant | 
|  | 331 | 5                  Private (Scratch) | 
|  | 332 | 6                  Constant 32-bit | 
| Neil Henning | 523dab0 | 2019-03-18 14:44:28 +0000 | [diff] [blame] | 333 | 7                  Buffer Fat Pointer (experimental) | 
|  | 334 | ================== ================================= | 
|  | 335 |  | 
|  | 336 | The buffer fat pointer is an experimental address space that is currently | 
|  | 337 | unsupported in the backend. It exposes a non-integral pointer that is in future | 
|  | 338 | intended to support the modelling of 128-bit buffer descriptors + a 32-bit | 
|  | 339 | offset into the buffer descriptor (in total encapsulating a 160-bit 'pointer'), | 
|  | 340 | allowing us to use normal LLVM load/store/atomic operations to model the buffer | 
|  | 341 | descriptors used heavily in graphics workloads targeting the backend. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 342 |  | 
|  | 343 | .. _amdgpu-memory-scopes: | 
|  | 344 |  | 
|  | 345 | Memory Scopes | 
|  | 346 | ------------- | 
|  | 347 |  | 
|  | 348 | This section provides LLVM memory synchronization scopes supported by the AMDGPU | 
|  | 349 | backend memory model when the target triple OS is ``amdhsa`` (see | 
|  | 350 | :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). | 
|  | 351 |  | 
|  | 352 | The memory model supported is based on the HSA memory model [HSA]_ which is | 
|  | 353 | based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before | 
|  | 354 | relation is transitive over the synchonizes-with relation independent of scope, | 
|  | 355 | and synchonizes-with allows the memory scope instances to be inclusive (see | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 356 | table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 357 |  | 
|  | 358 | This is different to the OpenCL [OpenCL]_ memory model which does not have scope | 
|  | 359 | inclusion and requires the memory scopes to exactly match. However, this | 
|  | 360 | is conservatively correct for OpenCL. | 
|  | 361 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 362 | .. table:: AMDHSA LLVM Sync Scopes | 
|  | 363 | :name: amdgpu-amdhsa-llvm-sync-scopes-table | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 364 |  | 
| Konstantin Zhuravlyov | 51809cb | 2019-03-25 20:50:21 +0000 | [diff] [blame] | 365 | ======================= =================================================== | 
|  | 366 | LLVM Sync Scope         Description | 
|  | 367 | ======================= =================================================== | 
|  | 368 | *none*                  The default: ``system``. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 369 |  | 
| Konstantin Zhuravlyov | 51809cb | 2019-03-25 20:50:21 +0000 | [diff] [blame] | 370 | Synchronizes with, and participates in modification | 
|  | 371 | and seq_cst total orderings with, other operations | 
|  | 372 | (except image operations) for all address spaces | 
|  | 373 | (except private, or generic that accesses private) | 
|  | 374 | provided the other operation's sync scope is: | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 375 |  | 
| Konstantin Zhuravlyov | 51809cb | 2019-03-25 20:50:21 +0000 | [diff] [blame] | 376 | - ``system``. | 
|  | 377 | - ``agent`` and executed by a thread on the same | 
|  | 378 | agent. | 
|  | 379 | - ``workgroup`` and executed by a thread in the | 
|  | 380 | same workgroup. | 
|  | 381 | - ``wavefront`` and executed by a thread in the | 
|  | 382 | same wavefront. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 383 |  | 
| Konstantin Zhuravlyov | 51809cb | 2019-03-25 20:50:21 +0000 | [diff] [blame] | 384 | ``agent``               Synchronizes with, and participates in modification | 
|  | 385 | and seq_cst total orderings with, other operations | 
|  | 386 | (except image operations) for all address spaces | 
|  | 387 | (except private, or generic that accesses private) | 
|  | 388 | provided the other operation's sync scope is: | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 389 |  | 
| Konstantin Zhuravlyov | 51809cb | 2019-03-25 20:50:21 +0000 | [diff] [blame] | 390 | - ``system`` or ``agent`` and executed by a thread | 
|  | 391 | on the same agent. | 
|  | 392 | - ``workgroup`` and executed by a thread in the | 
|  | 393 | same workgroup. | 
|  | 394 | - ``wavefront`` and executed by a thread in the | 
|  | 395 | same wavefront. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 396 |  | 
| Konstantin Zhuravlyov | 51809cb | 2019-03-25 20:50:21 +0000 | [diff] [blame] | 397 | ``workgroup``           Synchronizes with, and participates in modification | 
|  | 398 | and seq_cst total orderings with, other operations | 
|  | 399 | (except image operations) for all address spaces | 
|  | 400 | (except private, or generic that accesses private) | 
|  | 401 | provided the other operation's sync scope is: | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 402 |  | 
| Konstantin Zhuravlyov | 51809cb | 2019-03-25 20:50:21 +0000 | [diff] [blame] | 403 | - ``system``, ``agent`` or ``workgroup`` and | 
|  | 404 | executed by a thread in the same workgroup. | 
|  | 405 | - ``wavefront`` and executed by a thread in the | 
|  | 406 | same wavefront. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 407 |  | 
| Konstantin Zhuravlyov | 51809cb | 2019-03-25 20:50:21 +0000 | [diff] [blame] | 408 | ``wavefront``           Synchronizes with, and participates in modification | 
|  | 409 | and seq_cst total orderings with, other operations | 
|  | 410 | (except image operations) for all address spaces | 
|  | 411 | (except private, or generic that accesses private) | 
|  | 412 | provided the other operation's sync scope is: | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 413 |  | 
| Konstantin Zhuravlyov | 51809cb | 2019-03-25 20:50:21 +0000 | [diff] [blame] | 414 | - ``system``, ``agent``, ``workgroup`` or | 
|  | 415 | ``wavefront`` and executed by a thread in the | 
|  | 416 | same wavefront. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 417 |  | 
| Konstantin Zhuravlyov | 51809cb | 2019-03-25 20:50:21 +0000 | [diff] [blame] | 418 | ``singlethread``        Only synchronizes with, and participates in | 
|  | 419 | modification and seq_cst total orderings with, | 
|  | 420 | other operations (except image operations) running | 
|  | 421 | in the same thread for all address spaces (for | 
|  | 422 | example, in signal handlers). | 
|  | 423 |  | 
|  | 424 | ``one-as``              Same as ``system`` but only synchronizes with other | 
|  | 425 | operations within the same address space. | 
|  | 426 |  | 
|  | 427 | ``agent-one-as``        Same as ``agent`` but only synchronizes with other | 
|  | 428 | operations within the same address space. | 
|  | 429 |  | 
|  | 430 | ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with | 
|  | 431 | other operations within the same address space. | 
|  | 432 |  | 
|  | 433 | ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with | 
|  | 434 | other operations within the same address space. | 
|  | 435 |  | 
|  | 436 | ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with | 
|  | 437 | other operations within the same address space. | 
|  | 438 | ======================= =================================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 439 |  | 
|  | 440 | AMDGPU Intrinsics | 
|  | 441 | ----------------- | 
|  | 442 |  | 
| Tony Tye | e2f3e10 | 2018-06-14 16:40:10 +0000 | [diff] [blame] | 443 | The AMDGPU backend implements the following LLVM IR intrinsics. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 444 |  | 
|  | 445 | *This section is WIP.* | 
|  | 446 |  | 
|  | 447 | .. TODO | 
|  | 448 | List AMDGPU intrinsics | 
|  | 449 |  | 
| Tony Tye | e2f3e10 | 2018-06-14 16:40:10 +0000 | [diff] [blame] | 450 | AMDGPU Attributes | 
|  | 451 | ----------------- | 
|  | 452 |  | 
|  | 453 | The AMDGPU backend supports the following LLVM IR attributes. | 
|  | 454 |  | 
|  | 455 | .. table:: AMDGPU LLVM IR Attributes | 
|  | 456 | :name: amdgpu-llvm-ir-attributes-table | 
|  | 457 |  | 
|  | 458 | ======================================= ========================================================== | 
|  | 459 | LLVM Attribute                          Description | 
|  | 460 | ======================================= ========================================================== | 
|  | 461 | "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that | 
|  | 462 | will be specified when the kernel is dispatched. Generated | 
|  | 463 | by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. | 
|  | 464 | "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel | 
|  | 465 | argument block size for the implicit arguments. This | 
|  | 466 | varies by OS and language (for OpenCL see | 
|  | 467 | :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). | 
| Tony Tye | e2f3e10 | 2018-06-14 16:40:10 +0000 | [diff] [blame] | 468 | "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by | 
|  | 469 | the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. | 
|  | 470 | "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the | 
|  | 471 | ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. | 
|  | 472 | "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per | 
|  | 473 | execution unit. Generated by the ``amdgpu_waves_per_eu`` | 
|  | 474 | CLANG attribute [CLANG-ATTR]_. | 
| Zachary Turner | 6eb7ab9 | 2019-04-05 18:06:42 +0000 | [diff] [blame] | 475 | "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the | 
|  | 476 | mode register to be set on entry. Overrides the default for | 
|  | 477 | the calling convention. | 
|  | 478 | "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of | 
|  | 479 | the mode register to be set on entry. Overrides the default | 
|  | 480 | for the calling convention. | 
| Tony Tye | e2f3e10 | 2018-06-14 16:40:10 +0000 | [diff] [blame] | 481 | ======================================= ========================================================== | 
|  | 482 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 483 | Code Object | 
|  | 484 | =========== | 
|  | 485 |  | 
|  | 486 | The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that | 
|  | 487 | can be linked by ``lld`` to produce a standard ELF shared code object which can | 
|  | 488 | be loaded and executed on an AMDGPU target. | 
|  | 489 |  | 
|  | 490 | Header | 
|  | 491 | ------ | 
|  | 492 |  | 
|  | 493 | The AMDGPU backend uses the following ELF header: | 
|  | 494 |  | 
|  | 495 | .. table:: AMDGPU ELF Header | 
|  | 496 | :name: amdgpu-elf-header-table | 
|  | 497 |  | 
| Konstantin Zhuravlyov | a952b44 | 2017-10-03 20:54:07 +0000 | [diff] [blame] | 498 | ========================== =============================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 499 | Field                      Value | 
| Konstantin Zhuravlyov | a952b44 | 2017-10-03 20:54:07 +0000 | [diff] [blame] | 500 | ========================== =============================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 501 | ``e_ident[EI_CLASS]``      ``ELFCLASS64`` | 
|  | 502 | ``e_ident[EI_DATA]``       ``ELFDATA2LSB`` | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 503 | ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE`` | 
|  | 504 | - ``ELFOSABI_AMDGPU_HSA`` | 
|  | 505 | - ``ELFOSABI_AMDGPU_PAL`` | 
|  | 506 | - ``ELFOSABI_AMDGPU_MESA3D`` | 
|  | 507 | ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA`` | 
|  | 508 | - ``ELFABIVERSION_AMDGPU_PAL`` | 
|  | 509 | - ``ELFABIVERSION_AMDGPU_MESA3D`` | 
|  | 510 | ``e_type``                 - ``ET_REL`` | 
|  | 511 | - ``ET_DYN`` | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 512 | ``e_machine``              ``EM_AMDGPU`` | 
|  | 513 | ``e_entry``                0 | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 514 | ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table` | 
| Konstantin Zhuravlyov | a952b44 | 2017-10-03 20:54:07 +0000 | [diff] [blame] | 515 | ========================== =============================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 516 |  | 
|  | 517 | .. | 
|  | 518 |  | 
|  | 519 | .. table:: AMDGPU ELF Header Enumeration Values | 
|  | 520 | :name: amdgpu-elf-header-enumeration-values-table | 
|  | 521 |  | 
| Konstantin Zhuravlyov | 0aa94d3 | 2017-10-03 21:14:14 +0000 | [diff] [blame] | 522 | =============================== ===== | 
|  | 523 | Name                            Value | 
|  | 524 | =============================== ===== | 
|  | 525 | ``EM_AMDGPU``                   224 | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 526 | ``ELFOSABI_NONE``               0 | 
| Konstantin Zhuravlyov | 0aa94d3 | 2017-10-03 21:14:14 +0000 | [diff] [blame] | 527 | ``ELFOSABI_AMDGPU_HSA``         64 | 
|  | 528 | ``ELFOSABI_AMDGPU_PAL``         65 | 
|  | 529 | ``ELFOSABI_AMDGPU_MESA3D``      66 | 
|  | 530 | ``ELFABIVERSION_AMDGPU_HSA``    1 | 
|  | 531 | ``ELFABIVERSION_AMDGPU_PAL``    0 | 
|  | 532 | ``ELFABIVERSION_AMDGPU_MESA3D`` 0 | 
|  | 533 | =============================== ===== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 534 |  | 
|  | 535 | ``e_ident[EI_CLASS]`` | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 536 | The ELF class is: | 
|  | 537 |  | 
|  | 538 | * ``ELFCLASS32`` for ``r600`` architecture. | 
|  | 539 |  | 
|  | 540 | * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64 | 
|  | 541 | bit applications. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 542 |  | 
|  | 543 | ``e_ident[EI_DATA]`` | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 544 | All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 545 |  | 
|  | 546 | ``e_ident[EI_OSABI]`` | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 547 | One of the following AMD GPU architecture specific OS ABIs | 
|  | 548 | (see :ref:`amdgpu-os-table`): | 
| Konstantin Zhuravlyov | a952b44 | 2017-10-03 20:54:07 +0000 | [diff] [blame] | 549 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 550 | * ``ELFOSABI_NONE`` for *unknown* OS. | 
| Konstantin Zhuravlyov | a952b44 | 2017-10-03 20:54:07 +0000 | [diff] [blame] | 551 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 552 | * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 553 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 554 | * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. | 
|  | 555 |  | 
|  | 556 | * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. | 
| Konstantin Zhuravlyov | 0aa94d3 | 2017-10-03 21:14:14 +0000 | [diff] [blame] | 557 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 558 | ``e_ident[EI_ABIVERSION]`` | 
| Konstantin Zhuravlyov | a952b44 | 2017-10-03 20:54:07 +0000 | [diff] [blame] | 559 | The ABI version of the AMD GPU architecture specific OS ABI to which the code | 
|  | 560 | object conforms: | 
|  | 561 |  | 
|  | 562 | * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA | 
|  | 563 | runtime ABI. | 
|  | 564 |  | 
|  | 565 | * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL | 
|  | 566 | runtime ABI. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 567 |  | 
| Konstantin Zhuravlyov | 0aa94d3 | 2017-10-03 21:14:14 +0000 | [diff] [blame] | 568 | * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 569 | 3D runtime ABI. | 
| Konstantin Zhuravlyov | 0aa94d3 | 2017-10-03 21:14:14 +0000 | [diff] [blame] | 570 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 571 | ``e_type`` | 
|  | 572 | Can be one of the following values: | 
|  | 573 |  | 
|  | 574 |  | 
|  | 575 | ``ET_REL`` | 
|  | 576 | The type produced by the AMD GPU backend compiler as it is relocatable code | 
|  | 577 | object. | 
|  | 578 |  | 
|  | 579 | ``ET_DYN`` | 
|  | 580 | The type produced by the linker as it is a shared code object. | 
|  | 581 |  | 
|  | 582 | The AMD HSA runtime loader requires a ``ET_DYN`` code object. | 
|  | 583 |  | 
|  | 584 | ``e_machine`` | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 585 | The value ``EM_AMDGPU`` is used for the machine for all processors supported | 
|  | 586 | by the ``r600`` and ``amdgcn`` architectures (see | 
|  | 587 | :ref:`amdgpu-processor-table`). The specific processor is specified in the | 
|  | 588 | ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see | 
|  | 589 | :ref:`amdgpu-elf-header-e_flags-table`). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 590 |  | 
|  | 591 | ``e_entry`` | 
|  | 592 | The entry point is 0 as the entry points for individual kernels must be | 
|  | 593 | selected in order to invoke them through AQL packets. | 
|  | 594 |  | 
|  | 595 | ``e_flags`` | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 596 | The AMDGPU backend uses the following ELF header flags: | 
|  | 597 |  | 
|  | 598 | .. table:: AMDGPU ELF Header ``e_flags`` | 
|  | 599 | :name: amdgpu-elf-header-e_flags-table | 
|  | 600 |  | 
|  | 601 | ================================= ========== ============================= | 
|  | 602 | Name                              Value      Description | 
|  | 603 | ================================= ========== ============================= | 
|  | 604 | **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`. | 
|  | 605 | -------------------------------------------- ----------------------------- | 
|  | 606 | ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection | 
|  | 607 | mask for | 
|  | 608 | ``EF_AMDGPU_MACH_xxx`` values | 
|  | 609 | defined in | 
|  | 610 | :ref:`amdgpu-ef-amdgpu-mach-table`. | 
| Tony Tye | 31105cc | 2017-12-11 15:35:27 +0000 | [diff] [blame] | 611 | ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack`` | 
|  | 612 | target feature is | 
|  | 613 | enabled for all code | 
|  | 614 | contained in the code object. | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 615 | If the processor | 
|  | 616 | does not support the | 
|  | 617 | ``xnack`` target | 
|  | 618 | feature then must | 
|  | 619 | be 0. | 
| Tony Tye | 31105cc | 2017-12-11 15:35:27 +0000 | [diff] [blame] | 620 | See | 
|  | 621 | :ref:`amdgpu-target-features`. | 
| Konstantin Zhuravlyov | 108927b | 2018-11-05 22:44:19 +0000 | [diff] [blame] | 622 | ``EF_AMDGPU_SRAM_ECC``            0x00000200 Indicates if the ``sram-ecc`` | 
|  | 623 | target feature is | 
|  | 624 | enabled for all code | 
|  | 625 | contained in the code object. | 
|  | 626 | If the processor | 
|  | 627 | does not support the | 
|  | 628 | ``sram-ecc`` target | 
|  | 629 | feature then must | 
|  | 630 | be 0. | 
|  | 631 | See | 
|  | 632 | :ref:`amdgpu-target-features`. | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 633 | ================================= ========== ============================= | 
|  | 634 |  | 
|  | 635 | .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values | 
|  | 636 | :name: amdgpu-ef-amdgpu-mach-table | 
|  | 637 |  | 
|  | 638 | ================================= ========== ============================= | 
|  | 639 | Name                              Value      Description (see | 
|  | 640 | :ref:`amdgpu-processor-table`) | 
|  | 641 | ================================= ========== ============================= | 
| Konstantin Zhuravlyov | 9122a63 | 2018-02-16 22:33:59 +0000 | [diff] [blame] | 642 | ``EF_AMDGPU_MACH_NONE``           0x000      *not specified* | 
|  | 643 | ``EF_AMDGPU_MACH_R600_R600``      0x001      ``r600`` | 
|  | 644 | ``EF_AMDGPU_MACH_R600_R630``      0x002      ``r630`` | 
|  | 645 | ``EF_AMDGPU_MACH_R600_RS880``     0x003      ``rs880`` | 
|  | 646 | ``EF_AMDGPU_MACH_R600_RV670``     0x004      ``rv670`` | 
|  | 647 | ``EF_AMDGPU_MACH_R600_RV710``     0x005      ``rv710`` | 
|  | 648 | ``EF_AMDGPU_MACH_R600_RV730``     0x006      ``rv730`` | 
|  | 649 | ``EF_AMDGPU_MACH_R600_RV770``     0x007      ``rv770`` | 
|  | 650 | ``EF_AMDGPU_MACH_R600_CEDAR``     0x008      ``cedar`` | 
|  | 651 | ``EF_AMDGPU_MACH_R600_CYPRESS``   0x009      ``cypress`` | 
|  | 652 | ``EF_AMDGPU_MACH_R600_JUNIPER``   0x00a      ``juniper`` | 
|  | 653 | ``EF_AMDGPU_MACH_R600_REDWOOD``   0x00b      ``redwood`` | 
|  | 654 | ``EF_AMDGPU_MACH_R600_SUMO``      0x00c      ``sumo`` | 
|  | 655 | ``EF_AMDGPU_MACH_R600_BARTS``     0x00d      ``barts`` | 
|  | 656 | ``EF_AMDGPU_MACH_R600_CAICOS``    0x00e      ``caicos`` | 
|  | 657 | ``EF_AMDGPU_MACH_R600_CAYMAN``    0x00f      ``cayman`` | 
|  | 658 | ``EF_AMDGPU_MACH_R600_TURKS``     0x010      ``turks`` | 
|  | 659 | *reserved*                        0x011 -    Reserved for ``r600`` | 
|  | 660 | 0x01f      architecture processors. | 
|  | 661 | ``EF_AMDGPU_MACH_AMDGCN_GFX600``  0x020      ``gfx600`` | 
|  | 662 | ``EF_AMDGPU_MACH_AMDGCN_GFX601``  0x021      ``gfx601`` | 
|  | 663 | ``EF_AMDGPU_MACH_AMDGCN_GFX700``  0x022      ``gfx700`` | 
|  | 664 | ``EF_AMDGPU_MACH_AMDGCN_GFX701``  0x023      ``gfx701`` | 
|  | 665 | ``EF_AMDGPU_MACH_AMDGCN_GFX702``  0x024      ``gfx702`` | 
|  | 666 | ``EF_AMDGPU_MACH_AMDGCN_GFX703``  0x025      ``gfx703`` | 
|  | 667 | ``EF_AMDGPU_MACH_AMDGCN_GFX704``  0x026      ``gfx704`` | 
|  | 668 | *reserved*                        0x027      Reserved. | 
|  | 669 | ``EF_AMDGPU_MACH_AMDGCN_GFX801``  0x028      ``gfx801`` | 
|  | 670 | ``EF_AMDGPU_MACH_AMDGCN_GFX802``  0x029      ``gfx802`` | 
|  | 671 | ``EF_AMDGPU_MACH_AMDGCN_GFX803``  0x02a      ``gfx803`` | 
|  | 672 | ``EF_AMDGPU_MACH_AMDGCN_GFX810``  0x02b      ``gfx810`` | 
|  | 673 | ``EF_AMDGPU_MACH_AMDGCN_GFX900``  0x02c      ``gfx900`` | 
|  | 674 | ``EF_AMDGPU_MACH_AMDGCN_GFX902``  0x02d      ``gfx902`` | 
| Matt Arsenault | 0084adc | 2018-04-30 19:08:16 +0000 | [diff] [blame] | 675 | ``EF_AMDGPU_MACH_AMDGCN_GFX904``  0x02e      ``gfx904`` | 
|  | 676 | ``EF_AMDGPU_MACH_AMDGCN_GFX906``  0x02f      ``gfx906`` | 
| Konstantin Zhuravlyov | 9122a63 | 2018-02-16 22:33:59 +0000 | [diff] [blame] | 677 | *reserved*                        0x030      Reserved. | 
| Tim Renouf | 2a1b1d9 | 2018-10-24 08:14:07 +0000 | [diff] [blame] | 678 | ``EF_AMDGPU_MACH_AMDGCN_GFX909``  0x031      ``gfx909`` | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 679 | *reserved*                        0x032      Reserved. | 
|  | 680 | ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033      ``gfx1010`` | 
|  | 681 | ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034      ``gfx1011`` | 
|  | 682 | ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035      ``gfx1012`` | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 683 | ================================= ========== ============================= | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 684 |  | 
|  | 685 | Sections | 
|  | 686 | -------- | 
|  | 687 |  | 
|  | 688 | An AMDGPU target ELF code object has the standard ELF sections which include: | 
|  | 689 |  | 
|  | 690 | .. table:: AMDGPU ELF Sections | 
|  | 691 | :name: amdgpu-elf-sections-table | 
|  | 692 |  | 
|  | 693 | ================== ================ ================================= | 
|  | 694 | Name               Type             Attributes | 
|  | 695 | ================== ================ ================================= | 
|  | 696 | ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE`` | 
|  | 697 | ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` | 
|  | 698 | ``.debug_``\ *\**  ``SHT_PROGBITS`` *none* | 
|  | 699 | ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC`` | 
|  | 700 | ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC`` | 
|  | 701 | ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC`` | 
|  | 702 | ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` | 
|  | 703 | ``.hash``          ``SHT_HASH``     ``SHF_ALLOC`` | 
|  | 704 | ``.note``          ``SHT_NOTE``     *none* | 
|  | 705 | ``.rela``\ *name*  ``SHT_RELA``     *none* | 
|  | 706 | ``.rela.dyn``      ``SHT_RELA``     *none* | 
|  | 707 | ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC`` | 
|  | 708 | ``.shstrtab``      ``SHT_STRTAB``   *none* | 
|  | 709 | ``.strtab``        ``SHT_STRTAB``   *none* | 
|  | 710 | ``.symtab``        ``SHT_SYMTAB``   *none* | 
|  | 711 | ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` | 
|  | 712 | ================== ================ ================================= | 
|  | 713 |  | 
|  | 714 | These sections have their standard meanings (see [ELF]_) and are only generated | 
|  | 715 | if needed. | 
|  | 716 |  | 
|  | 717 | ``.debug``\ *\** | 
|  | 718 | The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the | 
|  | 719 | DWARF produced by the AMDGPU backend. | 
|  | 720 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 721 | ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 722 | The standard sections used by a dynamic loader. | 
|  | 723 |  | 
|  | 724 | ``.note`` | 
|  | 725 | See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU | 
|  | 726 | backend. | 
|  | 727 |  | 
|  | 728 | ``.rela``\ *name*, ``.rela.dyn`` | 
|  | 729 | For relocatable code objects, *name* is the name of the section that the | 
|  | 730 | relocation records apply. For example, ``.rela.text`` is the section name for | 
|  | 731 | relocation records associated with the ``.text`` section. | 
|  | 732 |  | 
|  | 733 | For linked shared code objects, ``.rela.dyn`` contains all the relocation | 
|  | 734 | records from each of the relocatable code object's ``.rela``\ *name* sections. | 
|  | 735 |  | 
|  | 736 | See :ref:`amdgpu-relocation-records` for the relocation records supported by | 
|  | 737 | the AMDGPU backend. | 
|  | 738 |  | 
|  | 739 | ``.text`` | 
|  | 740 | The executable machine code for the kernels and functions they call. Generated | 
|  | 741 | as position independent code. See :ref:`amdgpu-code-conventions` for | 
|  | 742 | information on conventions used in the isa generation. | 
|  | 743 |  | 
|  | 744 | .. _amdgpu-note-records: | 
|  | 745 |  | 
|  | 746 | Note Records | 
|  | 747 | ------------ | 
|  | 748 |  | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 749 | The AMDGPU backend code object contains ELF note records in the ``.note`` | 
|  | 750 | section. The set of generated notes and their semantics depend on the code | 
|  | 751 | object version; see :ref:`amdgpu-note-records-v2` and | 
|  | 752 | :ref:`amdgpu-note-records-v3`. | 
|  | 753 |  | 
|  | 754 | As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding | 
|  | 755 | must be generated after the ``name`` field to ensure the ``desc`` field is 4 | 
|  | 756 | byte aligned. In addition, minimal zero byte padding must be generated to | 
|  | 757 | ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` | 
|  | 758 | field of the ``.note`` section must be at least 4 to indicate at least 8 byte | 
|  | 759 | alignment. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 760 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 761 | .. _amdgpu-note-records-v2: | 
|  | 762 |  | 
|  | 763 | Code Object V2 Note Records (-mattr=-code-object-v3) | 
|  | 764 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  | 765 |  | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 766 | .. warning:: Code Object V2 is not the default code object version emitted by | 
|  | 767 | this version of LLVM. For a description of the notes generated with the | 
|  | 768 | default configuration (Code Object V3) see :ref:`amdgpu-note-records-v3`. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 769 |  | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 770 | The AMDGPU backend code object uses the following ELF note record in the | 
|  | 771 | ``.note`` section when compiling for Code Object V2 (-mattr=-code-object-v3). | 
|  | 772 |  | 
|  | 773 | Additional note records may be present, but any which are not documented here | 
|  | 774 | are deprecated and should not be used. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 775 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 776 | .. table:: AMDGPU Code Object V2 ELF Note Records | 
|  | 777 | :name: amdgpu-elf-note-records-table-v2 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 778 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 779 | ===== ============================== ====================================== | 
|  | 780 | Name  Type                           Description | 
|  | 781 | ===== ============================== ====================================== | 
|  | 782 | "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string> | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 783 | ===== ============================== ====================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 784 |  | 
|  | 785 | .. | 
|  | 786 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 787 | .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values | 
|  | 788 | :name: amdgpu-elf-note-record-enumeration-values-table-v2 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 789 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 790 | ============================== ===== | 
|  | 791 | Name                           Value | 
|  | 792 | ============================== ===== | 
|  | 793 | *reserved*                       0-9 | 
|  | 794 | ``NT_AMD_AMDGPU_HSA_METADATA``    10 | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 795 | *reserved*                        11 | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 796 | ============================== ===== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 797 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 798 | ``NT_AMD_AMDGPU_HSA_METADATA`` | 
|  | 799 | Specifies extensible metadata associated with the code objects executed on HSA | 
|  | 800 | [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when | 
|  | 801 | the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 802 | :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 803 | object metadata string. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 804 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 805 | .. _amdgpu-note-records-v3: | 
|  | 806 |  | 
|  | 807 | Code Object V3 Note Records (-mattr=+code-object-v3) | 
|  | 808 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  | 809 |  | 
|  | 810 | The AMDGPU backend code object uses the following ELF note record in the | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 811 | ``.note`` section when compiling for Code Object V3 (-mattr=+code-object-v3). | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 812 |  | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 813 | Additional note records may be present, but any which are not documented here | 
|  | 814 | are deprecated and should not be used. | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 815 |  | 
|  | 816 | .. table:: AMDGPU Code Object V3 ELF Note Records | 
|  | 817 | :name: amdgpu-elf-note-records-table-v3 | 
|  | 818 |  | 
|  | 819 | ======== ============================== ====================================== | 
|  | 820 | Name     Type                           Description | 
|  | 821 | ======== ============================== ====================================== | 
|  | 822 | "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_ | 
|  | 823 | binary format. | 
|  | 824 | ======== ============================== ====================================== | 
|  | 825 |  | 
|  | 826 | .. | 
|  | 827 |  | 
|  | 828 | .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values | 
|  | 829 | :name: amdgpu-elf-note-record-enumeration-values-table-v3 | 
|  | 830 |  | 
|  | 831 | ============================== ===== | 
|  | 832 | Name                           Value | 
|  | 833 | ============================== ===== | 
|  | 834 | *reserved*                     0-31 | 
|  | 835 | ``NT_AMDGPU_METADATA``         32 | 
|  | 836 | ============================== ===== | 
|  | 837 |  | 
|  | 838 | ``NT_AMDGPU_METADATA`` | 
|  | 839 | Specifies extensible metadata associated with an AMDGPU code | 
|  | 840 | object. It is encoded as a map in the Message Pack [MsgPack]_ binary | 
|  | 841 | data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the | 
|  | 842 | map keys defined for the ``amdhsa`` OS. | 
|  | 843 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 844 | .. _amdgpu-symbols: | 
|  | 845 |  | 
|  | 846 | Symbols | 
|  | 847 | ------- | 
|  | 848 |  | 
|  | 849 | Symbols include the following: | 
|  | 850 |  | 
|  | 851 | .. table:: AMDGPU ELF Symbols | 
|  | 852 | :name: amdgpu-elf-symbols-table | 
|  | 853 |  | 
| Nicolai Haehnle | 08e8cb5 | 2019-06-25 11:51:35 +0000 | [diff] [blame] | 854 | ===================== ================== ================ ================== | 
|  | 855 | Name                  Type               Section          Description | 
|  | 856 | ===================== ================== ================ ================== | 
|  | 857 | *link-name*           ``STT_OBJECT``     - ``.data``      Global variable | 
|  | 858 | - ``.rodata`` | 
|  | 859 | - ``.bss`` | 
|  | 860 | *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor | 
|  | 861 | *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point | 
|  | 862 | *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS | 
|  | 863 | ===================== ================== ================ ================== | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 864 |  | 
|  | 865 | Global variable | 
|  | 866 | Global variables both used and defined by the compilation unit. | 
|  | 867 |  | 
|  | 868 | If the symbol is defined in the compilation unit then it is allocated in the | 
|  | 869 | appropriate section according to if it has initialized data or is readonly. | 
|  | 870 |  | 
|  | 871 | If the symbol is external then its section is ``STN_UNDEF`` and the loader | 
|  | 872 | will resolve relocations using the definition provided by another code object | 
|  | 873 | or explicitly defined by the runtime. | 
|  | 874 |  | 
| Nicolai Haehnle | 08e8cb5 | 2019-06-25 11:51:35 +0000 | [diff] [blame] | 875 | If the symbol resides in local/group memory (LDS) then its section is the | 
|  | 876 | special processor-specific section name ``SHN_AMDGPU_LDS``, and the | 
|  | 877 | ``st_value`` field describes alignment requirements as it does for common | 
|  | 878 | symbols. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 879 |  | 
|  | 880 | .. TODO | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 881 | Add description of linked shared object symbols. Seems undefined symbols | 
|  | 882 | are marked as STT_NOTYPE. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 883 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 884 | Kernel descriptor | 
|  | 885 | Every HSA kernel has an associated kernel descriptor. It is the address of the | 
|  | 886 | kernel descriptor that is used in the AQL dispatch packet used to invoke the | 
|  | 887 | kernel, not the kernel entry point. The layout of the HSA kernel descriptor is | 
|  | 888 | defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. | 
|  | 889 |  | 
|  | 890 | Kernel entry point | 
|  | 891 | Every HSA kernel also has a symbol for its machine code entry point. | 
|  | 892 |  | 
|  | 893 | .. _amdgpu-relocation-records: | 
|  | 894 |  | 
|  | 895 | Relocation Records | 
|  | 896 | ------------------ | 
|  | 897 |  | 
|  | 898 | AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported | 
|  | 899 | relocatable fields are: | 
|  | 900 |  | 
|  | 901 | ``word32`` | 
|  | 902 | This specifies a 32-bit field occupying 4 bytes with arbitrary byte | 
|  | 903 | alignment. These values use the same byte order as other word values in the | 
|  | 904 | AMD GPU architecture. | 
|  | 905 |  | 
|  | 906 | ``word64`` | 
|  | 907 | This specifies a 64-bit field occupying 8 bytes with arbitrary byte | 
|  | 908 | alignment. These values use the same byte order as other word values in the | 
|  | 909 | AMD GPU architecture. | 
|  | 910 |  | 
|  | 911 | Following notations are used for specifying relocation calculations: | 
|  | 912 |  | 
|  | 913 | **A** | 
|  | 914 | Represents the addend used to compute the value of the relocatable field. | 
|  | 915 |  | 
|  | 916 | **G** | 
|  | 917 | Represents the offset into the global offset table at which the relocation | 
| Konstantin Zhuravlyov | ea35e46 | 2017-10-19 17:12:55 +0000 | [diff] [blame] | 918 | entry's symbol will reside during execution. | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 919 |  | 
|  | 920 | **GOT** | 
|  | 921 | Represents the address of the global offset table. | 
|  | 922 |  | 
|  | 923 | **P** | 
|  | 924 | Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) | 
|  | 925 | of the storage unit being relocated (computed using ``r_offset``). | 
|  | 926 |  | 
|  | 927 | **S** | 
|  | 928 | Represents the value of the symbol whose index resides in the relocation | 
| Tony Tye | d288430 | 2017-10-16 20:44:29 +0000 | [diff] [blame] | 929 | entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``. | 
|  | 930 |  | 
|  | 931 | **B** | 
|  | 932 | Represents the base address of a loaded executable or shared object which is | 
|  | 933 | the difference between the ELF address and the actual load address. Relocations | 
|  | 934 | using this are only valid in executable or shared objects. | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 935 |  | 
|  | 936 | The following relocation types are supported: | 
|  | 937 |  | 
|  | 938 | .. table:: AMDGPU ELF Relocation Records | 
|  | 939 | :name: amdgpu-elf-relocation-records-table | 
|  | 940 |  | 
| Tony Tye | db6c993 | 2018-01-30 23:59:43 +0000 | [diff] [blame] | 941 | ========================== ======= =====  ==========  ============================== | 
|  | 942 | Relocation Type            Kind    Value  Field       Calculation | 
|  | 943 | ========================== ======= =====  ==========  ============================== | 
|  | 944 | ``R_AMDGPU_NONE``                  0      *none*      *none* | 
| Tony Tye | 223f4c7 | 2018-04-13 01:01:27 +0000 | [diff] [blame] | 945 | ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF | 
|  | 946 | Dynamic | 
|  | 947 | ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32 | 
|  | 948 | Dynamic | 
|  | 949 | ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A | 
| Matt Arsenault | 0084adc | 2018-04-30 19:08:16 +0000 | [diff] [blame] | 950 | Dynamic | 
| Tony Tye | db6c993 | 2018-01-30 23:59:43 +0000 | [diff] [blame] | 951 | ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P | 
|  | 952 | ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P | 
| Tony Tye | 223f4c7 | 2018-04-13 01:01:27 +0000 | [diff] [blame] | 953 | ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A | 
|  | 954 | Dynamic | 
| Tony Tye | db6c993 | 2018-01-30 23:59:43 +0000 | [diff] [blame] | 955 | ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P | 
|  | 956 | ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF | 
|  | 957 | ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32 | 
|  | 958 | ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF | 
|  | 959 | ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32 | 
|  | 960 | *reserved*                         12 | 
|  | 961 | ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A | 
|  | 962 | ========================== ======= =====  ==========  ============================== | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 963 |  | 
| Tony Tye | 223f4c7 | 2018-04-13 01:01:27 +0000 | [diff] [blame] | 964 | ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by | 
|  | 965 | the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. | 
|  | 966 |  | 
|  | 967 | There is no current OS loader support for 32 bit programs and so | 
|  | 968 | ``R_AMDGPU_ABS32`` is not used. | 
| Matt Arsenault | 0084adc | 2018-04-30 19:08:16 +0000 | [diff] [blame] | 969 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 970 | .. _amdgpu-dwarf: | 
|  | 971 |  | 
|  | 972 | DWARF | 
|  | 973 | ----- | 
|  | 974 |  | 
| Scott Linder | 16c7bda | 2018-02-23 23:01:06 +0000 | [diff] [blame] | 975 | Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 976 | information that maps the code object executable code and data to the source | 
|  | 977 | language constructs. It can be used by tools such as debuggers and profilers. | 
|  | 978 |  | 
|  | 979 | Address Space Mapping | 
|  | 980 | ~~~~~~~~~~~~~~~~~~~~~ | 
|  | 981 |  | 
|  | 982 | The following address space mapping is used: | 
|  | 983 |  | 
|  | 984 | .. table:: AMDGPU DWARF Address Space Mapping | 
|  | 985 | :name: amdgpu-dwarf-address-space-mapping-table | 
|  | 986 |  | 
|  | 987 | =================== ================= | 
|  | 988 | DWARF Address Space Memory Space | 
|  | 989 | =================== ================= | 
|  | 990 | 1                   Private (Scratch) | 
|  | 991 | 2                   Local (group/LDS) | 
|  | 992 | *omitted*           Global | 
|  | 993 | *omitted*           Constant | 
|  | 994 | *omitted*           Generic (Flat) | 
|  | 995 | *not supported*     Region (GDS) | 
|  | 996 | =================== ================= | 
|  | 997 |  | 
|  | 998 | See :ref:`amdgpu-address-spaces` for information on the memory space terminology | 
|  | 999 | used in the table. | 
|  | 1000 |  | 
|  | 1001 | An ``address_class`` attribute is generated on pointer type DIEs to specify the | 
|  | 1002 | DWARF address space of the value of the pointer when it is in the *private* or | 
|  | 1003 | *local* address space. Otherwise the attribute is omitted. | 
|  | 1004 |  | 
|  | 1005 | An ``XDEREF`` operation is generated in location list expressions for variables | 
|  | 1006 | that are allocated in the *private* and *local* address space. Otherwise no | 
|  | 1007 | ``XDREF`` is omitted. | 
|  | 1008 |  | 
|  | 1009 | Register Mapping | 
|  | 1010 | ~~~~~~~~~~~~~~~~ | 
|  | 1011 |  | 
|  | 1012 | *This section is WIP.* | 
|  | 1013 |  | 
|  | 1014 | .. TODO | 
|  | 1015 | Define DWARF register enumeration. | 
|  | 1016 |  | 
|  | 1017 | If want to present a wavefront state then should expose vector registers as | 
|  | 1018 | 64 wide (rather than per work-item view that LLVM uses). Either as separate | 
|  | 1019 | registers, or a 64x4 byte single register. In either case use a new LANE op | 
|  | 1020 | (akin to XDREF) to select the current lane usage in a location | 
|  | 1021 | expression. This would also allow scalar register spilling to vector register | 
|  | 1022 | lanes to be expressed (currently no debug information is being generated for | 
|  | 1023 | spilling). If choose a wide single register approach then use LANE in | 
|  | 1024 | conjunction with PIECE operation to select the dword part of the register for | 
|  | 1025 | the current lane. If the separate register approach then use LANE to select | 
|  | 1026 | the register. | 
|  | 1027 |  | 
|  | 1028 | Source Text | 
|  | 1029 | ~~~~~~~~~~~ | 
|  | 1030 |  | 
| Scott Linder | 16c7bda | 2018-02-23 23:01:06 +0000 | [diff] [blame] | 1031 | Source text for online-compiled programs (e.g. those compiled by the OpenCL | 
|  | 1032 | runtime) may be embedded into the DWARF v5 line table using the ``clang | 
|  | 1033 | -gembed-source`` option, described in table :ref:`amdgpu-debug-options`. | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 1034 |  | 
| Scott Linder | 16c7bda | 2018-02-23 23:01:06 +0000 | [diff] [blame] | 1035 | For example: | 
|  | 1036 |  | 
|  | 1037 | ``-gembed-source`` | 
|  | 1038 | Enable the embedded source DWARF v5 extension. | 
|  | 1039 | ``-gno-embed-source`` | 
|  | 1040 | Disable the embedded source DWARF v5 extension. | 
|  | 1041 |  | 
|  | 1042 | .. table:: AMDGPU Debug Options | 
|  | 1043 | :name: amdgpu-debug-options | 
|  | 1044 |  | 
|  | 1045 | ==================== ================================================== | 
|  | 1046 | Debug Flag           Description | 
|  | 1047 | ==================== ================================================== | 
|  | 1048 | -g[no-]embed-source  Enable/disable embedding source text in DWARF | 
|  | 1049 | debug sections. Useful for environments where | 
|  | 1050 | source cannot be written to disk, such as | 
|  | 1051 | when performing online compilation. | 
|  | 1052 | ==================== ================================================== | 
|  | 1053 |  | 
|  | 1054 | This option enables one extended content types in the DWARF v5 Line Number | 
|  | 1055 | Program Header, which is used to encode embedded source. | 
|  | 1056 |  | 
|  | 1057 | .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types | 
|  | 1058 | :name: amdgpu-dwarf-extended-content-types | 
|  | 1059 |  | 
|  | 1060 | ============================  ====================== | 
|  | 1061 | Content Type                  Form | 
|  | 1062 | ============================  ====================== | 
|  | 1063 | ``DW_LNCT_LLVM_source``       ``DW_FORM_line_strp`` | 
|  | 1064 | ============================  ====================== | 
|  | 1065 |  | 
|  | 1066 | The source field will contain the UTF-8 encoded, null-terminated source text | 
|  | 1067 | with ``'\n'`` line endings. When the source field is present, consumers can use | 
|  | 1068 | the embedded source instead of attempting to discover the source on disk. When | 
|  | 1069 | the source field is absent, consumers can access the file to get the source | 
|  | 1070 | text. | 
|  | 1071 |  | 
|  | 1072 | The above content type appears in the ``file_name_entry_format`` field of the | 
|  | 1073 | line table prologue, and its corresponding value appear in the ``file_names`` | 
|  | 1074 | field. The current encoding of the content type is documented in table | 
|  | 1075 | :ref:`amdgpu-dwarf-extended-content-types-encoding` | 
|  | 1076 |  | 
|  | 1077 | .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding | 
|  | 1078 | :name: amdgpu-dwarf-extended-content-types-encoding | 
|  | 1079 |  | 
|  | 1080 | ============================  ==================== | 
|  | 1081 | Content Type                  Value | 
|  | 1082 | ============================  ==================== | 
|  | 1083 | ``DW_LNCT_LLVM_source``       0x2001 | 
|  | 1084 | ============================  ==================== | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 1085 |  | 
|  | 1086 | .. _amdgpu-code-conventions: | 
|  | 1087 |  | 
|  | 1088 | Code Conventions | 
|  | 1089 | ================ | 
|  | 1090 |  | 
|  | 1091 | This section provides code conventions used for each supported target triple OS | 
|  | 1092 | (see :ref:`amdgpu-target-triples`). | 
|  | 1093 |  | 
|  | 1094 | AMDHSA | 
|  | 1095 | ------ | 
|  | 1096 |  | 
|  | 1097 | This section provides code conventions used when the target triple OS is | 
|  | 1098 | ``amdhsa`` (see :ref:`amdgpu-target-triples`). | 
|  | 1099 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 1100 | .. _amdgpu-amdhsa-code-object-target-identification: | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1101 |  | 
| Tony Tye | 01bfd6c | 2018-03-27 21:20:46 +0000 | [diff] [blame] | 1102 | Code Object Target Identification | 
|  | 1103 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  | 1104 |  | 
|  | 1105 | The AMDHSA OS uses the following syntax to specify the code object | 
|  | 1106 | target as a single string: | 
|  | 1107 |  | 
|  | 1108 | ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>`` | 
|  | 1109 |  | 
|  | 1110 | Where: | 
|  | 1111 |  | 
|  | 1112 | - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>`` | 
|  | 1113 | are the same as the *Target Triple* (see | 
|  | 1114 | :ref:`amdgpu-target-triples`). | 
|  | 1115 |  | 
|  | 1116 | - ``<Processor>`` is the same as the *Processor* (see | 
|  | 1117 | :ref:`amdgpu-processors`). | 
|  | 1118 |  | 
|  | 1119 | - ``<Target Features>`` is a list of the enabled *Target Features* | 
|  | 1120 | (see :ref:`amdgpu-target-features`), each prefixed by a plus, that | 
|  | 1121 | apply to *Processor*. The list must be in the same order as listed | 
|  | 1122 | in the table :ref:`amdgpu-target-feature-table`. Note that *Target | 
|  | 1123 | Features* must be included in the list if they are enabled even if | 
|  | 1124 | that is the default for *Processor*. | 
|  | 1125 |  | 
|  | 1126 | For example: | 
|  | 1127 |  | 
|  | 1128 | ``"amdgcn-amd-amdhsa--gfx902+xnack"`` | 
|  | 1129 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 1130 | .. _amdgpu-amdhsa-code-object-metadata: | 
|  | 1131 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1132 | Code Object Metadata | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 1133 | ~~~~~~~~~~~~~~~~~~~~ | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1134 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 1135 | The code object metadata specifies extensible metadata associated with the code | 
|  | 1136 | objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 1137 | [AMD-ROCm]_. The encoding and semantics of this metadata depends on the code | 
|  | 1138 | object version; see :ref:`amdgpu-amdhsa-code-object-metadata-v2` and | 
|  | 1139 | :ref:`amdgpu-amdhsa-code-object-metadata-v3`. | 
|  | 1140 |  | 
|  | 1141 | Code object metadata is specified in a note record (see | 
|  | 1142 | :ref:`amdgpu-note-records`) and is required when the target triple OS is | 
|  | 1143 | ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum | 
|  | 1144 | information necessary to support the ROCM kernel queries. For example, the | 
|  | 1145 | segment sizes needed in a dispatch packet. In addition, a high level language | 
|  | 1146 | runtime may require other information to be included. For example, the AMD | 
|  | 1147 | OpenCL runtime records kernel argument information. | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1148 |  | 
|  | 1149 | .. _amdgpu-amdhsa-code-object-metadata-v2: | 
|  | 1150 |  | 
|  | 1151 | Code Object V2 Metadata (-mattr=-code-object-v3) | 
|  | 1152 | ++++++++++++++++++++++++++++++++++++++++++++++++ | 
|  | 1153 |  | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 1154 | .. warning:: Code Object V2 is not the default code object version emitted by | 
|  | 1155 | this version of LLVM. For a description of the metadata generated with the | 
|  | 1156 | default configuration (Code Object V3) see | 
|  | 1157 | :ref:`amdgpu-amdhsa-code-object-metadata-v3`. | 
|  | 1158 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1159 | Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note | 
|  | 1160 | record (see :ref:`amdgpu-note-records-v2`). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1161 |  | 
| Sylvestre Ledru | e3fdbae | 2017-06-26 02:45:39 +0000 | [diff] [blame] | 1162 | The metadata is specified as a YAML formatted string (see [YAML]_ and | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1163 | :doc:`YamlIO`). | 
|  | 1164 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 1165 | .. TODO | 
|  | 1166 | Is the string null terminated? It probably should not if YAML allows it to | 
|  | 1167 | contain null characters, otherwise it should be. | 
|  | 1168 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1169 | The metadata is represented as a single YAML document comprised of the mapping | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1170 | defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1171 | referenced tables. | 
|  | 1172 |  | 
|  | 1173 | For boolean values, the string values of ``false`` and ``true`` are used for | 
|  | 1174 | false and true respectively. | 
|  | 1175 |  | 
|  | 1176 | Additional information can be added to the mappings. To avoid conflicts, any | 
|  | 1177 | non-AMD key names should be prefixed by "*vendor-name*.". | 
|  | 1178 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1179 | .. table:: AMDHSA Code Object V2 Metadata Map | 
|  | 1180 | :name: amdgpu-amdhsa-code-object-metadata-map-table-v2 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1181 |  | 
|  | 1182 | ========== ============== ========= ======================================= | 
|  | 1183 | String Key Value Type     Required? Description | 
|  | 1184 | ========== ============== ========= ======================================= | 
|  | 1185 | "Version"  sequence of    Required  - The first integer is the major | 
|  | 1186 | 2 integers                 version. Currently 1. | 
|  | 1187 | - The second integer is the minor | 
|  | 1188 | version. Currently 0. | 
|  | 1189 | "Printf"   sequence of              Each string is encoded information | 
|  | 1190 | strings                  about a printf function call. The | 
|  | 1191 | encoded information is organized as | 
|  | 1192 | fields separated by colon (':'): | 
|  | 1193 |  | 
|  | 1194 | ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` | 
|  | 1195 |  | 
|  | 1196 | where: | 
|  | 1197 |  | 
|  | 1198 | ``ID`` | 
|  | 1199 | A 32 bit integer as a unique id for | 
|  | 1200 | each printf function call | 
|  | 1201 |  | 
|  | 1202 | ``N`` | 
|  | 1203 | A 32 bit integer equal to the number | 
|  | 1204 | of arguments of printf function call | 
|  | 1205 | minus 1 | 
|  | 1206 |  | 
|  | 1207 | ``S[i]`` (where i = 0, 1, ... , N-1) | 
|  | 1208 | 32 bit integers for the size in bytes | 
|  | 1209 | of the i-th FormatString argument of | 
|  | 1210 | the printf function call | 
|  | 1211 |  | 
|  | 1212 | FormatString | 
|  | 1213 | The format string passed to the | 
|  | 1214 | printf function call. | 
|  | 1215 | "Kernels"  sequence of    Required  Sequence of the mappings for each | 
|  | 1216 | mapping                  kernel in the code object. See | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1217 | :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2` | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1218 | for the definition of the mapping. | 
|  | 1219 | ========== ============== ========= ======================================= | 
|  | 1220 |  | 
|  | 1221 | .. | 
|  | 1222 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1223 | .. table:: AMDHSA Code Object V2 Kernel Metadata Map | 
|  | 1224 | :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1225 |  | 
|  | 1226 | ================= ============== ========= ================================ | 
|  | 1227 | String Key        Value Type     Required? Description | 
|  | 1228 | ================= ============== ========= ================================ | 
|  | 1229 | "Name"            string         Required  Source name of the kernel. | 
|  | 1230 | "SymbolName"      string         Required  Name of the kernel | 
|  | 1231 | descriptor ELF symbol. | 
|  | 1232 | "Language"        string                   Source language of the kernel. | 
|  | 1233 | Values include: | 
|  | 1234 |  | 
|  | 1235 | - "OpenCL C" | 
|  | 1236 | - "OpenCL C++" | 
|  | 1237 | - "HCC" | 
|  | 1238 | - "OpenMP" | 
|  | 1239 |  | 
|  | 1240 | "LanguageVersion" sequence of              - The first integer is the major | 
|  | 1241 | 2 integers                 version. | 
|  | 1242 | - The second integer is the | 
|  | 1243 | minor version. | 
|  | 1244 | "Attrs"           mapping                  Mapping of kernel attributes. | 
|  | 1245 | See | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1246 | :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2` | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1247 | for the mapping definition. | 
| Konstantin Zhuravlyov | a01d8b0 | 2017-10-14 19:03:51 +0000 | [diff] [blame] | 1248 | "Args"            sequence of              Sequence of mappings of the | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1249 | mapping                  kernel arguments. See | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1250 | :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2` | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1251 | for the definition of the mapping. | 
|  | 1252 | "CodeProps"       mapping                  Mapping of properties related to | 
|  | 1253 | the kernel code. See | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1254 | :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2` | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1255 | for the mapping definition. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1256 | ================= ============== ========= ================================ | 
|  | 1257 |  | 
|  | 1258 | .. | 
|  | 1259 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1260 | .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map | 
|  | 1261 | :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1262 |  | 
|  | 1263 | =================== ============== ========= ============================== | 
|  | 1264 | String Key          Value Type     Required? Description | 
|  | 1265 | =================== ============== ========= ============================== | 
| Tony Tye | e039d0e | 2018-01-30 23:07:10 +0000 | [diff] [blame] | 1266 | "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values | 
|  | 1267 | 3 integers               must be >=1 and the dispatch | 
|  | 1268 | work-group size X, Y, Z must | 
|  | 1269 | correspond to the specified | 
|  | 1270 | values. Defaults to 0, 0, 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1271 |  | 
|  | 1272 | Corresponds to the OpenCL | 
|  | 1273 | ``reqd_work_group_size`` | 
|  | 1274 | attribute. | 
|  | 1275 | "WorkGroupSizeHint" sequence of              The dispatch work-group size | 
|  | 1276 | 3 integers               X, Y, Z is likely to be the | 
|  | 1277 | specified values. | 
|  | 1278 |  | 
|  | 1279 | Corresponds to the OpenCL | 
|  | 1280 | ``work_group_size_hint`` | 
|  | 1281 | attribute. | 
|  | 1282 | "VecTypeHint"       string                   The name of a scalar or vector | 
|  | 1283 | type. | 
|  | 1284 |  | 
|  | 1285 | Corresponds to the OpenCL | 
|  | 1286 | ``vec_type_hint`` attribute. | 
| Yaxun Liu | de4b88d | 2017-10-10 19:39:48 +0000 | [diff] [blame] | 1287 |  | 
|  | 1288 | "RuntimeHandle"     string                   The external symbol name | 
|  | 1289 | associated with a kernel. | 
|  | 1290 | OpenCL runtime allocates a | 
|  | 1291 | global buffer for the symbol | 
|  | 1292 | and saves the kernel's address | 
|  | 1293 | to it, which is used for | 
|  | 1294 | device side enqueueing. Only | 
|  | 1295 | available for device side | 
|  | 1296 | enqueued kernels. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1297 | =================== ============== ========= ============================== | 
|  | 1298 |  | 
|  | 1299 | .. | 
|  | 1300 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1301 | .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map | 
|  | 1302 | :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1303 |  | 
|  | 1304 | ================= ============== ========= ================================ | 
|  | 1305 | String Key        Value Type     Required? Description | 
|  | 1306 | ================= ============== ========= ================================ | 
|  | 1307 | "Name"            string                   Kernel argument name. | 
|  | 1308 | "TypeName"        string                   Kernel argument type name. | 
|  | 1309 | "Size"            integer        Required  Kernel argument size in bytes. | 
|  | 1310 | "Align"           integer        Required  Kernel argument alignment in | 
|  | 1311 | bytes. Must be a power of two. | 
|  | 1312 | "ValueKind"       string         Required  Kernel argument kind that | 
|  | 1313 | specifies how to set up the | 
|  | 1314 | corresponding argument. | 
|  | 1315 | Values include: | 
|  | 1316 |  | 
|  | 1317 | "ByValue" | 
|  | 1318 | The argument is copied | 
|  | 1319 | directly into the kernarg. | 
|  | 1320 |  | 
|  | 1321 | "GlobalBuffer" | 
|  | 1322 | A global address space pointer | 
|  | 1323 | to the buffer data is passed | 
|  | 1324 | in the kernarg. | 
|  | 1325 |  | 
|  | 1326 | "DynamicSharedPointer" | 
|  | 1327 | A group address space pointer | 
|  | 1328 | to dynamically allocated LDS | 
|  | 1329 | is passed in the kernarg. | 
|  | 1330 |  | 
|  | 1331 | "Sampler" | 
|  | 1332 | A global address space | 
|  | 1333 | pointer to a S# is passed in | 
|  | 1334 | the kernarg. | 
|  | 1335 |  | 
|  | 1336 | "Image" | 
|  | 1337 | A global address space | 
|  | 1338 | pointer to a T# is passed in | 
|  | 1339 | the kernarg. | 
|  | 1340 |  | 
|  | 1341 | "Pipe" | 
|  | 1342 | A global address space pointer | 
|  | 1343 | to an OpenCL pipe is passed in | 
|  | 1344 | the kernarg. | 
|  | 1345 |  | 
|  | 1346 | "Queue" | 
|  | 1347 | A global address space pointer | 
|  | 1348 | to an OpenCL device enqueue | 
|  | 1349 | queue is passed in the | 
|  | 1350 | kernarg. | 
|  | 1351 |  | 
|  | 1352 | "HiddenGlobalOffsetX" | 
|  | 1353 | The OpenCL grid dispatch | 
|  | 1354 | global offset for the X | 
|  | 1355 | dimension is passed in the | 
|  | 1356 | kernarg. | 
|  | 1357 |  | 
|  | 1358 | "HiddenGlobalOffsetY" | 
|  | 1359 | The OpenCL grid dispatch | 
|  | 1360 | global offset for the Y | 
|  | 1361 | dimension is passed in the | 
|  | 1362 | kernarg. | 
|  | 1363 |  | 
|  | 1364 | "HiddenGlobalOffsetZ" | 
|  | 1365 | The OpenCL grid dispatch | 
|  | 1366 | global offset for the Z | 
|  | 1367 | dimension is passed in the | 
|  | 1368 | kernarg. | 
|  | 1369 |  | 
|  | 1370 | "HiddenNone" | 
|  | 1371 | An argument that is not used | 
|  | 1372 | by the kernel. Space needs to | 
|  | 1373 | be left for it, but it does | 
|  | 1374 | not need to be set up. | 
|  | 1375 |  | 
|  | 1376 | "HiddenPrintfBuffer" | 
|  | 1377 | A global address space pointer | 
|  | 1378 | to the runtime printf buffer | 
|  | 1379 | is passed in kernarg. | 
|  | 1380 |  | 
|  | 1381 | "HiddenDefaultQueue" | 
|  | 1382 | A global address space pointer | 
|  | 1383 | to the OpenCL device enqueue | 
|  | 1384 | queue that should be used by | 
|  | 1385 | the kernel by default is | 
|  | 1386 | passed in the kernarg. | 
|  | 1387 |  | 
|  | 1388 | "HiddenCompletionAction" | 
| Yaxun Liu | c928f2a | 2017-10-30 14:30:28 +0000 | [diff] [blame] | 1389 | A global address space pointer | 
|  | 1390 | to help link enqueued kernels into | 
|  | 1391 | the ancestor tree for determining | 
|  | 1392 | when the parent kernel has finished. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1393 |  | 
|  | 1394 | "ValueType"       string         Required  Kernel argument value type. Only | 
|  | 1395 | present if "ValueKind" is | 
|  | 1396 | "ByValue". For vector data | 
|  | 1397 | types, the value is for the | 
|  | 1398 | element type. Values include: | 
|  | 1399 |  | 
|  | 1400 | - "Struct" | 
|  | 1401 | - "I8" | 
|  | 1402 | - "U8" | 
|  | 1403 | - "I16" | 
|  | 1404 | - "U16" | 
|  | 1405 | - "F16" | 
|  | 1406 | - "I32" | 
|  | 1407 | - "U32" | 
|  | 1408 | - "F32" | 
|  | 1409 | - "I64" | 
|  | 1410 | - "U64" | 
|  | 1411 | - "F64" | 
|  | 1412 |  | 
|  | 1413 | .. TODO | 
|  | 1414 | How can it be determined if a | 
|  | 1415 | vector type, and what size | 
|  | 1416 | vector? | 
|  | 1417 | "PointeeAlign"    integer                  Alignment in bytes of pointee | 
|  | 1418 | type for pointer type kernel | 
|  | 1419 | argument. Must be a power | 
|  | 1420 | of 2. Only present if | 
|  | 1421 | "ValueKind" is | 
|  | 1422 | "DynamicSharedPointer". | 
|  | 1423 | "AddrSpaceQual"   string                   Kernel argument address space | 
|  | 1424 | qualifier. Only present if | 
|  | 1425 | "ValueKind" is "GlobalBuffer" or | 
|  | 1426 | "DynamicSharedPointer". Values | 
|  | 1427 | are: | 
|  | 1428 |  | 
|  | 1429 | - "Private" | 
|  | 1430 | - "Global" | 
|  | 1431 | - "Constant" | 
|  | 1432 | - "Local" | 
|  | 1433 | - "Generic" | 
|  | 1434 | - "Region" | 
|  | 1435 |  | 
|  | 1436 | .. TODO | 
|  | 1437 | Is GlobalBuffer only Global | 
|  | 1438 | or Constant? Is | 
|  | 1439 | DynamicSharedPointer always | 
|  | 1440 | Local? Can HCC allow Generic? | 
|  | 1441 | How can Private or Region | 
|  | 1442 | ever happen? | 
|  | 1443 | "AccQual"         string                   Kernel argument access | 
|  | 1444 | qualifier. Only present if | 
|  | 1445 | "ValueKind" is "Image" or | 
|  | 1446 | "Pipe". Values | 
|  | 1447 | are: | 
|  | 1448 |  | 
|  | 1449 | - "ReadOnly" | 
|  | 1450 | - "WriteOnly" | 
|  | 1451 | - "ReadWrite" | 
|  | 1452 |  | 
|  | 1453 | .. TODO | 
|  | 1454 | Does this apply to | 
|  | 1455 | GlobalBuffer? | 
| Konstantin Zhuravlyov | a01d8b0 | 2017-10-14 19:03:51 +0000 | [diff] [blame] | 1456 | "ActualAccQual"   string                   The actual memory accesses | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1457 | performed by the kernel on the | 
|  | 1458 | kernel argument. Only present if | 
|  | 1459 | "ValueKind" is "GlobalBuffer", | 
|  | 1460 | "Image", or "Pipe". This may be | 
|  | 1461 | more restrictive than indicated | 
|  | 1462 | by "AccQual" to reflect what the | 
|  | 1463 | kernel actual does. If not | 
|  | 1464 | present then the runtime must | 
|  | 1465 | assume what is implied by | 
|  | 1466 | "AccQual" and "IsConst". Values | 
|  | 1467 | are: | 
|  | 1468 |  | 
|  | 1469 | - "ReadOnly" | 
|  | 1470 | - "WriteOnly" | 
|  | 1471 | - "ReadWrite" | 
|  | 1472 |  | 
|  | 1473 | "IsConst"         boolean                  Indicates if the kernel argument | 
|  | 1474 | is const qualified. Only present | 
|  | 1475 | if "ValueKind" is | 
|  | 1476 | "GlobalBuffer". | 
|  | 1477 |  | 
|  | 1478 | "IsRestrict"      boolean                  Indicates if the kernel argument | 
|  | 1479 | is restrict qualified. Only | 
|  | 1480 | present if "ValueKind" is | 
|  | 1481 | "GlobalBuffer". | 
|  | 1482 |  | 
|  | 1483 | "IsVolatile"      boolean                  Indicates if the kernel argument | 
|  | 1484 | is volatile qualified. Only | 
|  | 1485 | present if "ValueKind" is | 
|  | 1486 | "GlobalBuffer". | 
|  | 1487 |  | 
|  | 1488 | "IsPipe"          boolean                  Indicates if the kernel argument | 
|  | 1489 | is pipe qualified. Only present | 
|  | 1490 | if "ValueKind" is "Pipe". | 
|  | 1491 |  | 
|  | 1492 | .. TODO | 
|  | 1493 | Can GlobalBuffer be pipe | 
|  | 1494 | qualified? | 
|  | 1495 | ================= ============== ========= ================================ | 
|  | 1496 |  | 
|  | 1497 | .. | 
|  | 1498 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1499 | .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map | 
|  | 1500 | :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1501 |  | 
|  | 1502 | ============================ ============== ========= ===================== | 
|  | 1503 | String Key                   Value Type     Required? Description | 
|  | 1504 | ============================ ============== ========= ===================== | 
|  | 1505 | "KernargSegmentSize"         integer        Required  The size in bytes of | 
|  | 1506 | the kernarg segment | 
|  | 1507 | that holds the values | 
|  | 1508 | of the arguments to | 
|  | 1509 | the kernel. | 
|  | 1510 | "GroupSegmentFixedSize"      integer        Required  The amount of group | 
|  | 1511 | segment memory | 
|  | 1512 | required by a | 
|  | 1513 | work-group in | 
|  | 1514 | bytes. This does not | 
|  | 1515 | include any | 
|  | 1516 | dynamically allocated | 
|  | 1517 | group segment memory | 
|  | 1518 | that may be added | 
|  | 1519 | when the kernel is | 
|  | 1520 | dispatched. | 
|  | 1521 | "PrivateSegmentFixedSize"    integer        Required  The amount of fixed | 
|  | 1522 | private address space | 
|  | 1523 | memory required for a | 
|  | 1524 | work-item in | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 1525 | bytes. If the kernel | 
|  | 1526 | uses a dynamic call | 
|  | 1527 | stack then additional | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1528 | space must be added | 
|  | 1529 | to this value for the | 
|  | 1530 | call stack. | 
|  | 1531 | "KernargSegmentAlign"        integer        Required  The maximum byte | 
|  | 1532 | alignment of | 
|  | 1533 | arguments in the | 
|  | 1534 | kernarg segment. Must | 
|  | 1535 | be a power of 2. | 
|  | 1536 | "WavefrontSize"              integer        Required  Wavefront size. Must | 
|  | 1537 | be a power of 2. | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 1538 | "NumSGPRs"                   integer        Required  Number of scalar | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1539 | registers used by a | 
|  | 1540 | wavefront for | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 1541 | GFX6-GFX10. This | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1542 | includes the special | 
|  | 1543 | SGPRs for VCC, Flat | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 1544 | Scratch (GFX7-GFX10) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1545 | and XNACK (for | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 1546 | GFX8-GFX10). It does | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1547 | not include the 16 | 
|  | 1548 | SGPR added if a trap | 
|  | 1549 | handler is | 
|  | 1550 | enabled. It is not | 
|  | 1551 | rounded up to the | 
|  | 1552 | allocation | 
|  | 1553 | granularity. | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 1554 | "NumVGPRs"                   integer        Required  Number of vector | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1555 | registers used by | 
|  | 1556 | each work-item for | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 1557 | GFX6-GFX10 | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 1558 | "MaxFlatWorkGroupSize"       integer        Required  Maximum flat | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1559 | work-group size | 
|  | 1560 | supported by the | 
|  | 1561 | kernel in work-items. | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 1562 | Must be >=1 and | 
| Tony Tye | e039d0e | 2018-01-30 23:07:10 +0000 | [diff] [blame] | 1563 | consistent with | 
|  | 1564 | ReqdWorkGroupSize if | 
|  | 1565 | not 0, 0, 0. | 
| Konstantin Zhuravlyov | 06ae4ec | 2017-11-28 17:51:08 +0000 | [diff] [blame] | 1566 | "NumSpilledSGPRs"            integer                  Number of stores from | 
|  | 1567 | a scalar register to | 
|  | 1568 | a register allocator | 
|  | 1569 | created spill | 
|  | 1570 | location. | 
|  | 1571 | "NumSpilledVGPRs"            integer                  Number of stores from | 
|  | 1572 | a vector register to | 
|  | 1573 | a register allocator | 
|  | 1574 | created spill | 
|  | 1575 | location. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1576 | ============================ ============== ========= ===================== | 
|  | 1577 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 1578 | .. _amdgpu-amdhsa-code-object-metadata-v3: | 
|  | 1579 |  | 
|  | 1580 | Code Object V3 Metadata (-mattr=+code-object-v3) | 
|  | 1581 | ++++++++++++++++++++++++++++++++++++++++++++++++ | 
|  | 1582 |  | 
|  | 1583 | Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record | 
|  | 1584 | (see :ref:`amdgpu-note-records-v3`). | 
|  | 1585 |  | 
|  | 1586 | The metadata is represented as Message Pack formatted binary data (see | 
|  | 1587 | [MsgPack]_). The top level is a Message Pack map that includes the | 
|  | 1588 | keys defined in table | 
|  | 1589 | :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced | 
|  | 1590 | tables. | 
|  | 1591 |  | 
|  | 1592 | Additional information can be added to the maps. To avoid conflicts, | 
|  | 1593 | any key names should be prefixed by "*vendor-name*." where | 
|  | 1594 | ``vendor-name`` can be the the name of the vendor and specific vendor | 
|  | 1595 | tool that generates the information. The prefix is abbreviated to | 
|  | 1596 | simply "." when it appears within a map that has been added by the | 
|  | 1597 | same *vendor-name*. | 
|  | 1598 |  | 
|  | 1599 | .. table:: AMDHSA Code Object V3 Metadata Map | 
|  | 1600 | :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 | 
|  | 1601 |  | 
|  | 1602 | ================= ============== ========= ======================================= | 
|  | 1603 | String Key        Value Type     Required? Description | 
|  | 1604 | ================= ============== ========= ======================================= | 
|  | 1605 | "amdhsa.version"  sequence of    Required  - The first integer is the major | 
|  | 1606 | 2 integers                 version. Currently 1. | 
|  | 1607 | - The second integer is the minor | 
|  | 1608 | version. Currently 0. | 
|  | 1609 | "amdhsa.printf"   sequence of              Each string is encoded information | 
|  | 1610 | strings                  about a printf function call. The | 
|  | 1611 | encoded information is organized as | 
|  | 1612 | fields separated by colon (':'): | 
|  | 1613 |  | 
|  | 1614 | ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` | 
|  | 1615 |  | 
|  | 1616 | where: | 
|  | 1617 |  | 
|  | 1618 | ``ID`` | 
|  | 1619 | A 32 bit integer as a unique id for | 
|  | 1620 | each printf function call | 
|  | 1621 |  | 
|  | 1622 | ``N`` | 
|  | 1623 | A 32 bit integer equal to the number | 
|  | 1624 | of arguments of printf function call | 
|  | 1625 | minus 1 | 
|  | 1626 |  | 
|  | 1627 | ``S[i]`` (where i = 0, 1, ... , N-1) | 
|  | 1628 | 32 bit integers for the size in bytes | 
|  | 1629 | of the i-th FormatString argument of | 
|  | 1630 | the printf function call | 
|  | 1631 |  | 
|  | 1632 | FormatString | 
|  | 1633 | The format string passed to the | 
|  | 1634 | printf function call. | 
|  | 1635 | "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each | 
|  | 1636 | map                      kernel in the code object. See | 
|  | 1637 | :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` | 
|  | 1638 | for the definition of the keys included | 
|  | 1639 | in that map. | 
|  | 1640 | ================= ============== ========= ======================================= | 
|  | 1641 |  | 
|  | 1642 | .. | 
|  | 1643 |  | 
|  | 1644 | .. table:: AMDHSA Code Object V3 Kernel Metadata Map | 
|  | 1645 | :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 | 
|  | 1646 |  | 
|  | 1647 | =================================== ============== ========= ================================ | 
|  | 1648 | String Key                          Value Type     Required? Description | 
|  | 1649 | =================================== ============== ========= ================================ | 
|  | 1650 | ".name"                             string         Required  Source name of the kernel. | 
|  | 1651 | ".symbol"                           string         Required  Name of the kernel | 
|  | 1652 | descriptor ELF symbol. | 
|  | 1653 | ".language"                         string                   Source language of the kernel. | 
|  | 1654 | Values include: | 
|  | 1655 |  | 
|  | 1656 | - "OpenCL C" | 
|  | 1657 | - "OpenCL C++" | 
|  | 1658 | - "HCC" | 
|  | 1659 | - "HIP" | 
|  | 1660 | - "OpenMP" | 
|  | 1661 | - "Assembler" | 
|  | 1662 |  | 
|  | 1663 | ".language_version"                 sequence of              - The first integer is the major | 
|  | 1664 | 2 integers                 version. | 
|  | 1665 | - The second integer is the | 
|  | 1666 | minor version. | 
|  | 1667 | ".args"                             sequence of              Sequence of maps of the | 
|  | 1668 | map                      kernel arguments. See | 
|  | 1669 | :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` | 
|  | 1670 | for the definition of the keys | 
|  | 1671 | included in that map. | 
|  | 1672 | ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values | 
|  | 1673 | 3 integers               must be >=1 and the dispatch | 
|  | 1674 | work-group size X, Y, Z must | 
|  | 1675 | correspond to the specified | 
|  | 1676 | values. Defaults to 0, 0, 0. | 
|  | 1677 |  | 
|  | 1678 | Corresponds to the OpenCL | 
|  | 1679 | ``reqd_work_group_size`` | 
|  | 1680 | attribute. | 
|  | 1681 | ".workgroup_size_hint"              sequence of              The dispatch work-group size | 
|  | 1682 | 3 integers               X, Y, Z is likely to be the | 
|  | 1683 | specified values. | 
|  | 1684 |  | 
|  | 1685 | Corresponds to the OpenCL | 
|  | 1686 | ``work_group_size_hint`` | 
|  | 1687 | attribute. | 
|  | 1688 | ".vec_type_hint"                    string                   The name of a scalar or vector | 
|  | 1689 | type. | 
|  | 1690 |  | 
|  | 1691 | Corresponds to the OpenCL | 
|  | 1692 | ``vec_type_hint`` attribute. | 
|  | 1693 |  | 
|  | 1694 | ".device_enqueue_symbol"            string                   The external symbol name | 
|  | 1695 | associated with a kernel. | 
|  | 1696 | OpenCL runtime allocates a | 
|  | 1697 | global buffer for the symbol | 
|  | 1698 | and saves the kernel's address | 
|  | 1699 | to it, which is used for | 
|  | 1700 | device side enqueueing. Only | 
|  | 1701 | available for device side | 
|  | 1702 | enqueued kernels. | 
|  | 1703 | ".kernarg_segment_size"             integer        Required  The size in bytes of | 
|  | 1704 | the kernarg segment | 
|  | 1705 | that holds the values | 
|  | 1706 | of the arguments to | 
|  | 1707 | the kernel. | 
|  | 1708 | ".group_segment_fixed_size"         integer        Required  The amount of group | 
|  | 1709 | segment memory | 
|  | 1710 | required by a | 
|  | 1711 | work-group in | 
|  | 1712 | bytes. This does not | 
|  | 1713 | include any | 
|  | 1714 | dynamically allocated | 
|  | 1715 | group segment memory | 
|  | 1716 | that may be added | 
|  | 1717 | when the kernel is | 
|  | 1718 | dispatched. | 
|  | 1719 | ".private_segment_fixed_size"       integer        Required  The amount of fixed | 
|  | 1720 | private address space | 
|  | 1721 | memory required for a | 
|  | 1722 | work-item in | 
|  | 1723 | bytes. If the kernel | 
|  | 1724 | uses a dynamic call | 
|  | 1725 | stack then additional | 
|  | 1726 | space must be added | 
|  | 1727 | to this value for the | 
|  | 1728 | call stack. | 
|  | 1729 | ".kernarg_segment_align"            integer        Required  The maximum byte | 
|  | 1730 | alignment of | 
|  | 1731 | arguments in the | 
|  | 1732 | kernarg segment. Must | 
|  | 1733 | be a power of 2. | 
|  | 1734 | ".wavefront_size"                   integer        Required  Wavefront size. Must | 
|  | 1735 | be a power of 2. | 
|  | 1736 | ".sgpr_count"                       integer        Required  Number of scalar | 
|  | 1737 | registers required by a | 
|  | 1738 | wavefront for | 
|  | 1739 | GFX6-GFX9. A register | 
|  | 1740 | is required if it is | 
|  | 1741 | used explicitly, or | 
|  | 1742 | if a higher numbered | 
|  | 1743 | register is used | 
|  | 1744 | explicitly. This | 
|  | 1745 | includes the special | 
|  | 1746 | SGPRs for VCC, Flat | 
|  | 1747 | Scratch (GFX7-GFX9) | 
|  | 1748 | and XNACK (for | 
|  | 1749 | GFX8-GFX9). It does | 
|  | 1750 | not include the 16 | 
|  | 1751 | SGPR added if a trap | 
|  | 1752 | handler is | 
|  | 1753 | enabled. It is not | 
|  | 1754 | rounded up to the | 
|  | 1755 | allocation | 
|  | 1756 | granularity. | 
|  | 1757 | ".vgpr_count"                       integer        Required  Number of vector | 
|  | 1758 | registers required by | 
|  | 1759 | each work-item for | 
|  | 1760 | GFX6-GFX9. A register | 
|  | 1761 | is required if it is | 
|  | 1762 | used explicitly, or | 
|  | 1763 | if a higher numbered | 
|  | 1764 | register is used | 
|  | 1765 | explicitly. | 
|  | 1766 | ".max_flat_workgroup_size"          integer        Required  Maximum flat | 
|  | 1767 | work-group size | 
|  | 1768 | supported by the | 
|  | 1769 | kernel in work-items. | 
|  | 1770 | Must be >=1 and | 
|  | 1771 | consistent with | 
|  | 1772 | ReqdWorkGroupSize if | 
|  | 1773 | not 0, 0, 0. | 
|  | 1774 | ".sgpr_spill_count"                 integer                  Number of stores from | 
|  | 1775 | a scalar register to | 
|  | 1776 | a register allocator | 
|  | 1777 | created spill | 
|  | 1778 | location. | 
|  | 1779 | ".vgpr_spill_count"                 integer                  Number of stores from | 
|  | 1780 | a vector register to | 
|  | 1781 | a register allocator | 
|  | 1782 | created spill | 
|  | 1783 | location. | 
|  | 1784 | =================================== ============== ========= ================================ | 
|  | 1785 |  | 
|  | 1786 | .. | 
|  | 1787 |  | 
|  | 1788 | .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map | 
|  | 1789 | :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 | 
|  | 1790 |  | 
|  | 1791 | ====================== ============== ========= ================================ | 
|  | 1792 | String Key             Value Type     Required? Description | 
|  | 1793 | ====================== ============== ========= ================================ | 
|  | 1794 | ".name"                string                   Kernel argument name. | 
|  | 1795 | ".type_name"           string                   Kernel argument type name. | 
|  | 1796 | ".size"                integer        Required  Kernel argument size in bytes. | 
|  | 1797 | ".offset"              integer        Required  Kernel argument offset in | 
|  | 1798 | bytes. The offset must be a | 
|  | 1799 | multiple of the alignment | 
|  | 1800 | required by the argument. | 
|  | 1801 | ".value_kind"          string         Required  Kernel argument kind that | 
|  | 1802 | specifies how to set up the | 
|  | 1803 | corresponding argument. | 
|  | 1804 | Values include: | 
|  | 1805 |  | 
|  | 1806 | "by_value" | 
|  | 1807 | The argument is copied | 
|  | 1808 | directly into the kernarg. | 
|  | 1809 |  | 
|  | 1810 | "global_buffer" | 
|  | 1811 | A global address space pointer | 
|  | 1812 | to the buffer data is passed | 
|  | 1813 | in the kernarg. | 
|  | 1814 |  | 
|  | 1815 | "dynamic_shared_pointer" | 
|  | 1816 | A group address space pointer | 
|  | 1817 | to dynamically allocated LDS | 
|  | 1818 | is passed in the kernarg. | 
|  | 1819 |  | 
|  | 1820 | "sampler" | 
|  | 1821 | A global address space | 
|  | 1822 | pointer to a S# is passed in | 
|  | 1823 | the kernarg. | 
|  | 1824 |  | 
|  | 1825 | "image" | 
|  | 1826 | A global address space | 
|  | 1827 | pointer to a T# is passed in | 
|  | 1828 | the kernarg. | 
|  | 1829 |  | 
|  | 1830 | "pipe" | 
|  | 1831 | A global address space pointer | 
|  | 1832 | to an OpenCL pipe is passed in | 
|  | 1833 | the kernarg. | 
|  | 1834 |  | 
|  | 1835 | "queue" | 
|  | 1836 | A global address space pointer | 
|  | 1837 | to an OpenCL device enqueue | 
|  | 1838 | queue is passed in the | 
|  | 1839 | kernarg. | 
|  | 1840 |  | 
|  | 1841 | "hidden_global_offset_x" | 
|  | 1842 | The OpenCL grid dispatch | 
|  | 1843 | global offset for the X | 
|  | 1844 | dimension is passed in the | 
|  | 1845 | kernarg. | 
|  | 1846 |  | 
|  | 1847 | "hidden_global_offset_y" | 
|  | 1848 | The OpenCL grid dispatch | 
|  | 1849 | global offset for the Y | 
|  | 1850 | dimension is passed in the | 
|  | 1851 | kernarg. | 
|  | 1852 |  | 
|  | 1853 | "hidden_global_offset_z" | 
|  | 1854 | The OpenCL grid dispatch | 
|  | 1855 | global offset for the Z | 
|  | 1856 | dimension is passed in the | 
|  | 1857 | kernarg. | 
|  | 1858 |  | 
|  | 1859 | "hidden_none" | 
|  | 1860 | An argument that is not used | 
|  | 1861 | by the kernel. Space needs to | 
|  | 1862 | be left for it, but it does | 
|  | 1863 | not need to be set up. | 
|  | 1864 |  | 
|  | 1865 | "hidden_printf_buffer" | 
|  | 1866 | A global address space pointer | 
|  | 1867 | to the runtime printf buffer | 
|  | 1868 | is passed in kernarg. | 
|  | 1869 |  | 
|  | 1870 | "hidden_default_queue" | 
|  | 1871 | A global address space pointer | 
|  | 1872 | to the OpenCL device enqueue | 
|  | 1873 | queue that should be used by | 
|  | 1874 | the kernel by default is | 
|  | 1875 | passed in the kernarg. | 
|  | 1876 |  | 
|  | 1877 | "hidden_completion_action" | 
|  | 1878 | A global address space pointer | 
|  | 1879 | to help link enqueued kernels into | 
|  | 1880 | the ancestor tree for determining | 
|  | 1881 | when the parent kernel has finished. | 
|  | 1882 |  | 
|  | 1883 | ".value_type"          string         Required  Kernel argument value type. Only | 
|  | 1884 | present if ".value_kind" is | 
|  | 1885 | "by_value". For vector data | 
|  | 1886 | types, the value is for the | 
|  | 1887 | element type. Values include: | 
|  | 1888 |  | 
|  | 1889 | - "struct" | 
|  | 1890 | - "i8" | 
|  | 1891 | - "u8" | 
|  | 1892 | - "i16" | 
|  | 1893 | - "u16" | 
|  | 1894 | - "f16" | 
|  | 1895 | - "i32" | 
|  | 1896 | - "u32" | 
|  | 1897 | - "f32" | 
|  | 1898 | - "i64" | 
|  | 1899 | - "u64" | 
|  | 1900 | - "f64" | 
|  | 1901 |  | 
|  | 1902 | .. TODO | 
|  | 1903 | How can it be determined if a | 
|  | 1904 | vector type, and what size | 
|  | 1905 | vector? | 
|  | 1906 | ".pointee_align"       integer                  Alignment in bytes of pointee | 
|  | 1907 | type for pointer type kernel | 
|  | 1908 | argument. Must be a power | 
|  | 1909 | of 2. Only present if | 
|  | 1910 | ".value_kind" is | 
|  | 1911 | "dynamic_shared_pointer". | 
|  | 1912 | ".address_space"       string                   Kernel argument address space | 
|  | 1913 | qualifier. Only present if | 
|  | 1914 | ".value_kind" is "global_buffer" or | 
|  | 1915 | "dynamic_shared_pointer". Values | 
|  | 1916 | are: | 
|  | 1917 |  | 
|  | 1918 | - "private" | 
|  | 1919 | - "global" | 
|  | 1920 | - "constant" | 
|  | 1921 | - "local" | 
|  | 1922 | - "generic" | 
|  | 1923 | - "region" | 
|  | 1924 |  | 
|  | 1925 | .. TODO | 
|  | 1926 | Is "global_buffer" only "global" | 
|  | 1927 | or "constant"? Is | 
|  | 1928 | "dynamic_shared_pointer" always | 
|  | 1929 | "local"? Can HCC allow "generic"? | 
|  | 1930 | How can "private" or "region" | 
|  | 1931 | ever happen? | 
|  | 1932 | ".access"              string                   Kernel argument access | 
|  | 1933 | qualifier. Only present if | 
|  | 1934 | ".value_kind" is "image" or | 
|  | 1935 | "pipe". Values | 
|  | 1936 | are: | 
|  | 1937 |  | 
|  | 1938 | - "read_only" | 
|  | 1939 | - "write_only" | 
|  | 1940 | - "read_write" | 
|  | 1941 |  | 
|  | 1942 | .. TODO | 
|  | 1943 | Does this apply to | 
|  | 1944 | "global_buffer"? | 
|  | 1945 | ".actual_access"       string                   The actual memory accesses | 
|  | 1946 | performed by the kernel on the | 
|  | 1947 | kernel argument. Only present if | 
|  | 1948 | ".value_kind" is "global_buffer", | 
|  | 1949 | "image", or "pipe". This may be | 
|  | 1950 | more restrictive than indicated | 
|  | 1951 | by ".access" to reflect what the | 
|  | 1952 | kernel actual does. If not | 
|  | 1953 | present then the runtime must | 
|  | 1954 | assume what is implied by | 
|  | 1955 | ".access" and ".is_const"      . Values | 
|  | 1956 | are: | 
|  | 1957 |  | 
|  | 1958 | - "read_only" | 
|  | 1959 | - "write_only" | 
|  | 1960 | - "read_write" | 
|  | 1961 |  | 
|  | 1962 | ".is_const"            boolean                  Indicates if the kernel argument | 
|  | 1963 | is const qualified. Only present | 
|  | 1964 | if ".value_kind" is | 
|  | 1965 | "global_buffer". | 
|  | 1966 |  | 
|  | 1967 | ".is_restrict"         boolean                  Indicates if the kernel argument | 
|  | 1968 | is restrict qualified. Only | 
|  | 1969 | present if ".value_kind" is | 
|  | 1970 | "global_buffer". | 
|  | 1971 |  | 
|  | 1972 | ".is_volatile"         boolean                  Indicates if the kernel argument | 
|  | 1973 | is volatile qualified. Only | 
|  | 1974 | present if ".value_kind" is | 
|  | 1975 | "global_buffer". | 
|  | 1976 |  | 
|  | 1977 | ".is_pipe"             boolean                  Indicates if the kernel argument | 
|  | 1978 | is pipe qualified. Only present | 
|  | 1979 | if ".value_kind" is "pipe". | 
|  | 1980 |  | 
|  | 1981 | .. TODO | 
|  | 1982 | Can "global_buffer" be pipe | 
|  | 1983 | qualified? | 
|  | 1984 | ====================== ============== ========= ================================ | 
|  | 1985 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1986 | .. | 
|  | 1987 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 1988 | Kernel Dispatch | 
|  | 1989 | ~~~~~~~~~~~~~~~ | 
|  | 1990 |  | 
|  | 1991 | The HSA architected queuing language (AQL) defines a user space memory interface | 
|  | 1992 | that can be used to control the dispatch of kernels, in an agent independent | 
|  | 1993 | way. An agent can have zero or more AQL queues created for it using the ROCm | 
|  | 1994 | runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the | 
|  | 1995 | *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue | 
|  | 1996 | mechanics and packet layouts. | 
|  | 1997 |  | 
|  | 1998 | The packet processor of a kernel agent is responsible for detecting and | 
|  | 1999 | dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the | 
|  | 2000 | packet processor is implemented by the hardware command processor (CP), | 
|  | 2001 | asynchronous dispatch controller (ADC) and shader processor input controller | 
|  | 2002 | (SPI). | 
|  | 2003 |  | 
|  | 2004 | The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel | 
|  | 2005 | mode driver to initialize and register the AQL queue with CP. | 
|  | 2006 |  | 
|  | 2007 | To dispatch a kernel the following actions are performed. This can occur in the | 
|  | 2008 | CPU host program, or from an HSA kernel executing on a GPU. | 
|  | 2009 |  | 
|  | 2010 | 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be | 
|  | 2011 | executed is obtained. | 
|  | 2012 | 2. A pointer to the kernel descriptor (see | 
|  | 2013 | :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is | 
|  | 2014 | obtained. It must be for a kernel that is contained in a code object that that | 
|  | 2015 | was loaded by the ROCm runtime on the kernel agent with which the AQL queue is | 
|  | 2016 | associated. | 
|  | 2017 | 3. Space is allocated for the kernel arguments using the ROCm runtime allocator | 
|  | 2018 | for a memory region with the kernarg property for the kernel agent that will | 
|  | 2019 | execute the kernel. It must be at least 16 byte aligned. | 
|  | 2020 | 4. Kernel argument values are assigned to the kernel argument memory | 
| Konstantin Zhuravlyov | ea35e46 | 2017-10-19 17:12:55 +0000 | [diff] [blame] | 2021 | allocation. The layout is defined in the *HSA Programmer's Language Reference* | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2022 | [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument | 
|  | 2023 | memory in the same way constant memory is accessed. (Note that the HSA | 
|  | 2024 | specification allows an implementation to copy the kernel argument contents to | 
|  | 2025 | another location that is accessed by the kernel.) | 
|  | 2026 | 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime | 
|  | 2027 | api uses 64 bit atomic operations to reserve space in the AQL queue for the | 
|  | 2028 | packet. The packet must be set up, and the final write must use an atomic | 
|  | 2029 | store release to set the packet kind to ensure the packet contents are | 
|  | 2030 | visible to the kernel agent. AQL defines a doorbell signal mechanism to | 
|  | 2031 | notify the kernel agent that the AQL queue has been updated. These rules, and | 
|  | 2032 | the layout of the AQL queue and kernel dispatch packet is defined in the *HSA | 
|  | 2033 | System Architecture Specification* [HSA]_. | 
|  | 2034 | 6. A kernel dispatch packet includes information about the actual dispatch, | 
|  | 2035 | such as grid and work-group size, together with information from the code | 
|  | 2036 | object about the kernel, such as segment sizes. The ROCm runtime queries on | 
|  | 2037 | the kernel symbol can be used to obtain the code object values which are | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 2038 | recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2039 | 7. CP executes micro-code and is responsible for detecting and setting up the | 
|  | 2040 | GPU to execute the wavefronts of a kernel dispatch. | 
|  | 2041 | 8. CP ensures that when the a wavefront starts executing the kernel machine | 
|  | 2042 | code, the scalar general purpose registers (SGPR) and vector general purpose | 
|  | 2043 | registers (VGPR) are set up as required by the machine code. The required | 
|  | 2044 | setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial | 
|  | 2045 | register state is defined in | 
|  | 2046 | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. | 
|  | 2047 | 9. The prolog of the kernel machine code (see | 
|  | 2048 | :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary | 
|  | 2049 | before continuing executing the machine code that corresponds to the kernel. | 
|  | 2050 | 10. When the kernel dispatch has completed execution, CP signals the completion | 
|  | 2051 | signal specified in the kernel dispatch packet if not 0. | 
|  | 2052 |  | 
|  | 2053 | .. _amdgpu-amdhsa-memory-spaces: | 
|  | 2054 |  | 
|  | 2055 | Memory Spaces | 
|  | 2056 | ~~~~~~~~~~~~~ | 
|  | 2057 |  | 
|  | 2058 | The memory space properties are: | 
|  | 2059 |  | 
|  | 2060 | .. table:: AMDHSA Memory Spaces | 
|  | 2061 | :name: amdgpu-amdhsa-memory-spaces-table | 
|  | 2062 |  | 
|  | 2063 | ================= =========== ======== ======= ================== | 
|  | 2064 | Memory Space Name HSA Segment Hardware Address NULL Value | 
|  | 2065 | Name        Name     Size | 
|  | 2066 | ================= =========== ======== ======= ================== | 
|  | 2067 | Private           private     scratch  32      0x00000000 | 
|  | 2068 | Local             group       LDS      32      0xFFFFFFFF | 
|  | 2069 | Global            global      global   64      0x0000000000000000 | 
|  | 2070 | Constant          constant    *same as 64      0x0000000000000000 | 
|  | 2071 | global* | 
|  | 2072 | Generic           flat        flat     64      0x0000000000000000 | 
|  | 2073 | Region            N/A         GDS      32      *not implemented | 
|  | 2074 | for AMDHSA* | 
|  | 2075 | ================= =========== ======== ======= ================== | 
|  | 2076 |  | 
|  | 2077 | The global and constant memory spaces both use global virtual addresses, which | 
|  | 2078 | are the same virtual address space used by the CPU. However, some virtual | 
|  | 2079 | addresses may only be accessible to the CPU, some only accessible by the GPU, | 
|  | 2080 | and some by both. | 
|  | 2081 |  | 
|  | 2082 | Using the constant memory space indicates that the data will not change during | 
|  | 2083 | the execution of the kernel. This allows scalar read instructions to be | 
|  | 2084 | used. The vector and scalar L1 caches are invalidated of volatile data before | 
|  | 2085 | each kernel dispatch execution to allow constant memory to change values between | 
|  | 2086 | kernel dispatches. | 
|  | 2087 |  | 
|  | 2088 | The local memory space uses the hardware Local Data Store (LDS) which is | 
|  | 2089 | automatically allocated when the hardware creates work-groups of wavefronts, and | 
|  | 2090 | freed when all the wavefronts of a work-group have terminated. The data store | 
|  | 2091 | (DS) instructions can be used to access it. | 
|  | 2092 |  | 
|  | 2093 | The private memory space uses the hardware scratch memory support. If the kernel | 
|  | 2094 | uses scratch, then the hardware allocates memory that is accessed using | 
|  | 2095 | wavefront lane dword (4 byte) interleaving. The mapping used from private | 
|  | 2096 | address to physical address is: | 
|  | 2097 |  | 
|  | 2098 | ``wavefront-scratch-base + | 
|  | 2099 | (private-address * wavefront-size * 4) + | 
|  | 2100 | (wavefront-lane-id * 4)`` | 
|  | 2101 |  | 
|  | 2102 | There are different ways that the wavefront scratch base address is determined | 
|  | 2103 | by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This | 
|  | 2104 | memory can be accessed in an interleaved manner using buffer instruction with | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2105 | the scratch buffer descriptor and per wavefront scratch offset, by the scratch | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2106 | instructions, or by flat instructions. If each lane of a wavefront accesses the | 
|  | 2107 | same private address, the interleaving results in adjacent dwords being accessed | 
|  | 2108 | and hence requires fewer cache lines to be fetched. Multi-dword access is not | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2109 | supported except by flat and scratch instructions in GFX9-GFX10. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2110 |  | 
|  | 2111 | The generic address space uses the hardware flat address support available in | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2112 | GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2113 | local appertures), that are outside the range of addressible global memory, to | 
|  | 2114 | map from a flat address to a private or local address. | 
|  | 2115 |  | 
|  | 2116 | FLAT instructions can take a flat address and access global, private (scratch) | 
|  | 2117 | and group (LDS) memory depending in if the address is within one of the | 
|  | 2118 | apperture ranges. Flat access to scratch requires hardware aperture setup and | 
|  | 2119 | setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat | 
|  | 2120 | access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup | 
|  | 2121 | (see :ref:`amdgpu-amdhsa-m0`). | 
|  | 2122 |  | 
|  | 2123 | To convert between a segment address and a flat address the base address of the | 
|  | 2124 | appertures address can be used. For GFX7-GFX8 these are available in the | 
|  | 2125 | :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with | 
|  | 2126 | Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2127 | GFX9-GFX10 the appature base addresses are directly available as inline constant | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2128 | registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit | 
|  | 2129 | address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32 | 
|  | 2130 | which makes it easier to convert from flat to segment or segment to flat. | 
|  | 2131 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 2132 | Image and Samplers | 
|  | 2133 | ~~~~~~~~~~~~~~~~~~ | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2134 |  | 
|  | 2135 | Image and sample handles created by the ROCm runtime are 64 bit addresses of a | 
|  | 2136 | hardware 32 byte V# and 48 byte S# object respectively. In order to support the | 
|  | 2137 | HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG | 
|  | 2138 | enumeration values for the queries that are not trivially deducible from the S# | 
|  | 2139 | representation. | 
|  | 2140 |  | 
|  | 2141 | HSA Signals | 
|  | 2142 | ~~~~~~~~~~~ | 
|  | 2143 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 2144 | HSA signal handles created by the ROCm runtime are 64 bit addresses of a | 
|  | 2145 | structure allocated in memory accessible from both the CPU and GPU. The | 
|  | 2146 | structure is defined by the ROCm runtime and subject to change between releases | 
|  | 2147 | (see [AMD-ROCm-github]_). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2148 |  | 
|  | 2149 | .. _amdgpu-amdhsa-hsa-aql-queue: | 
|  | 2150 |  | 
|  | 2151 | HSA AQL Queue | 
|  | 2152 | ~~~~~~~~~~~~~ | 
|  | 2153 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 2154 | The HSA AQL queue structure is defined by the ROCm runtime and subject to change | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2155 | between releases (see [AMD-ROCm-github]_). For some processors it contains | 
|  | 2156 | fields needed to implement certain language features such as the flat address | 
|  | 2157 | aperture bases. It also contains fields used by CP such as managing the | 
|  | 2158 | allocation of scratch memory. | 
|  | 2159 |  | 
|  | 2160 | .. _amdgpu-amdhsa-kernel-descriptor: | 
|  | 2161 |  | 
|  | 2162 | Kernel Descriptor | 
|  | 2163 | ~~~~~~~~~~~~~~~~~ | 
|  | 2164 |  | 
|  | 2165 | A kernel descriptor consists of the information needed by CP to initiate the | 
|  | 2166 | execution of a kernel, including the entry point address of the machine code | 
|  | 2167 | that implements the kernel. | 
|  | 2168 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2169 | Kernel Descriptor for GFX6-GFX10 | 
|  | 2170 | ++++++++++++++++++++++++++++++++ | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2171 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 2172 | CP microcode requires the Kernel descriptor to be allocated on 64 byte | 
|  | 2173 | alignment. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2174 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2175 | .. table:: Kernel Descriptor for GFX6-GFX10 | 
|  | 2176 | :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2177 |  | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 2178 | ======= ======= =============================== ============================ | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2179 | Bits    Size    Field Name                      Description | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 2180 | ======= ======= =============================== ============================ | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2181 | 31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2182 | address space memory | 
|  | 2183 | required for a work-group | 
|  | 2184 | in bytes. This does not | 
|  | 2185 | include any dynamically | 
|  | 2186 | allocated local address | 
|  | 2187 | space memory that may be | 
|  | 2188 | added when the kernel is | 
|  | 2189 | dispatched. | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2190 | 63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2191 | private address space | 
|  | 2192 | memory required for a | 
|  | 2193 | work-item in bytes. If | 
|  | 2194 | is_dynamic_callstack is 1 | 
|  | 2195 | then additional space must | 
|  | 2196 | be added to this value for | 
|  | 2197 | the call stack. | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 2198 | 127:64  8 bytes                                 Reserved, must be 0. | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2199 | 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2200 | negative) from base | 
|  | 2201 | address of kernel | 
|  | 2202 | descriptor to kernel's | 
|  | 2203 | entry point instruction | 
|  | 2204 | which must be 256 byte | 
|  | 2205 | aligned. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2206 | 351:272 20                                      Reserved, must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2207 | bytes | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2208 | 383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-9 | 
|  | 2209 | Reserved, must be 0. | 
|  | 2210 | GFX10 | 
|  | 2211 | Compute Shader (CS) | 
|  | 2212 | program settings used by | 
|  | 2213 | CP to set up | 
|  | 2214 | ``COMPUTE_PGM_RSRC3`` | 
|  | 2215 | configuration | 
|  | 2216 | register. See | 
|  | 2217 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2218 | 415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2219 | program settings used by | 
|  | 2220 | CP to set up | 
|  | 2221 | ``COMPUTE_PGM_RSRC1`` | 
|  | 2222 | configuration | 
|  | 2223 | register. See | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2224 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2225 | 447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2226 | program settings used by | 
|  | 2227 | CP to set up | 
|  | 2228 | ``COMPUTE_PGM_RSRC2`` | 
|  | 2229 | configuration | 
|  | 2230 | register. See | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2231 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2232 | 448     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the | 
|  | 2233 | _BUFFER                         SGPR user data registers | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2234 | (see | 
|  | 2235 | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  | 2236 |  | 
|  | 2237 | The total number of SGPR | 
|  | 2238 | user data registers | 
|  | 2239 | requested must not exceed | 
|  | 2240 | 16 and match value in | 
|  | 2241 | ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. | 
|  | 2242 | Any requests beyond 16 | 
|  | 2243 | will be ignored. | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2244 | 449     1 bit   ENABLE_SGPR_DISPATCH_PTR        *see above* | 
|  | 2245 | 450     1 bit   ENABLE_SGPR_QUEUE_PTR           *see above* | 
|  | 2246 | 451     1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above* | 
|  | 2247 | 452     1 bit   ENABLE_SGPR_DISPATCH_ID         *see above* | 
|  | 2248 | 453     1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   *see above* | 
|  | 2249 | 454     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     *see above* | 
|  | 2250 | _SIZE | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2251 | 457:455 3 bits                                  Reserved, must be 0. | 
|  | 2252 | 458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-9 | 
|  | 2253 | Reserved, must be 0. | 
|  | 2254 | GFX10 | 
|  | 2255 | - If 0 execute in | 
|  | 2256 | wavefront size 64 mode. | 
|  | 2257 | - If 1 execute in | 
|  | 2258 | native wavefront size | 
|  | 2259 | 32 mode. | 
|  | 2260 | 463:459 5 bits                                  Reserved, must be 0. | 
|  | 2261 | 511:464 6 bytes                                 Reserved, must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2262 | 512     **Total size 64 bytes.** | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 2263 | ======= ==================================================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2264 |  | 
|  | 2265 | .. | 
|  | 2266 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2267 | .. table:: compute_pgm_rsrc1 for GFX6-GFX10 | 
|  | 2268 | :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2269 |  | 
| Tony Tye | 3b34061 | 2017-06-07 00:46:08 +0000 | [diff] [blame] | 2270 | ======= ======= =============================== =========================================================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2271 | Bits    Size    Field Name                      Description | 
| Tony Tye | 3b34061 | 2017-06-07 00:46:08 +0000 | [diff] [blame] | 2272 | ======= ======= =============================== =========================================================================== | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 2273 | 5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register | 
|  | 2274 | blocks used by each work-item; | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2275 | granularity is device | 
|  | 2276 | specific: | 
|  | 2277 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 2278 | GFX6-GFX9 | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 2279 | - vgprs_used 0..256 | 
|  | 2280 | - max(0, ceil(vgprs_used / 4) - 1) | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2281 | GFX10 (wavefront size 64) | 
|  | 2282 | - max_vgpr 1..256 | 
|  | 2283 | - max(0, ceil(vgprs_used / 4) - 1) | 
|  | 2284 | GFX10 (wavefront size 32) | 
|  | 2285 | - max_vgpr 1..256 | 
|  | 2286 | - max(0, ceil(vgprs_used / 8) - 1) | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 2287 |  | 
|  | 2288 | Where vgprs_used is defined | 
|  | 2289 | as the highest VGPR number | 
|  | 2290 | explicitly referenced plus | 
|  | 2291 | one. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2292 |  | 
|  | 2293 | Used by CP to set up | 
|  | 2294 | ``COMPUTE_PGM_RSRC1.VGPRS``. | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 2295 |  | 
|  | 2296 | The | 
|  | 2297 | :ref:`amdgpu-assembler` | 
|  | 2298 | calculates this | 
|  | 2299 | automatically for the | 
|  | 2300 | selected processor from | 
|  | 2301 | values provided to the | 
|  | 2302 | `.amdhsa_kernel` directive | 
|  | 2303 | by the | 
|  | 2304 | `.amdhsa_next_free_vgpr` | 
|  | 2305 | nested directive (see | 
|  | 2306 | :ref:`amdhsa-kernel-directives-table`). | 
|  | 2307 | 9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register | 
|  | 2308 | blocks used by a wavefront; | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2309 | granularity is device | 
|  | 2310 | specific: | 
|  | 2311 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 2312 | GFX6-GFX8 | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 2313 | - sgprs_used 0..112 | 
|  | 2314 | - max(0, ceil(sgprs_used / 8) - 1) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2315 | GFX9 | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 2316 | - sgprs_used 0..112 | 
|  | 2317 | - 2 * max(0, ceil(sgprs_used / 16) - 1) | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2318 | GFX10 | 
|  | 2319 | Reserved, must be 0. | 
|  | 2320 | (128 SGPRs always | 
|  | 2321 | allocated.) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2322 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 2323 | Where sgprs_used is | 
|  | 2324 | defined as the highest | 
|  | 2325 | SGPR number explicitly | 
|  | 2326 | referenced plus one, plus | 
|  | 2327 | a target-specific number | 
|  | 2328 | of additional special | 
|  | 2329 | SGPRs for VCC, | 
|  | 2330 | FLAT_SCRATCH (GFX7+) and | 
|  | 2331 | XNACK_MASK (GFX8+), and | 
|  | 2332 | any additional | 
|  | 2333 | target-specific | 
|  | 2334 | limitations. It does not | 
|  | 2335 | include the 16 SGPRs added | 
|  | 2336 | if a trap handler is | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2337 | enabled. | 
|  | 2338 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 2339 | The target-specific | 
|  | 2340 | limitations and special | 
|  | 2341 | SGPR layout are defined in | 
|  | 2342 | the hardware | 
|  | 2343 | documentation, which can | 
|  | 2344 | be found in the | 
|  | 2345 | :ref:`amdgpu-processors` | 
|  | 2346 | table. | 
|  | 2347 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2348 | Used by CP to set up | 
|  | 2349 | ``COMPUTE_PGM_RSRC1.SGPRS``. | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 2350 |  | 
|  | 2351 | The | 
|  | 2352 | :ref:`amdgpu-assembler` | 
|  | 2353 | calculates this | 
|  | 2354 | automatically for the | 
|  | 2355 | selected processor from | 
|  | 2356 | values provided to the | 
|  | 2357 | `.amdhsa_kernel` directive | 
|  | 2358 | by the | 
|  | 2359 | `.amdhsa_next_free_sgpr` | 
|  | 2360 | and `.amdhsa_reserve_*` | 
|  | 2361 | nested directives (see | 
|  | 2362 | :ref:`amdhsa-kernel-directives-table`). | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2363 | 11:10   2 bits  PRIORITY                        Must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2364 |  | 
|  | 2365 | Start executing wavefront | 
|  | 2366 | at the specified priority. | 
|  | 2367 |  | 
|  | 2368 | CP is responsible for | 
|  | 2369 | filling in | 
|  | 2370 | ``COMPUTE_PGM_RSRC1.PRIORITY``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2371 | 13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2372 | with specified rounding | 
|  | 2373 | mode for single (32 | 
|  | 2374 | bit) floating point | 
|  | 2375 | precision floating point | 
|  | 2376 | operations. | 
|  | 2377 |  | 
|  | 2378 | Floating point rounding | 
|  | 2379 | mode values are defined in | 
|  | 2380 | :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. | 
|  | 2381 |  | 
|  | 2382 | Used by CP to set up | 
|  | 2383 | ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2384 | 15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2385 | with specified rounding | 
|  | 2386 | denorm mode for half/double (16 | 
|  | 2387 | and 64 bit) floating point | 
|  | 2388 | precision floating point | 
|  | 2389 | operations. | 
|  | 2390 |  | 
|  | 2391 | Floating point rounding | 
|  | 2392 | mode values are defined in | 
|  | 2393 | :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. | 
|  | 2394 |  | 
|  | 2395 | Used by CP to set up | 
|  | 2396 | ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2397 | 17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2398 | with specified denorm mode | 
|  | 2399 | for single (32 | 
|  | 2400 | bit)  floating point | 
|  | 2401 | precision floating point | 
|  | 2402 | operations. | 
|  | 2403 |  | 
|  | 2404 | Floating point denorm mode | 
|  | 2405 | values are defined in | 
|  | 2406 | :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. | 
|  | 2407 |  | 
|  | 2408 | Used by CP to set up | 
|  | 2409 | ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2410 | 19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2411 | with specified denorm mode | 
|  | 2412 | for half/double (16 | 
|  | 2413 | and 64 bit) floating point | 
|  | 2414 | precision floating point | 
|  | 2415 | operations. | 
|  | 2416 |  | 
|  | 2417 | Floating point denorm mode | 
|  | 2418 | values are defined in | 
|  | 2419 | :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. | 
|  | 2420 |  | 
|  | 2421 | Used by CP to set up | 
|  | 2422 | ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2423 | 20      1 bit   PRIV                            Must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2424 |  | 
|  | 2425 | Start executing wavefront | 
|  | 2426 | in privilege trap handler | 
|  | 2427 | mode. | 
|  | 2428 |  | 
|  | 2429 | CP is responsible for | 
|  | 2430 | filling in | 
|  | 2431 | ``COMPUTE_PGM_RSRC1.PRIV``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2432 | 21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2433 | with DX10 clamp mode | 
|  | 2434 | enabled. Used by the vector | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 2435 | ALU to force DX10 style | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2436 | treatment of NaN's (when | 
|  | 2437 | set, clamp NaN to zero, | 
|  | 2438 | otherwise pass NaN | 
|  | 2439 | through). | 
|  | 2440 |  | 
|  | 2441 | Used by CP to set up | 
|  | 2442 | ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2443 | 22      1 bit   DEBUG_MODE                      Must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2444 |  | 
|  | 2445 | Start executing wavefront | 
|  | 2446 | in single step mode. | 
|  | 2447 |  | 
|  | 2448 | CP is responsible for | 
|  | 2449 | filling in | 
|  | 2450 | ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2451 | 23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2452 | with IEEE mode | 
|  | 2453 | enabled. Floating point | 
|  | 2454 | opcodes that support | 
|  | 2455 | exception flag gathering | 
|  | 2456 | will quiet and propagate | 
|  | 2457 | signaling-NaN inputs per | 
|  | 2458 | IEEE 754-2008. Min_dx10 and | 
|  | 2459 | max_dx10 become IEEE | 
|  | 2460 | 754-2008 compliant due to | 
|  | 2461 | signaling-NaN propagation | 
|  | 2462 | and quieting. | 
|  | 2463 |  | 
|  | 2464 | Used by CP to set up | 
|  | 2465 | ``COMPUTE_PGM_RSRC1.IEEE_MODE``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2466 | 24      1 bit   BULKY                           Must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2467 |  | 
|  | 2468 | Only one work-group allowed | 
|  | 2469 | to execute on a compute | 
|  | 2470 | unit. | 
|  | 2471 |  | 
|  | 2472 | CP is responsible for | 
|  | 2473 | filling in | 
|  | 2474 | ``COMPUTE_PGM_RSRC1.BULKY``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2475 | 25      1 bit   CDBG_USER                       Must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2476 |  | 
|  | 2477 | Flag that can be used to | 
|  | 2478 | control debugging code. | 
|  | 2479 |  | 
|  | 2480 | CP is responsible for | 
|  | 2481 | filling in | 
|  | 2482 | ``COMPUTE_PGM_RSRC1.CDBG_USER``. | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 2483 | 26      1 bit   FP16_OVFL                       GFX6-GFX8 | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 2484 | Reserved, must be 0. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2485 | GFX9-GFX10 | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 2486 | Wavefront starts execution | 
|  | 2487 | with specified fp16 overflow | 
|  | 2488 | mode. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2489 |  | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 2490 | - If 0, fp16 overflow generates | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2491 | +/-INF values. | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 2492 | - If 1, fp16 overflow that is the | 
|  | 2493 | result of an +/-INF input value | 
|  | 2494 | or divide by 0 produces a +/-INF, | 
|  | 2495 | otherwise clamps computed | 
|  | 2496 | overflow to +/-MAX_FP16 as | 
|  | 2497 | appropriate. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2498 |  | 
|  | 2499 | Used by CP to set up | 
|  | 2500 | ``COMPUTE_PGM_RSRC1.FP16_OVFL``. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2501 | 28:27   2 bits                                  Reserved, must be 0. | 
|  | 2502 | 29      1 bit    WGP_MODE                       GFX6-GFX9 | 
|  | 2503 | Reserved, must be 0. | 
|  | 2504 | GFX10 | 
|  | 2505 | - If 0 execute work-groups in | 
|  | 2506 | CU wavefront execution mode. | 
|  | 2507 | - If 1 execute work-groups on | 
|  | 2508 | in WGP wavefront execution mode. | 
|  | 2509 |  | 
|  | 2510 | See :ref:`amdgpu-amdhsa-memory-model`. | 
|  | 2511 |  | 
|  | 2512 | Used by CP to set up | 
|  | 2513 | ``COMPUTE_PGM_RSRC1.WGP_MODE``. | 
|  | 2514 | 30      1 bit    MEM_ORDERED                    GFX6-9 | 
|  | 2515 | Reserved, must be 0. | 
|  | 2516 | GFX10 | 
|  | 2517 | Controls the behavior of the | 
|  | 2518 | waitcnt's vmcnt and vscnt | 
|  | 2519 | counters. | 
|  | 2520 |  | 
|  | 2521 | - If 0 vmcnt reports completion | 
|  | 2522 | of load and atomic with return | 
|  | 2523 | out of order with sample | 
|  | 2524 | instructions, and the vscnt | 
|  | 2525 | reports the completion of | 
|  | 2526 | store and atomic without | 
|  | 2527 | return in order. | 
|  | 2528 | - If 1 vmcnt reports completion | 
|  | 2529 | of load, atomic with return | 
|  | 2530 | and sample instructions in | 
|  | 2531 | order, and the vscnt reports | 
|  | 2532 | the completion of store and | 
|  | 2533 | atomic without return in order. | 
|  | 2534 |  | 
|  | 2535 | Used by CP to set up | 
|  | 2536 | ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. | 
|  | 2537 | 31      1 bit    FWD_PROGRESS                   GFX6-9 | 
|  | 2538 | Reserved, must be 0. | 
|  | 2539 | GFX10 | 
|  | 2540 | - If 0 execute SIMD wavefronts | 
|  | 2541 | using oldest first policy. | 
|  | 2542 | - If 1 execute SIMD wavefronts to | 
|  | 2543 | ensure wavefronts will make some | 
|  | 2544 | forward progress. | 
|  | 2545 |  | 
|  | 2546 | Used by CP to set up | 
|  | 2547 | ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2548 | 32      **Total size 4 bytes** | 
| Tony Tye | 3b34061 | 2017-06-07 00:46:08 +0000 | [diff] [blame] | 2549 | ======= =================================================================================================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2550 |  | 
|  | 2551 | .. | 
|  | 2552 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2553 | .. table:: compute_pgm_rsrc2 for GFX6-GFX10 | 
|  | 2554 | :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2555 |  | 
| Tony Tye | 3b34061 | 2017-06-07 00:46:08 +0000 | [diff] [blame] | 2556 | ======= ======= =============================== =========================================================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2557 | Bits    Size    Field Name                      Description | 
| Tony Tye | 3b34061 | 2017-06-07 00:46:08 +0000 | [diff] [blame] | 2558 | ======= ======= =============================== =========================================================================== | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2559 | 0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2560 | _WAVEFRONT_OFFSET               SGPR wavefront scratch offset | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2561 | system register (see | 
|  | 2562 | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  | 2563 |  | 
|  | 2564 | Used by CP to set up | 
|  | 2565 | ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2566 | 5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2567 | user data registers | 
|  | 2568 | requested. This number must | 
|  | 2569 | match the number of user | 
|  | 2570 | data registers enabled. | 
|  | 2571 |  | 
|  | 2572 | Used by CP to set up | 
|  | 2573 | ``COMPUTE_PGM_RSRC2.USER_SGPR``. | 
| Konstantin Zhuravlyov | 2ca6b1f | 2018-05-29 19:09:13 +0000 | [diff] [blame] | 2574 | 6       1 bit   ENABLE_TRAP_HANDLER             Must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2575 |  | 
| Konstantin Zhuravlyov | 2ca6b1f | 2018-05-29 19:09:13 +0000 | [diff] [blame] | 2576 | This bit represents | 
|  | 2577 | ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, | 
|  | 2578 | which is set by the CP if | 
|  | 2579 | the runtime has installed a | 
|  | 2580 | trap handler. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2581 | 7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2582 | system SGPR register for | 
|  | 2583 | the work-group id in the X | 
|  | 2584 | dimension (see | 
|  | 2585 | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  | 2586 |  | 
|  | 2587 | Used by CP to set up | 
|  | 2588 | ``COMPUTE_PGM_RSRC2.TGID_X_EN``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2589 | 8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2590 | system SGPR register for | 
|  | 2591 | the work-group id in the Y | 
|  | 2592 | dimension (see | 
|  | 2593 | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  | 2594 |  | 
|  | 2595 | Used by CP to set up | 
|  | 2596 | ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2597 | 9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2598 | system SGPR register for | 
|  | 2599 | the work-group id in the Z | 
|  | 2600 | dimension (see | 
|  | 2601 | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  | 2602 |  | 
|  | 2603 | Used by CP to set up | 
|  | 2604 | ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2605 | 10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2606 | system SGPR register for | 
|  | 2607 | work-group information (see | 
|  | 2608 | :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). | 
|  | 2609 |  | 
|  | 2610 | Used by CP to set up | 
|  | 2611 | ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2612 | 12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2613 | VGPR system registers used | 
|  | 2614 | for the work-item ID. | 
|  | 2615 | :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` | 
|  | 2616 | defines the values. | 
|  | 2617 |  | 
|  | 2618 | Used by CP to set up | 
|  | 2619 | ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2620 | 13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2621 |  | 
|  | 2622 | Wavefront starts execution | 
|  | 2623 | with address watch | 
|  | 2624 | exceptions enabled which | 
|  | 2625 | are generated when L1 has | 
|  | 2626 | witnessed a thread access | 
|  | 2627 | an *address of | 
|  | 2628 | interest*. | 
|  | 2629 |  | 
|  | 2630 | CP is responsible for | 
|  | 2631 | filling in the address | 
|  | 2632 | watch bit in | 
|  | 2633 | ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` | 
|  | 2634 | according to what the | 
|  | 2635 | runtime requests. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2636 | 14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2637 |  | 
|  | 2638 | Wavefront starts execution | 
|  | 2639 | with memory violation | 
|  | 2640 | exceptions exceptions | 
|  | 2641 | enabled which are generated | 
|  | 2642 | when a memory violation has | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2643 | occurred for this wavefront from | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2644 | L1 or LDS | 
|  | 2645 | (write-to-read-only-memory, | 
|  | 2646 | mis-aligned atomic, LDS | 
|  | 2647 | address out of range, | 
|  | 2648 | illegal address, etc.). | 
|  | 2649 |  | 
|  | 2650 | CP sets the memory | 
|  | 2651 | violation bit in | 
|  | 2652 | ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` | 
|  | 2653 | according to what the | 
|  | 2654 | runtime requests. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2655 | 23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2656 |  | 
|  | 2657 | CP uses the rounded value | 
|  | 2658 | from the dispatch packet, | 
|  | 2659 | not this value, as the | 
|  | 2660 | dispatch may contain | 
|  | 2661 | dynamically allocated group | 
|  | 2662 | segment memory. CP writes | 
|  | 2663 | directly to | 
|  | 2664 | ``COMPUTE_PGM_RSRC2.LDS_SIZE``. | 
|  | 2665 |  | 
|  | 2666 | Amount of group segment | 
|  | 2667 | (LDS) to allocate for each | 
|  | 2668 | work-group. Granularity is | 
|  | 2669 | device specific: | 
|  | 2670 |  | 
|  | 2671 | GFX6: | 
|  | 2672 | roundup(lds-size / (64 * 4)) | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2673 | GFX7-GFX10: | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2674 | roundup(lds-size / (128 * 4)) | 
|  | 2675 |  | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2676 | 24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution | 
|  | 2677 | _INVALID_OPERATION              with specified exceptions | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2678 | enabled. | 
|  | 2679 |  | 
|  | 2680 | Used by CP to set up | 
|  | 2681 | ``COMPUTE_PGM_RSRC2.EXCP_EN`` | 
|  | 2682 | (set from bits 0..6). | 
|  | 2683 |  | 
|  | 2684 | IEEE 754 FP Invalid | 
|  | 2685 | Operation | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2686 | 25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more | 
|  | 2687 | _SOURCE                         input operands is a | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2688 | denormal number | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2689 | 26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by | 
|  | 2690 | _DIVISION_BY_ZERO               Zero | 
|  | 2691 | 27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow | 
|  | 2692 | _OVERFLOW | 
|  | 2693 | 28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow | 
|  | 2694 | _UNDERFLOW | 
|  | 2695 | 29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact | 
|  | 2696 | _INEXACT | 
|  | 2697 | 30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero | 
|  | 2698 | _ZERO                           (rcp_iflag_f32 instruction | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2699 | only) | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 2700 | 31      1 bit                                   Reserved, must be 0. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2701 | 32      **Total size 4 bytes.** | 
| Tony Tye | 3b34061 | 2017-06-07 00:46:08 +0000 | [diff] [blame] | 2702 | ======= =================================================================================================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2703 |  | 
|  | 2704 | .. | 
|  | 2705 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2706 | .. table:: compute_pgm_rsrc3 for GFX10 | 
|  | 2707 | :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table | 
|  | 2708 |  | 
|  | 2709 | ======= ======= =============================== =========================================================================== | 
|  | 2710 | Bits    Size    Field Name                      Description | 
|  | 2711 | ======= ======= =============================== =========================================================================== | 
|  | 2712 | 3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120. | 
|  | 2713 | compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64. | 
|  | 2714 | 31:4    28                                      Reserved, must be 0. | 
|  | 2715 | bits | 
|  | 2716 | 32      **Total size 4 bytes.** | 
|  | 2717 | ======= =================================================================================================================== | 
|  | 2718 |  | 
|  | 2719 | .. | 
|  | 2720 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2721 | .. table:: Floating Point Rounding Mode Enumeration Values | 
|  | 2722 | :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table | 
|  | 2723 |  | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2724 | ====================================== ===== ============================== | 
|  | 2725 | Enumeration Name                       Value Description | 
|  | 2726 | ====================================== ===== ============================== | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2727 | FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even | 
|  | 2728 | FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity | 
|  | 2729 | FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity | 
|  | 2730 | FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0 | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2731 | ====================================== ===== ============================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2732 |  | 
|  | 2733 | .. | 
|  | 2734 |  | 
|  | 2735 | .. table:: Floating Point Denorm Mode Enumeration Values | 
|  | 2736 | :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table | 
|  | 2737 |  | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2738 | ====================================== ===== ============================== | 
|  | 2739 | Enumeration Name                       Value Description | 
|  | 2740 | ====================================== ===== ============================== | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2741 | FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2742 | Denorms | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2743 | FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms | 
|  | 2744 | FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms | 
|  | 2745 | FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2746 | ====================================== ===== ============================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2747 |  | 
|  | 2748 | .. | 
|  | 2749 |  | 
|  | 2750 | .. table:: System VGPR Work-Item ID Enumeration Values | 
|  | 2751 | :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table | 
|  | 2752 |  | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2753 | ======================================== ===== ============================ | 
|  | 2754 | Enumeration Name                         Value Description | 
|  | 2755 | ======================================== ===== ============================ | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2756 | SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2757 | ID. | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2758 | SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2759 | dimensions ID. | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2760 | SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2761 | dimensions ID. | 
| Konstantin Zhuravlyov | 00f2cb1 | 2018-06-12 18:02:46 +0000 | [diff] [blame] | 2762 | SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined. | 
| Konstantin Zhuravlyov | 13376a4 | 2017-10-14 19:17:08 +0000 | [diff] [blame] | 2763 | ======================================== ===== ============================ | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2764 |  | 
|  | 2765 | .. _amdgpu-amdhsa-initial-kernel-execution-state: | 
|  | 2766 |  | 
|  | 2767 | Initial Kernel Execution State | 
|  | 2768 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  | 2769 |  | 
|  | 2770 | This section defines the register state that will be set up by the packet | 
|  | 2771 | processor prior to the start of execution of every wavefront. This is limited by | 
|  | 2772 | the constraints of the hardware controllers of CP/ADC/SPI. | 
|  | 2773 |  | 
|  | 2774 | The order of the SGPR registers is defined, but the compiler can specify which | 
|  | 2775 | ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit | 
|  | 2776 | fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used | 
|  | 2777 | for enabled registers are dense starting at SGPR0: the first enabled register is | 
|  | 2778 | SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have | 
|  | 2779 | an SGPR number. | 
|  | 2780 |  | 
|  | 2781 | The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2782 | all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2783 | the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually | 
|  | 2784 | initialized. These are then immediately followed by the System SGPRs that are | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2785 | set up by ADC/SPI and can have different values for each wavefront of the grid | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2786 | dispatch. | 
|  | 2787 |  | 
|  | 2788 | SGPR register initial state is defined in | 
|  | 2789 | :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. | 
|  | 2790 |  | 
|  | 2791 | .. table:: SGPR Register Set Up Order | 
|  | 2792 | :name: amdgpu-amdhsa-sgpr-register-set-up-order-table | 
|  | 2793 |  | 
|  | 2794 | ========== ========================== ====== ============================== | 
|  | 2795 | SGPR Order Name                       Number Description | 
|  | 2796 | (kernel descriptor enable  of | 
|  | 2797 | field)                     SGPRs | 
|  | 2798 | ========== ========================== ====== ============================== | 
|  | 2799 | First      Private Segment Buffer     4      V# that can be used, together | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2800 | (enable_sgpr_private              with Scratch Wavefront Offset | 
|  | 2801 | _segment_buffer)                  as an offset, to access the | 
|  | 2802 | private memory space using a | 
|  | 2803 | segment address. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2804 |  | 
|  | 2805 | CP uses the value provided by | 
|  | 2806 | the runtime. | 
|  | 2807 | then       Dispatch Ptr               2      64 bit address of AQL dispatch | 
|  | 2808 | (enable_sgpr_dispatch_ptr)        packet for kernel dispatch | 
|  | 2809 | actually executing. | 
|  | 2810 | then       Queue Ptr                  2      64 bit address of amd_queue_t | 
|  | 2811 | (enable_sgpr_queue_ptr)           object for AQL queue on which | 
|  | 2812 | the dispatch packet was | 
|  | 2813 | queued. | 
|  | 2814 | then       Kernarg Segment Ptr        2      64 bit address of Kernarg | 
|  | 2815 | (enable_sgpr_kernarg              segment. This is directly | 
|  | 2816 | _segment_ptr)                     copied from the | 
|  | 2817 | kernarg_address in the kernel | 
|  | 2818 | dispatch packet. | 
|  | 2819 |  | 
|  | 2820 | Having CP load it once avoids | 
|  | 2821 | loading it at the beginning of | 
|  | 2822 | every wavefront. | 
|  | 2823 | then       Dispatch Id                2      64 bit Dispatch ID of the | 
|  | 2824 | (enable_sgpr_dispatch_id)         dispatch packet being | 
|  | 2825 | executed. | 
|  | 2826 | then       Flat Scratch Init          2      This is 2 SGPRs: | 
|  | 2827 | (enable_sgpr_flat_scratch | 
|  | 2828 | _init)                            GFX6 | 
|  | 2829 | Not supported. | 
|  | 2830 | GFX7-GFX8 | 
|  | 2831 | The first SGPR is a 32 bit | 
|  | 2832 | byte offset from | 
|  | 2833 | ``SH_HIDDEN_PRIVATE_BASE_VIMID`` | 
|  | 2834 | to per SPI base of memory | 
|  | 2835 | for scratch for the queue | 
|  | 2836 | executing the kernel | 
|  | 2837 | dispatch. CP obtains this | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 2838 | from the runtime. (The | 
|  | 2839 | Scratch Segment Buffer base | 
|  | 2840 | address is | 
|  | 2841 | ``SH_HIDDEN_PRIVATE_BASE_VIMID`` | 
|  | 2842 | plus this offset.) The value | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2843 | of Scratch Wavefront Offset must | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 2844 | be added to this offset by | 
|  | 2845 | the kernel machine code, | 
|  | 2846 | right shifted by 8, and | 
|  | 2847 | moved to the FLAT_SCRATCH_HI | 
|  | 2848 | SGPR register. | 
|  | 2849 | FLAT_SCRATCH_HI corresponds | 
|  | 2850 | to SGPRn-4 on GFX7, and | 
|  | 2851 | SGPRn-6 on GFX8 (where SGPRn | 
|  | 2852 | is the highest numbered SGPR | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2853 | allocated to the wavefront). | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 2854 | FLAT_SCRATCH_HI is | 
|  | 2855 | multiplied by 256 (as it is | 
|  | 2856 | in units of 256 bytes) and | 
|  | 2857 | added to | 
|  | 2858 | ``SH_HIDDEN_PRIVATE_BASE_VIMID`` | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2859 | to calculate the per wavefront | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 2860 | FLAT SCRATCH BASE in flat | 
|  | 2861 | memory instructions that | 
|  | 2862 | access the scratch | 
|  | 2863 | apperture. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2864 |  | 
|  | 2865 | The second SGPR is 32 bit | 
|  | 2866 | byte size of a single | 
| Konstantin Zhuravlyov | ea35e46 | 2017-10-19 17:12:55 +0000 | [diff] [blame] | 2867 | work-item's scratch memory | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 2868 | usage. CP obtains this from | 
|  | 2869 | the runtime, and it is | 
|  | 2870 | always a multiple of DWORD. | 
|  | 2871 | CP checks that the value in | 
|  | 2872 | the kernel dispatch packet | 
|  | 2873 | Private Segment Byte Size is | 
|  | 2874 | not larger, and requests the | 
|  | 2875 | runtime to increase the | 
|  | 2876 | queue's scratch size if | 
|  | 2877 | necessary. The kernel code | 
|  | 2878 | must move it to | 
|  | 2879 | FLAT_SCRATCH_LO which is | 
|  | 2880 | SGPRn-3 on GFX7 and SGPRn-5 | 
|  | 2881 | on GFX8. FLAT_SCRATCH_LO is | 
|  | 2882 | used as the FLAT SCRATCH | 
|  | 2883 | SIZE in flat memory | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2884 | instructions. Having CP load | 
|  | 2885 | it once avoids loading it at | 
|  | 2886 | the beginning of every | 
| Tony Tye | f59d071 | 2017-11-10 20:51:43 +0000 | [diff] [blame] | 2887 | wavefront. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2888 | GFX9-GFX10 | 
| Tony Tye | f59d071 | 2017-11-10 20:51:43 +0000 | [diff] [blame] | 2889 | This is the | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 2890 | 64 bit base address of the | 
|  | 2891 | per SPI scratch backing | 
|  | 2892 | memory managed by SPI for | 
|  | 2893 | the queue executing the | 
|  | 2894 | kernel dispatch. CP obtains | 
|  | 2895 | this from the runtime (and | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2896 | divides it if there are | 
|  | 2897 | multiple Shader Arrays each | 
|  | 2898 | with its own SPI). The value | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2899 | of Scratch Wavefront Offset must | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2900 | be added by the kernel | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 2901 | machine code and the result | 
|  | 2902 | moved to the FLAT_SCRATCH | 
|  | 2903 | SGPR which is SGPRn-6 and | 
|  | 2904 | SGPRn-5. It is used as the | 
|  | 2905 | FLAT SCRATCH BASE in flat | 
| Tony Tye | f59d071 | 2017-11-10 20:51:43 +0000 | [diff] [blame] | 2906 | memory instructions. | 
|  | 2907 | then       Private Segment Size       1      The 32 bit byte size of a | 
|  | 2908 | (enable_sgpr_private single | 
|  | 2909 | work-item's | 
|  | 2910 | scratch_segment_size) memory | 
|  | 2911 | allocation. This is the | 
|  | 2912 | value from the kernel | 
|  | 2913 | dispatch packet Private | 
|  | 2914 | Segment Byte Size rounded up | 
|  | 2915 | by CP to a multiple of | 
|  | 2916 | DWORD. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2917 |  | 
|  | 2918 | Having CP load it once avoids | 
|  | 2919 | loading it at the beginning of | 
|  | 2920 | every wavefront. | 
|  | 2921 |  | 
|  | 2922 | This is not used for | 
|  | 2923 | GFX7-GFX8 since it is the same | 
|  | 2924 | value as the second SGPR of | 
|  | 2925 | Flat Scratch Init. However, it | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 2926 | may be needed for GFX9-GFX10 which | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2927 | changes the meaning of the | 
|  | 2928 | Flat Scratch Init value. | 
|  | 2929 | then       Grid Work-Group Count X    1      32 bit count of the number of | 
|  | 2930 | (enable_sgpr_grid                 work-groups in the X dimension | 
|  | 2931 | _workgroup_count_X)               for the grid being | 
|  | 2932 | executed. Computed from the | 
|  | 2933 | fields in the kernel dispatch | 
|  | 2934 | packet as ((grid_size.x + | 
|  | 2935 | workgroup_size.x - 1) / | 
|  | 2936 | workgroup_size.x). | 
|  | 2937 | then       Grid Work-Group Count Y    1      32 bit count of the number of | 
|  | 2938 | (enable_sgpr_grid                 work-groups in the Y dimension | 
|  | 2939 | _workgroup_count_Y &&             for the grid being | 
|  | 2940 | less than 16 previous             executed. Computed from the | 
|  | 2941 | SGPRs)                            fields in the kernel dispatch | 
|  | 2942 | packet as ((grid_size.y + | 
|  | 2943 | workgroup_size.y - 1) / | 
|  | 2944 | workgroupSize.y). | 
|  | 2945 |  | 
|  | 2946 | Only initialized if <16 | 
|  | 2947 | previous SGPRs initialized. | 
|  | 2948 | then       Grid Work-Group Count Z    1      32 bit count of the number of | 
|  | 2949 | (enable_sgpr_grid                 work-groups in the Z dimension | 
|  | 2950 | _workgroup_count_Z &&             for the grid being | 
|  | 2951 | less than 16 previous             executed. Computed from the | 
|  | 2952 | SGPRs)                            fields in the kernel dispatch | 
|  | 2953 | packet as ((grid_size.z + | 
|  | 2954 | workgroup_size.z - 1) / | 
|  | 2955 | workgroupSize.z). | 
|  | 2956 |  | 
|  | 2957 | Only initialized if <16 | 
|  | 2958 | previous SGPRs initialized. | 
|  | 2959 | then       Work-Group Id X            1      32 bit work-group id in X | 
|  | 2960 | (enable_sgpr_workgroup_id         dimension of grid for | 
|  | 2961 | _X)                               wavefront. | 
|  | 2962 | then       Work-Group Id Y            1      32 bit work-group id in Y | 
|  | 2963 | (enable_sgpr_workgroup_id         dimension of grid for | 
|  | 2964 | _Y)                               wavefront. | 
|  | 2965 | then       Work-Group Id Z            1      32 bit work-group id in Z | 
|  | 2966 | (enable_sgpr_workgroup_id         dimension of grid for | 
|  | 2967 | _Z)                               wavefront. | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2968 | then       Work-Group Info            1      {first_wavefront, 14'b0000, | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2969 | (enable_sgpr_workgroup            ordered_append_term[10:0], | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2970 | _info)                            threadgroup_size_in_wavefronts[5:0]} | 
|  | 2971 | then       Scratch Wavefront Offset   1      32 bit byte offset from base | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2972 | (enable_sgpr_private              of scratch base of queue | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 2973 | _segment_wavefront_offset)        executing the kernel | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 2974 | dispatch. Must be used as an | 
|  | 2975 | offset with Private | 
|  | 2976 | segment address when using | 
|  | 2977 | Scratch Segment Buffer. It | 
|  | 2978 | must be used to set up FLAT | 
|  | 2979 | SCRATCH for flat addressing | 
|  | 2980 | (see | 
|  | 2981 | :ref:`amdgpu-amdhsa-flat-scratch`). | 
|  | 2982 | ========== ========================== ====== ============================== | 
|  | 2983 |  | 
|  | 2984 | The order of the VGPR registers is defined, but the compiler can specify which | 
|  | 2985 | ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit | 
|  | 2986 | fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used | 
|  | 2987 | for enabled registers are dense starting at VGPR0: the first enabled register is | 
|  | 2988 | VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a | 
|  | 2989 | VGPR number. | 
|  | 2990 |  | 
|  | 2991 | VGPR register initial state is defined in | 
|  | 2992 | :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. | 
|  | 2993 |  | 
|  | 2994 | .. table:: VGPR Register Set Up Order | 
|  | 2995 | :name: amdgpu-amdhsa-vgpr-register-set-up-order-table | 
|  | 2996 |  | 
|  | 2997 | ========== ========================== ====== ============================== | 
|  | 2998 | VGPR Order Name                       Number Description | 
|  | 2999 | (kernel descriptor enable  of | 
|  | 3000 | field)                     VGPRs | 
|  | 3001 | ========== ========================== ====== ============================== | 
|  | 3002 | First      Work-Item Id X             1      32 bit work item id in X | 
|  | 3003 | (Always initialized)              dimension of work-group for | 
|  | 3004 | wavefront lane. | 
|  | 3005 | then       Work-Item Id Y             1      32 bit work item id in Y | 
|  | 3006 | (enable_vgpr_workitem_id          dimension of work-group for | 
|  | 3007 | > 0)                              wavefront lane. | 
|  | 3008 | then       Work-Item Id Z             1      32 bit work item id in Z | 
|  | 3009 | (enable_vgpr_workitem_id          dimension of work-group for | 
|  | 3010 | > 1)                              wavefront lane. | 
|  | 3011 | ========== ========================== ====== ============================== | 
|  | 3012 |  | 
| Hiroshi Inoue | bcadfee | 2018-04-12 05:53:20 +0000 | [diff] [blame] | 3013 | The setting of registers is done by GPU CP/ADC/SPI hardware as follows: | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3014 |  | 
|  | 3015 | 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data | 
|  | 3016 | registers. | 
|  | 3017 | 2. Work-group Id registers X, Y, Z are set by ADC which supports any | 
|  | 3018 | combination including none. | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 3019 | 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why | 
|  | 3020 | its value cannot included with the flat scratch init value which is per queue. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3021 | 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) | 
|  | 3022 | or (X, Y, Z). | 
|  | 3023 |  | 
|  | 3024 | Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit | 
|  | 3025 | value to the hardware required SGPRn-3 and SGPRn-4 respectively. | 
|  | 3026 |  | 
|  | 3027 | The global segment can be accessed either using buffer instructions (GFX6 which | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3028 | has V# 64 bit address support), flat instructions (GFX7-GFX10), or global | 
|  | 3029 | instructions (GFX9-GFX10). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3030 |  | 
|  | 3031 | If buffer operations are used then the compiler can generate a V# with the | 
|  | 3032 | following properties: | 
|  | 3033 |  | 
|  | 3034 | * base address of 0 | 
|  | 3035 | * no swizzle | 
|  | 3036 | * ATC: 1 if IOMMU present (such as APU) | 
|  | 3037 | * ptr64: 1 | 
|  | 3038 | * MTYPE set to support memory coherence that matches the runtime (such as CC for | 
|  | 3039 | APU and NC for dGPU). | 
|  | 3040 |  | 
|  | 3041 | .. _amdgpu-amdhsa-kernel-prolog: | 
|  | 3042 |  | 
|  | 3043 | Kernel Prolog | 
|  | 3044 | ~~~~~~~~~~~~~ | 
|  | 3045 |  | 
|  | 3046 | .. _amdgpu-amdhsa-m0: | 
|  | 3047 |  | 
|  | 3048 | M0 | 
|  | 3049 | ++ | 
|  | 3050 |  | 
|  | 3051 | GFX6-GFX8 | 
|  | 3052 | The M0 register must be initialized with a value at least the total LDS size | 
|  | 3053 | if the kernel may access LDS via DS or flat operations. Total LDS size is | 
|  | 3054 | available in dispatch packet. For M0, it is also possible to use maximum | 
|  | 3055 | possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for | 
|  | 3056 | GFX7-GFX8). | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3057 | GFX9-GFX10 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3058 | The M0 register is not used for range checking LDS accesses and so does not | 
|  | 3059 | need to be initialized in the prolog. | 
|  | 3060 |  | 
|  | 3061 | .. _amdgpu-amdhsa-flat-scratch: | 
|  | 3062 |  | 
|  | 3063 | Flat Scratch | 
|  | 3064 | ++++++++++++ | 
|  | 3065 |  | 
|  | 3066 | If the kernel may use flat operations to access scratch memory, the prolog code | 
|  | 3067 | must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 3068 | are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3069 | Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): | 
|  | 3070 |  | 
|  | 3071 | GFX6 | 
|  | 3072 | Flat scratch is not supported. | 
|  | 3073 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 3074 | GFX7-GFX8 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3075 | 1. The low word of Flat Scratch Init is 32 bit byte offset from | 
|  | 3076 | ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory | 
|  | 3077 | being managed by SPI for the queue executing the kernel dispatch. This is | 
|  | 3078 | the same value used in the Scratch Segment Buffer V# base address. The | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 3079 | prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3080 | scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since | 
|  | 3081 | FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted | 
|  | 3082 | by 8 before moving into FLAT_SCRATCH_LO. | 
|  | 3083 | 2. The second word of Flat Scratch Init is 32 bit byte size of a single | 
|  | 3084 | work-items scratch memory usage. This is directly loaded from the kernel | 
|  | 3085 | dispatch packet Private Segment Byte Size and rounded up to a multiple of | 
|  | 3086 | DWORD. Having CP load it once avoids loading it at the beginning of every | 
|  | 3087 | wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH | 
|  | 3088 | SIZE. | 
| Tony Tye | f59d071 | 2017-11-10 20:51:43 +0000 | [diff] [blame] | 3089 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3090 | GFX9-GFX10 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3091 | The Flat Scratch Init is the 64 bit address of the base of scratch backing | 
|  | 3092 | memory being managed by SPI for the queue executing the kernel dispatch. The | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 3093 | prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3094 | pair for use as the flat scratch base in flat memory instructions. | 
|  | 3095 |  | 
|  | 3096 | .. _amdgpu-amdhsa-memory-model: | 
|  | 3097 |  | 
|  | 3098 | Memory Model | 
|  | 3099 | ~~~~~~~~~~~~ | 
|  | 3100 |  | 
|  | 3101 | This section describes the mapping of LLVM memory model onto AMDGPU machine code | 
|  | 3102 | (see :ref:`memmodel`). *The implementation is WIP.* | 
|  | 3103 |  | 
|  | 3104 | .. TODO | 
|  | 3105 | Update when implementation complete. | 
|  | 3106 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3107 | The AMDGPU backend supports the memory synchronization scopes specified in | 
|  | 3108 | :ref:`amdgpu-memory-scopes`. | 
|  | 3109 |  | 
|  | 3110 | The code sequences used to implement the memory model are defined in table | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3111 | :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3112 |  | 
|  | 3113 | The sequences specify the order of instructions that a single thread must | 
|  | 3114 | execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect | 
|  | 3115 | to other memory instructions executed by the same thread. This allows them to be | 
|  | 3116 | moved earlier or later which can allow them to be combined with other instances | 
|  | 3117 | of the same instruction, or hoisted/sunk out of loops to improve | 
|  | 3118 | performance. Only the instructions related to the memory model are given; | 
|  | 3119 | additional ``s_waitcnt`` instructions are required to ensure registers are | 
|  | 3120 | defined before being used. These may be able to be combined with the memory | 
|  | 3121 | model ``s_waitcnt`` instructions as described above. | 
|  | 3122 |  | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3123 | The AMDGPU backend supports the following memory models: | 
|  | 3124 |  | 
|  | 3125 | HSA Memory Model [HSA]_ | 
|  | 3126 | The HSA memory model uses a single happens-before relation for all address | 
|  | 3127 | spaces (see :ref:`amdgpu-address-spaces`). | 
|  | 3128 | OpenCL Memory Model [OpenCL]_ | 
|  | 3129 | The OpenCL memory model which has separate happens-before relations for the | 
|  | 3130 | global and local address spaces. Only a fence specifying both global and | 
|  | 3131 | local address space, and seq_cst instructions join the relationships. Since | 
|  | 3132 | the LLVM ``memfence`` instruction does not allow an address space to be | 
|  | 3133 | specified the OpenCL fence has to convervatively assume both local and | 
|  | 3134 | global address space was specified. However, optimizations can often be | 
|  | 3135 | done to eliminate the additional ``s_waitcnt`` instructions when there are | 
|  | 3136 | no intervening memory instructions which access the corresponding address | 
|  | 3137 | space. The code sequences in the table indicate what can be omitted for the | 
|  | 3138 | OpenCL memory. The target triple environment is used to determine if the | 
|  | 3139 | source language is OpenCL (see :ref:`amdgpu-opencl`). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3140 |  | 
|  | 3141 | ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS | 
|  | 3142 | operations. | 
|  | 3143 |  | 
|  | 3144 | ``buffer/global/flat_load/store/atomic`` instructions to global memory are | 
|  | 3145 | termed vector memory operations. | 
|  | 3146 |  | 
|  | 3147 | For GFX6-GFX9: | 
|  | 3148 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3149 | * Each agent has multiple shader arrays (SA). | 
|  | 3150 | * Each SA has multiple compute units (CU). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3151 | * Each CU has multiple SIMDs that execute wavefronts. | 
|  | 3152 | * The wavefronts for a single work-group are executed in the same CU but may be | 
|  | 3153 | executed by different SIMDs. | 
|  | 3154 | * Each CU has a single LDS memory shared by the wavefronts of the work-groups | 
|  | 3155 | executing on it. | 
|  | 3156 | * All LDS operations of a CU are performed as wavefront wide operations in a | 
|  | 3157 | global order and involve no caching. Completion is reported to a wavefront in | 
|  | 3158 | execution order. | 
|  | 3159 | * The LDS memory has multiple request queues shared by the SIMDs of a | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 3160 | CU. Therefore, the LDS operations performed by different wavefronts of a work-group | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3161 | can be reordered relative to each other, which can result in reordering the | 
|  | 3162 | visibility of vector memory operations with respect to LDS operations of other | 
|  | 3163 | wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to | 
| Sylvestre Ledru | e3fdbae | 2017-06-26 02:45:39 +0000 | [diff] [blame] | 3164 | ensure synchronization between LDS operations and vector memory operations | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 3165 | between wavefronts of a work-group, but not between operations performed by the | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3166 | same wavefront. | 
|  | 3167 | * The vector memory operations are performed as wavefront wide operations and | 
|  | 3168 | completion is reported to a wavefront in execution order. The exception is | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 3169 | that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3170 | vector memory order if they access LDS memory, and out of LDS operation order | 
|  | 3171 | if they access global memory. | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3172 | * The vector memory operations access a single vector L1 cache shared by all | 
|  | 3173 | SIMDs a CU. Therefore, no special action is required for coherence between the | 
|  | 3174 | lanes of a single wavefront, or for coherence between wavefronts in the same | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 3175 | work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3176 | executing in different work-groups as they may be executing on different CUs. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3177 | * The scalar memory operations access a scalar L1 cache shared by all wavefronts | 
|  | 3178 | on a group of CUs. The scalar and vector L1 caches are not coherent. However, | 
|  | 3179 | scalar operations are used in a restricted way so do not impact the memory | 
|  | 3180 | model. See :ref:`amdgpu-amdhsa-memory-spaces`. | 
|  | 3181 | * The vector and scalar memory operations use an L2 cache shared by all CUs on | 
|  | 3182 | the same agent. | 
|  | 3183 | * The L2 cache has independent channels to service disjoint ranges of virtual | 
|  | 3184 | addresses. | 
|  | 3185 | * Each CU has a separate request queue per channel. Therefore, the vector and | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 3186 | scalar memory operations performed by wavefronts executing in different work-groups | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3187 | (which may be executing on different CUs) of an agent can be reordered | 
|  | 3188 | relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure | 
| Sylvestre Ledru | e3fdbae | 2017-06-26 02:45:39 +0000 | [diff] [blame] | 3189 | synchronization between vector memory operations of different CUs. It ensures a | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3190 | previous vector memory operation has completed before executing a subsequent | 
|  | 3191 | vector memory or LDS operation and so can be used to meet the requirements of | 
|  | 3192 | acquire and release. | 
|  | 3193 | * The L2 cache can be kept coherent with other agents on some targets, or ranges | 
|  | 3194 | of virtual addresses can be set up to bypass it to ensure system coherence. | 
|  | 3195 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3196 | For GFX10: | 
|  | 3197 |  | 
|  | 3198 | * Each agent has multiple shader arrays (SA). | 
|  | 3199 | * Each SA has multiple work-group processors (WGP). | 
|  | 3200 | * Each WGP has multiple compute units (CU). | 
|  | 3201 | * Each CU has multiple SIMDs that execute wavefronts. | 
|  | 3202 | * The wavefronts for a single work-group are executed in the same | 
|  | 3203 | WGP. In CU wavefront execution mode the wavefronts may be executed by | 
|  | 3204 | different SIMDs in the same CU. In WGP wavefront execution mode the | 
|  | 3205 | wavefronts may be executed by different SIMDs in different CUs in the same | 
|  | 3206 | WGP. | 
|  | 3207 | * Each WGP has a single LDS memory shared by the wavefronts of the work-groups | 
|  | 3208 | executing on it. | 
|  | 3209 | * All LDS operations of a WGP are performed as wavefront wide operations in a | 
|  | 3210 | global order and involve no caching. Completion is reported to a wavefront in | 
|  | 3211 | execution order. | 
|  | 3212 | * The LDS memory has multiple request queues shared by the SIMDs of a | 
|  | 3213 | WGP. Therefore, the LDS operations performed by different wavefronts of a work-group | 
|  | 3214 | can be reordered relative to each other, which can result in reordering the | 
|  | 3215 | visibility of vector memory operations with respect to LDS operations of other | 
|  | 3216 | wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to | 
|  | 3217 | ensure synchronization between LDS operations and vector memory operations | 
|  | 3218 | between wavefronts of a work-group, but not between operations performed by the | 
|  | 3219 | same wavefront. | 
|  | 3220 | * The vector memory operations are performed as wavefront wide operations. | 
|  | 3221 | Completion of load/store/sample operations are reported to a wavefront in | 
|  | 3222 | execution order of other load/store/sample operations performed by that | 
|  | 3223 | wavefront. | 
|  | 3224 | * The vector memory operations access a vector L0 cache. There is a single L0 | 
|  | 3225 | cache per CU. Each SIMD of a CU accesses the same L0 cache. | 
|  | 3226 | Therefore, no special action is required for coherence between the lanes of a | 
|  | 3227 | single wavefront. However, a ``BUFFER_GL0_INV`` is required for coherence | 
|  | 3228 | between wavefronts executing in the same work-group as they may be executing on | 
|  | 3229 | SIMDs of different CUs that access different L0s. A ``BUFFER_GL0_INV`` is also | 
|  | 3230 | required for coherence between wavefronts executing in different work-groups as | 
|  | 3231 | they may be executing on different WGPs. | 
|  | 3232 | * The scalar memory operations access a scalar L0 cache shared by all wavefronts | 
|  | 3233 | on a WGP. The scalar and vector L0 caches are not coherent. However, scalar | 
|  | 3234 | operations are used in a restricted way so do not impact the memory model. See | 
|  | 3235 | :ref:`amdgpu-amdhsa-memory-spaces`. | 
|  | 3236 | * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on | 
|  | 3237 | the same SA. Therefore, no special action is required for coherence between | 
|  | 3238 | the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is | 
|  | 3239 | required for coherence between wavefronts executing in different work-groups as | 
|  | 3240 | they may be executing on different SAs that access different L1s. | 
|  | 3241 | * The L1 caches have independent quadrants to service disjoint ranges of virtual | 
|  | 3242 | addresses. | 
|  | 3243 | * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the | 
|  | 3244 | vector and scalar memory operations performed by different wavefronts, whether | 
|  | 3245 | executing in the same or different work-groups (which may be executing on | 
|  | 3246 | different CUs accessing different L0s), can be reordered relative to each | 
|  | 3247 | other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure synchronization | 
|  | 3248 | between vector memory operations of different wavefronts. It ensures a previous | 
|  | 3249 | vector memory operation has completed before executing a subsequent vector | 
|  | 3250 | memory or LDS operation and so can be used to meet the requirements of acquire, | 
|  | 3251 | release and sequential consistency. | 
|  | 3252 | * The L1 caches use an L2 cache shared by all SAs on the same agent. | 
|  | 3253 | * The L2 cache has independent channels to service disjoint ranges of virtual | 
|  | 3254 | addresses. | 
|  | 3255 | * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 | 
|  | 3256 | quadrant has a separate request queue per L2 channel. Therefore, the vector | 
|  | 3257 | and scalar memory operations performed by wavefronts executing in different | 
|  | 3258 | work-groups (which may be executing on different SAs) of an agent can be | 
|  | 3259 | reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is | 
|  | 3260 | required to ensure synchronization between vector memory operations of | 
|  | 3261 | different SAs. It ensures a previous vector memory operation has completed | 
|  | 3262 | before executing a subsequent vector memory and so can be used to meet the | 
|  | 3263 | requirements of acquire, release and sequential consistency. | 
|  | 3264 | * The L2 cache can be kept coherent with other agents on some targets, or ranges | 
|  | 3265 | of virtual addresses can be set up to bypass it to ensure system coherence. | 
|  | 3266 |  | 
| Tony Tye | 07d9f10 | 2017-11-10 01:00:54 +0000 | [diff] [blame] | 3267 | Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8), | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3268 | or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread is accessing the | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3269 | memory, atomic memory orderings are not meaningful and all accesses are treated | 
|  | 3270 | as non-atomic. | 
|  | 3271 |  | 
|  | 3272 | Constant address space uses ``buffer/global_load`` instructions (or equivalent | 
|  | 3273 | scalar memory instructions). Since the constant address space contents do not | 
|  | 3274 | change during the execution of a kernel dispatch it is not legal to perform | 
|  | 3275 | stores, and atomic memory orderings are not meaningful and all access are | 
|  | 3276 | treated as non-atomic. | 
|  | 3277 |  | 
|  | 3278 | A memory synchronization scope wider than work-group is not meaningful for the | 
|  | 3279 | group (LDS) address space and is treated as work-group. | 
|  | 3280 |  | 
|  | 3281 | The memory model does not support the region address space which is treated as | 
|  | 3282 | non-atomic. | 
|  | 3283 |  | 
|  | 3284 | Acquire memory ordering is not meaningful on store atomic instructions and is | 
|  | 3285 | treated as non-atomic. | 
|  | 3286 |  | 
|  | 3287 | Release memory ordering is not meaningful on load atomic instructions and is | 
|  | 3288 | treated a non-atomic. | 
|  | 3289 |  | 
|  | 3290 | Acquire-release memory ordering is not meaningful on load or store atomic | 
|  | 3291 | instructions and is treated as acquire and release respectively. | 
|  | 3292 |  | 
|  | 3293 | AMDGPU backend only uses scalar memory operations to access memory that is | 
|  | 3294 | proven to not change during the execution of the kernel dispatch. This includes | 
|  | 3295 | constant address space and global address space for program scope const | 
|  | 3296 | variables. Therefore the kernel machine code does not have to maintain the | 
|  | 3297 | scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar | 
|  | 3298 | and vector L1 caches are invalidated between kernel dispatches by CP since | 
|  | 3299 | constant address space data may change between kernel dispatch executions. See | 
|  | 3300 | :ref:`amdgpu-amdhsa-memory-spaces`. | 
|  | 3301 |  | 
| Sylvestre Ledru | e3fdbae | 2017-06-26 02:45:39 +0000 | [diff] [blame] | 3302 | The one execption is if scalar writes are used to spill SGPR registers. In this | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3303 | case the AMDGPU backend ensures the memory location used to spill is never | 
|  | 3304 | accessed by vector memory operations at the same time. If scalar writes are used | 
|  | 3305 | then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function | 
|  | 3306 | return since the locations may be used for vector memory instructions by a | 
| Tony Tye | 5bbcca6 | 2018-03-08 05:46:01 +0000 | [diff] [blame] | 3307 | future wavefront that uses the same scratch area, or a function call that creates a | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3308 | frame at the same address, respectively. There is no need for a ``s_dcache_inv`` | 
|  | 3309 | as all scalar writes are write-before-read in the same thread. | 
|  | 3310 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3311 | For GFX6-GFX9, scratch backing memory (which is used for the private address space) | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3312 | is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private | 
|  | 3313 | address space is only accessed by a single thread, and is always | 
|  | 3314 | write-before-read, there is never a need to invalidate these entries from the L1 | 
|  | 3315 | cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the | 
|  | 3316 | volatile cache lines. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3317 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3318 | For GFX10, scratch backing memory (which is used for the private address space) | 
|  | 3319 | is accessed with MTYPE NC (non-coherenent). Since the private address space is | 
|  | 3320 | only accessed by a single thread, and is always write-before-read, there is | 
|  | 3321 | never a need to invalidate these entries from the L0 or L1 caches. | 
|  | 3322 |  | 
|  | 3323 | For GFX10, wavefronts are executed in native mode with in-order reporting of loads | 
|  | 3324 | and sample instructions. In this mode vmcnt reports completion of load, atomic | 
|  | 3325 | with return and sample instructions in order, and the vscnt reports the | 
|  | 3326 | completion of store and atomic without return in order. See ``MEM_ORDERED`` field | 
|  | 3327 | in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 3328 |  | 
|  | 3329 | In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode: | 
|  | 3330 |  | 
|  | 3331 | * In WGP wavefront execution mode the wavefronts of a work-group are executed | 
|  | 3332 | on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per | 
|  | 3333 | CU L0 caches is required for work-group synchronization. Also accesses to L1 at | 
|  | 3334 | work-group scope need to be expicitly ordered as the accesses from different | 
|  | 3335 | CUs are not ordered. | 
|  | 3336 | * In CU wavefront execution mode the wavefronts of a work-group are executed on | 
|  | 3337 | the SIMDs of a single CU of the WGP. Therefore, all global memory access by | 
|  | 3338 | the work-group access the same L0 which in turn ensures L1 accesses are | 
|  | 3339 | ordered and so do not require explicit management of the caches for | 
|  | 3340 | work-group synchronization. | 
|  | 3341 |  | 
|  | 3342 | See ``WGP_MODE`` field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` | 
|  | 3343 | and :ref:`amdgpu-target-features`. | 
|  | 3344 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3345 | On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3346 | to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3347 | non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3348 | (cache coherent) and so the L2 cache will be coherent with the CPU and other | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3349 | agents. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3350 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3351 | .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10 | 
|  | 3352 | :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3353 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3354 | ============ ============ ============== ========== =============================== ================================== | 
|  | 3355 | LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code             AMDGPU Machine Code | 
|  | 3356 | Ordering     Sync Scope     Address    GFX6-9                          GFX10 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3357 | Space | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3358 | ============ ============ ============== ========== =============================== ================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3359 | **Non-Atomic** | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3360 | ---------------------------------------------------------------------------------------------------------------------- | 
|  | 3361 | load         *none*       *none*         - global   - !volatile & !nontemporal      - !volatile & !nontemporal | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3362 | - generic | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3363 | - private    1. buffer/global/flat_load      1. buffer/global/flat_load | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3364 | - constant | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3365 | - volatile & !nontemporal       - volatile & !nontemporal | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3366 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3367 | 1. buffer/global/flat_load      1. buffer/global/flat_load | 
|  | 3368 | glc=1                           glc=1 dlc=1 | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3369 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3370 | - nontemporal                   - nontemporal | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3371 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3372 | 1. buffer/global/flat_load      1. buffer/global/flat_load | 
|  | 3373 | glc=1 slc=1                     slc=1 | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3374 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3375 | load         *none*       *none*         - local    1. ds_load                      1. ds_load | 
|  | 3376 | store        *none*       *none*         - global   - !nontemporal                  - !nontemporal | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3377 | - generic | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3378 | - private    1. buffer/global/flat_store     1. buffer/global/flat_store | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3379 | - constant | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3380 | - nontemporal                   - nontemporal | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3381 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3382 | 1. buffer/global/flat_stote      1. buffer/global/flat_store | 
|  | 3383 | glc=1 slc=1                      slc=1 | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3384 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3385 | store        *none*       *none*         - local    1. ds_store                     1. ds_store | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3386 | **Unordered Atomic** | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3387 | ---------------------------------------------------------------------------------------------------------------------- | 
|  | 3388 | load atomic  unordered    *any*          *any*      *Same as non-atomic*.           *Same as non-atomic*. | 
|  | 3389 | store atomic unordered    *any*          *any*      *Same as non-atomic*.           *Same as non-atomic*. | 
|  | 3390 | atomicrmw    unordered    *any*          *any*      *Same as monotonic              *Same as monotonic | 
|  | 3391 | atomic*.                        atomic*. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3392 | **Monotonic Atomic** | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3393 | ---------------------------------------------------------------------------------------------------------------------- | 
|  | 3394 | load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load      1. buffer/global/flat_load | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3395 | - wavefront    - generic | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3396 | load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load      1. buffer/global/flat_load | 
|  | 3397 | - generic                                     glc=1 | 
|  | 3398 |  | 
|  | 3399 | - If CU wavefront execution mode, omit glc=1. | 
|  | 3400 |  | 
|  | 3401 | load atomic  monotonic    - singlethread - local    1. ds_load                      1. ds_load | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3402 | - wavefront | 
|  | 3403 | - workgroup | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3404 | load atomic  monotonic    - agent        - global   1. buffer/global/flat_load      1. buffer/global/flat_load | 
|  | 3405 | - system       - generic     glc=1                           glc=1 dlc=1 | 
|  | 3406 | store atomic monotonic    - singlethread - global   1. buffer/global/flat_store     1. buffer/global/flat_store | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3407 | - wavefront    - generic | 
|  | 3408 | - workgroup | 
|  | 3409 | - agent | 
|  | 3410 | - system | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3411 | store atomic monotonic    - singlethread - local    1. ds_store                     1. ds_store | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3412 | - wavefront | 
|  | 3413 | - workgroup | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3414 | atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic    1. buffer/global/flat_atomic | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3415 | - wavefront    - generic | 
|  | 3416 | - workgroup | 
|  | 3417 | - agent | 
|  | 3418 | - system | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3419 | atomicrmw    monotonic    - singlethread - local    1. ds_atomic                    1. ds_atomic | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3420 | - wavefront | 
|  | 3421 | - workgroup | 
|  | 3422 | **Acquire Atomic** | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3423 | ---------------------------------------------------------------------------------------------------------------------- | 
|  | 3424 | load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load   1. buffer/global/ds/flat_load | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3425 | - wavefront    - local | 
|  | 3426 | - generic | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3427 | load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load      1. buffer/global_load glc=1 | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3428 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3429 | - If CU wavefront execution mode, omit glc=1. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3430 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3431 | 2. s_waitcnt vmcnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3432 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3433 | - If CU wavefront execution mode, omit. | 
|  | 3434 | - Must happen before | 
|  | 3435 | the following buffer_gl0_inv | 
|  | 3436 | and before any following | 
|  | 3437 | global/generic | 
|  | 3438 | load/load | 
|  | 3439 | atomic/stote/store | 
|  | 3440 | atomic/atomicrmw. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3441 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3442 | 3. buffer_gl0_inv | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3443 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3444 | - If CU wavefront execution mode, omit. | 
|  | 3445 | - Ensures that | 
|  | 3446 | following | 
|  | 3447 | loads will not see | 
|  | 3448 | stale data. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3449 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3450 | load atomic  acquire      - workgroup    - local    1. ds_load                      1. ds_load | 
|  | 3451 | 2. s_waitcnt lgkmcnt(0)         2. s_waitcnt lgkmcnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3452 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3453 | - If OpenCL, omit.              - If OpenCL, omit. | 
|  | 3454 | - Must happen before            - Must happen before | 
|  | 3455 | any following                   the following buffer_gl0_inv | 
|  | 3456 | global/generic                  and before any following | 
|  | 3457 | load/load                       global/generic load/load | 
|  | 3458 | atomic/store/store              atomic/store/store | 
|  | 3459 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 3460 | - Ensures any                   - Ensures any | 
|  | 3461 | following global                following global | 
|  | 3462 | data read is no                 data read is no | 
|  | 3463 | older than the load             older than the load | 
|  | 3464 | atomic value being              atomic value being | 
|  | 3465 | acquired.                       acquired. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3466 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3467 | 3. buffer_gl0_inv | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3468 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3469 | - If CU wavefront execution mode, omit. | 
|  | 3470 | - If OpenCL, omit. | 
|  | 3471 | - Ensures that | 
|  | 3472 | following | 
|  | 3473 | loads will not see | 
|  | 3474 | stale data. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3475 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3476 | load atomic  acquire      - workgroup    - generic  1. flat_load                    1. flat_load glc=1 | 
|  | 3477 |  | 
|  | 3478 | - If CU wavefront execution mode, omit glc=1. | 
|  | 3479 |  | 
|  | 3480 | 2. s_waitcnt lgkmcnt(0)         2. s_waitcnt lgkmcnt(0) & | 
|  | 3481 | vmcnt(0) | 
|  | 3482 |  | 
|  | 3483 | - If CU wavefront execution mode, omit vmcnt. | 
|  | 3484 | - If OpenCL, omit.              - If OpenCL, omit | 
|  | 3485 | lgkmcnt(0). | 
|  | 3486 | - Must happen before            - Must happen before | 
|  | 3487 | any following                   the following | 
|  | 3488 | global/generic                  buffer_gl0_inv and any | 
|  | 3489 | load/load                       following global/generic | 
|  | 3490 | atomic/store/store              load/load | 
|  | 3491 | atomic/atomicrmw.               atomic/store/store | 
|  | 3492 | atomic/atomicrmw. | 
|  | 3493 | - Ensures any                   - Ensures any | 
|  | 3494 | following global                following global | 
|  | 3495 | data read is no                 data read is no | 
|  | 3496 | older than the load             older than the load | 
|  | 3497 | atomic value being              atomic value being | 
|  | 3498 | acquired.                       acquired. | 
|  | 3499 |  | 
|  | 3500 | 3. buffer_gl0_inv | 
|  | 3501 |  | 
|  | 3502 | - If CU wavefront execution mode, omit. | 
|  | 3503 | - Ensures that | 
|  | 3504 | following | 
|  | 3505 | loads will not see | 
|  | 3506 | stale data. | 
|  | 3507 |  | 
|  | 3508 | load atomic  acquire      - agent        - global   1. buffer/global/flat_load      1. buffer/global_load | 
|  | 3509 | - system                     glc=1                           glc=1 dlc=1 | 
|  | 3510 | 2. s_waitcnt vmcnt(0)           2. s_waitcnt vmcnt(0) | 
|  | 3511 |  | 
|  | 3512 | - Must happen before            - Must happen before | 
|  | 3513 | following                       following | 
|  | 3514 | buffer_wbinvl1_vol.             buffer_gl*_inv. | 
|  | 3515 | - Ensures the load              - Ensures the load | 
|  | 3516 | has completed                   has completed | 
|  | 3517 | before invalidating             before invalidating | 
|  | 3518 | the cache.                      the caches. | 
|  | 3519 |  | 
|  | 3520 | 3. buffer_wbinvl1_vol           3. buffer_gl0_inv; | 
|  | 3521 | buffer_gl1_inv | 
|  | 3522 |  | 
|  | 3523 | - Must happen before            - Must happen before | 
|  | 3524 | any following                   any following | 
|  | 3525 | global/generic                  global/generic | 
|  | 3526 | load/load                       load/load | 
|  | 3527 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 3528 | - Ensures that                  - Ensures that | 
|  | 3529 | following                       following | 
|  | 3530 | loads will not see              loads will not see | 
|  | 3531 | stale global data.              stale global data. | 
|  | 3532 |  | 
|  | 3533 | load atomic  acquire      - agent        - generic  1. flat_load glc=1              1. flat_load glc=1 dlc=1 | 
|  | 3534 | - system                  2. s_waitcnt vmcnt(0) &         2. s_waitcnt vmcnt(0) & | 
|  | 3535 | lgkmcnt(0)                      lgkmcnt(0) | 
|  | 3536 |  | 
|  | 3537 | - If OpenCL omit                - If OpenCL omit | 
|  | 3538 | lgkmcnt(0).                     lgkmcnt(0). | 
|  | 3539 | - Must happen before            - Must happen before | 
|  | 3540 | following                       following | 
|  | 3541 | buffer_wbinvl1_vol.             buffer_gl*_invl. | 
|  | 3542 | - Ensures the flat_load         - Ensures the flat_load | 
|  | 3543 | has completed                   has completed | 
|  | 3544 | before invalidating             before invalidating | 
|  | 3545 | the cache.                      the caches. | 
|  | 3546 |  | 
|  | 3547 | 3. buffer_wbinvl1_vol           3. buffer_gl0_inv; | 
|  | 3548 | buffer_gl1_inv | 
|  | 3549 |  | 
|  | 3550 | - Must happen before            - Must happen before | 
|  | 3551 | any following                   any following | 
|  | 3552 | global/generic                  global/generic | 
|  | 3553 | load/load                       load/load | 
|  | 3554 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 3555 | - Ensures that                  - Ensures that | 
|  | 3556 | following loads                 following loads | 
|  | 3557 | will not see stale              will not see stale | 
|  | 3558 | global data.                    global data. | 
|  | 3559 |  | 
|  | 3560 | atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3561 | - wavefront    - local | 
|  | 3562 | - generic | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3563 | atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic    1. buffer/global_atomic | 
|  | 3564 | 2. s_waitcnt vm/vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3565 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3566 | - If CU wavefront execution mode, omit. | 
|  | 3567 | - Use vmcnt if atomic with | 
|  | 3568 | return and vscnt if atomic | 
|  | 3569 | with no-return. | 
|  | 3570 | - Must happen before | 
|  | 3571 | the following buffer_gl0_inv | 
|  | 3572 | and before any following | 
|  | 3573 | global/generic | 
|  | 3574 | load/load | 
|  | 3575 | atomic/stote/store | 
|  | 3576 | atomic/atomicrmw. | 
|  | 3577 |  | 
|  | 3578 | 3. buffer_gl0_inv | 
|  | 3579 |  | 
|  | 3580 | - If CU wavefront execution mode, omit. | 
|  | 3581 | - Ensures that | 
|  | 3582 | following | 
|  | 3583 | loads will not see | 
|  | 3584 | stale data. | 
|  | 3585 |  | 
|  | 3586 | atomicrmw    acquire      - workgroup    - local    1. ds_atomic                    1. ds_atomic | 
|  | 3587 | 2. waitcnt lgkmcnt(0)           2. waitcnt lgkmcnt(0) | 
|  | 3588 |  | 
|  | 3589 | - If OpenCL, omit.              - If OpenCL, omit. | 
|  | 3590 | - Must happen before            - Must happen before | 
|  | 3591 | any following                   the following | 
|  | 3592 | global/generic                  buffer_gl0_inv. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3593 | load/load | 
|  | 3594 | atomic/store/store | 
|  | 3595 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3596 | - Ensures any                   - Ensures any | 
|  | 3597 | following global                following global | 
|  | 3598 | data read is no                 data read is no | 
|  | 3599 | older than the                  older than the | 
|  | 3600 | atomicrmw value                 atomicrmw value | 
|  | 3601 | being acquired.                 being acquired. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3602 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3603 | 3. buffer_gl0_inv | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3604 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3605 | - If OpenCL omit. | 
|  | 3606 | - Ensures that | 
|  | 3607 | following | 
|  | 3608 | loads will not see | 
|  | 3609 | stale data. | 
|  | 3610 |  | 
|  | 3611 | atomicrmw    acquire      - workgroup    - generic  1. flat_atomic                  1. flat_atomic | 
|  | 3612 | 2. waitcnt lgkmcnt(0)           2. waitcnt lgkmcnt(0) & | 
|  | 3613 | vm/vscnt(0) | 
|  | 3614 |  | 
|  | 3615 | - If CU wavefront execution mode, omit vm/vscnt. | 
|  | 3616 | - If OpenCL, omit.              - If OpenCL, omit | 
|  | 3617 | waitcnt lgkmcnt(0).. | 
|  | 3618 | - Use vmcnt if atomic with | 
|  | 3619 | return and vscnt if atomic | 
|  | 3620 | with no-return. | 
|  | 3621 | waitcnt lgkmcnt(0). | 
|  | 3622 | - Must happen before            - Must happen before | 
|  | 3623 | any following                   the following | 
|  | 3624 | global/generic                  buffer_gl0_inv. | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3625 | load/load | 
|  | 3626 | atomic/store/store | 
|  | 3627 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3628 | - Ensures any                   - Ensures any | 
|  | 3629 | following global                following global | 
|  | 3630 | data read is no                 data read is no | 
|  | 3631 | older than the                  older than the | 
|  | 3632 | atomicrmw value                 atomicrmw value | 
|  | 3633 | being acquired.                 being acquired. | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3634 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3635 | 3. buffer_gl0_inv | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3636 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3637 | - If CU wavefront execution mode, omit. | 
|  | 3638 | - Ensures that | 
|  | 3639 | following | 
|  | 3640 | loads will not see | 
|  | 3641 | stale data. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3642 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3643 | atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic    1. buffer/global_atomic | 
|  | 3644 | - system                  2. s_waitcnt vmcnt(0)           2. s_waitcnt vm/vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3645 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3646 | - Use vmcnt if atomic with | 
|  | 3647 | return and vscnt if atomic | 
|  | 3648 | with no-return. | 
|  | 3649 | waitcnt lgkmcnt(0). | 
|  | 3650 | - Must happen before            - Must happen before | 
|  | 3651 | following                       following | 
|  | 3652 | buffer_wbinvl1_vol.             buffer_gl*_inv. | 
|  | 3653 | - Ensures the                   - Ensures the | 
|  | 3654 | atomicrmw has                   atomicrmw has | 
|  | 3655 | completed before                completed before | 
|  | 3656 | invalidating the                invalidating the | 
|  | 3657 | cache.                          caches. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3658 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3659 | 3. buffer_wbinvl1_vol           3. buffer_gl0_inv; | 
|  | 3660 | buffer_gl1_inv | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3661 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3662 | - Must happen before            - Must happen before | 
|  | 3663 | any following                   any following | 
|  | 3664 | global/generic                  global/generic | 
|  | 3665 | load/load                       load/load | 
|  | 3666 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 3667 | - Ensures that                  - Ensures that | 
|  | 3668 | following loads                 following loads | 
|  | 3669 | will not see stale              will not see stale | 
|  | 3670 | global data.                    global data. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3671 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3672 | atomicrmw    acquire      - agent        - generic  1. flat_atomic                  1. flat_atomic | 
|  | 3673 | - system                  2. s_waitcnt vmcnt(0) &         2. s_waitcnt vm/vscnt(0) & | 
|  | 3674 | lgkmcnt(0)                      lgkmcnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3675 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3676 | - If OpenCL, omit               - If OpenCL, omit | 
|  | 3677 | lgkmcnt(0).                     lgkmcnt(0). | 
|  | 3678 | - Use vmcnt if atomic with | 
|  | 3679 | return and vscnt if atomic | 
|  | 3680 | with no-return. | 
|  | 3681 | - Must happen before            - Must happen before | 
|  | 3682 | following                       following | 
|  | 3683 | buffer_wbinvl1_vol.             buffer_gl*_inv. | 
|  | 3684 | - Ensures the                   - Ensures the | 
|  | 3685 | atomicrmw has                   atomicrmw has | 
|  | 3686 | completed before                completed before | 
|  | 3687 | invalidating the                invalidating the | 
|  | 3688 | cache.                          caches. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3689 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3690 | 3. buffer_wbinvl1_vol           3. buffer_gl0_inv; | 
|  | 3691 | buffer_gl1_inv | 
|  | 3692 |  | 
|  | 3693 | - Must happen before            - Must happen before | 
|  | 3694 | any following                   any following | 
|  | 3695 | global/generic                  global/generic | 
|  | 3696 | load/load                       load/load | 
|  | 3697 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 3698 | - Ensures that                  - Ensures that | 
|  | 3699 | following loads                 following loads | 
|  | 3700 | will not see stale              will not see stale | 
|  | 3701 | global data.                    global data. | 
|  | 3702 |  | 
|  | 3703 | fence        acquire      - singlethread *none*     *none*                          *none* | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3704 | - wavefront | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3705 | fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) & | 
|  | 3706 | vmcnt(0) & vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3707 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3708 | - If CU wavefront execution mode, omit vmcnt and | 
|  | 3709 | vscnt. | 
|  | 3710 | - If OpenCL and                 - If OpenCL and | 
|  | 3711 | address space is                address space is | 
|  | 3712 | not generic, omit.              not generic, omit | 
|  | 3713 | lgkmcnt(0). | 
|  | 3714 | - If OpenCL and | 
|  | 3715 | address space is | 
|  | 3716 | local, omit | 
|  | 3717 | vmcnt(0) and vscnt(0). | 
|  | 3718 | - However, since LLVM           - However, since LLVM | 
|  | 3719 | currently has no                currently has no | 
|  | 3720 | address space on                address space on | 
|  | 3721 | the fence need to               the fence need to | 
|  | 3722 | conservatively                  conservatively | 
|  | 3723 | always generate. If             always generate. If | 
|  | 3724 | fence had an                    fence had an | 
|  | 3725 | address space then              address space then | 
|  | 3726 | set to address                  set to address | 
|  | 3727 | space of OpenCL                 space of OpenCL | 
|  | 3728 | fence flag, or to               fence flag, or to | 
|  | 3729 | generic if both                 generic if both | 
|  | 3730 | local and global                local and global | 
|  | 3731 | flags are                       flags are | 
|  | 3732 | specified.                      specified. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3733 | - Must happen after | 
|  | 3734 | any preceding | 
|  | 3735 | local/generic load | 
|  | 3736 | atomic/atomicrmw | 
|  | 3737 | with an equal or | 
|  | 3738 | wider sync scope | 
|  | 3739 | and memory ordering | 
|  | 3740 | stronger than | 
|  | 3741 | unordered (this is | 
|  | 3742 | termed the | 
|  | 3743 | fence-paired-atomic). | 
|  | 3744 | - Must happen before | 
|  | 3745 | any following | 
|  | 3746 | global/generic | 
|  | 3747 | load/load | 
|  | 3748 | atomic/store/store | 
|  | 3749 | atomic/atomicrmw. | 
|  | 3750 | - Ensures any | 
|  | 3751 | following global | 
|  | 3752 | data read is no | 
|  | 3753 | older than the | 
|  | 3754 | value read by the | 
|  | 3755 | fence-paired-atomic. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3756 | - Could be split into | 
|  | 3757 | separate s_waitcnt | 
|  | 3758 | vmcnt(0), s_waitcnt | 
|  | 3759 | vscnt(0) and s_waitcnt | 
|  | 3760 | lgkmcnt(0) to allow | 
|  | 3761 | them to be | 
|  | 3762 | independently moved | 
|  | 3763 | according to the | 
|  | 3764 | following rules. | 
|  | 3765 | - s_waitcnt vmcnt(0) | 
|  | 3766 | must happen after | 
|  | 3767 | any preceding | 
|  | 3768 | global/generic load | 
|  | 3769 | atomic/ | 
|  | 3770 | atomicrmw-with-return-value | 
|  | 3771 | with an equal or | 
|  | 3772 | wider sync scope | 
|  | 3773 | and memory ordering | 
|  | 3774 | stronger than | 
|  | 3775 | unordered (this is | 
|  | 3776 | termed the | 
|  | 3777 | fence-paired-atomic). | 
|  | 3778 | - s_waitcnt vscnt(0) | 
|  | 3779 | must happen after | 
|  | 3780 | any preceding | 
|  | 3781 | global/generic | 
|  | 3782 | atomicrmw-no-return-value | 
|  | 3783 | with an equal or | 
|  | 3784 | wider sync scope | 
|  | 3785 | and memory ordering | 
|  | 3786 | stronger than | 
|  | 3787 | unordered (this is | 
|  | 3788 | termed the | 
|  | 3789 | fence-paired-atomic). | 
|  | 3790 | - s_waitcnt lgkmcnt(0) | 
|  | 3791 | must happen after | 
|  | 3792 | any preceding | 
|  | 3793 | local/generic load | 
|  | 3794 | atomic/atomicrmw | 
|  | 3795 | with an equal or | 
|  | 3796 | wider sync scope | 
|  | 3797 | and memory ordering | 
|  | 3798 | stronger than | 
|  | 3799 | unordered (this is | 
|  | 3800 | termed the | 
|  | 3801 | fence-paired-atomic). | 
|  | 3802 | - Must happen before | 
|  | 3803 | the following | 
|  | 3804 | buffer_gl0_inv. | 
|  | 3805 | - Ensures that the | 
|  | 3806 | fence-paired atomic | 
|  | 3807 | has completed | 
|  | 3808 | before invalidating | 
|  | 3809 | the | 
|  | 3810 | cache. Therefore | 
|  | 3811 | any following | 
|  | 3812 | locations read must | 
|  | 3813 | be no older than | 
|  | 3814 | the value read by | 
|  | 3815 | the | 
|  | 3816 | fence-paired-atomic. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3817 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3818 | 3. buffer_gl0_inv | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3819 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3820 | - If CU wavefront execution mode, omit. | 
|  | 3821 | - Ensures that | 
|  | 3822 | following | 
|  | 3823 | loads will not see | 
|  | 3824 | stale data. | 
|  | 3825 |  | 
|  | 3826 | fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) & | 
|  | 3827 | - system                     vmcnt(0)                        vmcnt(0) & vscnt(0) | 
|  | 3828 |  | 
|  | 3829 | - If OpenCL and                 - If OpenCL and | 
|  | 3830 | address space is                address space is | 
|  | 3831 | not generic, omit               not generic, omit | 
|  | 3832 | lgkmcnt(0).                     lgkmcnt(0). | 
|  | 3833 | - If OpenCL and | 
|  | 3834 | address space is | 
|  | 3835 | local, omit | 
|  | 3836 | vmcnt(0) and vscnt(0). | 
|  | 3837 | - However, since LLVM           - However, since LLVM | 
|  | 3838 | currently has no                currently has no | 
|  | 3839 | address space on                address space on | 
|  | 3840 | the fence need to               the fence need to | 
|  | 3841 | conservatively                  conservatively | 
|  | 3842 | always generate                 always generate | 
|  | 3843 | (see comment for                (see comment for | 
|  | 3844 | previous fence).                previous fence). | 
| Tony Tye | d9c251f | 2017-06-07 00:08:35 +0000 | [diff] [blame] | 3845 | - Could be split into | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3846 | separate s_waitcnt | 
|  | 3847 | vmcnt(0) and | 
|  | 3848 | s_waitcnt | 
|  | 3849 | lgkmcnt(0) to allow | 
|  | 3850 | them to be | 
|  | 3851 | independently moved | 
|  | 3852 | according to the | 
|  | 3853 | following rules. | 
|  | 3854 | - s_waitcnt vmcnt(0) | 
|  | 3855 | must happen after | 
|  | 3856 | any preceding | 
|  | 3857 | global/generic load | 
|  | 3858 | atomic/atomicrmw | 
|  | 3859 | with an equal or | 
|  | 3860 | wider sync scope | 
|  | 3861 | and memory ordering | 
|  | 3862 | stronger than | 
|  | 3863 | unordered (this is | 
|  | 3864 | termed the | 
|  | 3865 | fence-paired-atomic). | 
|  | 3866 | - s_waitcnt lgkmcnt(0) | 
|  | 3867 | must happen after | 
|  | 3868 | any preceding | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3869 | local/generic load | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3870 | atomic/atomicrmw | 
|  | 3871 | with an equal or | 
|  | 3872 | wider sync scope | 
|  | 3873 | and memory ordering | 
|  | 3874 | stronger than | 
|  | 3875 | unordered (this is | 
|  | 3876 | termed the | 
|  | 3877 | fence-paired-atomic). | 
|  | 3878 | - Must happen before | 
|  | 3879 | the following | 
|  | 3880 | buffer_wbinvl1_vol. | 
|  | 3881 | - Ensures that the | 
|  | 3882 | fence-paired atomic | 
|  | 3883 | has completed | 
|  | 3884 | before invalidating | 
|  | 3885 | the | 
|  | 3886 | cache. Therefore | 
|  | 3887 | any following | 
|  | 3888 | locations read must | 
|  | 3889 | be no older than | 
|  | 3890 | the value read by | 
|  | 3891 | the | 
|  | 3892 | fence-paired-atomic. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3893 | - Could be split into | 
|  | 3894 | separate s_waitcnt | 
|  | 3895 | vmcnt(0), s_waitcnt | 
|  | 3896 | vscnt(0) and s_waitcnt | 
|  | 3897 | lgkmcnt(0) to allow | 
|  | 3898 | them to be | 
|  | 3899 | independently moved | 
|  | 3900 | according to the | 
|  | 3901 | following rules. | 
|  | 3902 | - s_waitcnt vmcnt(0) | 
|  | 3903 | must happen after | 
|  | 3904 | any preceding | 
|  | 3905 | global/generic load | 
|  | 3906 | atomic/ | 
|  | 3907 | atomicrmw-with-return-value | 
|  | 3908 | with an equal or | 
|  | 3909 | wider sync scope | 
|  | 3910 | and memory ordering | 
|  | 3911 | stronger than | 
|  | 3912 | unordered (this is | 
|  | 3913 | termed the | 
|  | 3914 | fence-paired-atomic). | 
|  | 3915 | - s_waitcnt vscnt(0) | 
|  | 3916 | must happen after | 
|  | 3917 | any preceding | 
|  | 3918 | global/generic | 
|  | 3919 | atomicrmw-no-return-value | 
|  | 3920 | with an equal or | 
|  | 3921 | wider sync scope | 
|  | 3922 | and memory ordering | 
|  | 3923 | stronger than | 
|  | 3924 | unordered (this is | 
|  | 3925 | termed the | 
|  | 3926 | fence-paired-atomic). | 
|  | 3927 | - s_waitcnt lgkmcnt(0) | 
|  | 3928 | must happen after | 
|  | 3929 | any preceding | 
|  | 3930 | local/generic load | 
|  | 3931 | atomic/atomicrmw | 
|  | 3932 | with an equal or | 
|  | 3933 | wider sync scope | 
|  | 3934 | and memory ordering | 
|  | 3935 | stronger than | 
|  | 3936 | unordered (this is | 
|  | 3937 | termed the | 
|  | 3938 | fence-paired-atomic). | 
|  | 3939 | - Must happen before | 
|  | 3940 | the following | 
|  | 3941 | buffer_gl*_inv. | 
|  | 3942 | - Ensures that the | 
|  | 3943 | fence-paired atomic | 
|  | 3944 | has completed | 
|  | 3945 | before invalidating | 
|  | 3946 | the | 
|  | 3947 | caches. Therefore | 
|  | 3948 | any following | 
|  | 3949 | locations read must | 
|  | 3950 | be no older than | 
|  | 3951 | the value read by | 
|  | 3952 | the | 
|  | 3953 | fence-paired-atomic. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3954 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3955 | 2. buffer_wbinvl1_vol           2. buffer_gl0_inv; | 
|  | 3956 | buffer_gl1_inv | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3957 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3958 | - Must happen before any        - Must happen before any | 
|  | 3959 | following global/generic        following global/generic | 
|  | 3960 | load/load                       load/load | 
|  | 3961 | atomic/store/store              atomic/store/store | 
|  | 3962 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 3963 | - Ensures that                  - Ensures that | 
|  | 3964 | following loads                 following loads | 
|  | 3965 | will not see stale              will not see stale | 
|  | 3966 | global data.                    global data. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3967 |  | 
|  | 3968 | **Release Atomic** | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3969 | ---------------------------------------------------------------------------------------------------------------------- | 
|  | 3970 | store atomic release      - singlethread - global   1. buffer/global/ds/flat_store  1. buffer/global/ds/flat_store | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3971 | - wavefront    - local | 
|  | 3972 | - generic | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3973 | store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) & | 
|  | 3974 | vmcnt(0) & vscnt(0) | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 3975 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3976 | - If CU wavefront execution mode, omit vmcnt and | 
|  | 3977 | vscnt. | 
|  | 3978 | - If OpenCL, omit.              - If OpenCL, omit | 
|  | 3979 | lgkmcnt(0). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 3980 | - Must happen after | 
|  | 3981 | any preceding | 
|  | 3982 | local/generic | 
|  | 3983 | load/store/load | 
|  | 3984 | atomic/store | 
|  | 3985 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 3986 | - Could be split into | 
|  | 3987 | separate s_waitcnt | 
|  | 3988 | vmcnt(0), s_waitcnt | 
|  | 3989 | vscnt(0) and s_waitcnt | 
|  | 3990 | lgkmcnt(0) to allow | 
|  | 3991 | them to be | 
|  | 3992 | independently moved | 
|  | 3993 | according to the | 
|  | 3994 | following rules. | 
|  | 3995 | - s_waitcnt vmcnt(0) | 
|  | 3996 | must happen after | 
|  | 3997 | any preceding | 
|  | 3998 | global/generic load/load | 
|  | 3999 | atomic/ | 
|  | 4000 | atomicrmw-with-return-value. | 
|  | 4001 | - s_waitcnt vscnt(0) | 
|  | 4002 | must happen after | 
|  | 4003 | any preceding | 
|  | 4004 | global/generic | 
|  | 4005 | store/store | 
|  | 4006 | atomic/ | 
|  | 4007 | atomicrmw-no-return-value. | 
|  | 4008 | - s_waitcnt lgkmcnt(0) | 
|  | 4009 | must happen after | 
|  | 4010 | any preceding | 
|  | 4011 | local/generic | 
|  | 4012 | load/store/load | 
|  | 4013 | atomic/store | 
|  | 4014 | atomic/atomicrmw. | 
|  | 4015 | - Must happen before            - Must happen before | 
|  | 4016 | the following                   the following | 
|  | 4017 | store.                          store. | 
|  | 4018 | - Ensures that all              - Ensures that all | 
|  | 4019 | memory operations               memory operations | 
|  | 4020 | to local have                   have | 
|  | 4021 | completed before                completed before | 
|  | 4022 | performing the                  performing the | 
|  | 4023 | store that is being             store that is being | 
|  | 4024 | released.                       released. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4025 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4026 | 2. buffer/global/flat_store     2. buffer/global_store | 
|  | 4027 | store atomic release      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0) | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 4028 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4029 | - If CU wavefront execution mode, omit. | 
|  | 4030 | - If OpenCL, omit. | 
|  | 4031 | - Could be split into | 
|  | 4032 | separate s_waitcnt | 
|  | 4033 | vmcnt(0) and s_waitcnt | 
|  | 4034 | vscnt(0) to allow | 
|  | 4035 | them to be | 
|  | 4036 | independently moved | 
|  | 4037 | according to the | 
|  | 4038 | following rules. | 
|  | 4039 | - s_waitcnt vmcnt(0) | 
|  | 4040 | must happen after | 
|  | 4041 | any preceding | 
|  | 4042 | global/generic load/load | 
|  | 4043 | atomic/ | 
|  | 4044 | atomicrmw-with-return-value. | 
|  | 4045 | - s_waitcnt vscnt(0) | 
|  | 4046 | must happen after | 
|  | 4047 | any preceding | 
|  | 4048 | global/generic | 
|  | 4049 | store/store atomic/ | 
|  | 4050 | atomicrmw-no-return-value. | 
|  | 4051 | - Must happen before | 
|  | 4052 | the following | 
|  | 4053 | store. | 
|  | 4054 | - Ensures that all | 
|  | 4055 | global memory | 
|  | 4056 | operations have | 
|  | 4057 | completed before | 
|  | 4058 | performing the | 
|  | 4059 | store that is being | 
|  | 4060 | released. | 
|  | 4061 |  | 
|  | 4062 | 1. ds_store                     2. ds_store | 
|  | 4063 | store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) & | 
|  | 4064 | vmcnt(0) & vscnt(0) | 
|  | 4065 |  | 
|  | 4066 | - If CU wavefront execution mode, omit vmcnt and | 
|  | 4067 | vscnt. | 
|  | 4068 | - If OpenCL, omit.              - If OpenCL, omit | 
|  | 4069 | lgkmcnt(0). | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 4070 | - Must happen after | 
|  | 4071 | any preceding | 
|  | 4072 | local/generic | 
|  | 4073 | load/store/load | 
|  | 4074 | atomic/store | 
|  | 4075 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4076 | - Could be split into | 
|  | 4077 | separate s_waitcnt | 
|  | 4078 | vmcnt(0), s_waitcnt | 
|  | 4079 | vscnt(0) and s_waitcnt | 
|  | 4080 | lgkmcnt(0) to allow | 
|  | 4081 | them to be | 
|  | 4082 | independently moved | 
|  | 4083 | according to the | 
|  | 4084 | following rules. | 
|  | 4085 | - s_waitcnt vmcnt(0) | 
|  | 4086 | must happen after | 
|  | 4087 | any preceding | 
|  | 4088 | global/generic load/load | 
|  | 4089 | atomic/ | 
|  | 4090 | atomicrmw-with-return-value. | 
|  | 4091 | - s_waitcnt vscnt(0) | 
|  | 4092 | must happen after | 
|  | 4093 | any preceding | 
|  | 4094 | global/generic | 
|  | 4095 | store/store | 
|  | 4096 | atomic/ | 
|  | 4097 | atomicrmw-no-return-value. | 
|  | 4098 | - s_waitcnt lgkmcnt(0) | 
|  | 4099 | must happen after | 
|  | 4100 | any preceding | 
|  | 4101 | local/generic load/store/load | 
|  | 4102 | atomic/store atomic/atomicrmw. | 
|  | 4103 | - Must happen before            - Must happen before | 
|  | 4104 | the following                   the following | 
|  | 4105 | store.                          store. | 
|  | 4106 | - Ensures that all              - Ensures that all | 
|  | 4107 | memory operations               memory operations | 
|  | 4108 | to local have                   have | 
|  | 4109 | completed before                completed before | 
|  | 4110 | performing the                  performing the | 
|  | 4111 | store that is being             store that is being | 
|  | 4112 | released.                       released. | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 4113 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4114 | 2. flat_store                   2. flat_store | 
|  | 4115 | store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &         1. s_waitcnt lgkmcnt(0) & | 
|  | 4116 | - system       - generic     vmcnt(0)                          vmcnt(0) & vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4117 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4118 | - If OpenCL, omit               - If OpenCL, omit | 
|  | 4119 | lgkmcnt(0).                     lgkmcnt(0). | 
|  | 4120 | - Could be split into           - Could be split into | 
|  | 4121 | separate s_waitcnt              separate s_waitcnt | 
|  | 4122 | vmcnt(0) and                    vmcnt(0), s_waitcnt vscnt(0) | 
|  | 4123 | s_waitcnt                       and s_waitcnt | 
|  | 4124 | lgkmcnt(0) to allow             lgkmcnt(0) to allow | 
|  | 4125 | them to be                      them to be | 
|  | 4126 | independently moved             independently moved | 
|  | 4127 | according to the                according to the | 
|  | 4128 | following rules.                following rules. | 
|  | 4129 | - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0) | 
|  | 4130 | must happen after               must happen after | 
|  | 4131 | any preceding                   any preceding | 
|  | 4132 | global/generic                  global/generic | 
|  | 4133 | load/store/load                 load/load | 
|  | 4134 | atomic/store                    atomic/ | 
|  | 4135 | atomic/atomicrmw.               atomicrmw-with-return-value. | 
|  | 4136 | - s_waitcnt vscnt(0) | 
|  | 4137 | must happen after | 
|  | 4138 | any preceding | 
|  | 4139 | global/generic | 
|  | 4140 | store/store atomic/ | 
|  | 4141 | atomicrmw-no-return-value. | 
|  | 4142 | - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0) | 
|  | 4143 | must happen after               must happen after | 
|  | 4144 | any preceding                   any preceding | 
|  | 4145 | local/generic                   local/generic | 
|  | 4146 | load/store/load                 load/store/load | 
|  | 4147 | atomic/store                    atomic/store | 
|  | 4148 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 4149 | - Must happen before            - Must happen before | 
|  | 4150 | the following                   the following | 
|  | 4151 | store.                          store. | 
|  | 4152 | - Ensures that all              - Ensures that all | 
|  | 4153 | memory operations               memory operations | 
|  | 4154 | to memory have                  to memory have | 
|  | 4155 | completed before                completed before | 
|  | 4156 | performing the                  performing the | 
|  | 4157 | store that is being             store that is being | 
|  | 4158 | released.                       released. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4159 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4160 | 2. buffer/global/ds/flat_store  2. buffer/global/ds/flat_store | 
|  | 4161 | atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4162 | - wavefront    - local | 
|  | 4163 | - generic | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4164 | atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) & | 
|  | 4165 | vmcnt(0) & vscnt(0) | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 4166 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4167 | - If CU wavefront execution mode, omit vmcnt and | 
|  | 4168 | vscnt. | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 4169 | - If OpenCL, omit. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4170 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4171 | - Must happen after | 
|  | 4172 | any preceding | 
|  | 4173 | local/generic | 
|  | 4174 | load/store/load | 
|  | 4175 | atomic/store | 
|  | 4176 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4177 | - Could be split into | 
|  | 4178 | separate s_waitcnt | 
|  | 4179 | vmcnt(0), s_waitcnt | 
|  | 4180 | vscnt(0) and s_waitcnt | 
|  | 4181 | lgkmcnt(0) to allow | 
|  | 4182 | them to be | 
|  | 4183 | independently moved | 
|  | 4184 | according to the | 
|  | 4185 | following rules. | 
|  | 4186 | - s_waitcnt vmcnt(0) | 
|  | 4187 | must happen after | 
|  | 4188 | any preceding | 
|  | 4189 | global/generic load/load | 
|  | 4190 | atomic/ | 
|  | 4191 | atomicrmw-with-return-value. | 
|  | 4192 | - s_waitcnt vscnt(0) | 
|  | 4193 | must happen after | 
|  | 4194 | any preceding | 
|  | 4195 | global/generic | 
|  | 4196 | store/store | 
|  | 4197 | atomic/ | 
|  | 4198 | atomicrmw-no-return-value. | 
|  | 4199 | - s_waitcnt lgkmcnt(0) | 
|  | 4200 | must happen after | 
|  | 4201 | any preceding | 
|  | 4202 | local/generic | 
|  | 4203 | load/store/load | 
|  | 4204 | atomic/store | 
|  | 4205 | atomic/atomicrmw. | 
|  | 4206 | - Must happen before            - Must happen before | 
|  | 4207 | the following                   the following | 
|  | 4208 | atomicrmw.                      atomicrmw. | 
|  | 4209 | - Ensures that all              - Ensures that all | 
|  | 4210 | memory operations               memory operations | 
|  | 4211 | to local have                   have | 
|  | 4212 | completed before                completed before | 
|  | 4213 | performing the                  performing the | 
|  | 4214 | atomicrmw that is               atomicrmw that is | 
|  | 4215 | being released.                 being released. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4216 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4217 | 2. buffer/global/flat_atomic    2. buffer/global_atomic | 
|  | 4218 | atomicrmw    release      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0) | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 4219 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4220 | - If CU wavefront execution mode, omit. | 
|  | 4221 | - If OpenCL, omit. | 
|  | 4222 | - Could be split into | 
|  | 4223 | separate s_waitcnt | 
|  | 4224 | vmcnt(0) and s_waitcnt | 
|  | 4225 | vscnt(0) to allow | 
|  | 4226 | them to be | 
|  | 4227 | independently moved | 
|  | 4228 | according to the | 
|  | 4229 | following rules. | 
|  | 4230 | - s_waitcnt vmcnt(0) | 
|  | 4231 | must happen after | 
|  | 4232 | any preceding | 
|  | 4233 | global/generic load/load | 
|  | 4234 | atomic/ | 
|  | 4235 | atomicrmw-with-return-value. | 
|  | 4236 | - s_waitcnt vscnt(0) | 
|  | 4237 | must happen after | 
|  | 4238 | any preceding | 
|  | 4239 | global/generic | 
|  | 4240 | store/store atomic/ | 
|  | 4241 | atomicrmw-no-return-value. | 
|  | 4242 | - Must happen before | 
|  | 4243 | the following | 
|  | 4244 | store. | 
|  | 4245 | - Ensures that all | 
|  | 4246 | global memory | 
|  | 4247 | operations have | 
|  | 4248 | completed before | 
|  | 4249 | performing the | 
|  | 4250 | store that is being | 
|  | 4251 | released. | 
|  | 4252 |  | 
|  | 4253 | 1. ds_atomic                    2. ds_atomic | 
|  | 4254 | atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) & | 
|  | 4255 | vmcnt(0) & vscnt(0) | 
|  | 4256 |  | 
|  | 4257 | - If CU wavefront execution mode, omit vmcnt and | 
|  | 4258 | vscnt. | 
|  | 4259 | - If OpenCL, omit.              - If OpenCL, omit | 
|  | 4260 | waitcnt lgkmcnt(0). | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 4261 | - Must happen after | 
|  | 4262 | any preceding | 
|  | 4263 | local/generic | 
|  | 4264 | load/store/load | 
|  | 4265 | atomic/store | 
|  | 4266 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4267 | - Could be split into | 
|  | 4268 | separate s_waitcnt | 
|  | 4269 | vmcnt(0), s_waitcnt | 
|  | 4270 | vscnt(0) and s_waitcnt | 
|  | 4271 | lgkmcnt(0) to allow | 
|  | 4272 | them to be | 
|  | 4273 | independently moved | 
|  | 4274 | according to the | 
|  | 4275 | following rules. | 
|  | 4276 | - s_waitcnt vmcnt(0) | 
|  | 4277 | must happen after | 
|  | 4278 | any preceding | 
|  | 4279 | global/generic load/load | 
|  | 4280 | atomic/ | 
|  | 4281 | atomicrmw-with-return-value. | 
|  | 4282 | - s_waitcnt vscnt(0) | 
|  | 4283 | must happen after | 
|  | 4284 | any preceding | 
|  | 4285 | global/generic | 
|  | 4286 | store/store | 
|  | 4287 | atomic/ | 
|  | 4288 | atomicrmw-no-return-value. | 
|  | 4289 | - s_waitcnt lgkmcnt(0) | 
|  | 4290 | must happen after | 
|  | 4291 | any preceding | 
|  | 4292 | local/generic load/store/load | 
|  | 4293 | atomic/store atomic/atomicrmw. | 
|  | 4294 | - Must happen before            - Must happen before | 
|  | 4295 | the following                   the following | 
|  | 4296 | atomicrmw.                      atomicrmw. | 
|  | 4297 | - Ensures that all              - Ensures that all | 
|  | 4298 | memory operations               memory operations | 
|  | 4299 | to local have                   have | 
|  | 4300 | completed before                completed before | 
|  | 4301 | performing the                  performing the | 
|  | 4302 | atomicrmw that is               atomicrmw that is | 
|  | 4303 | being released.                 being released. | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 4304 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4305 | 2. flat_atomic                  2. flat_atomic | 
|  | 4306 | atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lkkmcnt(0) & | 
|  | 4307 | - system       - generic     vmcnt(0)                         vmcnt(0) & vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4308 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4309 | - If OpenCL, omit               - If OpenCL, omit | 
|  | 4310 | lgkmcnt(0).                     lgkmcnt(0). | 
|  | 4311 | - Could be split into           - Could be split into | 
|  | 4312 | separate s_waitcnt              separate s_waitcnt | 
|  | 4313 | vmcnt(0) and                    vmcnt(0), s_waitcnt | 
|  | 4314 | s_waitcnt                       vscnt(0) and s_waitcnt | 
|  | 4315 | lgkmcnt(0) to allow             lgkmcnt(0) to allow | 
|  | 4316 | them to be                      them to be | 
|  | 4317 | independently moved             independently moved | 
|  | 4318 | according to the                according to the | 
|  | 4319 | following rules.                following rules. | 
|  | 4320 | - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0) | 
|  | 4321 | must happen after               must happen after | 
|  | 4322 | any preceding                   any preceding | 
|  | 4323 | global/generic                  global/generic | 
|  | 4324 | load/store/load                 load/load atomic/ | 
|  | 4325 | atomic/store                    atomicrmw-with-return-value. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4326 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4327 | - s_waitcnt vscnt(0) | 
|  | 4328 | must happen after | 
|  | 4329 | any preceding | 
|  | 4330 | global/generic | 
|  | 4331 | store/store atomic/ | 
|  | 4332 | atomicrmw-no-return-value. | 
|  | 4333 | - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0) | 
|  | 4334 | must happen after               must happen after | 
|  | 4335 | any preceding                   any preceding | 
|  | 4336 | local/generic                   local/generic | 
|  | 4337 | load/store/load                 load/store/load | 
|  | 4338 | atomic/store                    atomic/store | 
|  | 4339 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 4340 | - Must happen before            - Must happen before | 
|  | 4341 | the following                   the following | 
|  | 4342 | atomicrmw.                      atomicrmw. | 
|  | 4343 | - Ensures that all              - Ensures that all | 
|  | 4344 | memory operations               memory operations | 
|  | 4345 | to global and local             to global and local | 
|  | 4346 | have completed                  have completed | 
|  | 4347 | before performing               before performing | 
|  | 4348 | the atomicrmw that              the atomicrmw that | 
|  | 4349 | is being released.              is being released. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4350 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4351 | 2. buffer/global/ds/flat_atomic 2. buffer/global/ds/flat_atomic | 
|  | 4352 | fence        release      - singlethread *none*     *none*                          *none* | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4353 | - wavefront | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4354 | fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) & | 
|  | 4355 | vmcnt(0) & vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4356 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4357 | - If CU wavefront execution mode, omit vmcnt and | 
|  | 4358 | vscnt. | 
|  | 4359 | - If OpenCL and                 - If OpenCL and | 
|  | 4360 | address space is                address space is | 
|  | 4361 | not generic, omit.              not generic, omit | 
|  | 4362 | lgkmcnt(0). | 
|  | 4363 | - If OpenCL and | 
|  | 4364 | address space is | 
|  | 4365 | local, omit | 
|  | 4366 | vmcnt(0) and vscnt(0). | 
|  | 4367 | - However, since LLVM           - However, since LLVM | 
|  | 4368 | currently has no                currently has no | 
|  | 4369 | address space on                address space on | 
|  | 4370 | the fence need to               the fence need to | 
|  | 4371 | conservatively                  conservatively | 
|  | 4372 | always generate. If             always generate. If | 
|  | 4373 | fence had an                    fence had an | 
|  | 4374 | address space then              address space then | 
|  | 4375 | set to address                  set to address | 
|  | 4376 | space of OpenCL                 space of OpenCL | 
|  | 4377 | fence flag, or to               fence flag, or to | 
|  | 4378 | generic if both                 generic if both | 
|  | 4379 | local and global                local and global | 
|  | 4380 | flags are                       flags are | 
|  | 4381 | specified.                      specified. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4382 | - Must happen after | 
|  | 4383 | any preceding | 
|  | 4384 | local/generic | 
|  | 4385 | load/load | 
|  | 4386 | atomic/store/store | 
|  | 4387 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4388 | - Could be split into | 
|  | 4389 | separate s_waitcnt | 
|  | 4390 | vmcnt(0), s_waitcnt | 
|  | 4391 | vscnt(0) and s_waitcnt | 
|  | 4392 | lgkmcnt(0) to allow | 
|  | 4393 | them to be | 
|  | 4394 | independently moved | 
|  | 4395 | according to the | 
|  | 4396 | following rules. | 
|  | 4397 | - s_waitcnt vmcnt(0) | 
|  | 4398 | must happen after | 
|  | 4399 | any preceding | 
|  | 4400 | global/generic | 
|  | 4401 | load/load | 
|  | 4402 | atomic/ | 
|  | 4403 | atomicrmw-with-return-value. | 
|  | 4404 | - s_waitcnt vscnt(0) | 
|  | 4405 | must happen after | 
|  | 4406 | any preceding | 
|  | 4407 | global/generic | 
|  | 4408 | store/store atomic/ | 
|  | 4409 | atomicrmw-no-return-value. | 
|  | 4410 | - s_waitcnt lgkmcnt(0) | 
|  | 4411 | must happen after | 
|  | 4412 | any preceding | 
|  | 4413 | local/generic | 
|  | 4414 | load/store/load | 
|  | 4415 | atomic/store atomic/ | 
|  | 4416 | atomicrmw. | 
|  | 4417 | - Must happen before            - Must happen before | 
|  | 4418 | any following store             any following store | 
|  | 4419 | atomic/atomicrmw                atomic/atomicrmw | 
|  | 4420 | with an equal or                with an equal or | 
|  | 4421 | wider sync scope                wider sync scope | 
|  | 4422 | and memory ordering             and memory ordering | 
|  | 4423 | stronger than                   stronger than | 
|  | 4424 | unordered (this is              unordered (this is | 
|  | 4425 | termed the                      termed the | 
|  | 4426 | fence-paired-atomic).           fence-paired-atomic). | 
|  | 4427 | - Ensures that all              - Ensures that all | 
|  | 4428 | memory operations               memory operations | 
|  | 4429 | to local have                   have | 
|  | 4430 | completed before                completed before | 
|  | 4431 | performing the                  performing the | 
|  | 4432 | following                       following | 
|  | 4433 | fence-paired-atomic.            fence-paired-atomic. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4434 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4435 | fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) & | 
|  | 4436 | - system                     vmcnt(0)                        vmcnt(0) & vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4437 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4438 | - If OpenCL and                 - If OpenCL and | 
|  | 4439 | address space is                address space is | 
|  | 4440 | not generic, omit               not generic, omit | 
|  | 4441 | lgkmcnt(0).                     lgkmcnt(0). | 
|  | 4442 | - If OpenCL and                 - If OpenCL and | 
|  | 4443 | address space is                address space is | 
|  | 4444 | local, omit                     local, omit | 
|  | 4445 | vmcnt(0).                       vmcnt(0) and vscnt(0). | 
|  | 4446 | - However, since LLVM           - However, since LLVM | 
|  | 4447 | currently has no                currently has no | 
|  | 4448 | address space on                address space on | 
|  | 4449 | the fence need to               the fence need to | 
|  | 4450 | conservatively                  conservatively | 
|  | 4451 | always generate. If             always generate. If | 
|  | 4452 | fence had an                    fence had an | 
|  | 4453 | address space then              address space then | 
|  | 4454 | set to address                  set to address | 
|  | 4455 | space of OpenCL                 space of OpenCL | 
|  | 4456 | fence flag, or to               fence flag, or to | 
|  | 4457 | generic if both                 generic if both | 
|  | 4458 | local and global                local and global | 
|  | 4459 | flags are                       flags are | 
|  | 4460 | specified.                      specified. | 
|  | 4461 | - Could be split into           - Could be split into | 
|  | 4462 | separate s_waitcnt              separate s_waitcnt | 
|  | 4463 | vmcnt(0) and                    vmcnt(0), s_waitcnt | 
|  | 4464 | s_waitcnt                       vscnt(0) and s_waitcnt | 
|  | 4465 | lgkmcnt(0) to allow             lgkmcnt(0) to allow | 
|  | 4466 | them to be                      them to be | 
|  | 4467 | independently moved             independently moved | 
|  | 4468 | according to the                according to the | 
|  | 4469 | following rules.                following rules. | 
|  | 4470 | - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0) | 
|  | 4471 | must happen after               must happen after | 
|  | 4472 | any preceding                   any preceding | 
|  | 4473 | global/generic                  global/generic | 
|  | 4474 | load/store/load                 load/load atomic/ | 
|  | 4475 | atomic/store                    atomicrmw-with-return-value. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4476 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4477 | - s_waitcnt vscnt(0) | 
|  | 4478 | must happen after | 
|  | 4479 | any preceding | 
|  | 4480 | global/generic | 
|  | 4481 | store/store atomic/ | 
|  | 4482 | atomicrmw-no-return-value. | 
|  | 4483 | - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0) | 
|  | 4484 | must happen after               must happen after | 
|  | 4485 | any preceding                   any preceding | 
|  | 4486 | local/generic                   local/generic | 
|  | 4487 | load/store/load                 load/store/load | 
|  | 4488 | atomic/store                    atomic/store | 
|  | 4489 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 4490 | - Must happen before            - Must happen before | 
|  | 4491 | any following store             any following store | 
|  | 4492 | atomic/atomicrmw                atomic/atomicrmw | 
|  | 4493 | with an equal or                with an equal or | 
|  | 4494 | wider sync scope                wider sync scope | 
|  | 4495 | and memory ordering             and memory ordering | 
|  | 4496 | stronger than                   stronger than | 
|  | 4497 | unordered (this is              unordered (this is | 
|  | 4498 | termed the                      termed the | 
|  | 4499 | fence-paired-atomic).           fence-paired-atomic). | 
|  | 4500 | - Ensures that all              - Ensures that all | 
|  | 4501 | memory operations               memory operations | 
|  | 4502 | have                            have | 
|  | 4503 | completed before                completed before | 
|  | 4504 | performing the                  performing the | 
|  | 4505 | following                       following | 
|  | 4506 | fence-paired-atomic.            fence-paired-atomic. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4507 |  | 
|  | 4508 | **Acquire-Release Atomic** | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4509 | ---------------------------------------------------------------------------------------------------------------------- | 
|  | 4510 | atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4511 | - wavefront    - local | 
|  | 4512 | - generic | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4513 | atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) & | 
|  | 4514 | vmcnt(0) & vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4515 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4516 | - If CU wavefront execution mode, omit vmcnt and | 
|  | 4517 | vscnt. | 
|  | 4518 | - If OpenCL, omit.              - If OpenCL, omit | 
|  | 4519 | s_waitcnt lgkmcnt(0). | 
|  | 4520 | - Must happen after             - Must happen after | 
|  | 4521 | any preceding                   any preceding | 
|  | 4522 | local/generic                   local/generic | 
|  | 4523 | load/store/load                 load/store/load | 
|  | 4524 | atomic/store                    atomic/store | 
|  | 4525 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 4526 | - Could be split into | 
|  | 4527 | separate s_waitcnt | 
|  | 4528 | vmcnt(0), s_waitcnt | 
|  | 4529 | vscnt(0) and s_waitcnt | 
|  | 4530 | lgkmcnt(0) to allow | 
|  | 4531 | them to be | 
|  | 4532 | independently moved | 
|  | 4533 | according to the | 
|  | 4534 | following rules. | 
|  | 4535 | - s_waitcnt vmcnt(0) | 
|  | 4536 | must happen after | 
|  | 4537 | any preceding | 
|  | 4538 | global/generic load/load | 
|  | 4539 | atomic/ | 
|  | 4540 | atomicrmw-with-return-value. | 
|  | 4541 | - s_waitcnt vscnt(0) | 
|  | 4542 | must happen after | 
|  | 4543 | any preceding | 
|  | 4544 | global/generic | 
|  | 4545 | store/store | 
|  | 4546 | atomic/ | 
|  | 4547 | atomicrmw-no-return-value. | 
|  | 4548 | - s_waitcnt lgkmcnt(0) | 
|  | 4549 | must happen after | 
|  | 4550 | any preceding | 
|  | 4551 | local/generic load/store/load | 
|  | 4552 | atomic/store atomic/atomicrmw. | 
|  | 4553 | - Must happen before            - Must happen before | 
|  | 4554 | the following                   the following | 
|  | 4555 | atomicrmw.                      atomicrmw. | 
|  | 4556 | - Ensures that all              - Ensures that all | 
|  | 4557 | memory operations               memory operations | 
|  | 4558 | to local have                   have | 
|  | 4559 | completed before                completed before | 
|  | 4560 | performing the                  performing the | 
|  | 4561 | atomicrmw that is               atomicrmw that is | 
|  | 4562 | being released.                 being released. | 
|  | 4563 |  | 
|  | 4564 | 2. buffer/global/flat_atomic    2. buffer/global_atomic | 
|  | 4565 | 3. s_waitcnt vm/vscnt(0) | 
|  | 4566 |  | 
|  | 4567 | - If CU wavefront execution mode, omit vm/vscnt. | 
|  | 4568 | - Use vmcnt if atomic with | 
|  | 4569 | return and vscnt if atomic | 
|  | 4570 | with no-return. | 
|  | 4571 | waitcnt lgkmcnt(0). | 
|  | 4572 | - Must happen before | 
|  | 4573 | the following | 
|  | 4574 | buffer_gl0_inv. | 
|  | 4575 | - Ensures any | 
|  | 4576 | following global | 
|  | 4577 | data read is no | 
|  | 4578 | older than the | 
|  | 4579 | atomicrmw value | 
|  | 4580 | being acquired. | 
|  | 4581 |  | 
|  | 4582 | 4. buffer_gl0_inv | 
|  | 4583 |  | 
|  | 4584 | - If CU wavefront execution mode, omit. | 
|  | 4585 | - Ensures that | 
|  | 4586 | following | 
|  | 4587 | loads will not see | 
|  | 4588 | stale data. | 
|  | 4589 |  | 
|  | 4590 | atomicrmw    acq_rel      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0) | 
|  | 4591 |  | 
|  | 4592 | - If CU wavefront execution mode, omit. | 
|  | 4593 | - If OpenCL, omit. | 
|  | 4594 | - Could be split into | 
|  | 4595 | separate s_waitcnt | 
|  | 4596 | vmcnt(0) and s_waitcnt | 
|  | 4597 | vscnt(0) to allow | 
|  | 4598 | them to be | 
|  | 4599 | independently moved | 
|  | 4600 | according to the | 
|  | 4601 | following rules. | 
|  | 4602 | - s_waitcnt vmcnt(0) | 
|  | 4603 | must happen after | 
|  | 4604 | any preceding | 
|  | 4605 | global/generic load/load | 
|  | 4606 | atomic/ | 
|  | 4607 | atomicrmw-with-return-value. | 
|  | 4608 | - s_waitcnt vscnt(0) | 
|  | 4609 | must happen after | 
|  | 4610 | any preceding | 
|  | 4611 | global/generic | 
|  | 4612 | store/store atomic/ | 
|  | 4613 | atomicrmw-no-return-value. | 
|  | 4614 | - Must happen before | 
|  | 4615 | the following | 
|  | 4616 | store. | 
|  | 4617 | - Ensures that all | 
|  | 4618 | global memory | 
|  | 4619 | operations have | 
|  | 4620 | completed before | 
|  | 4621 | performing the | 
|  | 4622 | store that is being | 
|  | 4623 | released. | 
|  | 4624 |  | 
|  | 4625 | 1. ds_atomic                    2. ds_atomic | 
|  | 4626 | 2. s_waitcnt lgkmcnt(0)         3. s_waitcnt lgkmcnt(0) | 
|  | 4627 |  | 
|  | 4628 | - If OpenCL, omit.              - If OpenCL, omit. | 
|  | 4629 | - Must happen before            - Must happen before | 
|  | 4630 | any following                   the following | 
|  | 4631 | global/generic                  buffer_gl0_inv. | 
|  | 4632 | load/load | 
|  | 4633 | atomic/store/store | 
|  | 4634 | atomic/atomicrmw. | 
|  | 4635 | - Ensures any                   - Ensures any | 
|  | 4636 | following global                following global | 
|  | 4637 | data read is no                 data read is no | 
|  | 4638 | older than the load             older than the load | 
|  | 4639 | atomic value being              atomic value being | 
|  | 4640 | acquired.                       acquired. | 
|  | 4641 |  | 
|  | 4642 | 4. buffer_gl0_inv | 
|  | 4643 |  | 
|  | 4644 | - If CU wavefront execution mode, omit. | 
|  | 4645 | - If OpenCL omit. | 
|  | 4646 | - Ensures that | 
|  | 4647 | following | 
|  | 4648 | loads will not see | 
|  | 4649 | stale data. | 
|  | 4650 |  | 
|  | 4651 | atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) & | 
|  | 4652 | vmcnt(0) & vscnt(0) | 
|  | 4653 |  | 
|  | 4654 | - If CU wavefront execution mode, omit vmcnt and | 
|  | 4655 | vscnt. | 
|  | 4656 | - If OpenCL, omit.              - If OpenCL, omit | 
|  | 4657 | waitcnt lgkmcnt(0). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4658 | - Must happen after | 
|  | 4659 | any preceding | 
|  | 4660 | local/generic | 
|  | 4661 | load/store/load | 
|  | 4662 | atomic/store | 
|  | 4663 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4664 | - Could be split into | 
|  | 4665 | separate s_waitcnt | 
|  | 4666 | vmcnt(0), s_waitcnt | 
|  | 4667 | vscnt(0) and s_waitcnt | 
|  | 4668 | lgkmcnt(0) to allow | 
|  | 4669 | them to be | 
|  | 4670 | independently moved | 
|  | 4671 | according to the | 
|  | 4672 | following rules. | 
|  | 4673 | - s_waitcnt vmcnt(0) | 
|  | 4674 | must happen after | 
|  | 4675 | any preceding | 
|  | 4676 | global/generic load/load | 
|  | 4677 | atomic/ | 
|  | 4678 | atomicrmw-with-return-value. | 
|  | 4679 | - s_waitcnt vscnt(0) | 
|  | 4680 | must happen after | 
|  | 4681 | any preceding | 
|  | 4682 | global/generic | 
|  | 4683 | store/store | 
|  | 4684 | atomic/ | 
|  | 4685 | atomicrmw-no-return-value. | 
|  | 4686 | - s_waitcnt lgkmcnt(0) | 
|  | 4687 | must happen after | 
|  | 4688 | any preceding | 
|  | 4689 | local/generic load/store/load | 
|  | 4690 | atomic/store atomic/atomicrmw. | 
|  | 4691 | - Must happen before            - Must happen before | 
|  | 4692 | the following                   the following | 
|  | 4693 | atomicrmw.                      atomicrmw. | 
|  | 4694 | - Ensures that all              - Ensures that all | 
|  | 4695 | memory operations               memory operations | 
|  | 4696 | to local have                   have | 
|  | 4697 | completed before                completed before | 
|  | 4698 | performing the                  performing the | 
|  | 4699 | atomicrmw that is               atomicrmw that is | 
|  | 4700 | being released.                 being released. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4701 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4702 | 2. flat_atomic                  2. flat_atomic | 
|  | 4703 | 3. s_waitcnt lgkmcnt(0)         3. s_waitcnt lgkmcnt(0) & | 
|  | 4704 | vm/vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4705 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4706 | - If CU wavefront execution mode, omit vm/vscnt. | 
|  | 4707 | - If OpenCL, omit.              - If OpenCL, omit | 
|  | 4708 | waitcnt lgkmcnt(0). | 
|  | 4709 | - Must happen before            - Must happen before | 
|  | 4710 | any following                   the following | 
|  | 4711 | global/generic                  buffer_gl0_inv. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4712 | load/load | 
|  | 4713 | atomic/store/store | 
|  | 4714 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4715 | - Ensures any                   - Ensures any | 
|  | 4716 | following global                following global | 
|  | 4717 | data read is no                 data read is no | 
|  | 4718 | older than the load             older than the load | 
|  | 4719 | atomic value being              atomic value being | 
|  | 4720 | acquired.                       acquired. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4721 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4722 | 3. buffer_gl0_inv | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4723 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4724 | - If CU wavefront execution mode, omit. | 
|  | 4725 | - Ensures that | 
|  | 4726 | following | 
|  | 4727 | loads will not see | 
|  | 4728 | stale data. | 
|  | 4729 |  | 
|  | 4730 | atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) & | 
|  | 4731 | - system                     vmcnt(0)                        vmcnt(0) & vscnt(0) | 
|  | 4732 |  | 
|  | 4733 | - If OpenCL, omit               - If OpenCL, omit | 
|  | 4734 | lgkmcnt(0).                     lgkmcnt(0). | 
|  | 4735 | - Could be split into           - Could be split into | 
|  | 4736 | separate s_waitcnt              separate s_waitcnt | 
|  | 4737 | vmcnt(0) and                    vmcnt(0), s_waitcnt | 
|  | 4738 | s_waitcnt                       vscnt(0) and s_waitcnt | 
|  | 4739 | lgkmcnt(0) to allow             lgkmcnt(0) to allow | 
|  | 4740 | them to be                      them to be | 
|  | 4741 | independently moved             independently moved | 
|  | 4742 | according to the                according to the | 
|  | 4743 | following rules.                following rules. | 
|  | 4744 | - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0) | 
|  | 4745 | must happen after               must happen after | 
|  | 4746 | any preceding                   any preceding | 
|  | 4747 | global/generic                  global/generic | 
|  | 4748 | load/store/load                 load/load atomic/ | 
|  | 4749 | atomic/store                    atomicrmw-with-return-value. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4750 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4751 | - s_waitcnt vscnt(0) | 
|  | 4752 | must happen after | 
|  | 4753 | any preceding | 
|  | 4754 | global/generic | 
|  | 4755 | store/store atomic/ | 
|  | 4756 | atomicrmw-no-return-value. | 
|  | 4757 | - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0) | 
|  | 4758 | must happen after               must happen after | 
|  | 4759 | any preceding                   any preceding | 
|  | 4760 | local/generic                   local/generic | 
|  | 4761 | load/store/load                 load/store/load | 
|  | 4762 | atomic/store                    atomic/store | 
|  | 4763 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 4764 | - Must happen before            - Must happen before | 
|  | 4765 | the following                   the following | 
|  | 4766 | atomicrmw.                      atomicrmw. | 
|  | 4767 | - Ensures that all              - Ensures that all | 
|  | 4768 | memory operations               memory operations | 
|  | 4769 | to global have                  to global have | 
|  | 4770 | completed before                completed before | 
|  | 4771 | performing the                  performing the | 
|  | 4772 | atomicrmw that is               atomicrmw that is | 
|  | 4773 | being released.                 being released. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4774 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4775 | 2. buffer/global/flat_atomic    2. buffer/global_atomic | 
|  | 4776 | 3. s_waitcnt vmcnt(0)           3. s_waitcnt vm/vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4777 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4778 | - Use vmcnt if atomic with | 
|  | 4779 | return and vscnt if atomic | 
|  | 4780 | with no-return. | 
|  | 4781 | waitcnt lgkmcnt(0). | 
|  | 4782 | - Must happen before            - Must happen before | 
|  | 4783 | following                       following | 
|  | 4784 | buffer_wbinvl1_vol.             buffer_gl*_inv. | 
|  | 4785 | - Ensures the                   - Ensures the | 
|  | 4786 | atomicrmw has                   atomicrmw has | 
|  | 4787 | completed before                completed before | 
|  | 4788 | invalidating the                invalidating the | 
|  | 4789 | cache.                          caches. | 
|  | 4790 |  | 
|  | 4791 | 4. buffer_wbinvl1_vol           4. buffer_gl0_inv; | 
|  | 4792 | buffer_gl1_inv | 
|  | 4793 |  | 
|  | 4794 | - Must happen before            - Must happen before | 
|  | 4795 | any following                   any following | 
|  | 4796 | global/generic                  global/generic | 
|  | 4797 | load/load                       load/load | 
|  | 4798 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 4799 | - Ensures that                  - Ensures that | 
|  | 4800 | following loads                 following loads | 
|  | 4801 | will not see stale              will not see stale | 
|  | 4802 | global data.                    global data. | 
|  | 4803 |  | 
|  | 4804 | atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) & | 
|  | 4805 | - system                     vmcnt(0)                        vmcnt(0) & vscnt(0) | 
|  | 4806 |  | 
|  | 4807 | - If OpenCL, omit               - If OpenCL, omit | 
|  | 4808 | lgkmcnt(0).                     lgkmcnt(0). | 
|  | 4809 | - Could be split into           - Could be split into | 
|  | 4810 | separate s_waitcnt              separate s_waitcnt | 
|  | 4811 | vmcnt(0) and                    vmcnt(0), s_waitcnt | 
|  | 4812 | s_waitcnt                       vscnt(0) and s_waitcnt | 
|  | 4813 | lgkmcnt(0) to allow             lgkmcnt(0) to allow | 
|  | 4814 | them to be                      them to be | 
|  | 4815 | independently moved             independently moved | 
|  | 4816 | according to the                according to the | 
|  | 4817 | following rules.                following rules. | 
|  | 4818 | - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0) | 
|  | 4819 | must happen after               must happen after | 
|  | 4820 | any preceding                   any preceding | 
|  | 4821 | global/generic                  global/generic | 
|  | 4822 | load/store/load                 load/load atomic | 
|  | 4823 | atomic/store                    atomicrmw-with-return-value. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4824 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4825 | - s_waitcnt vscnt(0) | 
|  | 4826 | must happen after | 
|  | 4827 | any preceding | 
|  | 4828 | global/generic | 
|  | 4829 | store/store atomic/ | 
|  | 4830 | atomicrmw-no-return-value. | 
|  | 4831 | - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0) | 
|  | 4832 | must happen after               must happen after | 
|  | 4833 | any preceding                   any preceding | 
|  | 4834 | local/generic                   local/generic | 
|  | 4835 | load/store/load                 load/store/load | 
|  | 4836 | atomic/store                    atomic/store | 
|  | 4837 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 4838 | - Must happen before            - Must happen before | 
|  | 4839 | the following                   the following | 
|  | 4840 | atomicrmw.                      atomicrmw. | 
|  | 4841 | - Ensures that all              - Ensures that all | 
|  | 4842 | memory operations               memory operations | 
|  | 4843 | to global have                  have | 
|  | 4844 | completed before                completed before | 
|  | 4845 | performing the                  performing the | 
|  | 4846 | atomicrmw that is               atomicrmw that is | 
|  | 4847 | being released.                 being released. | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 4848 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4849 | 2. flat_atomic                  2. flat_atomic | 
|  | 4850 | 3. s_waitcnt vmcnt(0) &         3. s_waitcnt vm/vscnt(0) & | 
|  | 4851 | lgkmcnt(0)                      lgkmcnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4852 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4853 | - If OpenCL, omit               - If OpenCL, omit | 
|  | 4854 | lgkmcnt(0).                     lgkmcnt(0). | 
|  | 4855 | - Use vmcnt if atomic with | 
|  | 4856 | return and vscnt if atomic | 
|  | 4857 | with no-return. | 
|  | 4858 | - Must happen before            - Must happen before | 
|  | 4859 | following                       following | 
|  | 4860 | buffer_wbinvl1_vol.             buffer_gl*_inv. | 
|  | 4861 | - Ensures the                   - Ensures the | 
|  | 4862 | atomicrmw has                   atomicrmw has | 
|  | 4863 | completed before                completed before | 
|  | 4864 | invalidating the                invalidating the | 
|  | 4865 | cache.                          caches. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4866 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4867 | 4. buffer_wbinvl1_vol           4. buffer_gl0_inv; | 
|  | 4868 | buffer_gl1_inv | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4869 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4870 | - Must happen before            - Must happen before | 
|  | 4871 | any following                   any following | 
|  | 4872 | global/generic                  global/generic | 
|  | 4873 | load/load                       load/load | 
|  | 4874 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 4875 | - Ensures that                  - Ensures that | 
|  | 4876 | following loads                 following loads | 
|  | 4877 | will not see stale              will not see stale | 
|  | 4878 | global data.                    global data. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4879 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4880 | fence        acq_rel      - singlethread *none*     *none*                          *none* | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4881 | - wavefront | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4882 | fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) & | 
|  | 4883 | vmcnt(0) & vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4884 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4885 | - If CU wavefront execution mode, omit vmcnt and | 
|  | 4886 | vscnt. | 
|  | 4887 | - If OpenCL and                 - If OpenCL and | 
|  | 4888 | address space is                address space is | 
|  | 4889 | not generic, omit.              not generic, omit | 
|  | 4890 | lgkmcnt(0). | 
|  | 4891 | - If OpenCL and | 
|  | 4892 | address space is | 
|  | 4893 | local, omit | 
|  | 4894 | vmcnt(0) and vscnt(0). | 
|  | 4895 | - However,                      - However, | 
|  | 4896 | since LLVM                      since LLVM | 
|  | 4897 | currently has no                currently has no | 
|  | 4898 | address space on                address space on | 
|  | 4899 | the fence need to               the fence need to | 
|  | 4900 | conservatively                  conservatively | 
|  | 4901 | always generate                 always generate | 
|  | 4902 | (see comment for                (see comment for | 
|  | 4903 | previous fence).                previous fence). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 4904 | - Must happen after | 
|  | 4905 | any preceding | 
|  | 4906 | local/generic | 
|  | 4907 | load/load | 
|  | 4908 | atomic/store/store | 
|  | 4909 | atomic/atomicrmw. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 4910 | - Could be split into | 
|  | 4911 | separate s_waitcnt | 
|  | 4912 | vmcnt(0), s_waitcnt | 
|  | 4913 | vscnt(0) and s_waitcnt | 
|  | 4914 | lgkmcnt(0) to allow | 
|  | 4915 | them to be | 
|  | 4916 | independently moved | 
|  | 4917 | according to the | 
|  | 4918 | following rules. | 
|  | 4919 | - s_waitcnt vmcnt(0) | 
|  | 4920 | must happen after | 
|  | 4921 | any preceding | 
|  | 4922 | global/generic | 
|  | 4923 | load/load | 
|  | 4924 | atomic/ | 
|  | 4925 | atomicrmw-with-return-value. | 
|  | 4926 | - s_waitcnt vscnt(0) | 
|  | 4927 | must happen after | 
|  | 4928 | any preceding | 
|  | 4929 | global/generic | 
|  | 4930 | store/store atomic/ | 
|  | 4931 | atomicrmw-no-return-value. | 
|  | 4932 | - s_waitcnt lgkmcnt(0) | 
|  | 4933 | must happen after | 
|  | 4934 | any preceding | 
|  | 4935 | local/generic | 
|  | 4936 | load/store/load | 
|  | 4937 | atomic/store atomic/ | 
|  | 4938 | atomicrmw. | 
|  | 4939 | - Must happen before            - Must happen before | 
|  | 4940 | any following                   any following | 
|  | 4941 | global/generic                  global/generic | 
|  | 4942 | load/load                       load/load | 
|  | 4943 | atomic/store/store              atomic/store/store | 
|  | 4944 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 4945 | - Ensures that all              - Ensures that all | 
|  | 4946 | memory operations               memory operations | 
|  | 4947 | to local have                   have | 
|  | 4948 | completed before                completed before | 
|  | 4949 | performing any                  performing any | 
|  | 4950 | following global                following global | 
|  | 4951 | memory operations.              memory operations. | 
|  | 4952 | - Ensures that the              - Ensures that the | 
|  | 4953 | preceding                       preceding | 
|  | 4954 | local/generic load              local/generic load | 
|  | 4955 | atomic/atomicrmw                atomic/atomicrmw | 
|  | 4956 | with an equal or                with an equal or | 
|  | 4957 | wider sync scope                wider sync scope | 
|  | 4958 | and memory ordering             and memory ordering | 
|  | 4959 | stronger than                   stronger than | 
|  | 4960 | unordered (this is              unordered (this is | 
|  | 4961 | termed the                      termed the | 
|  | 4962 | acquire-fence-paired-atomic     acquire-fence-paired-atomic | 
|  | 4963 | ) has completed                 ) has completed | 
|  | 4964 | before following                before following | 
|  | 4965 | global memory                   global memory | 
|  | 4966 | operations. This                operations. This | 
|  | 4967 | satisfies the                   satisfies the | 
|  | 4968 | requirements of                 requirements of | 
|  | 4969 | acquire.                        acquire. | 
|  | 4970 | - Ensures that all              - Ensures that all | 
|  | 4971 | previous memory                 previous memory | 
|  | 4972 | operations have                 operations have | 
|  | 4973 | completed before a              completed before a | 
|  | 4974 | following                       following | 
|  | 4975 | local/generic store             local/generic store | 
|  | 4976 | atomic/atomicrmw                atomic/atomicrmw | 
|  | 4977 | with an equal or                with an equal or | 
|  | 4978 | wider sync scope                wider sync scope | 
|  | 4979 | and memory ordering             and memory ordering | 
|  | 4980 | stronger than                   stronger than | 
|  | 4981 | unordered (this is              unordered (this is | 
|  | 4982 | termed the                      termed the | 
|  | 4983 | release-fence-paired-atomic     release-fence-paired-atomic | 
|  | 4984 | ). This satisfies the           ). This satisfies the | 
|  | 4985 | requirements of                 requirements of | 
|  | 4986 | release.                        release. | 
|  | 4987 | - Must happen before | 
|  | 4988 | the following | 
|  | 4989 | buffer_gl0_inv. | 
|  | 4990 | - Ensures that the | 
|  | 4991 | acquire-fence-paired | 
|  | 4992 | atomic has completed | 
|  | 4993 | before invalidating | 
|  | 4994 | the | 
|  | 4995 | cache. Therefore | 
|  | 4996 | any following | 
|  | 4997 | locations read must | 
|  | 4998 | be no older than | 
|  | 4999 | the value read by | 
|  | 5000 | the | 
|  | 5001 | acquire-fence-paired-atomic. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5002 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5003 | 3. buffer_gl0_inv | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5004 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5005 | - If CU wavefront execution mode, omit. | 
|  | 5006 | - Ensures that | 
|  | 5007 | following | 
|  | 5008 | loads will not see | 
|  | 5009 | stale data. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5010 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5011 | fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) & | 
|  | 5012 | - system                     vmcnt(0)                        vmcnt(0) & vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5013 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5014 | - If OpenCL and                 - If OpenCL and | 
|  | 5015 | address space is                address space is | 
|  | 5016 | not generic, omit               not generic, omit | 
|  | 5017 | lgkmcnt(0).                     lgkmcnt(0). | 
|  | 5018 | - If OpenCL and | 
|  | 5019 | address space is | 
|  | 5020 | local, omit | 
|  | 5021 | vmcnt(0) and vscnt(0). | 
|  | 5022 | - However, since LLVM           - However, since LLVM | 
|  | 5023 | currently has no                currently has no | 
|  | 5024 | address space on                address space on | 
|  | 5025 | the fence need to               the fence need to | 
|  | 5026 | conservatively                  conservatively | 
|  | 5027 | always generate                 always generate | 
|  | 5028 | (see comment for                (see comment for | 
|  | 5029 | previous fence).                previous fence). | 
|  | 5030 | - Could be split into           - Could be split into | 
|  | 5031 | separate s_waitcnt              separate s_waitcnt | 
|  | 5032 | vmcnt(0) and                    vmcnt(0), s_waitcnt | 
|  | 5033 | s_waitcnt                       vscnt(0) and s_waitcnt | 
|  | 5034 | lgkmcnt(0) to allow             lgkmcnt(0) to allow | 
|  | 5035 | them to be                      them to be | 
|  | 5036 | independently moved             independently moved | 
|  | 5037 | according to the                according to the | 
|  | 5038 | following rules.                following rules. | 
|  | 5039 | - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0) | 
|  | 5040 | must happen after               must happen after | 
|  | 5041 | any preceding                   any preceding | 
|  | 5042 | global/generic                  global/generic | 
|  | 5043 | load/store/load                 load/load | 
|  | 5044 | atomic/store                    atomic/ | 
|  | 5045 | atomic/atomicrmw.               atomicrmw-with-return-value. | 
|  | 5046 | - s_waitcnt vscnt(0) | 
|  | 5047 | must happen after | 
|  | 5048 | any preceding | 
|  | 5049 | global/generic | 
|  | 5050 | store/store atomic/ | 
|  | 5051 | atomicrmw-no-return-value. | 
|  | 5052 | - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0) | 
|  | 5053 | must happen after               must happen after | 
|  | 5054 | any preceding                   any preceding | 
|  | 5055 | local/generic                   local/generic | 
|  | 5056 | load/store/load                 load/store/load | 
|  | 5057 | atomic/store                    atomic/store | 
|  | 5058 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 5059 | - Must happen before            - Must happen before | 
|  | 5060 | the following                   the following | 
|  | 5061 | buffer_wbinvl1_vol.             buffer_gl*_inv. | 
|  | 5062 | - Ensures that the              - Ensures that the | 
|  | 5063 | preceding                       preceding | 
|  | 5064 | global/local/generic            global/local/generic | 
|  | 5065 | load                            load | 
|  | 5066 | atomic/atomicrmw                atomic/atomicrmw | 
|  | 5067 | with an equal or                with an equal or | 
|  | 5068 | wider sync scope                wider sync scope | 
|  | 5069 | and memory ordering             and memory ordering | 
|  | 5070 | stronger than                   stronger than | 
|  | 5071 | unordered (this is              unordered (this is | 
|  | 5072 | termed the                      termed the | 
|  | 5073 | acquire-fence-paired-atomic     acquire-fence-paired-atomic | 
|  | 5074 | ) has completed                 ) has completed | 
|  | 5075 | before invalidating             before invalidating | 
|  | 5076 | the cache. This                 the caches. This | 
|  | 5077 | satisfies the                   satisfies the | 
|  | 5078 | requirements of                 requirements of | 
|  | 5079 | acquire.                        acquire. | 
|  | 5080 | - Ensures that all              - Ensures that all | 
|  | 5081 | previous memory                 previous memory | 
|  | 5082 | operations have                 operations have | 
|  | 5083 | completed before a              completed before a | 
|  | 5084 | following                       following | 
|  | 5085 | global/local/generic            global/local/generic | 
|  | 5086 | store                           store | 
|  | 5087 | atomic/atomicrmw                atomic/atomicrmw | 
|  | 5088 | with an equal or                with an equal or | 
|  | 5089 | wider sync scope                wider sync scope | 
|  | 5090 | and memory ordering             and memory ordering | 
|  | 5091 | stronger than                   stronger than | 
|  | 5092 | unordered (this is              unordered (this is | 
|  | 5093 | termed the                      termed the | 
|  | 5094 | release-fence-paired-atomic     release-fence-paired-atomic | 
|  | 5095 | ). This satisfies the           ). This satisfies the | 
|  | 5096 | requirements of                 requirements of | 
|  | 5097 | release.                        release. | 
|  | 5098 |  | 
|  | 5099 | 2. buffer_wbinvl1_vol           2. buffer_gl0_inv; | 
|  | 5100 | buffer_gl1_inv | 
|  | 5101 |  | 
|  | 5102 | - Must happen before            - Must happen before | 
|  | 5103 | any following                   any following | 
|  | 5104 | global/generic                  global/generic | 
|  | 5105 | load/load                       load/load | 
|  | 5106 | atomic/store/store              atomic/store/store | 
|  | 5107 | atomic/atomicrmw.               atomic/atomicrmw. | 
|  | 5108 | - Ensures that                  - Ensures that | 
|  | 5109 | following loads                 following loads | 
|  | 5110 | will not see stale              will not see stale | 
|  | 5111 | global data. This               global data. This | 
|  | 5112 | satisfies the                   satisfies the | 
|  | 5113 | requirements of                 requirements of | 
|  | 5114 | acquire.                        acquire. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5115 |  | 
|  | 5116 | **Sequential Consistent Atomic** | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5117 | ---------------------------------------------------------------------------------------------------------------------- | 
|  | 5118 | load atomic  seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding | 
|  | 5119 | - wavefront    - local    load atomic acquire,            load atomic acquire, | 
|  | 5120 | - generic  except must generated           except must generated | 
|  | 5121 | all instructions even           all instructions even | 
|  | 5122 | for OpenCL.*                    for OpenCL.* | 
|  | 5123 | load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) & | 
|  | 5124 | - generic                                     vmcnt(0) & vscnt(0) | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 5125 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5126 | - If CU wavefront execution mode, omit vmcnt and | 
|  | 5127 | vscnt. | 
|  | 5128 | - Could be split into | 
|  | 5129 | separate s_waitcnt | 
|  | 5130 | vmcnt(0), s_waitcnt | 
|  | 5131 | vscnt(0) and s_waitcnt | 
|  | 5132 | lgkmcnt(0) to allow | 
|  | 5133 | them to be | 
|  | 5134 | independently moved | 
|  | 5135 | according to the | 
|  | 5136 | following rules. | 
|  | 5137 | - Must                          - waitcnt lgkmcnt(0) must | 
|  | 5138 | happen after                    happen after | 
|  | 5139 | preceding                       preceding | 
|  | 5140 | global/generic load             local load | 
|  | 5141 | atomic/store                    atomic/store | 
|  | 5142 | atomic/atomicrmw                atomic/atomicrmw | 
|  | 5143 | with memory                     with memory | 
|  | 5144 | ordering of seq_cst             ordering of seq_cst | 
|  | 5145 | and with equal or               and with equal or | 
|  | 5146 | wider sync scope.               wider sync scope. | 
|  | 5147 | (Note that seq_cst              (Note that seq_cst | 
|  | 5148 | fences have their               fences have their | 
|  | 5149 | own s_waitcnt                   own s_waitcnt | 
|  | 5150 | lgkmcnt(0) and so do            lgkmcnt(0) and so do | 
|  | 5151 | not need to be                  not need to be | 
|  | 5152 | considered.)                    considered.) | 
|  | 5153 | - waitcnt vmcnt(0) | 
|  | 5154 | Must happen after | 
|  | 5155 | preceding | 
|  | 5156 | global/generic load | 
|  | 5157 | atomic/ | 
|  | 5158 | atomicrmw-with-return-value | 
|  | 5159 | with memory | 
|  | 5160 | ordering of seq_cst | 
|  | 5161 | and with equal or | 
|  | 5162 | wider sync scope. | 
|  | 5163 | (Note that seq_cst | 
|  | 5164 | fences have their | 
|  | 5165 | own s_waitcnt | 
|  | 5166 | vmcnt(0) and so do | 
|  | 5167 | not need to be | 
|  | 5168 | considered.) | 
|  | 5169 | - waitcnt vscnt(0) | 
|  | 5170 | Must happen after | 
|  | 5171 | preceding | 
|  | 5172 | global/generic store | 
|  | 5173 | atomic/ | 
|  | 5174 | atomicrmw-no-return-value | 
|  | 5175 | with memory | 
|  | 5176 | ordering of seq_cst | 
|  | 5177 | and with equal or | 
|  | 5178 | wider sync scope. | 
|  | 5179 | (Note that seq_cst | 
|  | 5180 | fences have their | 
|  | 5181 | own s_waitcnt | 
|  | 5182 | vscnt(0) and so do | 
|  | 5183 | not need to be | 
|  | 5184 | considered.) | 
|  | 5185 | - Ensures any                   - Ensures any | 
|  | 5186 | preceding                       preceding | 
|  | 5187 | sequential                      sequential | 
|  | 5188 | consistent local                consistent global/local | 
|  | 5189 | memory instructions             memory instructions | 
|  | 5190 | have completed                  have completed | 
|  | 5191 | before executing                before executing | 
|  | 5192 | this sequentially               this sequentially | 
|  | 5193 | consistent                      consistent | 
|  | 5194 | instruction. This               instruction. This | 
|  | 5195 | prevents reordering             prevents reordering | 
|  | 5196 | a seq_cst store                 a seq_cst store | 
|  | 5197 | followed by a                   followed by a | 
|  | 5198 | seq_cst load. (Note             seq_cst load. (Note | 
|  | 5199 | that seq_cst is                 that seq_cst is | 
|  | 5200 | stronger than                   stronger than | 
|  | 5201 | acquire/release as              acquire/release as | 
|  | 5202 | the reordering of               the reordering of | 
|  | 5203 | load acquire                    load acquire | 
|  | 5204 | followed by a store             followed by a store | 
|  | 5205 | release is                      release is | 
|  | 5206 | prevented by the                prevented by the | 
|  | 5207 | waitcnt of                      waitcnt of | 
|  | 5208 | the release, but                the release, but | 
|  | 5209 | there is nothing                there is nothing | 
|  | 5210 | preventing a store              preventing a store | 
|  | 5211 | release followed by             release followed by | 
|  | 5212 | load acquire from               load acquire from | 
|  | 5213 | competing out of                competing out of | 
|  | 5214 | order.)                         order.) | 
|  | 5215 |  | 
|  | 5216 | 2. *Following                   2. *Following | 
|  | 5217 | instructions same as            instructions same as | 
|  | 5218 | corresponding load              corresponding load | 
|  | 5219 | atomic acquire,                 atomic acquire, | 
|  | 5220 | except must generated           except must generated | 
|  | 5221 | all instructions even           all instructions even | 
|  | 5222 | for OpenCL.*                    for OpenCL.* | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 5223 | load atomic  seq_cst      - workgroup    - local    *Same as corresponding | 
|  | 5224 | load atomic acquire, | 
|  | 5225 | except must generated | 
|  | 5226 | all instructions even | 
|  | 5227 | for OpenCL.* | 
| Tony Tye | 6baa6d2 | 2017-10-18 22:16:55 +0000 | [diff] [blame] | 5228 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5229 | 1. s_waitcnt vmcnt(0) & vscnt(0) | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5230 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5231 | - If CU wavefront execution mode, omit. | 
|  | 5232 | - Could be split into | 
|  | 5233 | separate s_waitcnt | 
|  | 5234 | vmcnt(0) and s_waitcnt | 
|  | 5235 | vscnt(0) to allow | 
|  | 5236 | them to be | 
|  | 5237 | independently moved | 
|  | 5238 | according to the | 
|  | 5239 | following rules. | 
|  | 5240 | - waitcnt vmcnt(0) | 
|  | 5241 | Must happen after | 
|  | 5242 | preceding | 
|  | 5243 | global/generic load | 
|  | 5244 | atomic/ | 
|  | 5245 | atomicrmw-with-return-value | 
|  | 5246 | with memory | 
|  | 5247 | ordering of seq_cst | 
|  | 5248 | and with equal or | 
|  | 5249 | wider sync scope. | 
|  | 5250 | (Note that seq_cst | 
|  | 5251 | fences have their | 
|  | 5252 | own s_waitcnt | 
|  | 5253 | vmcnt(0) and so do | 
|  | 5254 | not need to be | 
|  | 5255 | considered.) | 
|  | 5256 | - waitcnt vscnt(0) | 
|  | 5257 | Must happen after | 
|  | 5258 | preceding | 
|  | 5259 | global/generic store | 
|  | 5260 | atomic/ | 
|  | 5261 | atomicrmw-no-return-value | 
|  | 5262 | with memory | 
|  | 5263 | ordering of seq_cst | 
|  | 5264 | and with equal or | 
|  | 5265 | wider sync scope. | 
|  | 5266 | (Note that seq_cst | 
|  | 5267 | fences have their | 
|  | 5268 | own s_waitcnt | 
|  | 5269 | vscnt(0) and so do | 
|  | 5270 | not need to be | 
|  | 5271 | considered.) | 
|  | 5272 | - Ensures any | 
|  | 5273 | preceding | 
|  | 5274 | sequential | 
|  | 5275 | consistent global | 
|  | 5276 | memory instructions | 
|  | 5277 | have completed | 
|  | 5278 | before executing | 
|  | 5279 | this sequentially | 
|  | 5280 | consistent | 
|  | 5281 | instruction. This | 
|  | 5282 | prevents reordering | 
|  | 5283 | a seq_cst store | 
|  | 5284 | followed by a | 
|  | 5285 | seq_cst load. (Note | 
|  | 5286 | that seq_cst is | 
|  | 5287 | stronger than | 
|  | 5288 | acquire/release as | 
|  | 5289 | the reordering of | 
|  | 5290 | load acquire | 
|  | 5291 | followed by a store | 
|  | 5292 | release is | 
|  | 5293 | prevented by the | 
|  | 5294 | waitcnt of | 
|  | 5295 | the release, but | 
|  | 5296 | there is nothing | 
|  | 5297 | preventing a store | 
|  | 5298 | release followed by | 
|  | 5299 | load acquire from | 
|  | 5300 | competing out of | 
|  | 5301 | order.) | 
|  | 5302 |  | 
|  | 5303 | 2. *Following | 
|  | 5304 | instructions same as | 
|  | 5305 | corresponding load | 
|  | 5306 | atomic acquire, | 
|  | 5307 | except must generated | 
|  | 5308 | all instructions even | 
|  | 5309 | for OpenCL.* | 
|  | 5310 |  | 
|  | 5311 | load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) & | 
|  | 5312 | - system       - generic     vmcnt(0)                        vmcnt(0) & vscnt(0) | 
|  | 5313 |  | 
|  | 5314 | - Could be split into           - Could be split into | 
|  | 5315 | separate s_waitcnt              separate s_waitcnt | 
|  | 5316 | vmcnt(0)                        vmcnt(0), s_waitcnt | 
|  | 5317 | and s_waitcnt                   vscnt(0) and s_waitcnt | 
|  | 5318 | lgkmcnt(0) to allow             lgkmcnt(0) to allow | 
|  | 5319 | them to be                      them to be | 
|  | 5320 | independently moved             independently moved | 
|  | 5321 | according to the                according to the | 
|  | 5322 | following rules.                following rules. | 
|  | 5323 | - waitcnt lgkmcnt(0)            - waitcnt lgkmcnt(0) | 
|  | 5324 | must happen after               must happen after | 
|  | 5325 | preceding                       preceding | 
|  | 5326 | global/generic load             local load | 
|  | 5327 | atomic/store                    atomic/store | 
|  | 5328 | atomic/atomicrmw                atomic/atomicrmw | 
|  | 5329 | with memory                     with memory | 
|  | 5330 | ordering of seq_cst             ordering of seq_cst | 
|  | 5331 | and with equal or               and with equal or | 
|  | 5332 | wider sync scope.               wider sync scope. | 
|  | 5333 | (Note that seq_cst              (Note that seq_cst | 
|  | 5334 | fences have their               fences have their | 
|  | 5335 | own s_waitcnt                   own s_waitcnt | 
|  | 5336 | lgkmcnt(0) and so do            lgkmcnt(0) and so do | 
|  | 5337 | not need to be                  not need to be | 
|  | 5338 | considered.)                    considered.) | 
|  | 5339 | - waitcnt vmcnt(0)              - waitcnt vmcnt(0) | 
|  | 5340 | must happen after               must happen after | 
|  | 5341 | preceding                       preceding | 
|  | 5342 | global/generic load             global/generic load | 
|  | 5343 | atomic/store                    atomic/ | 
|  | 5344 | atomic/atomicrmw                atomicrmw-with-return-value | 
|  | 5345 | with memory                     with memory | 
|  | 5346 | ordering of seq_cst             ordering of seq_cst | 
|  | 5347 | and with equal or               and with equal or | 
|  | 5348 | wider sync scope.               wider sync scope. | 
|  | 5349 | (Note that seq_cst              (Note that seq_cst | 
|  | 5350 | fences have their               fences have their | 
|  | 5351 | own s_waitcnt                   own s_waitcnt | 
|  | 5352 | vmcnt(0) and so do              vmcnt(0) and so do | 
|  | 5353 | not need to be                  not need to be | 
|  | 5354 | considered.)                    considered.) | 
|  | 5355 | - waitcnt vscnt(0) | 
|  | 5356 | Must happen after | 
|  | 5357 | preceding | 
|  | 5358 | global/generic store | 
|  | 5359 | atomic/ | 
|  | 5360 | atomicrmw-no-return-value | 
|  | 5361 | with memory | 
|  | 5362 | ordering of seq_cst | 
|  | 5363 | and with equal or | 
|  | 5364 | wider sync scope. | 
|  | 5365 | (Note that seq_cst | 
|  | 5366 | fences have their | 
|  | 5367 | own s_waitcnt | 
|  | 5368 | vscnt(0) and so do | 
|  | 5369 | not need to be | 
|  | 5370 | considered.) | 
|  | 5371 | - Ensures any                   - Ensures any | 
|  | 5372 | preceding                       preceding | 
|  | 5373 | sequential                      sequential | 
|  | 5374 | consistent global               consistent global | 
|  | 5375 | memory instructions             memory instructions | 
|  | 5376 | have completed                  have completed | 
|  | 5377 | before executing                before executing | 
|  | 5378 | this sequentially               this sequentially | 
|  | 5379 | consistent                      consistent | 
|  | 5380 | instruction. This               instruction. This | 
|  | 5381 | prevents reordering             prevents reordering | 
|  | 5382 | a seq_cst store                 a seq_cst store | 
|  | 5383 | followed by a                   followed by a | 
|  | 5384 | seq_cst load. (Note             seq_cst load. (Note | 
|  | 5385 | that seq_cst is                 that seq_cst is | 
|  | 5386 | stronger than                   stronger than | 
|  | 5387 | acquire/release as              acquire/release as | 
|  | 5388 | the reordering of               the reordering of | 
|  | 5389 | load acquire                    load acquire | 
|  | 5390 | followed by a store             followed by a store | 
|  | 5391 | release is                      release is | 
|  | 5392 | prevented by the                prevented by the | 
|  | 5393 | waitcnt of                      waitcnt of | 
|  | 5394 | the release, but                the release, but | 
|  | 5395 | there is nothing                there is nothing | 
|  | 5396 | preventing a store              preventing a store | 
|  | 5397 | release followed by             release followed by | 
|  | 5398 | load acquire from               load acquire from | 
|  | 5399 | competing out of                competing out of | 
|  | 5400 | order.)                         order.) | 
|  | 5401 |  | 
|  | 5402 | 2. *Following                   2. *Following | 
|  | 5403 | instructions same as            instructions same as | 
|  | 5404 | corresponding load              corresponding load | 
|  | 5405 | atomic acquire,                 atomic acquire, | 
|  | 5406 | except must generated           except must generated | 
|  | 5407 | all instructions even           all instructions even | 
|  | 5408 | for OpenCL.*                    for OpenCL.* | 
|  | 5409 | store atomic seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding | 
|  | 5410 | - wavefront    - local    store atomic release,           store atomic release, | 
|  | 5411 | - workgroup    - generic  except must generated           except must generated | 
|  | 5412 | all instructions even           all instructions even | 
|  | 5413 | for OpenCL.*                    for OpenCL.* | 
|  | 5414 | store atomic seq_cst      - agent        - global   *Same as corresponding          *Same as corresponding | 
|  | 5415 | - system       - generic  store atomic release,           store atomic release, | 
|  | 5416 | except must generated           except must generated | 
|  | 5417 | all instructions even           all instructions even | 
|  | 5418 | for OpenCL.*                    for OpenCL.* | 
|  | 5419 | atomicrmw    seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding | 
|  | 5420 | - wavefront    - local    atomicrmw acq_rel,              atomicrmw acq_rel, | 
|  | 5421 | - workgroup    - generic  except must generated           except must generated | 
|  | 5422 | all instructions even           all instructions even | 
|  | 5423 | for OpenCL.*                    for OpenCL.* | 
|  | 5424 | atomicrmw    seq_cst      - agent        - global   *Same as corresponding          *Same as corresponding | 
|  | 5425 | - system       - generic  atomicrmw acq_rel,              atomicrmw acq_rel, | 
|  | 5426 | except must generated           except must generated | 
|  | 5427 | all instructions even           all instructions even | 
|  | 5428 | for OpenCL.*                    for OpenCL.* | 
|  | 5429 | fence        seq_cst      - singlethread *none*     *Same as corresponding          *Same as corresponding | 
|  | 5430 | - wavefront               fence acq_rel,                  fence acq_rel, | 
|  | 5431 | - workgroup               except must generated           except must generated | 
|  | 5432 | - agent                   all instructions even           all instructions even | 
|  | 5433 | - system                  for OpenCL.*                    for OpenCL.* | 
|  | 5434 | ============ ============ ============== ========== =============================== ================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5435 |  | 
|  | 5436 | The memory order also adds the single thread optimization constrains defined in | 
|  | 5437 | table | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5438 | :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5439 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5440 | .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10 | 
|  | 5441 | :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5442 |  | 
|  | 5443 | ============ ============================================================== | 
|  | 5444 | LLVM Memory  Optimization Constraints | 
|  | 5445 | Ordering | 
|  | 5446 | ============ ============================================================== | 
|  | 5447 | unordered    *none* | 
|  | 5448 | monotonic    *none* | 
|  | 5449 | acquire      - If a load atomic/atomicrmw then no following load/load | 
|  | 5450 | atomic/store/ store atomic/atomicrmw/fence instruction can | 
|  | 5451 | be moved before the acquire. | 
|  | 5452 | - If a fence then same as load atomic, plus no preceding | 
|  | 5453 | associated fence-paired-atomic can be moved after the fence. | 
| Sylvestre Ledru | e3fdbae | 2017-06-26 02:45:39 +0000 | [diff] [blame] | 5454 | release      - If a store atomic/atomicrmw then no preceding load/load | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5455 | atomic/store/ store atomic/atomicrmw/fence instruction can | 
|  | 5456 | be moved after the release. | 
|  | 5457 | - If a fence then same as store atomic, plus no following | 
|  | 5458 | associated fence-paired-atomic can be moved before the | 
|  | 5459 | fence. | 
|  | 5460 | acq_rel      Same constraints as both acquire and release. | 
|  | 5461 | seq_cst      - If a load atomic then same constraints as acquire, plus no | 
|  | 5462 | preceding sequentially consistent load atomic/store | 
|  | 5463 | atomic/atomicrmw/fence instruction can be moved after the | 
|  | 5464 | seq_cst. | 
|  | 5465 | - If a store atomic then the same constraints as release, plus | 
|  | 5466 | no following sequentially consistent load atomic/store | 
|  | 5467 | atomic/atomicrmw/fence instruction can be moved before the | 
|  | 5468 | seq_cst. | 
|  | 5469 | - If an atomicrmw/fence then same constraints as acq_rel. | 
|  | 5470 | ============ ============================================================== | 
| Konstantin Zhuravlyov | d5561e0 | 2017-03-08 23:55:44 +0000 | [diff] [blame] | 5471 |  | 
| Wei Ding | 16289cf | 2017-02-21 18:48:01 +0000 | [diff] [blame] | 5472 | Trap Handler ABI | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5473 | ~~~~~~~~~~~~~~~~ | 
| Wei Ding | 16289cf | 2017-02-21 18:48:01 +0000 | [diff] [blame] | 5474 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5475 | For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes | 
|  | 5476 | (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports | 
|  | 5477 | the ``s_trap`` instruction with the following usage: | 
| Wei Ding | 16289cf | 2017-02-21 18:48:01 +0000 | [diff] [blame] | 5478 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5479 | .. table:: AMDGPU Trap Handler for AMDHSA OS | 
|  | 5480 | :name: amdgpu-trap-handler-for-amdhsa-os-table | 
| Wei Ding | 16289cf | 2017-02-21 18:48:01 +0000 | [diff] [blame] | 5481 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5482 | =================== =============== =============== ======================= | 
|  | 5483 | Usage               Code Sequence   Trap Handler    Description | 
|  | 5484 | Inputs | 
|  | 5485 | =================== =============== =============== ======================= | 
|  | 5486 | reserved            ``s_trap 0x00``                 Reserved by hardware. | 
|  | 5487 | ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA | 
|  | 5488 | ``queue_ptr`` ``debugtrap`` | 
|  | 5489 | ``VGPR0``:      intrinsic (not | 
|  | 5490 | ``arg``       implemented). | 
|  | 5491 | ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be | 
|  | 5492 | ``queue_ptr`` terminated and its | 
|  | 5493 | associated queue put | 
|  | 5494 | into the error state. | 
| Tony Tye | 43259df | 2018-05-16 16:19:34 +0000 | [diff] [blame] | 5495 | ``llvm.debugtrap``  ``s_trap 0x03``                 - If debugger not | 
|  | 5496 | installed then | 
|  | 5497 | behaves as a | 
|  | 5498 | no-operation. The | 
|  | 5499 | trap handler is | 
|  | 5500 | entered and | 
|  | 5501 | immediately returns | 
|  | 5502 | to continue | 
|  | 5503 | execution of the | 
|  | 5504 | wavefront. | 
|  | 5505 | - If the debugger is | 
|  | 5506 | installed, causes | 
|  | 5507 | the debug trap to be | 
|  | 5508 | reported by the | 
|  | 5509 | debugger and the | 
|  | 5510 | wavefront is put in | 
|  | 5511 | the halt state until | 
|  | 5512 | resumed by the | 
|  | 5513 | debugger. | 
|  | 5514 | reserved            ``s_trap 0x04``                 Reserved. | 
|  | 5515 | reserved            ``s_trap 0x05``                 Reserved. | 
|  | 5516 | reserved            ``s_trap 0x06``                 Reserved. | 
|  | 5517 | debugger breakpoint ``s_trap 0x07``                 Reserved for debugger | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5518 | breakpoints. | 
| Tony Tye | 43259df | 2018-05-16 16:19:34 +0000 | [diff] [blame] | 5519 | reserved            ``s_trap 0x08``                 Reserved. | 
|  | 5520 | reserved            ``s_trap 0xfe``                 Reserved. | 
|  | 5521 | reserved            ``s_trap 0xff``                 Reserved. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5522 | =================== =============== =============== ======================= | 
| Wei Ding | 16289cf | 2017-02-21 18:48:01 +0000 | [diff] [blame] | 5523 |  | 
| Tim Corringham | af2dfc6 | 2018-04-04 13:02:09 +0000 | [diff] [blame] | 5524 | AMDPAL | 
|  | 5525 | ------ | 
|  | 5526 |  | 
|  | 5527 | This section provides code conventions used when the target triple OS is | 
|  | 5528 | ``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters | 
|  | 5529 | from the application/runtime to each invocation of a hardware shader. These | 
|  | 5530 | parameters include both generic, application-controlled parameters called | 
|  | 5531 | *user data* as well as system-generated parameters that are a product of the | 
|  | 5532 | draw or dispatch execution. | 
|  | 5533 |  | 
|  | 5534 | User Data | 
|  | 5535 | ~~~~~~~~~ | 
|  | 5536 |  | 
|  | 5537 | Each hardware stage has a set of 32-bit *user data registers* which can be | 
|  | 5538 | written from a command buffer and then loaded into SGPRs when waves are launched | 
|  | 5539 | via a subsequent dispatch or draw operation. This is the way most arguments are | 
|  | 5540 | passed from the application/runtime to a hardware shader. | 
|  | 5541 |  | 
|  | 5542 | Compute User Data | 
|  | 5543 | ~~~~~~~~~~~~~~~~~ | 
|  | 5544 |  | 
|  | 5545 | Compute shader user data mappings are simpler than graphics shaders, and have a | 
|  | 5546 | fixed mapping. | 
|  | 5547 |  | 
|  | 5548 | Note that there are always 10 available *user data entries* in registers - | 
|  | 5549 | entries beyond that limit must be fetched from memory (via the spill table | 
|  | 5550 | pointer) by the shader. | 
|  | 5551 |  | 
|  | 5552 | .. table:: PAL Compute Shader User Data Registers | 
|  | 5553 | :name: pal-compute-user-data-registers | 
|  | 5554 |  | 
|  | 5555 | ============= ================================ | 
|  | 5556 | User Register Description | 
|  | 5557 | ============= ================================ | 
|  | 5558 | 0             Global Internal Table (32-bit pointer) | 
|  | 5559 | 1             Per-Shader Internal Table (32-bit pointer) | 
|  | 5560 | 2 - 11        Application-Controlled User Data (10 32-bit values) | 
|  | 5561 | 12            Spill Table (32-bit pointer) | 
|  | 5562 | 13 - 14       Thread Group Count (64-bit pointer) | 
|  | 5563 | 15            GDS Range | 
|  | 5564 | ============= ================================ | 
|  | 5565 |  | 
|  | 5566 | Graphics User Data | 
|  | 5567 | ~~~~~~~~~~~~~~~~~~ | 
|  | 5568 |  | 
|  | 5569 | Graphics pipelines support a much more flexible user data mapping: | 
|  | 5570 |  | 
|  | 5571 | .. table:: PAL Graphics Shader User Data Registers | 
|  | 5572 | :name: pal-graphics-user-data-registers | 
|  | 5573 |  | 
|  | 5574 | ============= ================================ | 
|  | 5575 | User Register Description | 
|  | 5576 | ============= ================================ | 
|  | 5577 | 0             Global Internal Table (32-bit pointer) | 
|  | 5578 | +             Per-Shader Internal Table (32-bit pointer) | 
|  | 5579 | + 1-15        Application Controlled User Data | 
|  | 5580 | (1-15 Contiguous 32-bit Values in Registers) | 
|  | 5581 | +             Spill Table (32-bit pointer) | 
|  | 5582 | +             Draw Index (First Stage Only) | 
|  | 5583 | +             Vertex Offset (First Stage Only) | 
|  | 5584 | +             Instance Offset (First Stage Only) | 
|  | 5585 | ============= ================================ | 
|  | 5586 |  | 
|  | 5587 | The placement of the global internal table remains fixed in the first *user | 
|  | 5588 | data SGPR register*. Otherwise all parameters are optional, and can be mapped | 
|  | 5589 | to any desired *user data SGPR register*, with the following regstrictions: | 
|  | 5590 |  | 
|  | 5591 | * Draw Index, Vertex Offset, and Instance Offset can only be used by the first | 
|  | 5592 | activehardware stage in a graphics pipeline (i.e. where the API vertex | 
|  | 5593 | shader runs). | 
|  | 5594 |  | 
|  | 5595 | * Application-controlled user data must be mapped into a contiguous range of | 
|  | 5596 | user data registers. | 
|  | 5597 |  | 
|  | 5598 | * The application-controlled user data range supports compaction remapping, so | 
|  | 5599 | only *entries* that are actually consumed by the shader must be assigned to | 
|  | 5600 | corresponding *registers*. Note that in order to support an efficient runtime | 
|  | 5601 | implementation, the remapping must pack *registers* in the same order as | 
|  | 5602 | *entries*, with unused *entries* removed. | 
|  | 5603 |  | 
|  | 5604 | .. _pal_global_internal_table: | 
|  | 5605 |  | 
|  | 5606 | Global Internal Table | 
|  | 5607 | ~~~~~~~~~~~~~~~~~~~~~ | 
|  | 5608 |  | 
|  | 5609 | The global internal table is a table of *shader resource descriptors* (SRDs) that | 
|  | 5610 | define how certain engine-wide, runtime-managed resources should be accessed | 
|  | 5611 | from a shader. The majority of these resources have HW-defined formats, and it | 
|  | 5612 | is up to the compiler to write/read data as required by the target hardware. | 
|  | 5613 |  | 
|  | 5614 | The following table illustrates the required format: | 
|  | 5615 |  | 
|  | 5616 | .. table:: PAL Global Internal Table | 
|  | 5617 | :name: pal-git-table | 
|  | 5618 |  | 
|  | 5619 | ============= ================================ | 
|  | 5620 | Offset        Description | 
|  | 5621 | ============= ================================ | 
|  | 5622 | 0-3           Graphics Scratch SRD | 
|  | 5623 | 4-7           Compute Scratch SRD | 
|  | 5624 | 8-11          ES/GS Ring Output SRD | 
|  | 5625 | 12-15         ES/GS Ring Input SRD | 
|  | 5626 | 16-19         GS/VS Ring Output #0 | 
|  | 5627 | 20-23         GS/VS Ring Output #1 | 
|  | 5628 | 24-27         GS/VS Ring Output #2 | 
|  | 5629 | 28-31         GS/VS Ring Output #3 | 
|  | 5630 | 32-35         GS/VS Ring Input SRD | 
|  | 5631 | 36-39         Tessellation Factor Buffer SRD | 
|  | 5632 | 40-43         Off-Chip LDS Buffer SRD | 
|  | 5633 | 44-47         Off-Chip Param Cache Buffer SRD | 
|  | 5634 | 48-51         Sample Position Buffer SRD | 
|  | 5635 | 52            vaRange::ShadowDescriptorTable High Bits | 
|  | 5636 | ============= ================================ | 
|  | 5637 |  | 
|  | 5638 | The pointer to the global internal table passed to the shader as user data | 
|  | 5639 | is a 32-bit pointer. The top 32 bits should be assumed to be the same as | 
|  | 5640 | the top 32 bits of the pipeline, so the shader may use the program | 
|  | 5641 | counter's top 32 bits. | 
|  | 5642 |  | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 5643 | Unspecified OS | 
|  | 5644 | -------------- | 
|  | 5645 |  | 
|  | 5646 | This section provides code conventions used when the target triple OS is | 
|  | 5647 | empty (see :ref:`amdgpu-target-triples`). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5648 |  | 
|  | 5649 | Trap Handler ABI | 
|  | 5650 | ~~~~~~~~~~~~~~~~ | 
|  | 5651 |  | 
|  | 5652 | For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does | 
|  | 5653 | not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` | 
|  | 5654 | instructions are handled as follows: | 
|  | 5655 |  | 
|  | 5656 | .. table:: AMDGPU Trap Handler for Non-AMDHSA OS | 
|  | 5657 | :name: amdgpu-trap-handler-for-non-amdhsa-os-table | 
|  | 5658 |  | 
|  | 5659 | =============== =============== =========================================== | 
|  | 5660 | Usage           Code Sequence   Description | 
|  | 5661 | =============== =============== =========================================== | 
|  | 5662 | llvm.trap       s_endpgm        Causes wavefront to be terminated. | 
|  | 5663 | llvm.debugtrap  *none*          Compiler warning given that there is no | 
|  | 5664 | trap handler installed. | 
|  | 5665 | =============== =============== =========================================== | 
|  | 5666 |  | 
|  | 5667 | Source Languages | 
|  | 5668 | ================ | 
|  | 5669 |  | 
|  | 5670 | .. _amdgpu-opencl: | 
|  | 5671 |  | 
|  | 5672 | OpenCL | 
|  | 5673 | ------ | 
|  | 5674 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5675 | When the language is OpenCL the following differences occur: | 
|  | 5676 |  | 
|  | 5677 | 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). | 
| Tony Tye | 7a893d4 | 2018-03-23 18:45:18 +0000 | [diff] [blame] | 5678 | 2. The AMDGPU backend appends additional arguments to the kernel's explicit | 
|  | 5679 | arguments for the AMDHSA OS (see | 
|  | 5680 | :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). | 
| Tony Tye | 46d3576 | 2017-08-15 20:47:41 +0000 | [diff] [blame] | 5681 | 3. Additional metadata is generated | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 5682 | (see :ref:`amdgpu-amdhsa-code-object-metadata`). | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5683 |  | 
| Tony Tye | 7a893d4 | 2018-03-23 18:45:18 +0000 | [diff] [blame] | 5684 | .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS | 
|  | 5685 | :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table | 
|  | 5686 |  | 
|  | 5687 | ======== ==== ========= =========================================== | 
|  | 5688 | Position Byte Byte      Description | 
|  | 5689 | Size Alignment | 
|  | 5690 | ======== ==== ========= =========================================== | 
| Tony Tye | 88441a3 | 2018-03-23 18:58:47 +0000 | [diff] [blame] | 5691 | 1        8    8         OpenCL Global Offset X | 
|  | 5692 | 2        8    8         OpenCL Global Offset Y | 
|  | 5693 | 3        8    8         OpenCL Global Offset Z | 
|  | 5694 | 4        8    8         OpenCL address of printf buffer | 
|  | 5695 | 5        8    8         OpenCL address of virtual queue used by | 
|  | 5696 | enqueue_kernel. | 
|  | 5697 | 6        8    8         OpenCL address of AqlWrap struct used by | 
|  | 5698 | enqueue_kernel. | 
| Tony Tye | 7a893d4 | 2018-03-23 18:45:18 +0000 | [diff] [blame] | 5699 | ======== ==== ========= =========================================== | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5700 |  | 
|  | 5701 | .. _amdgpu-hcc: | 
|  | 5702 |  | 
|  | 5703 | HCC | 
|  | 5704 | --- | 
|  | 5705 |  | 
| Tony Tye | 7a893d4 | 2018-03-23 18:45:18 +0000 | [diff] [blame] | 5706 | When the language is HCC the following differences occur: | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5707 |  | 
|  | 5708 | 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). | 
|  | 5709 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 5710 | .. _amdgpu-assembler: | 
|  | 5711 |  | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5712 | Assembler | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5713 | --------- | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5714 |  | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5715 | AMDGPU backend has LLVM-MC based assembler which is currently in development. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5716 | It supports AMDGCN GFX6-GFX10. | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5717 |  | 
| Dmitry Preobrazhensky | c6d31e6 | 2018-03-12 15:55:08 +0000 | [diff] [blame] | 5718 | This section describes general syntax for instructions and operands. | 
|  | 5719 |  | 
|  | 5720 | Instructions | 
|  | 5721 | ~~~~~~~~~~~~ | 
|  | 5722 |  | 
|  | 5723 | .. toctree:: | 
|  | 5724 | :hidden: | 
|  | 5725 |  | 
| Dmitry Preobrazhensky | 47eb636 | 2018-12-17 17:38:11 +0000 | [diff] [blame] | 5726 | AMDGPU/AMDGPUAsmGFX7 | 
|  | 5727 | AMDGPU/AMDGPUAsmGFX8 | 
|  | 5728 | AMDGPU/AMDGPUAsmGFX9 | 
|  | 5729 | AMDGPUModifierSyntax | 
| Dmitry Preobrazhensky | c6d31e6 | 2018-03-12 15:55:08 +0000 | [diff] [blame] | 5730 | AMDGPUOperandSyntax | 
| Dmitry Preobrazhensky | 47eb636 | 2018-12-17 17:38:11 +0000 | [diff] [blame] | 5731 | AMDGPUInstructionSyntax | 
|  | 5732 | AMDGPUInstructionNotation | 
| Dmitry Preobrazhensky | c6d31e6 | 2018-03-12 15:55:08 +0000 | [diff] [blame] | 5733 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5734 | .. TODO | 
|  | 5735 | AMDGPUAsmGFX10 | 
|  | 5736 |  | 
| Dmitry Preobrazhensky | 47eb636 | 2018-12-17 17:38:11 +0000 | [diff] [blame] | 5737 | An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: | 
| Dmitry Preobrazhensky | c6d31e6 | 2018-03-12 15:55:08 +0000 | [diff] [blame] | 5738 |  | 
| Dmitry Preobrazhensky | 47eb636 | 2018-12-17 17:38:11 +0000 | [diff] [blame] | 5739 | ``<``\ *opcode*\ ``>    <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` | 
| Dmitry Preobrazhensky | c6d31e6 | 2018-03-12 15:55:08 +0000 | [diff] [blame] | 5740 |  | 
| Dmitry Preobrazhensky | 47eb636 | 2018-12-17 17:38:11 +0000 | [diff] [blame] | 5741 | :doc:`Operands<AMDGPUOperandSyntax>` are normally comma-separated while | 
|  | 5742 | :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. | 
| Dmitry Preobrazhensky | c6d31e6 | 2018-03-12 15:55:08 +0000 | [diff] [blame] | 5743 |  | 
| Dmitry Preobrazhensky | 47eb636 | 2018-12-17 17:38:11 +0000 | [diff] [blame] | 5744 | The order of *operands* and *modifiers* is fixed. | 
|  | 5745 | Most *modifiers* are optional and may be omitted. | 
| Dmitry Preobrazhensky | c6d31e6 | 2018-03-12 15:55:08 +0000 | [diff] [blame] | 5746 |  | 
| Dmitry Preobrazhensky | 47eb636 | 2018-12-17 17:38:11 +0000 | [diff] [blame] | 5747 | See detailed instruction syntax description for :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`, | 
|  | 5748 | :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`. | 
| Dmitry Preobrazhensky | c6d31e6 | 2018-03-12 15:55:08 +0000 | [diff] [blame] | 5749 |  | 
|  | 5750 | Note that features under development are not included in this description. | 
|  | 5751 |  | 
|  | 5752 | For more information about instructions, their semantics and supported combinations of | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5753 | operands, refer to one of instruction set architecture manuals | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 5754 | [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and | 
|  | 5755 | [AMD-GCN-GFX10]_. | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5756 |  | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5757 | Operands | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5758 | ~~~~~~~~ | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5759 |  | 
| Dmitry Preobrazhensky | 47eb636 | 2018-12-17 17:38:11 +0000 | [diff] [blame] | 5760 | Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5761 |  | 
| Dmitry Preobrazhensky | c6d31e6 | 2018-03-12 15:55:08 +0000 | [diff] [blame] | 5762 | Modifiers | 
|  | 5763 | ~~~~~~~~~ | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5764 |  | 
| Dmitry Preobrazhensky | 47eb636 | 2018-12-17 17:38:11 +0000 | [diff] [blame] | 5765 | Detailed description of modifiers may be found :doc:`here<AMDGPUModifierSyntax>`. | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5766 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5767 | Instruction Examples | 
|  | 5768 | ~~~~~~~~~~~~~~~~~~~~ | 
|  | 5769 |  | 
|  | 5770 | DS | 
| Dmitry Preobrazhensky | c6d31e6 | 2018-03-12 15:55:08 +0000 | [diff] [blame] | 5771 | ++ | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5772 |  | 
|  | 5773 | .. code-block:: nasm | 
|  | 5774 |  | 
|  | 5775 | ds_add_u32 v2, v4 offset:16 | 
|  | 5776 | ds_write_src2_b64 v2 offset0:4 offset1:8 | 
|  | 5777 | ds_cmpst_f32 v2, v4, v6 | 
|  | 5778 | ds_min_rtn_f64 v[8:9], v2, v[4:5] | 
|  | 5779 |  | 
|  | 5780 |  | 
|  | 5781 | For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual. | 
|  | 5782 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5783 | FLAT | 
|  | 5784 | ++++ | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5785 |  | 
|  | 5786 | .. code-block:: nasm | 
|  | 5787 |  | 
|  | 5788 | flat_load_dword v1, v[3:4] | 
|  | 5789 | flat_store_dwordx3 v[3:4], v[5:7] | 
|  | 5790 | flat_atomic_swap v1, v[3:4], v5 glc | 
|  | 5791 | flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc | 
|  | 5792 | flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc | 
|  | 5793 |  | 
|  | 5794 | For full list of supported instructions, refer to "FLAT instructions" in ISA Manual. | 
|  | 5795 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5796 | MUBUF | 
|  | 5797 | +++++ | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5798 |  | 
|  | 5799 | .. code-block:: nasm | 
|  | 5800 |  | 
|  | 5801 | buffer_load_dword v1, off, s[4:7], s1 | 
|  | 5802 | buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe | 
|  | 5803 | buffer_store_format_xy v[1:2], off, s[4:7], s1 | 
|  | 5804 | buffer_wbinvl1 | 
|  | 5805 | buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc | 
|  | 5806 |  | 
|  | 5807 | For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual. | 
|  | 5808 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5809 | SMRD/SMEM | 
|  | 5810 | +++++++++ | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5811 |  | 
|  | 5812 | .. code-block:: nasm | 
|  | 5813 |  | 
|  | 5814 | s_load_dword s1, s[2:3], 0xfc | 
|  | 5815 | s_load_dwordx8 s[8:15], s[2:3], s4 | 
|  | 5816 | s_load_dwordx16 s[88:103], s[2:3], s4 | 
|  | 5817 | s_dcache_inv_vol | 
|  | 5818 | s_memtime s[4:5] | 
|  | 5819 |  | 
|  | 5820 | For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual. | 
|  | 5821 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5822 | SOP1 | 
|  | 5823 | ++++ | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5824 |  | 
|  | 5825 | .. code-block:: nasm | 
|  | 5826 |  | 
|  | 5827 | s_mov_b32 s1, s2 | 
|  | 5828 | s_mov_b64 s[0:1], 0x80000000 | 
|  | 5829 | s_cmov_b32 s1, 200 | 
|  | 5830 | s_wqm_b64 s[2:3], s[4:5] | 
|  | 5831 | s_bcnt0_i32_b64 s1, s[2:3] | 
|  | 5832 | s_swappc_b64 s[2:3], s[4:5] | 
|  | 5833 | s_cbranch_join s[4:5] | 
|  | 5834 |  | 
|  | 5835 | For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual. | 
|  | 5836 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5837 | SOP2 | 
|  | 5838 | ++++ | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5839 |  | 
|  | 5840 | .. code-block:: nasm | 
|  | 5841 |  | 
|  | 5842 | s_add_u32 s1, s2, s3 | 
|  | 5843 | s_and_b64 s[2:3], s[4:5], s[6:7] | 
|  | 5844 | s_cselect_b32 s1, s2, s3 | 
|  | 5845 | s_andn2_b32 s2, s4, s6 | 
|  | 5846 | s_lshr_b64 s[2:3], s[4:5], s6 | 
|  | 5847 | s_ashr_i32 s2, s4, s6 | 
|  | 5848 | s_bfm_b64 s[2:3], s4, s6 | 
|  | 5849 | s_bfe_i64 s[2:3], s[4:5], s6 | 
|  | 5850 | s_cbranch_g_fork s[4:5], s[6:7] | 
|  | 5851 |  | 
|  | 5852 | For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual. | 
|  | 5853 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5854 | SOPC | 
|  | 5855 | ++++ | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5856 |  | 
|  | 5857 | .. code-block:: nasm | 
|  | 5858 |  | 
|  | 5859 | s_cmp_eq_i32 s1, s2 | 
|  | 5860 | s_bitcmp1_b32 s1, s2 | 
|  | 5861 | s_bitcmp0_b64 s[2:3], s4 | 
|  | 5862 | s_setvskip s3, s5 | 
|  | 5863 |  | 
|  | 5864 | For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual. | 
|  | 5865 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5866 | SOPP | 
|  | 5867 | ++++ | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5868 |  | 
|  | 5869 | .. code-block:: nasm | 
|  | 5870 |  | 
|  | 5871 | s_barrier | 
|  | 5872 | s_nop 2 | 
|  | 5873 | s_endpgm | 
|  | 5874 | s_waitcnt 0 ; Wait for all counters to be 0 | 
|  | 5875 | s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above | 
|  | 5876 | s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. | 
|  | 5877 | s_sethalt 9 | 
|  | 5878 | s_sleep 10 | 
|  | 5879 | s_sendmsg 0x1 | 
|  | 5880 | s_sendmsg sendmsg(MSG_INTERRUPT) | 
|  | 5881 | s_trap 1 | 
|  | 5882 |  | 
|  | 5883 | For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual. | 
|  | 5884 |  | 
|  | 5885 | Unless otherwise mentioned, little verification is performed on the operands | 
| Sylvestre Ledru | e6ec441 | 2017-01-14 11:37:01 +0000 | [diff] [blame] | 5886 | of SOPP Instructions, so it is up to the programmer to be familiar with the | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5887 | range or acceptable values. | 
|  | 5888 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 5889 | VALU | 
|  | 5890 | ++++ | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5891 |  | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5892 | For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), | 
|  | 5893 | the assembler will automatically use optimal encoding based on its operands. | 
|  | 5894 | To force specific encoding, one can add a suffix to the opcode of the instruction: | 
|  | 5895 |  | 
|  | 5896 | * _e32 for 32-bit VOP1/VOP2/VOPC | 
|  | 5897 | * _e64 for 64-bit VOP3 | 
|  | 5898 | * _dpp for VOP_DPP | 
|  | 5899 | * _sdwa for VOP_SDWA | 
|  | 5900 |  | 
|  | 5901 | VOP1/VOP2/VOP3/VOPC examples: | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5902 |  | 
|  | 5903 | .. code-block:: nasm | 
|  | 5904 |  | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5905 | v_mov_b32 v1, v2 | 
|  | 5906 | v_mov_b32_e32 v1, v2 | 
|  | 5907 | v_nop | 
|  | 5908 | v_cvt_f64_i32_e32 v[1:2], v2 | 
|  | 5909 | v_floor_f32_e32 v1, v2 | 
|  | 5910 | v_bfrev_b32_e32 v1, v2 | 
|  | 5911 | v_add_f32_e32 v1, v2, v3 | 
|  | 5912 | v_mul_i32_i24_e64 v1, v2, 3 | 
|  | 5913 | v_mul_i32_i24_e32 v1, -3, v3 | 
|  | 5914 | v_mul_i32_i24_e32 v1, -100, v3 | 
|  | 5915 | v_addc_u32 v1, s[0:1], v2, v3, s[2:3] | 
|  | 5916 | v_max_f16_e32 v1, v2, v3 | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5917 |  | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5918 | VOP_DPP examples: | 
| Tom Stellard | 45bb48e | 2015-06-13 03:28:10 +0000 | [diff] [blame] | 5919 |  | 
|  | 5920 | .. code-block:: nasm | 
|  | 5921 |  | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5922 | v_mov_b32 v0, v0 quad_perm:[0,2,1,1] | 
|  | 5923 | v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
|  | 5924 | v_mov_b32 v0, v0 wave_shl:1 | 
|  | 5925 | v_mov_b32 v0, v0 row_mirror | 
|  | 5926 | v_mov_b32 v0, v0 row_bcast:31 | 
|  | 5927 | v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
|  | 5928 | v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
|  | 5929 | v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 | 
| Tom Stellard | 347ac79 | 2015-06-26 21:15:07 +0000 | [diff] [blame] | 5930 |  | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 5931 | VOP_SDWA examples: | 
|  | 5932 |  | 
|  | 5933 | .. code-block:: nasm | 
|  | 5934 |  | 
|  | 5935 | v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD | 
|  | 5936 | v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD | 
|  | 5937 | v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 | 
|  | 5938 | v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 | 
|  | 5939 | v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 | 
|  | 5940 |  | 
|  | 5941 | For full list of supported instructions, refer to "Vector ALU instructions". | 
|  | 5942 |  | 
| Konstantin Zhuravlyov | dd6b05c | 2018-06-22 19:23:18 +0000 | [diff] [blame] | 5943 | .. TODO | 
|  | 5944 | Remove once we switch to code object v3 by default. | 
|  | 5945 |  | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 5946 | .. _amdgpu-amdhsa-assembler-predefined-symbols-v2: | 
|  | 5947 |  | 
|  | 5948 | Code Object V2 Predefined Symbols (-mattr=-code-object-v3) | 
|  | 5949 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  | 5950 |  | 
|  | 5951 | .. warning:: Code Object V2 is not the default code object version emitted by | 
|  | 5952 | this version of LLVM. For a description of the predefined symbols available | 
|  | 5953 | with the default configuration (Code Object V3) see | 
|  | 5954 | :ref:`amdgpu-amdhsa-assembler-predefined-symbols-v3`. | 
|  | 5955 |  | 
|  | 5956 | The AMDGPU assembler defines and updates some symbols automatically. These | 
|  | 5957 | symbols do not affect code generation. | 
|  | 5958 |  | 
|  | 5959 | .option.machine_version_major | 
|  | 5960 | +++++++++++++++++++++++++++++ | 
|  | 5961 |  | 
|  | 5962 | Set to the GFX major generation number of the target being assembled for. For | 
|  | 5963 | example, when assembling for a "GFX9" target this will be set to the integer | 
|  | 5964 | value "9". The possible GFX major generation numbers are presented in | 
|  | 5965 | :ref:`amdgpu-processors`. | 
|  | 5966 |  | 
|  | 5967 | .option.machine_version_minor | 
|  | 5968 | +++++++++++++++++++++++++++++ | 
|  | 5969 |  | 
|  | 5970 | Set to the GFX minor generation number of the target being assembled for. For | 
|  | 5971 | example, when assembling for a "GFX810" target this will be set to the integer | 
|  | 5972 | value "1". The possible GFX minor generation numbers are presented in | 
|  | 5973 | :ref:`amdgpu-processors`. | 
|  | 5974 |  | 
|  | 5975 | .option.machine_version_stepping | 
|  | 5976 | ++++++++++++++++++++++++++++++++ | 
|  | 5977 |  | 
|  | 5978 | Set to the GFX stepping generation number of the target being assembled for. | 
|  | 5979 | For example, when assembling for a "GFX704" target this will be set to the | 
|  | 5980 | integer value "4". The possible GFX stepping generation numbers are presented | 
|  | 5981 | in :ref:`amdgpu-processors`. | 
|  | 5982 |  | 
|  | 5983 | .kernel.vgpr_count | 
|  | 5984 | ++++++++++++++++++ | 
|  | 5985 |  | 
|  | 5986 | Set to zero each time a | 
|  | 5987 | :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is | 
|  | 5988 | encountered. At each instruction, if the current value of this symbol is less | 
|  | 5989 | than or equal to the maximum VPGR number explicitly referenced within that | 
|  | 5990 | instruction then the symbol value is updated to equal that VGPR number plus | 
|  | 5991 | one. | 
|  | 5992 |  | 
|  | 5993 | .kernel.sgpr_count | 
|  | 5994 | ++++++++++++++++++ | 
|  | 5995 |  | 
|  | 5996 | Set to zero each time a | 
|  | 5997 | :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is | 
|  | 5998 | encountered. At each instruction, if the current value of this symbol is less | 
|  | 5999 | than or equal to the maximum VPGR number explicitly referenced within that | 
|  | 6000 | instruction then the symbol value is updated to equal that SGPR number plus | 
|  | 6001 | one. | 
|  | 6002 |  | 
|  | 6003 | .. _amdgpu-amdhsa-assembler-directives-v2: | 
|  | 6004 |  | 
|  | 6005 | Code Object V2 Directives (-mattr=-code-object-v3) | 
|  | 6006 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  | 6007 |  | 
|  | 6008 | .. warning:: Code Object V2 is not the default code object version emitted by | 
|  | 6009 | this version of LLVM. For a description of the directives supported with | 
|  | 6010 | the default configuration (Code Object V3) see | 
|  | 6011 | :ref:`amdgpu-amdhsa-assembler-directives-v3`. | 
| Konstantin Zhuravlyov | dd6b05c | 2018-06-22 19:23:18 +0000 | [diff] [blame] | 6012 |  | 
|  | 6013 | AMDGPU ABI defines auxiliary data in output code object. In assembly source, | 
|  | 6014 | one can specify them with assembler directives. | 
|  | 6015 |  | 
|  | 6016 | .hsa_code_object_version major, minor | 
|  | 6017 | +++++++++++++++++++++++++++++++++++++ | 
|  | 6018 |  | 
|  | 6019 | *major* and *minor* are integers that specify the version of the HSA code | 
|  | 6020 | object that will be generated by the assembler. | 
|  | 6021 |  | 
|  | 6022 | .hsa_code_object_isa [major, minor, stepping, vendor, arch] | 
|  | 6023 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | 
|  | 6024 |  | 
|  | 6025 |  | 
|  | 6026 | *major*, *minor*, and *stepping* are all integers that describe the instruction | 
|  | 6027 | set architecture (ISA) version of the assembly program. | 
|  | 6028 |  | 
|  | 6029 | *vendor* and *arch* are quoted strings.  *vendor* should always be equal to | 
|  | 6030 | "AMD" and *arch* should always be equal to "AMDGPU". | 
|  | 6031 |  | 
|  | 6032 | By default, the assembler will derive the ISA version, *vendor*, and *arch* | 
|  | 6033 | from the value of the -mcpu option that is passed to the assembler. | 
|  | 6034 |  | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 6035 | .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: | 
|  | 6036 |  | 
| Konstantin Zhuravlyov | dd6b05c | 2018-06-22 19:23:18 +0000 | [diff] [blame] | 6037 | .amdgpu_hsa_kernel (name) | 
|  | 6038 | +++++++++++++++++++++++++ | 
|  | 6039 |  | 
|  | 6040 | This directives specifies that the symbol with given name is a kernel entry point | 
|  | 6041 | (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL. | 
|  | 6042 |  | 
|  | 6043 | .amd_kernel_code_t | 
|  | 6044 | ++++++++++++++++++ | 
|  | 6045 |  | 
|  | 6046 | This directive marks the beginning of a list of key / value pairs that are used | 
|  | 6047 | to specify the amd_kernel_code_t object that will be emitted by the assembler. | 
|  | 6048 | The list must be terminated by the *.end_amd_kernel_code_t* directive.  For | 
|  | 6049 | any amd_kernel_code_t values that are unspecified a default value will be | 
|  | 6050 | used.  The default value for all keys is 0, with the following exceptions: | 
|  | 6051 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 6052 | - *amd_code_version_major* defaults to 1. | 
|  | 6053 | - *amd_kernel_code_version_minor* defaults to 2. | 
|  | 6054 | - *amd_machine_kind* defaults to 1. | 
|  | 6055 | - *amd_machine_version_major*, *machine_version_minor*, and | 
|  | 6056 | *amd_machine_version_stepping* are derived from the value of the -mcpu option | 
| Konstantin Zhuravlyov | dd6b05c | 2018-06-22 19:23:18 +0000 | [diff] [blame] | 6057 | that is passed to the assembler. | 
|  | 6058 | - *kernel_code_entry_byte_offset* defaults to 256. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 6059 | - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards | 
|  | 6060 | defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. | 
|  | 6061 | Note that wavefront size is specified as a power of two, so a value of **n** | 
|  | 6062 | means a size of 2^ **n**. | 
|  | 6063 | - *call_convention* defaults to -1. | 
| Konstantin Zhuravlyov | dd6b05c | 2018-06-22 19:23:18 +0000 | [diff] [blame] | 6064 | - *kernarg_segment_alignment*, *group_segment_alignment*, and | 
|  | 6065 | *private_segment_alignment* default to 4. Note that alignments are specified | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 6066 | as a power of 2, so a value of **n** means an alignment of 2^ **n**. | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 6067 | - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for | 
|  | 6068 | GFX10 onwards. | 
|  | 6069 | - *enable_mem_ordered* defaults to 1 for GFX10 onwards. | 
| Konstantin Zhuravlyov | dd6b05c | 2018-06-22 19:23:18 +0000 | [diff] [blame] | 6070 |  | 
|  | 6071 | The *.amd_kernel_code_t* directive must be placed immediately after the | 
|  | 6072 | function label and before any instructions. | 
|  | 6073 |  | 
|  | 6074 | For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, | 
|  | 6075 | comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. | 
|  | 6076 |  | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 6077 | .. _amdgpu-amdhsa-assembler-example-v2: | 
|  | 6078 |  | 
|  | 6079 | Code Object V2 Example Source Code (-mattr=-code-object-v3) | 
|  | 6080 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  | 6081 |  | 
|  | 6082 | .. warning:: Code Object V2 is not the default code object version emitted by | 
|  | 6083 | this version of LLVM. For a description of the directives supported with | 
|  | 6084 | the default configuration (Code Object V3) see | 
|  | 6085 | :ref:`amdgpu-amdhsa-assembler-example-v3`. | 
|  | 6086 |  | 
|  | 6087 | Here is an example of a minimal assembly source file, defining one HSA kernel: | 
| Konstantin Zhuravlyov | dd6b05c | 2018-06-22 19:23:18 +0000 | [diff] [blame] | 6088 |  | 
|  | 6089 | .. code-block:: none | 
|  | 6090 |  | 
|  | 6091 | .hsa_code_object_version 1,0 | 
|  | 6092 | .hsa_code_object_isa | 
|  | 6093 |  | 
|  | 6094 | .hsatext | 
|  | 6095 | .globl  hello_world | 
|  | 6096 | .p2align 8 | 
|  | 6097 | .amdgpu_hsa_kernel hello_world | 
|  | 6098 |  | 
|  | 6099 | hello_world: | 
|  | 6100 |  | 
|  | 6101 | .amd_kernel_code_t | 
|  | 6102 | enable_sgpr_kernarg_segment_ptr = 1 | 
|  | 6103 | is_ptr64 = 1 | 
|  | 6104 | compute_pgm_rsrc1_vgprs = 0 | 
|  | 6105 | compute_pgm_rsrc1_sgprs = 0 | 
|  | 6106 | compute_pgm_rsrc2_user_sgpr = 2 | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 6107 | compute_pgm_rsrc1_wgp_mode = 0 | 
|  | 6108 | compute_pgm_rsrc1_mem_ordered = 0 | 
|  | 6109 | compute_pgm_rsrc1_fwd_progress = 1 | 
| Konstantin Zhuravlyov | dd6b05c | 2018-06-22 19:23:18 +0000 | [diff] [blame] | 6110 | .end_amd_kernel_code_t | 
|  | 6111 |  | 
|  | 6112 | s_load_dwordx2 s[0:1], s[0:1] 0x0 | 
|  | 6113 | v_mov_b32 v0, 3.14159 | 
|  | 6114 | s_waitcnt lgkmcnt(0) | 
|  | 6115 | v_mov_b32 v1, s0 | 
|  | 6116 | v_mov_b32 v2, s1 | 
|  | 6117 | flat_store_dword v[1:2], v0 | 
|  | 6118 | s_endpgm | 
|  | 6119 | .Lfunc_end0: | 
|  | 6120 | .size   hello_world, .Lfunc_end0-hello_world | 
|  | 6121 |  | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 6122 | .. _amdgpu-amdhsa-assembler-predefined-symbols-v3: | 
|  | 6123 |  | 
|  | 6124 | Code Object V3 Predefined Symbols (-mattr=+code-object-v3) | 
|  | 6125 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 6126 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6127 | The AMDGPU assembler defines and updates some symbols automatically. These | 
|  | 6128 | symbols do not affect code generation. | 
| Tom Stellard | 347ac79 | 2015-06-26 21:15:07 +0000 | [diff] [blame] | 6129 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6130 | .amdgcn.gfx_generation_number | 
|  | 6131 | +++++++++++++++++++++++++++++ | 
| Tom Stellard | 347ac79 | 2015-06-26 21:15:07 +0000 | [diff] [blame] | 6132 |  | 
| Dmitry Preobrazhensky | 62a0318 | 2019-02-08 13:51:31 +0000 | [diff] [blame] | 6133 | Set to the GFX major generation number of the target being assembled for. For | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6134 | example, when assembling for a "GFX9" target this will be set to the integer | 
| Dmitry Preobrazhensky | 62a0318 | 2019-02-08 13:51:31 +0000 | [diff] [blame] | 6135 | value "9". The possible GFX major generation numbers are presented in | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6136 | :ref:`amdgpu-processors`. | 
| Tom Stellard | 347ac79 | 2015-06-26 21:15:07 +0000 | [diff] [blame] | 6137 |  | 
| Dmitry Preobrazhensky | 62a0318 | 2019-02-08 13:51:31 +0000 | [diff] [blame] | 6138 | .amdgcn.gfx_generation_minor | 
|  | 6139 | ++++++++++++++++++++++++++++ | 
|  | 6140 |  | 
|  | 6141 | Set to the GFX minor generation number of the target being assembled for. For | 
|  | 6142 | example, when assembling for a "GFX810" target this will be set to the integer | 
|  | 6143 | value "1". The possible GFX minor generation numbers are presented in | 
|  | 6144 | :ref:`amdgpu-processors`. | 
|  | 6145 |  | 
|  | 6146 | .amdgcn.gfx_generation_stepping | 
|  | 6147 | +++++++++++++++++++++++++++++++ | 
|  | 6148 |  | 
|  | 6149 | Set to the GFX stepping generation number of the target being assembled for. | 
|  | 6150 | For example, when assembling for a "GFX704" target this will be set to the | 
|  | 6151 | integer value "4". The possible GFX stepping generation numbers are presented | 
|  | 6152 | in :ref:`amdgpu-processors`. | 
|  | 6153 |  | 
| Scott Linder | 0bc9f15 | 2019-03-29 17:49:51 +0000 | [diff] [blame] | 6154 | .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: | 
|  | 6155 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6156 | .amdgcn.next_free_vgpr | 
|  | 6157 | ++++++++++++++++++++++ | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 6158 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6159 | Set to zero before assembly begins. At each instruction, if the current value | 
|  | 6160 | of this symbol is less than or equal to the maximum VGPR number explicitly | 
|  | 6161 | referenced within that instruction then the symbol value is updated to equal | 
|  | 6162 | that VGPR number plus one. | 
| Tom Stellard | 347ac79 | 2015-06-26 21:15:07 +0000 | [diff] [blame] | 6163 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6164 | May be used to set the `.amdhsa_next_free_vpgr` directive in | 
|  | 6165 | :ref:`amdhsa-kernel-directives-table`. | 
| Tom Stellard | 347ac79 | 2015-06-26 21:15:07 +0000 | [diff] [blame] | 6166 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6167 | May be set at any time, e.g. manually set to zero at the start of each kernel. | 
| Tom Stellard | 347ac79 | 2015-06-26 21:15:07 +0000 | [diff] [blame] | 6168 |  | 
| Scott Linder | 0bc9f15 | 2019-03-29 17:49:51 +0000 | [diff] [blame] | 6169 | .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: | 
|  | 6170 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6171 | .amdgcn.next_free_sgpr | 
|  | 6172 | ++++++++++++++++++++++ | 
| Tom Stellard | 347ac79 | 2015-06-26 21:15:07 +0000 | [diff] [blame] | 6173 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6174 | Set to zero before assembly begins. At each instruction, if the current value | 
|  | 6175 | of this symbol is less than or equal the maximum SGPR number explicitly | 
|  | 6176 | referenced within that instruction then the symbol value is updated to equal | 
|  | 6177 | that SGPR number plus one. | 
| Nikolay Haustov | 96a56bd | 2016-09-20 09:04:51 +0000 | [diff] [blame] | 6178 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6179 | May be used to set the `.amdhsa_next_free_spgr` directive in | 
|  | 6180 | :ref:`amdhsa-kernel-directives-table`. | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6181 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6182 | May be set at any time, e.g. manually set to zero at the start of each kernel. | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6183 |  | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 6184 | .. _amdgpu-amdhsa-assembler-directives-v3: | 
|  | 6185 |  | 
|  | 6186 | Code Object V3 Directives (-mattr=+code-object-v3) | 
|  | 6187 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6188 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6189 | Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` | 
|  | 6190 | architecture processors, and are not OS-specific. Directives which begin with | 
|  | 6191 | ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the | 
|  | 6192 | ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and | 
|  | 6193 | :ref:`amdgpu-processors`. | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6194 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6195 | .amdgcn_target <target> | 
|  | 6196 | +++++++++++++++++++++++ | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6197 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6198 | Optional directive which declares the target supported by the containing | 
|  | 6199 | assembler source file. Valid values are described in | 
|  | 6200 | :ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler | 
|  | 6201 | to validate command-line options such as ``-triple``, ``-mcpu``, and those | 
|  | 6202 | which specify target features. | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6203 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6204 | .amdhsa_kernel <name> | 
|  | 6205 | +++++++++++++++++++++ | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6206 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6207 | Creates a correctly aligned AMDHSA kernel descriptor and a symbol, | 
|  | 6208 | ``<name>.kd``, in the current location of the current section. Only valid when | 
|  | 6209 | the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first | 
|  | 6210 | instruction to execute, and does not need to be previously defined. | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6211 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6212 | Marks the beginning of a list of directives used to generate the bytes of a | 
|  | 6213 | kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. | 
|  | 6214 | Directives which may appear in this list are described in | 
|  | 6215 | :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must | 
|  | 6216 | be valid for the target being assembled for, and cannot be repeated. Directives | 
|  | 6217 | support the range of values specified by the field they reference in | 
|  | 6218 | :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is | 
|  | 6219 | assumed to have its default value, unless it is marked as "Required", in which | 
|  | 6220 | case it is an error to omit the directive. This list of directives is | 
|  | 6221 | terminated by an ``.end_amdhsa_kernel`` directive. | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6222 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6223 | .. table:: AMDHSA Kernel Assembler Directives | 
|  | 6224 | :name: amdhsa-kernel-directives-table | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6225 |  | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 6226 | ======================================================== =================== ============ =================== | 
|  | 6227 | Directive                                                Default             Supported On Description | 
|  | 6228 | ======================================================== =================== ============ =================== | 
|  | 6229 | ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX10   Controls GROUP_SEGMENT_FIXED_SIZE in | 
|  | 6230 | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. | 
|  | 6231 | ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX10   Controls PRIVATE_SEGMENT_FIXED_SIZE in | 
|  | 6232 | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. | 
|  | 6233 | ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in | 
|  | 6234 | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. | 
|  | 6235 | ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_PTR in | 
|  | 6236 | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. | 
|  | 6237 | ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX10   Controls ENABLE_SGPR_QUEUE_PTR in | 
|  | 6238 | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. | 
|  | 6239 | ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX10   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in | 
|  | 6240 | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. | 
|  | 6241 | ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_ID in | 
|  | 6242 | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. | 
|  | 6243 | ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in | 
|  | 6244 | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. | 
|  | 6245 | ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in | 
|  | 6246 | :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. | 
|  | 6247 | ``.amdhsa_wavefront_size32``                             Target              GFX10        Controls ENABLE_WAVEFRONT_SIZE32 in | 
|  | 6248 | Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. | 
|  | 6249 | Specific | 
|  | 6250 | (-wavefrontsize64) | 
|  | 6251 | ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in | 
|  | 6252 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6253 | ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_X in | 
|  | 6254 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6255 | ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Y in | 
|  | 6256 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6257 | ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Z in | 
|  | 6258 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6259 | ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_INFO in | 
|  | 6260 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6261 | ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX10   Controls ENABLE_VGPR_WORKITEM_ID in | 
|  | 6262 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6263 | Possible values are defined in | 
|  | 6264 | :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. | 
|  | 6265 | ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX10   Maximum VGPR number explicitly referenced, plus one. | 
|  | 6266 | Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in | 
|  | 6267 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6268 | ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX10   Maximum SGPR number explicitly referenced, plus one. | 
|  | 6269 | Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in | 
|  | 6270 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6271 | ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX10   Whether the kernel may use the special VCC SGPR. | 
|  | 6272 | Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in | 
|  | 6273 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6274 | ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access | 
|  | 6275 | scratch memory. Used to calculate | 
|  | 6276 | GRANULATED_WAVEFRONT_SGPR_COUNT in | 
|  | 6277 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6278 | ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay. | 
|  | 6279 | Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in | 
|  | 6280 | Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6281 | (+xnack) | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 6282 | ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_32 in | 
|  | 6283 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6284 | Possible values are defined in | 
|  | 6285 | :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. | 
|  | 6286 | ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_16_64 in | 
|  | 6287 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6288 | Possible values are defined in | 
|  | 6289 | :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. | 
|  | 6290 | ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_32 in | 
|  | 6291 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6292 | Possible values are defined in | 
|  | 6293 | :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. | 
|  | 6294 | ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_16_64 in | 
|  | 6295 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6296 | Possible values are defined in | 
|  | 6297 | :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. | 
|  | 6298 | ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX10   Controls ENABLE_DX10_CLAMP in | 
|  | 6299 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6300 | ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX10   Controls ENABLE_IEEE_MODE in | 
|  | 6301 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6302 | ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX10   Controls FP16_OVFL in | 
|  | 6303 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6304 | ``.amdhsa_workgroup_processor_mode``                     Target              GFX10        Controls ENABLE_WGP_MODE in | 
|  | 6305 | Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. | 
|  | 6306 | Specific | 
|  | 6307 | (-cumode) | 
|  | 6308 | ``.amdhsa_memory_ordered``                               1                   GFX10        Controls MEM_ORDERED in | 
|  | 6309 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6310 | ``.amdhsa_forward_progress``                             0                   GFX10        Controls FWD_PROGRESS in | 
|  | 6311 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. | 
|  | 6312 | ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in | 
|  | 6313 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6314 | ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in | 
|  | 6315 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6316 | ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in | 
|  | 6317 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6318 | ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in | 
|  | 6319 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6320 | ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in | 
|  | 6321 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6322 | ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in | 
|  | 6323 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6324 | ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in | 
|  | 6325 | :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. | 
|  | 6326 | ======================================================== =================== ============ =================== | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6327 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 6328 | .amdgpu_metadata | 
|  | 6329 | ++++++++++++++++ | 
|  | 6330 |  | 
|  | 6331 | Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` | 
|  | 6332 | note record (see :ref:`amdgpu-elf-note-records-table-v3`). | 
|  | 6333 |  | 
|  | 6334 | The contents must be in the [YAML]_ markup format, with the same structure and | 
|  | 6335 | semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`. | 
|  | 6336 |  | 
|  | 6337 | This directive is terminated by an ``.end_amdgpu_metadata`` directive. | 
|  | 6338 |  | 
| Scott Linder | ac20b74 | 2019-03-28 15:08:52 +0000 | [diff] [blame] | 6339 | .. _amdgpu-amdhsa-assembler-example-v3: | 
|  | 6340 |  | 
|  | 6341 | Code Object V3 Example Source Code (-mattr=+code-object-v3) | 
|  | 6342 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
| Tom Stellard | ff7416b | 2015-06-26 21:58:31 +0000 | [diff] [blame] | 6343 |  | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6344 | Here is an example of a minimal assembly source file, defining one HSA kernel: | 
|  | 6345 |  | 
| Chandler Carruth | 343a87a | 2018-08-06 01:19:43 +0000 | [diff] [blame] | 6346 | .. code-block:: none | 
| Scott Linder | 1e8c2c7 | 2018-06-21 19:38:56 +0000 | [diff] [blame] | 6347 |  | 
|  | 6348 | .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional | 
|  | 6349 |  | 
|  | 6350 | .text | 
|  | 6351 | .globl hello_world | 
|  | 6352 | .p2align 8 | 
|  | 6353 | .type hello_world,@function | 
|  | 6354 | hello_world: | 
|  | 6355 | s_load_dwordx2 s[0:1], s[0:1] 0x0 | 
|  | 6356 | v_mov_b32 v0, 3.14159 | 
|  | 6357 | s_waitcnt lgkmcnt(0) | 
|  | 6358 | v_mov_b32 v1, s0 | 
|  | 6359 | v_mov_b32 v2, s1 | 
|  | 6360 | flat_store_dword v[1:2], v0 | 
|  | 6361 | s_endpgm | 
|  | 6362 | .Lfunc_end0: | 
|  | 6363 | .size   hello_world, .Lfunc_end0-hello_world | 
|  | 6364 |  | 
|  | 6365 | .rodata | 
|  | 6366 | .p2align 6 | 
|  | 6367 | .amdhsa_kernel hello_world | 
|  | 6368 | .amdhsa_user_sgpr_kernarg_segment_ptr 1 | 
|  | 6369 | .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr | 
|  | 6370 | .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr | 
|  | 6371 | .end_amdhsa_kernel | 
|  | 6372 |  | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 6373 | .amdgpu_metadata | 
|  | 6374 | --- | 
|  | 6375 | amdhsa.version: | 
|  | 6376 | - 1 | 
|  | 6377 | - 0 | 
|  | 6378 | amdhsa.kernels: | 
|  | 6379 | - .name: hello_world | 
|  | 6380 | .symbol: hello_world.kd | 
|  | 6381 | .kernarg_segment_size: 48 | 
|  | 6382 | .group_segment_fixed_size: 0 | 
|  | 6383 | .private_segment_fixed_size: 0 | 
|  | 6384 | .kernarg_segment_align: 4 | 
|  | 6385 | .wavefront_size: 64 | 
|  | 6386 | .sgpr_count: 2 | 
|  | 6387 | .vgpr_count: 3 | 
|  | 6388 | .max_flat_workgroup_size: 256 | 
|  | 6389 | ... | 
|  | 6390 | .end_amdgpu_metadata | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 6391 |  | 
| Scott Linder | 0bc9f15 | 2019-03-29 17:49:51 +0000 | [diff] [blame] | 6392 | If an assembly source file contains multiple kernels and/or functions, the | 
|  | 6393 | :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and | 
|  | 6394 | :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using | 
|  | 6395 | the ``.set <symbol>, <expression>`` directive. For example, in the case of two | 
|  | 6396 | kernels, where ``function1`` is only called from ``kernel1`` it is sufficient | 
|  | 6397 | to group the function with the kernel that calls it and reset the symbols | 
|  | 6398 | between the two connected components: | 
|  | 6399 |  | 
|  | 6400 | .. code-block:: none | 
|  | 6401 |  | 
|  | 6402 | .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional | 
|  | 6403 |  | 
|  | 6404 | // gpr tracking symbols are implicitly set to zero | 
|  | 6405 |  | 
|  | 6406 | .text | 
|  | 6407 | .globl kern0 | 
|  | 6408 | .p2align 8 | 
|  | 6409 | .type kern0,@function | 
|  | 6410 | kern0: | 
|  | 6411 | // ... | 
|  | 6412 | s_endpgm | 
|  | 6413 | .Lkern0_end: | 
|  | 6414 | .size   kern0, .Lkern0_end-kern0 | 
|  | 6415 |  | 
|  | 6416 | .rodata | 
|  | 6417 | .p2align 6 | 
|  | 6418 | .amdhsa_kernel kern0 | 
|  | 6419 | // ... | 
|  | 6420 | .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr | 
|  | 6421 | .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr | 
|  | 6422 | .end_amdhsa_kernel | 
|  | 6423 |  | 
|  | 6424 | // reset symbols to begin tracking usage in func1 and kern1 | 
|  | 6425 | .set .amdgcn.next_free_vgpr, 0 | 
|  | 6426 | .set .amdgcn.next_free_sgpr, 0 | 
|  | 6427 |  | 
|  | 6428 | .text | 
|  | 6429 | .hidden func1 | 
|  | 6430 | .global func1 | 
|  | 6431 | .p2align 2 | 
|  | 6432 | .type func1,@function | 
|  | 6433 | func1: | 
|  | 6434 | // ... | 
|  | 6435 | s_setpc_b64 s[30:31] | 
|  | 6436 | .Lfunc1_end: | 
|  | 6437 | .size func1, .Lfunc1_end-func1 | 
|  | 6438 |  | 
|  | 6439 | .globl kern1 | 
|  | 6440 | .p2align 8 | 
|  | 6441 | .type kern1,@function | 
|  | 6442 | kern1: | 
|  | 6443 | // ... | 
|  | 6444 | s_getpc_b64 s[4:5] | 
|  | 6445 | s_add_u32 s4, s4, func1@rel32@lo+4 | 
|  | 6446 | s_addc_u32 s5, s5, func1@rel32@lo+4 | 
|  | 6447 | s_swappc_b64 s[30:31], s[4:5] | 
|  | 6448 | // ... | 
|  | 6449 | s_endpgm | 
|  | 6450 | .Lkern1_end: | 
|  | 6451 | .size   kern1, .Lkern1_end-kern1 | 
|  | 6452 |  | 
|  | 6453 | .rodata | 
|  | 6454 | .p2align 6 | 
|  | 6455 | .amdhsa_kernel kern1 | 
|  | 6456 | // ... | 
|  | 6457 | .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr | 
|  | 6458 | .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr | 
|  | 6459 | .end_amdhsa_kernel | 
|  | 6460 |  | 
|  | 6461 | These symbols cannot identify connected components in order to automatically | 
|  | 6462 | track the usage for each kernel. However, in some cases careful organization of | 
|  | 6463 | the kernels and functions in the source file means there is minimal additional | 
|  | 6464 | effort required to accurately calculate GPR usage. | 
|  | 6465 |  | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 6466 | Additional Documentation | 
|  | 6467 | ======================== | 
|  | 6468 |  | 
| Konstantin Zhuravlyov | 265d253 | 2017-10-18 17:59:20 +0000 | [diff] [blame] | 6469 | .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ | 
|  | 6470 | .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ | 
|  | 6471 | .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ | 
|  | 6472 | .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ | 
|  | 6473 | .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ | 
|  | 6474 | .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ | 
|  | 6475 | .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ | 
|  | 6476 | .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ | 
| Stanislav Mekhanoshin | 4336a94 | 2019-06-13 22:18:47 +0000 | [diff] [blame] | 6477 | .. [AMD-GCN-GFX10] AMD "Navi" Instruction Set Architecture *TBA* | 
|  | 6478 | .. TODO | 
|  | 6479 | ttye Add link when made public. | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 6480 | .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__ | 
|  | 6481 | .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__ | 
|  | 6482 | .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ | 
|  | 6483 | .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ | 
|  | 6484 | .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ | 
| Konstantin Zhuravlyov | ea35e46 | 2017-10-19 17:12:55 +0000 | [diff] [blame] | 6485 | .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ | 
| Scott Linder | 8d5a36a | 2018-11-15 20:46:55 +0000 | [diff] [blame] | 6486 | .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ | 
| Tony Tye | f16a45e | 2017-06-06 20:31:59 +0000 | [diff] [blame] | 6487 | .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ | 
|  | 6488 | .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ | 
| Tony Tye | e2f3e10 | 2018-06-14 16:40:10 +0000 | [diff] [blame] | 6489 | .. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__ |