blob: d0671b0a74e0c0998dcaea2eb0e4855fe7ee7180 [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
67 *<empty>* Defaults to ``opencl``.
68 ``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
Tony Tye07d9f102017-11-10 01:00:54 +000069 ``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
70 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000071
72.. _amdgpu-processors:
73
74Processors
75----------
76
77Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
78names from both the *Processor* and *Alternative Processor* can be used.
79
80 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000081 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000082
Tony Tye31105cc2017-12-11 15:35:27 +000083 =========== =============== ============ ===== ========= ======= ==================
84 Processor Alternative Target dGPU/ Target ROCm Example
85 Processor Triple APU Features Support Products
86 Architecture Supported
87 [Default]
88 =========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000089 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye31105cc2017-12-11 15:35:27 +000090 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000091 ``r600`` ``r600`` dGPU
92 ``r630`` ``r600`` dGPU
93 ``rs880`` ``r600`` dGPU
94 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000095 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye31105cc2017-12-11 15:35:27 +000096 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000097 ``rv710`` ``r600`` dGPU
98 ``rv730`` ``r600`` dGPU
99 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000100 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000101 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000102 ``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000103 ``cypress`` ``r600`` dGPU
104 ``juniper`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000105 ``redwood`` ``r600`` dGPU
106 ``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000107 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000108 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000109 ``barts`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000110 ``caicos`` ``r600`` dGPU
111 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000112 ``turks`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000113 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye31105cc2017-12-11 15:35:27 +0000114 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000115 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000116 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000117 - ``oland``
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000118 - ``pitcairn``
119 - ``verde``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000120 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye31105cc2017-12-11 15:35:27 +0000121 -----------------------------------------------------------------------------------
122 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
123 - A6 Pro-7050B
124 - A8-7100
125 - A8 Pro-7150B
126 - A10-7300
127 - A10 Pro-7350B
128 - FX-7500
129 - A8-7200P
130 - A10-7400P
131 - FX-7600P
132 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
133 - FirePro W9100
134 - FirePro S9150
135 - FirePro S9170
136 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
137 - Radeon R9 290x
138 - Radeon R390
139 - Radeon R390x
140 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
141 - ``mullins`` - E1-2200
142 - E1-2500
143 - E2-3000
144 - E2-3800
145 - A4-5000
146 - A4-5100
147 - A6-5200
148 - A4 Pro-3340B
149 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
150 - Radeon HD 8770
151 - R7 260
152 - R7 260X
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000153 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye31105cc2017-12-11 15:35:27 +0000154 -----------------------------------------------------------------------------------
Tony Tye31105cc2017-12-11 15:35:27 +0000155 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
156 [on] - Pro A6-8500B
157 - A8-8600P
158 - Pro A8-8600B
159 - FX-8800P
160 - Pro A12-8800B
161 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
162 [on] - Pro A10-8700B
163 - A10-8780P
164 \ ``amdgcn`` APU - xnack - A10-9600P
165 [on] - A10-9630P
166 - A12-9700P
167 - A12-9730P
168 - FX-9800P
169 - FX-9830P
170 \ ``amdgcn`` APU - xnack - E2-9010
171 [on] - A6-9210
172 - A9-9410
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000173 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
174 - ``tonga`` [off] - FirePro S7100
Tony Tye31105cc2017-12-11 15:35:27 +0000175 - FirePro W7100
176 - Radeon R285
177 - Radeon R9 380
178 - Radeon R9 385
179 - Mobile FirePro
180 M7170
181 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
182 [off] - Radeon R9 Fury
183 - Radeon R9 FuryX
184 - Radeon Pro Duo
185 - FirePro S9300x2
186 - Radeon Instinct MI8
187 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
188 [off] - Radeon RX 480
189 - Radeon Instinct MI6
190 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
191 [off]
192 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
193 [on]
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000194 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye31105cc2017-12-11 15:35:27 +0000195 -----------------------------------------------------------------------------------
196 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
197 [off] Frontier Edition
198 - Radeon RX Vega 56
199 - Radeon RX Vega 64
200 - Radeon RX Vega 64
201 Liquid
202 - Radeon Instinct MI25
203 ``gfx902`` ``amdgcn`` APU - xnack *TBA*
204 [on]
205 .. TODO
206 Add product
207 names.
208 =========== =============== ============ ===== ========= ======= ==================
Tony Tye07d9f102017-11-10 01:00:54 +0000209
210.. _amdgpu-target-features:
211
212Target Features
213---------------
214
215Target features control how code is generated to support certain
Tony Tye31105cc2017-12-11 15:35:27 +0000216processor specific features. Not all target features are supported by
217all processors. The runtime must ensure that the features supported by
218the device used to execute the code match the features enabled when
219generating the code. A mismatch of features may result in incorrect
220execution, or a reduction in performance.
221
222The target features supported by each processor, and the default value
223used if not specified explicitly, is listed in
224:ref:`amdgpu-processor-table`.
Tony Tye07d9f102017-11-10 01:00:54 +0000225
226Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
227target features.
228
229For example:
230
231``-mxnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000232 Enable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000233``-mno-xnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000234 Disable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000235
236 .. table:: AMDGPU Target Features
237 :name: amdgpu-target-feature-table
238
Tony Tye31105cc2017-12-11 15:35:27 +0000239 ============== ==================================================
240 Target Feature Description
241 ============== ==================================================
242 -m[no-]xnack Enable/disable generating code that has
243 memory clauses that are compatible with
244 having XNACK replay enabled.
Tony Tye07d9f102017-11-10 01:00:54 +0000245
Tony Tye31105cc2017-12-11 15:35:27 +0000246 This is used for demand paging and page
247 migration. If XNACK replay is enabled in
248 the device, then if a page fault occurs
249 the code may execute incorrectly if the
250 ``xnack`` feature is not enabled. Executing
251 code that has the feature enabled on a
252 device that does not have XNACK replay
253 enabled will execute correctly, but may
254 be less performant than code with the
255 feature disabled.
256 ============== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000257
258.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000259
260Address Spaces
261--------------
262
Tony Tyef16a45e2017-06-06 20:31:59 +0000263The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000264
Tony Tyef16a45e2017-06-06 20:31:59 +0000265The memory space names used in the table, aside from the region memory space, is
266from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000267
Tony Tyef16a45e2017-06-06 20:31:59 +0000268LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000269
Tony Tyef16a45e2017-06-06 20:31:59 +0000270 .. table:: Address Space Mapping
271 :name: amdgpu-address-space-mapping-table
272
Yaxun Liu0124b542018-02-13 18:00:25 +0000273 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000274 LLVM Address Space Memory Space
Yaxun Liu0124b542018-02-13 18:00:25 +0000275 ================== =================
276 0 Generic (Flat)
277 1 Global
278 2 Region (GDS)
279 3 Local (group/LDS)
280 4 Constant
281 5 Private (Scratch)
282 6 Constant 32-bit
283 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000284
285.. _amdgpu-memory-scopes:
286
287Memory Scopes
288-------------
289
290This section provides LLVM memory synchronization scopes supported by the AMDGPU
291backend memory model when the target triple OS is ``amdhsa`` (see
292:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
293
294The memory model supported is based on the HSA memory model [HSA]_ which is
295based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
296relation is transitive over the synchonizes-with relation independent of scope,
297and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000298table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000299
300This is different to the OpenCL [OpenCL]_ memory model which does not have scope
301inclusion and requires the memory scopes to exactly match. However, this
302is conservatively correct for OpenCL.
303
Tony Tye07d9f102017-11-10 01:00:54 +0000304 .. table:: AMDHSA LLVM Sync Scopes
305 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000306
307 ================ ==========================================================
308 LLVM Sync Scope Description
309 ================ ==========================================================
310 *none* The default: ``system``.
311
312 Synchronizes with, and participates in modification and
313 seq_cst total orderings with, other operations (except
314 image operations) for all address spaces (except private,
315 or generic that accesses private) provided the other
316 operation's sync scope is:
317
318 - ``system``.
319 - ``agent`` and executed by a thread on the same agent.
320 - ``workgroup`` and executed by a thread in the same
321 workgroup.
322 - ``wavefront`` and executed by a thread in the same
323 wavefront.
324
325 ``agent`` Synchronizes with, and participates in modification and
326 seq_cst total orderings with, other operations (except
327 image operations) for all address spaces (except private,
328 or generic that accesses private) provided the other
329 operation's sync scope is:
330
331 - ``system`` or ``agent`` and executed by a thread on the
332 same agent.
333 - ``workgroup`` and executed by a thread in the same
334 workgroup.
335 - ``wavefront`` and executed by a thread in the same
336 wavefront.
337
338 ``workgroup`` Synchronizes with, and participates in modification and
339 seq_cst total orderings with, other operations (except
340 image operations) for all address spaces (except private,
341 or generic that accesses private) provided the other
342 operation's sync scope is:
343
344 - ``system``, ``agent`` or ``workgroup`` and executed by a
345 thread in the same workgroup.
346 - ``wavefront`` and executed by a thread in the same
347 wavefront.
348
349 ``wavefront`` Synchronizes with, and participates in modification and
350 seq_cst total orderings with, other operations (except
351 image operations) for all address spaces (except private,
352 or generic that accesses private) provided the other
353 operation's sync scope is:
354
355 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
356 and executed by a thread in the same wavefront.
357
358 ``singlethread`` Only synchronizes with, and participates in modification
359 and seq_cst total orderings with, other operations (except
360 image operations) running in the same thread for all
361 address spaces (for example, in signal handlers).
362 ================ ==========================================================
363
364AMDGPU Intrinsics
365-----------------
366
367The AMDGPU backend implements the following intrinsics.
368
369*This section is WIP.*
370
371.. TODO
372 List AMDGPU intrinsics
373
374Code Object
375===========
376
377The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
378can be linked by ``lld`` to produce a standard ELF shared code object which can
379be loaded and executed on an AMDGPU target.
380
381Header
382------
383
384The AMDGPU backend uses the following ELF header:
385
386 .. table:: AMDGPU ELF Header
387 :name: amdgpu-elf-header-table
388
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000389 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000390 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000391 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000392 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
393 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000394 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
395 - ``ELFOSABI_AMDGPU_HSA``
396 - ``ELFOSABI_AMDGPU_PAL``
397 - ``ELFOSABI_AMDGPU_MESA3D``
398 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
399 - ``ELFABIVERSION_AMDGPU_PAL``
400 - ``ELFABIVERSION_AMDGPU_MESA3D``
401 ``e_type`` - ``ET_REL``
402 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000403 ``e_machine`` ``EM_AMDGPU``
404 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000405 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000406 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000407
408..
409
410 .. table:: AMDGPU ELF Header Enumeration Values
411 :name: amdgpu-elf-header-enumeration-values-table
412
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000413 =============================== =====
414 Name Value
415 =============================== =====
416 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000417 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000418 ``ELFOSABI_AMDGPU_HSA`` 64
419 ``ELFOSABI_AMDGPU_PAL`` 65
420 ``ELFOSABI_AMDGPU_MESA3D`` 66
421 ``ELFABIVERSION_AMDGPU_HSA`` 1
422 ``ELFABIVERSION_AMDGPU_PAL`` 0
423 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
424 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000425
426``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000427 The ELF class is:
428
429 * ``ELFCLASS32`` for ``r600`` architecture.
430
431 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
432 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000433
434``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000435 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000436
437``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000438 One of the following AMD GPU architecture specific OS ABIs
439 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000440
Tony Tye07d9f102017-11-10 01:00:54 +0000441 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000442
Tony Tye07d9f102017-11-10 01:00:54 +0000443 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000444
Tony Tye07d9f102017-11-10 01:00:54 +0000445 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
446
447 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000448
Tony Tyef16a45e2017-06-06 20:31:59 +0000449``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000450 The ABI version of the AMD GPU architecture specific OS ABI to which the code
451 object conforms:
452
453 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
454 runtime ABI.
455
456 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
457 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000458
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000459 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000460 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000461
Tony Tyef16a45e2017-06-06 20:31:59 +0000462``e_type``
463 Can be one of the following values:
464
465
466 ``ET_REL``
467 The type produced by the AMD GPU backend compiler as it is relocatable code
468 object.
469
470 ``ET_DYN``
471 The type produced by the linker as it is a shared code object.
472
473 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
474
475``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000476 The value ``EM_AMDGPU`` is used for the machine for all processors supported
477 by the ``r600`` and ``amdgcn`` architectures (see
478 :ref:`amdgpu-processor-table`). The specific processor is specified in the
479 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
480 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000481
482``e_entry``
483 The entry point is 0 as the entry points for individual kernels must be
484 selected in order to invoke them through AQL packets.
485
486``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000487 The AMDGPU backend uses the following ELF header flags:
488
489 .. table:: AMDGPU ELF Header ``e_flags``
490 :name: amdgpu-elf-header-e_flags-table
491
492 ================================= ========== =============================
493 Name Value Description
494 ================================= ========== =============================
495 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
496 -------------------------------------------- -----------------------------
497 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
498 mask for
499 ``EF_AMDGPU_MACH_xxx`` values
500 defined in
501 :ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye31105cc2017-12-11 15:35:27 +0000502 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
503 target feature is
504 enabled for all code
505 contained in the code object.
Tony Tye5bbcca62018-03-08 05:46:01 +0000506 If the processor
507 does not support the
508 ``xnack`` target
509 feature then must
510 be 0.
Tony Tye31105cc2017-12-11 15:35:27 +0000511 See
512 :ref:`amdgpu-target-features`.
Tony Tye07d9f102017-11-10 01:00:54 +0000513 ================================= ========== =============================
514
515 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
516 :name: amdgpu-ef-amdgpu-mach-table
517
518 ================================= ========== =============================
519 Name Value Description (see
520 :ref:`amdgpu-processor-table`)
521 ================================= ========== =============================
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000522 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
523 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
524 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
525 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
526 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
527 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
528 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
529 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
530 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
531 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
532 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
533 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
534 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
535 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
536 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
537 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
538 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
539 *reserved* 0x011 - Reserved for ``r600``
540 0x01f architecture processors.
541 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
542 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
543 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
544 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
545 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
546 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
547 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
548 *reserved* 0x027 Reserved.
549 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
550 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
551 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
552 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
553 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
554 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
555 *reserved* 0x02e Reserved.
556 *reserved* 0x02f Reserved.
557 *reserved* 0x030 Reserved.
Tony Tye07d9f102017-11-10 01:00:54 +0000558 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000559
560Sections
561--------
562
563An AMDGPU target ELF code object has the standard ELF sections which include:
564
565 .. table:: AMDGPU ELF Sections
566 :name: amdgpu-elf-sections-table
567
568 ================== ================ =================================
569 Name Type Attributes
570 ================== ================ =================================
571 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
572 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
573 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
574 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
575 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
576 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
577 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
578 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
579 ``.note`` ``SHT_NOTE`` *none*
580 ``.rela``\ *name* ``SHT_RELA`` *none*
581 ``.rela.dyn`` ``SHT_RELA`` *none*
582 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
583 ``.shstrtab`` ``SHT_STRTAB`` *none*
584 ``.strtab`` ``SHT_STRTAB`` *none*
585 ``.symtab`` ``SHT_SYMTAB`` *none*
586 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
587 ================== ================ =================================
588
589These sections have their standard meanings (see [ELF]_) and are only generated
590if needed.
591
592``.debug``\ *\**
593 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
594 DWARF produced by the AMDGPU backend.
595
Tony Tye46d35762017-08-15 20:47:41 +0000596``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000597 The standard sections used by a dynamic loader.
598
599``.note``
600 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
601 backend.
602
603``.rela``\ *name*, ``.rela.dyn``
604 For relocatable code objects, *name* is the name of the section that the
605 relocation records apply. For example, ``.rela.text`` is the section name for
606 relocation records associated with the ``.text`` section.
607
608 For linked shared code objects, ``.rela.dyn`` contains all the relocation
609 records from each of the relocatable code object's ``.rela``\ *name* sections.
610
611 See :ref:`amdgpu-relocation-records` for the relocation records supported by
612 the AMDGPU backend.
613
614``.text``
615 The executable machine code for the kernels and functions they call. Generated
616 as position independent code. See :ref:`amdgpu-code-conventions` for
617 information on conventions used in the isa generation.
618
619.. _amdgpu-note-records:
620
621Note Records
622------------
623
Tony Tye07d9f102017-11-10 01:00:54 +0000624As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
625be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
626aligned. In addition, minimal zero byte padding must be generated to ensure the
627``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
628``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000629
630The AMDGPU backend code object uses the following ELF note records in the
631``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000632record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000633variable size strings have a corresponding ``*_size`` field that specifies the
634number of bytes, including the terminating null character, in the string. The
635string(s) come immediately after the preceding fields.
636
637Additional note records can be present.
638
639 .. table:: AMDGPU ELF Note Records
640 :name: amdgpu-elf-note-records-table
641
Tony Tye46d35762017-08-15 20:47:41 +0000642 ===== ============================== ======================================
643 Name Type Description
644 ===== ============================== ======================================
645 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000646 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000647
648..
649
650 .. table:: AMDGPU ELF Note Record Enumeration Values
651 :name: amdgpu-elf-note-record-enumeration-values-table
652
Tony Tye46d35762017-08-15 20:47:41 +0000653 ============================== =====
654 Name Value
655 ============================== =====
656 *reserved* 0-9
657 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000658 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000659 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000660
Tony Tye46d35762017-08-15 20:47:41 +0000661``NT_AMD_AMDGPU_HSA_METADATA``
662 Specifies extensible metadata associated with the code objects executed on HSA
663 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
664 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
665 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
666 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000667
Tony Tye46d35762017-08-15 20:47:41 +0000668.. _amdgpu-symbols:
669
670Symbols
671-------
672
673Symbols include the following:
674
675 .. table:: AMDGPU ELF Symbols
676 :name: amdgpu-elf-symbols-table
677
678 ===================== ============== ============= ==================
679 Name Type Section Description
680 ===================== ============== ============= ==================
681 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
682 - ``.rodata``
683 - ``.bss``
684 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
685 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
686 ===================== ============== ============= ==================
687
688Global variable
689 Global variables both used and defined by the compilation unit.
690
691 If the symbol is defined in the compilation unit then it is allocated in the
692 appropriate section according to if it has initialized data or is readonly.
693
694 If the symbol is external then its section is ``STN_UNDEF`` and the loader
695 will resolve relocations using the definition provided by another code object
696 or explicitly defined by the runtime.
697
698 All global symbols, whether defined in the compilation unit or external, are
699 accessed by the machine code indirectly through a GOT table entry. This
700 allows them to be preemptable. The GOT table is only supported when the target
701 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000702
703 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000704 Add description of linked shared object symbols. Seems undefined symbols
705 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000706
Tony Tye46d35762017-08-15 20:47:41 +0000707Kernel descriptor
708 Every HSA kernel has an associated kernel descriptor. It is the address of the
709 kernel descriptor that is used in the AQL dispatch packet used to invoke the
710 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
711 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
712
713Kernel entry point
714 Every HSA kernel also has a symbol for its machine code entry point.
715
716.. _amdgpu-relocation-records:
717
718Relocation Records
719------------------
720
721AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
722relocatable fields are:
723
724``word32``
725 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
726 alignment. These values use the same byte order as other word values in the
727 AMD GPU architecture.
728
729``word64``
730 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
731 alignment. These values use the same byte order as other word values in the
732 AMD GPU architecture.
733
734Following notations are used for specifying relocation calculations:
735
736**A**
737 Represents the addend used to compute the value of the relocatable field.
738
739**G**
740 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000741 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000742
743**GOT**
744 Represents the address of the global offset table.
745
746**P**
747 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
748 of the storage unit being relocated (computed using ``r_offset``).
749
750**S**
751 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000752 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
753
754**B**
755 Represents the base address of a loaded executable or shared object which is
756 the difference between the ELF address and the actual load address. Relocations
757 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000758
759The following relocation types are supported:
760
761 .. table:: AMDGPU ELF Relocation Records
762 :name: amdgpu-elf-relocation-records-table
763
Tony Tyedb6c9932018-01-30 23:59:43 +0000764 ========================== ======= ===== ========== ==============================
765 Relocation Type Kind Value Field Calculation
766 ========================== ======= ===== ========== ==============================
767 ``R_AMDGPU_NONE`` 0 *none* *none*
768 ``R_AMDGPU_ABS32_LO`` Dynamic 1 ``word32`` (S + A) & 0xFFFFFFFF
769 ``R_AMDGPU_ABS32_HI`` Dynamic 2 ``word32`` (S + A) >> 32
770 ``R_AMDGPU_ABS64`` Dynamic 3 ``word64`` S + A
771 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
772 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
773 ``R_AMDGPU_ABS32`` Static 6 ``word32`` S + A
774 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
775 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
776 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
777 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
778 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
779 *reserved* 12
780 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
781 ========================== ======= ===== ========== ==============================
Tony Tye46d35762017-08-15 20:47:41 +0000782
783.. _amdgpu-dwarf:
784
785DWARF
786-----
787
Scott Linder16c7bda2018-02-23 23:01:06 +0000788Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye46d35762017-08-15 20:47:41 +0000789information that maps the code object executable code and data to the source
790language constructs. It can be used by tools such as debuggers and profilers.
791
792Address Space Mapping
793~~~~~~~~~~~~~~~~~~~~~
794
795The following address space mapping is used:
796
797 .. table:: AMDGPU DWARF Address Space Mapping
798 :name: amdgpu-dwarf-address-space-mapping-table
799
800 =================== =================
801 DWARF Address Space Memory Space
802 =================== =================
803 1 Private (Scratch)
804 2 Local (group/LDS)
805 *omitted* Global
806 *omitted* Constant
807 *omitted* Generic (Flat)
808 *not supported* Region (GDS)
809 =================== =================
810
811See :ref:`amdgpu-address-spaces` for information on the memory space terminology
812used in the table.
813
814An ``address_class`` attribute is generated on pointer type DIEs to specify the
815DWARF address space of the value of the pointer when it is in the *private* or
816*local* address space. Otherwise the attribute is omitted.
817
818An ``XDEREF`` operation is generated in location list expressions for variables
819that are allocated in the *private* and *local* address space. Otherwise no
820``XDREF`` is omitted.
821
822Register Mapping
823~~~~~~~~~~~~~~~~
824
825*This section is WIP.*
826
827.. TODO
828 Define DWARF register enumeration.
829
830 If want to present a wavefront state then should expose vector registers as
831 64 wide (rather than per work-item view that LLVM uses). Either as separate
832 registers, or a 64x4 byte single register. In either case use a new LANE op
833 (akin to XDREF) to select the current lane usage in a location
834 expression. This would also allow scalar register spilling to vector register
835 lanes to be expressed (currently no debug information is being generated for
836 spilling). If choose a wide single register approach then use LANE in
837 conjunction with PIECE operation to select the dword part of the register for
838 the current lane. If the separate register approach then use LANE to select
839 the register.
840
841Source Text
842~~~~~~~~~~~
843
Scott Linder16c7bda2018-02-23 23:01:06 +0000844Source text for online-compiled programs (e.g. those compiled by the OpenCL
845runtime) may be embedded into the DWARF v5 line table using the ``clang
846-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye46d35762017-08-15 20:47:41 +0000847
Scott Linder16c7bda2018-02-23 23:01:06 +0000848For example:
849
850``-gembed-source``
851 Enable the embedded source DWARF v5 extension.
852``-gno-embed-source``
853 Disable the embedded source DWARF v5 extension.
854
855 .. table:: AMDGPU Debug Options
856 :name: amdgpu-debug-options
857
858 ==================== ==================================================
859 Debug Flag Description
860 ==================== ==================================================
861 -g[no-]embed-source Enable/disable embedding source text in DWARF
862 debug sections. Useful for environments where
863 source cannot be written to disk, such as
864 when performing online compilation.
865 ==================== ==================================================
866
867This option enables one extended content types in the DWARF v5 Line Number
868Program Header, which is used to encode embedded source.
869
870 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
871 :name: amdgpu-dwarf-extended-content-types
872
873 ============================ ======================
874 Content Type Form
875 ============================ ======================
876 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
877 ============================ ======================
878
879The source field will contain the UTF-8 encoded, null-terminated source text
880with ``'\n'`` line endings. When the source field is present, consumers can use
881the embedded source instead of attempting to discover the source on disk. When
882the source field is absent, consumers can access the file to get the source
883text.
884
885The above content type appears in the ``file_name_entry_format`` field of the
886line table prologue, and its corresponding value appear in the ``file_names``
887field. The current encoding of the content type is documented in table
888:ref:`amdgpu-dwarf-extended-content-types-encoding`
889
890 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
891 :name: amdgpu-dwarf-extended-content-types-encoding
892
893 ============================ ====================
894 Content Type Value
895 ============================ ====================
896 ``DW_LNCT_LLVM_source`` 0x2001
897 ============================ ====================
Tony Tye46d35762017-08-15 20:47:41 +0000898
899.. _amdgpu-code-conventions:
900
901Code Conventions
902================
903
904This section provides code conventions used for each supported target triple OS
905(see :ref:`amdgpu-target-triples`).
906
907AMDHSA
908------
909
910This section provides code conventions used when the target triple OS is
911``amdhsa`` (see :ref:`amdgpu-target-triples`).
912
913.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000914
915Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000916~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000917
Tony Tye46d35762017-08-15 20:47:41 +0000918The code object metadata specifies extensible metadata associated with the code
919objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
920[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
921(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
922``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
923information necessary to support the ROCM kernel queries. For example, the
924segment sizes needed in a dispatch packet. In addition, a high level language
925runtime may require other information to be included. For example, the AMD
926OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000927
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000928The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000929:doc:`YamlIO`).
930
Tony Tye46d35762017-08-15 20:47:41 +0000931.. TODO
932 Is the string null terminated? It probably should not if YAML allows it to
933 contain null characters, otherwise it should be.
934
Tony Tyef16a45e2017-06-06 20:31:59 +0000935The metadata is represented as a single YAML document comprised of the mapping
936defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
937referenced tables.
938
939For boolean values, the string values of ``false`` and ``true`` are used for
940false and true respectively.
941
942Additional information can be added to the mappings. To avoid conflicts, any
943non-AMD key names should be prefixed by "*vendor-name*.".
944
945 .. table:: AMDHSA Code Object Metadata Mapping
946 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
947
948 ========== ============== ========= =======================================
949 String Key Value Type Required? Description
950 ========== ============== ========= =======================================
951 "Version" sequence of Required - The first integer is the major
952 2 integers version. Currently 1.
953 - The second integer is the minor
954 version. Currently 0.
955 "Printf" sequence of Each string is encoded information
956 strings about a printf function call. The
957 encoded information is organized as
958 fields separated by colon (':'):
959
960 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
961
962 where:
963
964 ``ID``
965 A 32 bit integer as a unique id for
966 each printf function call
967
968 ``N``
969 A 32 bit integer equal to the number
970 of arguments of printf function call
971 minus 1
972
973 ``S[i]`` (where i = 0, 1, ... , N-1)
974 32 bit integers for the size in bytes
975 of the i-th FormatString argument of
976 the printf function call
977
978 FormatString
979 The format string passed to the
980 printf function call.
981 "Kernels" sequence of Required Sequence of the mappings for each
982 mapping kernel in the code object. See
983 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
984 for the definition of the mapping.
985 ========== ============== ========= =======================================
986
987..
988
989 .. table:: AMDHSA Code Object Kernel Metadata Mapping
990 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
991
992 ================= ============== ========= ================================
993 String Key Value Type Required? Description
994 ================= ============== ========= ================================
995 "Name" string Required Source name of the kernel.
996 "SymbolName" string Required Name of the kernel
997 descriptor ELF symbol.
998 "Language" string Source language of the kernel.
999 Values include:
1000
1001 - "OpenCL C"
1002 - "OpenCL C++"
1003 - "HCC"
1004 - "OpenMP"
1005
1006 "LanguageVersion" sequence of - The first integer is the major
1007 2 integers version.
1008 - The second integer is the
1009 minor version.
1010 "Attrs" mapping Mapping of kernel attributes.
1011 See
1012 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1013 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001014 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001015 mapping kernel arguments. See
1016 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1017 for the definition of the mapping.
1018 "CodeProps" mapping Mapping of properties related to
1019 the kernel code. See
1020 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1021 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +00001022 ================= ============== ========= ================================
1023
1024..
1025
1026 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1027 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1028
1029 =================== ============== ========= ==============================
1030 String Key Value Type Required? Description
1031 =================== ============== ========= ==============================
Tony Tyee039d0e2018-01-30 23:07:10 +00001032 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
1033 3 integers must be >=1 and the dispatch
1034 work-group size X, Y, Z must
1035 correspond to the specified
1036 values. Defaults to 0, 0, 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001037
1038 Corresponds to the OpenCL
1039 ``reqd_work_group_size``
1040 attribute.
1041 "WorkGroupSizeHint" sequence of The dispatch work-group size
1042 3 integers X, Y, Z is likely to be the
1043 specified values.
1044
1045 Corresponds to the OpenCL
1046 ``work_group_size_hint``
1047 attribute.
1048 "VecTypeHint" string The name of a scalar or vector
1049 type.
1050
1051 Corresponds to the OpenCL
1052 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001053
1054 "RuntimeHandle" string The external symbol name
1055 associated with a kernel.
1056 OpenCL runtime allocates a
1057 global buffer for the symbol
1058 and saves the kernel's address
1059 to it, which is used for
1060 device side enqueueing. Only
1061 available for device side
1062 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001063 =================== ============== ========= ==============================
1064
1065..
1066
1067 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1068 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1069
1070 ================= ============== ========= ================================
1071 String Key Value Type Required? Description
1072 ================= ============== ========= ================================
1073 "Name" string Kernel argument name.
1074 "TypeName" string Kernel argument type name.
1075 "Size" integer Required Kernel argument size in bytes.
1076 "Align" integer Required Kernel argument alignment in
1077 bytes. Must be a power of two.
1078 "ValueKind" string Required Kernel argument kind that
1079 specifies how to set up the
1080 corresponding argument.
1081 Values include:
1082
1083 "ByValue"
1084 The argument is copied
1085 directly into the kernarg.
1086
1087 "GlobalBuffer"
1088 A global address space pointer
1089 to the buffer data is passed
1090 in the kernarg.
1091
1092 "DynamicSharedPointer"
1093 A group address space pointer
1094 to dynamically allocated LDS
1095 is passed in the kernarg.
1096
1097 "Sampler"
1098 A global address space
1099 pointer to a S# is passed in
1100 the kernarg.
1101
1102 "Image"
1103 A global address space
1104 pointer to a T# is passed in
1105 the kernarg.
1106
1107 "Pipe"
1108 A global address space pointer
1109 to an OpenCL pipe is passed in
1110 the kernarg.
1111
1112 "Queue"
1113 A global address space pointer
1114 to an OpenCL device enqueue
1115 queue is passed in the
1116 kernarg.
1117
1118 "HiddenGlobalOffsetX"
1119 The OpenCL grid dispatch
1120 global offset for the X
1121 dimension is passed in the
1122 kernarg.
1123
1124 "HiddenGlobalOffsetY"
1125 The OpenCL grid dispatch
1126 global offset for the Y
1127 dimension is passed in the
1128 kernarg.
1129
1130 "HiddenGlobalOffsetZ"
1131 The OpenCL grid dispatch
1132 global offset for the Z
1133 dimension is passed in the
1134 kernarg.
1135
1136 "HiddenNone"
1137 An argument that is not used
1138 by the kernel. Space needs to
1139 be left for it, but it does
1140 not need to be set up.
1141
1142 "HiddenPrintfBuffer"
1143 A global address space pointer
1144 to the runtime printf buffer
1145 is passed in kernarg.
1146
1147 "HiddenDefaultQueue"
1148 A global address space pointer
1149 to the OpenCL device enqueue
1150 queue that should be used by
1151 the kernel by default is
1152 passed in the kernarg.
1153
1154 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001155 A global address space pointer
1156 to help link enqueued kernels into
1157 the ancestor tree for determining
1158 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001159
1160 "ValueType" string Required Kernel argument value type. Only
1161 present if "ValueKind" is
1162 "ByValue". For vector data
1163 types, the value is for the
1164 element type. Values include:
1165
1166 - "Struct"
1167 - "I8"
1168 - "U8"
1169 - "I16"
1170 - "U16"
1171 - "F16"
1172 - "I32"
1173 - "U32"
1174 - "F32"
1175 - "I64"
1176 - "U64"
1177 - "F64"
1178
1179 .. TODO
1180 How can it be determined if a
1181 vector type, and what size
1182 vector?
1183 "PointeeAlign" integer Alignment in bytes of pointee
1184 type for pointer type kernel
1185 argument. Must be a power
1186 of 2. Only present if
1187 "ValueKind" is
1188 "DynamicSharedPointer".
1189 "AddrSpaceQual" string Kernel argument address space
1190 qualifier. Only present if
1191 "ValueKind" is "GlobalBuffer" or
1192 "DynamicSharedPointer". Values
1193 are:
1194
1195 - "Private"
1196 - "Global"
1197 - "Constant"
1198 - "Local"
1199 - "Generic"
1200 - "Region"
1201
1202 .. TODO
1203 Is GlobalBuffer only Global
1204 or Constant? Is
1205 DynamicSharedPointer always
1206 Local? Can HCC allow Generic?
1207 How can Private or Region
1208 ever happen?
1209 "AccQual" string Kernel argument access
1210 qualifier. Only present if
1211 "ValueKind" is "Image" or
1212 "Pipe". Values
1213 are:
1214
1215 - "ReadOnly"
1216 - "WriteOnly"
1217 - "ReadWrite"
1218
1219 .. TODO
1220 Does this apply to
1221 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001222 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001223 performed by the kernel on the
1224 kernel argument. Only present if
1225 "ValueKind" is "GlobalBuffer",
1226 "Image", or "Pipe". This may be
1227 more restrictive than indicated
1228 by "AccQual" to reflect what the
1229 kernel actual does. If not
1230 present then the runtime must
1231 assume what is implied by
1232 "AccQual" and "IsConst". Values
1233 are:
1234
1235 - "ReadOnly"
1236 - "WriteOnly"
1237 - "ReadWrite"
1238
1239 "IsConst" boolean Indicates if the kernel argument
1240 is const qualified. Only present
1241 if "ValueKind" is
1242 "GlobalBuffer".
1243
1244 "IsRestrict" boolean Indicates if the kernel argument
1245 is restrict qualified. Only
1246 present if "ValueKind" is
1247 "GlobalBuffer".
1248
1249 "IsVolatile" boolean Indicates if the kernel argument
1250 is volatile qualified. Only
1251 present if "ValueKind" is
1252 "GlobalBuffer".
1253
1254 "IsPipe" boolean Indicates if the kernel argument
1255 is pipe qualified. Only present
1256 if "ValueKind" is "Pipe".
1257
1258 .. TODO
1259 Can GlobalBuffer be pipe
1260 qualified?
1261 ================= ============== ========= ================================
1262
1263..
1264
1265 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1266 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1267
1268 ============================ ============== ========= =====================
1269 String Key Value Type Required? Description
1270 ============================ ============== ========= =====================
1271 "KernargSegmentSize" integer Required The size in bytes of
1272 the kernarg segment
1273 that holds the values
1274 of the arguments to
1275 the kernel.
1276 "GroupSegmentFixedSize" integer Required The amount of group
1277 segment memory
1278 required by a
1279 work-group in
1280 bytes. This does not
1281 include any
1282 dynamically allocated
1283 group segment memory
1284 that may be added
1285 when the kernel is
1286 dispatched.
1287 "PrivateSegmentFixedSize" integer Required The amount of fixed
1288 private address space
1289 memory required for a
1290 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001291 bytes. If the kernel
1292 uses a dynamic call
1293 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001294 space must be added
1295 to this value for the
1296 call stack.
1297 "KernargSegmentAlign" integer Required The maximum byte
1298 alignment of
1299 arguments in the
1300 kernarg segment. Must
1301 be a power of 2.
1302 "WavefrontSize" integer Required Wavefront size. Must
1303 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001304 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001305 registers used by a
1306 wavefront for
1307 GFX6-GFX9. This
1308 includes the special
1309 SGPRs for VCC, Flat
1310 Scratch (GFX7-GFX9)
1311 and XNACK (for
1312 GFX8-GFX9). It does
1313 not include the 16
1314 SGPR added if a trap
1315 handler is
1316 enabled. It is not
1317 rounded up to the
1318 allocation
1319 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001320 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001321 registers used by
1322 each work-item for
1323 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001324 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001325 work-group size
1326 supported by the
1327 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001328 Must be >=1 and
Tony Tyee039d0e2018-01-30 23:07:10 +00001329 consistent with
1330 ReqdWorkGroupSize if
1331 not 0, 0, 0.
Konstantin Zhuravlyov06ae4ec2017-11-28 17:51:08 +00001332 "NumSpilledSGPRs" integer Number of stores from
1333 a scalar register to
1334 a register allocator
1335 created spill
1336 location.
1337 "NumSpilledVGPRs" integer Number of stores from
1338 a vector register to
1339 a register allocator
1340 created spill
1341 location.
Tony Tyef16a45e2017-06-06 20:31:59 +00001342 ============================ ============== ========= =====================
1343
1344..
1345
Tony Tyef16a45e2017-06-06 20:31:59 +00001346Kernel Dispatch
1347~~~~~~~~~~~~~~~
1348
1349The HSA architected queuing language (AQL) defines a user space memory interface
1350that can be used to control the dispatch of kernels, in an agent independent
1351way. An agent can have zero or more AQL queues created for it using the ROCm
1352runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1353*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1354mechanics and packet layouts.
1355
1356The packet processor of a kernel agent is responsible for detecting and
1357dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1358packet processor is implemented by the hardware command processor (CP),
1359asynchronous dispatch controller (ADC) and shader processor input controller
1360(SPI).
1361
1362The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1363mode driver to initialize and register the AQL queue with CP.
1364
1365To dispatch a kernel the following actions are performed. This can occur in the
1366CPU host program, or from an HSA kernel executing on a GPU.
1367
13681. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1369 executed is obtained.
13702. A pointer to the kernel descriptor (see
1371 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1372 obtained. It must be for a kernel that is contained in a code object that that
1373 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1374 associated.
13753. Space is allocated for the kernel arguments using the ROCm runtime allocator
1376 for a memory region with the kernarg property for the kernel agent that will
1377 execute the kernel. It must be at least 16 byte aligned.
13784. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001379 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001380 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1381 memory in the same way constant memory is accessed. (Note that the HSA
1382 specification allows an implementation to copy the kernel argument contents to
1383 another location that is accessed by the kernel.)
13845. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1385 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1386 packet. The packet must be set up, and the final write must use an atomic
1387 store release to set the packet kind to ensure the packet contents are
1388 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1389 notify the kernel agent that the AQL queue has been updated. These rules, and
1390 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1391 System Architecture Specification* [HSA]_.
13926. A kernel dispatch packet includes information about the actual dispatch,
1393 such as grid and work-group size, together with information from the code
1394 object about the kernel, such as segment sizes. The ROCm runtime queries on
1395 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001396 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000013977. CP executes micro-code and is responsible for detecting and setting up the
1398 GPU to execute the wavefronts of a kernel dispatch.
13998. CP ensures that when the a wavefront starts executing the kernel machine
1400 code, the scalar general purpose registers (SGPR) and vector general purpose
1401 registers (VGPR) are set up as required by the machine code. The required
1402 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1403 register state is defined in
1404 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
14059. The prolog of the kernel machine code (see
1406 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1407 before continuing executing the machine code that corresponds to the kernel.
140810. When the kernel dispatch has completed execution, CP signals the completion
1409 signal specified in the kernel dispatch packet if not 0.
1410
1411.. _amdgpu-amdhsa-memory-spaces:
1412
1413Memory Spaces
1414~~~~~~~~~~~~~
1415
1416The memory space properties are:
1417
1418 .. table:: AMDHSA Memory Spaces
1419 :name: amdgpu-amdhsa-memory-spaces-table
1420
1421 ================= =========== ======== ======= ==================
1422 Memory Space Name HSA Segment Hardware Address NULL Value
1423 Name Name Size
1424 ================= =========== ======== ======= ==================
1425 Private private scratch 32 0x00000000
1426 Local group LDS 32 0xFFFFFFFF
1427 Global global global 64 0x0000000000000000
1428 Constant constant *same as 64 0x0000000000000000
1429 global*
1430 Generic flat flat 64 0x0000000000000000
1431 Region N/A GDS 32 *not implemented
1432 for AMDHSA*
1433 ================= =========== ======== ======= ==================
1434
1435The global and constant memory spaces both use global virtual addresses, which
1436are the same virtual address space used by the CPU. However, some virtual
1437addresses may only be accessible to the CPU, some only accessible by the GPU,
1438and some by both.
1439
1440Using the constant memory space indicates that the data will not change during
1441the execution of the kernel. This allows scalar read instructions to be
1442used. The vector and scalar L1 caches are invalidated of volatile data before
1443each kernel dispatch execution to allow constant memory to change values between
1444kernel dispatches.
1445
1446The local memory space uses the hardware Local Data Store (LDS) which is
1447automatically allocated when the hardware creates work-groups of wavefronts, and
1448freed when all the wavefronts of a work-group have terminated. The data store
1449(DS) instructions can be used to access it.
1450
1451The private memory space uses the hardware scratch memory support. If the kernel
1452uses scratch, then the hardware allocates memory that is accessed using
1453wavefront lane dword (4 byte) interleaving. The mapping used from private
1454address to physical address is:
1455
1456 ``wavefront-scratch-base +
1457 (private-address * wavefront-size * 4) +
1458 (wavefront-lane-id * 4)``
1459
1460There are different ways that the wavefront scratch base address is determined
1461by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1462memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye5bbcca62018-03-08 05:46:01 +00001463the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tyef16a45e2017-06-06 20:31:59 +00001464instructions, or by flat instructions. If each lane of a wavefront accesses the
1465same private address, the interleaving results in adjacent dwords being accessed
1466and hence requires fewer cache lines to be fetched. Multi-dword access is not
1467supported except by flat and scratch instructions in GFX9.
1468
1469The generic address space uses the hardware flat address support available in
1470GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1471local appertures), that are outside the range of addressible global memory, to
1472map from a flat address to a private or local address.
1473
1474FLAT instructions can take a flat address and access global, private (scratch)
1475and group (LDS) memory depending in if the address is within one of the
1476apperture ranges. Flat access to scratch requires hardware aperture setup and
1477setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1478access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1479(see :ref:`amdgpu-amdhsa-m0`).
1480
1481To convert between a segment address and a flat address the base address of the
1482appertures address can be used. For GFX7-GFX8 these are available in the
1483:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1484Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1485GFX9 the appature base addresses are directly available as inline constant
1486registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1487address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1488which makes it easier to convert from flat to segment or segment to flat.
1489
Tony Tye46d35762017-08-15 20:47:41 +00001490Image and Samplers
1491~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001492
1493Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1494hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1495HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1496enumeration values for the queries that are not trivially deducible from the S#
1497representation.
1498
1499HSA Signals
1500~~~~~~~~~~~
1501
Tony Tye46d35762017-08-15 20:47:41 +00001502HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1503structure allocated in memory accessible from both the CPU and GPU. The
1504structure is defined by the ROCm runtime and subject to change between releases
1505(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001506
1507.. _amdgpu-amdhsa-hsa-aql-queue:
1508
1509HSA AQL Queue
1510~~~~~~~~~~~~~
1511
Tony Tye46d35762017-08-15 20:47:41 +00001512The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001513between releases (see [AMD-ROCm-github]_). For some processors it contains
1514fields needed to implement certain language features such as the flat address
1515aperture bases. It also contains fields used by CP such as managing the
1516allocation of scratch memory.
1517
1518.. _amdgpu-amdhsa-kernel-descriptor:
1519
1520Kernel Descriptor
1521~~~~~~~~~~~~~~~~~
1522
1523A kernel descriptor consists of the information needed by CP to initiate the
1524execution of a kernel, including the entry point address of the machine code
1525that implements the kernel.
1526
1527Kernel Descriptor for GFX6-GFX9
1528+++++++++++++++++++++++++++++++
1529
1530CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1531
1532 .. table:: Kernel Descriptor for GFX6-GFX9
1533 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1534
Tony Tye6baa6d22017-10-18 22:16:55 +00001535 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001536 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001537 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001538 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001539 address space memory
1540 required for a work-group
1541 in bytes. This does not
1542 include any dynamically
1543 allocated local address
1544 space memory that may be
1545 added when the kernel is
1546 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001547 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001548 private address space
1549 memory required for a
1550 work-item in bytes. If
1551 is_dynamic_callstack is 1
1552 then additional space must
1553 be added to this value for
1554 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001555 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001556 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001557 negative) from base
1558 address of kernel
1559 descriptor to kernel's
1560 entry point instruction
1561 which must be 256 byte
1562 aligned.
Tony Tyee039d0e2018-01-30 23:07:10 +00001563 383:192 24 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001564 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001565 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001566 program settings used by
1567 CP to set up
1568 ``COMPUTE_PGM_RSRC1``
1569 configuration
1570 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001571 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001572 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001573 program settings used by
1574 CP to set up
1575 ``COMPUTE_PGM_RSRC2``
1576 configuration
1577 register. See
1578 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001579 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1580 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001581 (see
1582 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1583
1584 The total number of SGPR
1585 user data registers
1586 requested must not exceed
1587 16 and match value in
1588 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1589 Any requests beyond 16
1590 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001591 449 1 bit EnableSGPRDispatchPtr *see above*
1592 450 1 bit EnableSGPRQueuePtr *see above*
1593 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1594 452 1 bit EnableSGPRDispatchID *see above*
1595 453 1 bit EnableSGPRFlatScratchInit *see above*
1596 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1597 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1598 should always be 0.
1599 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1600 should always be 0.
1601 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1602 should always be 0.
Tony Tye31105cc2017-12-11 15:35:27 +00001603 463:458 6 bits Reserved, must be 0.
Tony Tye6baa6d22017-10-18 22:16:55 +00001604 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001605 bytes
1606 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001607 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001608
1609..
1610
1611 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001612 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001613
Tony Tye3b340612017-06-07 00:46:08 +00001614 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001615 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001616 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001617 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001618 used by each work-item,
1619 granularity is device
1620 specific:
1621
Tony Tye07d9f102017-11-10 01:00:54 +00001622 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001623 - max_vgpr 1..256
1624 - roundup((max_vgpg + 1)
1625 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001626
1627 Used by CP to set up
1628 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001629 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001630 used by a wavefront,
1631 granularity is device
1632 specific:
1633
Tony Tye07d9f102017-11-10 01:00:54 +00001634 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001635 - max_sgpr 1..112
1636 - roundup((max_sgpg + 1)
1637 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001638 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001639 - max_sgpr 1..112
1640 - roundup((max_sgpg + 1)
1641 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001642
1643 Includes the special SGPRs
1644 for VCC, Flat Scratch (for
1645 GFX7 onwards) and XNACK
1646 (for GFX8 onwards). It does
1647 not include the 16 SGPR
1648 added if a trap handler is
1649 enabled.
1650
1651 Used by CP to set up
1652 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001653 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001654
1655 Start executing wavefront
1656 at the specified priority.
1657
1658 CP is responsible for
1659 filling in
1660 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001661 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001662 with specified rounding
1663 mode for single (32
1664 bit) floating point
1665 precision floating point
1666 operations.
1667
1668 Floating point rounding
1669 mode values are defined in
1670 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1671
1672 Used by CP to set up
1673 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001674 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001675 with specified rounding
1676 denorm mode for half/double (16
1677 and 64 bit) floating point
1678 precision floating point
1679 operations.
1680
1681 Floating point rounding
1682 mode values are defined in
1683 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1684
1685 Used by CP to set up
1686 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001687 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001688 with specified denorm mode
1689 for single (32
1690 bit) floating point
1691 precision floating point
1692 operations.
1693
1694 Floating point denorm mode
1695 values are defined in
1696 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1697
1698 Used by CP to set up
1699 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001700 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001701 with specified denorm mode
1702 for half/double (16
1703 and 64 bit) floating point
1704 precision floating point
1705 operations.
1706
1707 Floating point denorm mode
1708 values are defined in
1709 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1710
1711 Used by CP to set up
1712 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001713 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001714
1715 Start executing wavefront
1716 in privilege trap handler
1717 mode.
1718
1719 CP is responsible for
1720 filling in
1721 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001722 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001723 with DX10 clamp mode
1724 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001725 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001726 treatment of NaN's (when
1727 set, clamp NaN to zero,
1728 otherwise pass NaN
1729 through).
1730
1731 Used by CP to set up
1732 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001733 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001734
1735 Start executing wavefront
1736 in single step mode.
1737
1738 CP is responsible for
1739 filling in
1740 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001741 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001742 with IEEE mode
1743 enabled. Floating point
1744 opcodes that support
1745 exception flag gathering
1746 will quiet and propagate
1747 signaling-NaN inputs per
1748 IEEE 754-2008. Min_dx10 and
1749 max_dx10 become IEEE
1750 754-2008 compliant due to
1751 signaling-NaN propagation
1752 and quieting.
1753
1754 Used by CP to set up
1755 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001756 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001757
1758 Only one work-group allowed
1759 to execute on a compute
1760 unit.
1761
1762 CP is responsible for
1763 filling in
1764 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001765 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001766
1767 Flag that can be used to
1768 control debugging code.
1769
1770 CP is responsible for
1771 filling in
1772 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001773 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001774 Reserved, must be 0.
1775 GFX9
1776 Wavefront starts execution
1777 with specified fp16 overflow
1778 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001779
Tony Tye6baa6d22017-10-18 22:16:55 +00001780 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001781 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001782 - If 1, fp16 overflow that is the
1783 result of an +/-INF input value
1784 or divide by 0 produces a +/-INF,
1785 otherwise clamps computed
1786 overflow to +/-MAX_FP16 as
1787 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001788
1789 Used by CP to set up
1790 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001791 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001792 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001793 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001794
1795..
1796
1797 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1798 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1799
Tony Tye3b340612017-06-07 00:46:08 +00001800 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001801 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001802 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001803 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye5bbcca62018-03-08 05:46:01 +00001804 _WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001805 system register (see
1806 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1807
1808 Used by CP to set up
1809 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001810 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001811 user data registers
1812 requested. This number must
1813 match the number of user
1814 data registers enabled.
1815
1816 Used by CP to set up
1817 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001818 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001819 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001820 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001821 be enabled.
1822
1823 CP sets
1824 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1825 if the runtime has
1826 installed a trap handler
1827 regardless of the setting
1828 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001829 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001830 system SGPR register for
1831 the work-group id in the X
1832 dimension (see
1833 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1834
1835 Used by CP to set up
1836 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001837 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001838 system SGPR register for
1839 the work-group id in the Y
1840 dimension (see
1841 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1842
1843 Used by CP to set up
1844 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001845 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001846 system SGPR register for
1847 the work-group id in the Z
1848 dimension (see
1849 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1850
1851 Used by CP to set up
1852 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001853 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001854 system SGPR register for
1855 work-group information (see
1856 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1857
1858 Used by CP to set up
1859 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001860 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001861 VGPR system registers used
1862 for the work-item ID.
1863 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1864 defines the values.
1865
1866 Used by CP to set up
1867 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001868 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001869
1870 Wavefront starts execution
1871 with address watch
1872 exceptions enabled which
1873 are generated when L1 has
1874 witnessed a thread access
1875 an *address of
1876 interest*.
1877
1878 CP is responsible for
1879 filling in the address
1880 watch bit in
1881 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1882 according to what the
1883 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001884 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001885
1886 Wavefront starts execution
1887 with memory violation
1888 exceptions exceptions
1889 enabled which are generated
1890 when a memory violation has
Tony Tye5bbcca62018-03-08 05:46:01 +00001891 occurred for this wavefront from
Tony Tyef16a45e2017-06-06 20:31:59 +00001892 L1 or LDS
1893 (write-to-read-only-memory,
1894 mis-aligned atomic, LDS
1895 address out of range,
1896 illegal address, etc.).
1897
1898 CP sets the memory
1899 violation bit in
1900 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1901 according to what the
1902 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001903 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001904
1905 CP uses the rounded value
1906 from the dispatch packet,
1907 not this value, as the
1908 dispatch may contain
1909 dynamically allocated group
1910 segment memory. CP writes
1911 directly to
1912 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1913
1914 Amount of group segment
1915 (LDS) to allocate for each
1916 work-group. Granularity is
1917 device specific:
1918
1919 GFX6:
1920 roundup(lds-size / (64 * 4))
1921 GFX7-GFX9:
1922 roundup(lds-size / (128 * 4))
1923
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001924 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1925 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001926 enabled.
1927
1928 Used by CP to set up
1929 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1930 (set from bits 0..6).
1931
1932 IEEE 754 FP Invalid
1933 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001934 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1935 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001936 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001937 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1938 _DIVISION_BY_ZERO Zero
1939 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1940 _OVERFLOW
1941 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1942 _UNDERFLOW
1943 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1944 _INEXACT
1945 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1946 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001947 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001948 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001949 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001950 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001951
1952..
1953
1954 .. table:: Floating Point Rounding Mode Enumeration Values
1955 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1956
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001957 ====================================== ===== ==============================
1958 Enumeration Name Value Description
1959 ====================================== ===== ==============================
1960 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1961 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1962 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1963 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1964 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001965
1966..
1967
1968 .. table:: Floating Point Denorm Mode Enumeration Values
1969 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1970
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001971 ====================================== ===== ==============================
1972 Enumeration Name Value Description
1973 ====================================== ===== ==============================
1974 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1975 Denorms
1976 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1977 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1978 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1979 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001980
1981..
1982
1983 .. table:: System VGPR Work-Item ID Enumeration Values
1984 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1985
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001986 ======================================== ===== ============================
1987 Enumeration Name Value Description
1988 ======================================== ===== ============================
1989 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1990 ID.
1991 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1992 dimensions ID.
1993 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1994 dimensions ID.
1995 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1996 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001997
1998.. _amdgpu-amdhsa-initial-kernel-execution-state:
1999
2000Initial Kernel Execution State
2001~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2002
2003This section defines the register state that will be set up by the packet
2004processor prior to the start of execution of every wavefront. This is limited by
2005the constraints of the hardware controllers of CP/ADC/SPI.
2006
2007The order of the SGPR registers is defined, but the compiler can specify which
2008ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2009fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2010for enabled registers are dense starting at SGPR0: the first enabled register is
2011SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2012an SGPR number.
2013
2014The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye5bbcca62018-03-08 05:46:01 +00002015all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tyef16a45e2017-06-06 20:31:59 +00002016the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2017initialized. These are then immediately followed by the System SGPRs that are
Tony Tye5bbcca62018-03-08 05:46:01 +00002018set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tyef16a45e2017-06-06 20:31:59 +00002019dispatch.
2020
2021SGPR register initial state is defined in
2022:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2023
2024 .. table:: SGPR Register Set Up Order
2025 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2026
2027 ========== ========================== ====== ==============================
2028 SGPR Order Name Number Description
2029 (kernel descriptor enable of
2030 field) SGPRs
2031 ========== ========================== ====== ==============================
2032 First Private Segment Buffer 4 V# that can be used, together
Tony Tye5bbcca62018-03-08 05:46:01 +00002033 (enable_sgpr_private with Scratch Wavefront Offset
2034 _segment_buffer) as an offset, to access the
2035 private memory space using a
2036 segment address.
Tony Tyef16a45e2017-06-06 20:31:59 +00002037
2038 CP uses the value provided by
2039 the runtime.
2040 then Dispatch Ptr 2 64 bit address of AQL dispatch
2041 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2042 actually executing.
2043 then Queue Ptr 2 64 bit address of amd_queue_t
2044 (enable_sgpr_queue_ptr) object for AQL queue on which
2045 the dispatch packet was
2046 queued.
2047 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2048 (enable_sgpr_kernarg segment. This is directly
2049 _segment_ptr) copied from the
2050 kernarg_address in the kernel
2051 dispatch packet.
2052
2053 Having CP load it once avoids
2054 loading it at the beginning of
2055 every wavefront.
2056 then Dispatch Id 2 64 bit Dispatch ID of the
2057 (enable_sgpr_dispatch_id) dispatch packet being
2058 executed.
2059 then Flat Scratch Init 2 This is 2 SGPRs:
2060 (enable_sgpr_flat_scratch
2061 _init) GFX6
2062 Not supported.
2063 GFX7-GFX8
2064 The first SGPR is a 32 bit
2065 byte offset from
2066 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2067 to per SPI base of memory
2068 for scratch for the queue
2069 executing the kernel
2070 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002071 from the runtime. (The
2072 Scratch Segment Buffer base
2073 address is
2074 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2075 plus this offset.) The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002076 of Scratch Wavefront Offset must
Tony Tye46d35762017-08-15 20:47:41 +00002077 be added to this offset by
2078 the kernel machine code,
2079 right shifted by 8, and
2080 moved to the FLAT_SCRATCH_HI
2081 SGPR register.
2082 FLAT_SCRATCH_HI corresponds
2083 to SGPRn-4 on GFX7, and
2084 SGPRn-6 on GFX8 (where SGPRn
2085 is the highest numbered SGPR
Tony Tye5bbcca62018-03-08 05:46:01 +00002086 allocated to the wavefront).
Tony Tye46d35762017-08-15 20:47:41 +00002087 FLAT_SCRATCH_HI is
2088 multiplied by 256 (as it is
2089 in units of 256 bytes) and
2090 added to
2091 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye5bbcca62018-03-08 05:46:01 +00002092 to calculate the per wavefront
Tony Tye46d35762017-08-15 20:47:41 +00002093 FLAT SCRATCH BASE in flat
2094 memory instructions that
2095 access the scratch
2096 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002097
2098 The second SGPR is 32 bit
2099 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002100 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002101 usage. CP obtains this from
2102 the runtime, and it is
2103 always a multiple of DWORD.
2104 CP checks that the value in
2105 the kernel dispatch packet
2106 Private Segment Byte Size is
2107 not larger, and requests the
2108 runtime to increase the
2109 queue's scratch size if
2110 necessary. The kernel code
2111 must move it to
2112 FLAT_SCRATCH_LO which is
2113 SGPRn-3 on GFX7 and SGPRn-5
2114 on GFX8. FLAT_SCRATCH_LO is
2115 used as the FLAT SCRATCH
2116 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002117 instructions. Having CP load
2118 it once avoids loading it at
2119 the beginning of every
Tony Tyef59d0712017-11-10 20:51:43 +00002120 wavefront.
2121 GFX9
2122 This is the
Tony Tye46d35762017-08-15 20:47:41 +00002123 64 bit base address of the
2124 per SPI scratch backing
2125 memory managed by SPI for
2126 the queue executing the
2127 kernel dispatch. CP obtains
2128 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002129 divides it if there are
2130 multiple Shader Arrays each
2131 with its own SPI). The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002132 of Scratch Wavefront Offset must
Tony Tyef16a45e2017-06-06 20:31:59 +00002133 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002134 machine code and the result
2135 moved to the FLAT_SCRATCH
2136 SGPR which is SGPRn-6 and
2137 SGPRn-5. It is used as the
2138 FLAT SCRATCH BASE in flat
Tony Tyef59d0712017-11-10 20:51:43 +00002139 memory instructions.
2140 then Private Segment Size 1 The 32 bit byte size of a
2141 (enable_sgpr_private single
2142 work-item's
2143 scratch_segment_size) memory
2144 allocation. This is the
2145 value from the kernel
2146 dispatch packet Private
2147 Segment Byte Size rounded up
2148 by CP to a multiple of
2149 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002150
2151 Having CP load it once avoids
2152 loading it at the beginning of
2153 every wavefront.
2154
2155 This is not used for
2156 GFX7-GFX8 since it is the same
2157 value as the second SGPR of
2158 Flat Scratch Init. However, it
2159 may be needed for GFX9 which
2160 changes the meaning of the
2161 Flat Scratch Init value.
2162 then Grid Work-Group Count X 1 32 bit count of the number of
2163 (enable_sgpr_grid work-groups in the X dimension
2164 _workgroup_count_X) for the grid being
2165 executed. Computed from the
2166 fields in the kernel dispatch
2167 packet as ((grid_size.x +
2168 workgroup_size.x - 1) /
2169 workgroup_size.x).
2170 then Grid Work-Group Count Y 1 32 bit count of the number of
2171 (enable_sgpr_grid work-groups in the Y dimension
2172 _workgroup_count_Y && for the grid being
2173 less than 16 previous executed. Computed from the
2174 SGPRs) fields in the kernel dispatch
2175 packet as ((grid_size.y +
2176 workgroup_size.y - 1) /
2177 workgroupSize.y).
2178
2179 Only initialized if <16
2180 previous SGPRs initialized.
2181 then Grid Work-Group Count Z 1 32 bit count of the number of
2182 (enable_sgpr_grid work-groups in the Z dimension
2183 _workgroup_count_Z && for the grid being
2184 less than 16 previous executed. Computed from the
2185 SGPRs) fields in the kernel dispatch
2186 packet as ((grid_size.z +
2187 workgroup_size.z - 1) /
2188 workgroupSize.z).
2189
2190 Only initialized if <16
2191 previous SGPRs initialized.
2192 then Work-Group Id X 1 32 bit work-group id in X
2193 (enable_sgpr_workgroup_id dimension of grid for
2194 _X) wavefront.
2195 then Work-Group Id Y 1 32 bit work-group id in Y
2196 (enable_sgpr_workgroup_id dimension of grid for
2197 _Y) wavefront.
2198 then Work-Group Id Z 1 32 bit work-group id in Z
2199 (enable_sgpr_workgroup_id dimension of grid for
2200 _Z) wavefront.
Tony Tye5bbcca62018-03-08 05:46:01 +00002201 then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002202 (enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye5bbcca62018-03-08 05:46:01 +00002203 _info) threadgroup_size_in_wavefronts[5:0]}
2204 then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tyef16a45e2017-06-06 20:31:59 +00002205 (enable_sgpr_private of scratch base of queue
Tony Tye5bbcca62018-03-08 05:46:01 +00002206 _segment_wavefront_offset) executing the kernel
Tony Tyef16a45e2017-06-06 20:31:59 +00002207 dispatch. Must be used as an
2208 offset with Private
2209 segment address when using
2210 Scratch Segment Buffer. It
2211 must be used to set up FLAT
2212 SCRATCH for flat addressing
2213 (see
2214 :ref:`amdgpu-amdhsa-flat-scratch`).
2215 ========== ========================== ====== ==============================
2216
2217The order of the VGPR registers is defined, but the compiler can specify which
2218ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2219fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2220for enabled registers are dense starting at VGPR0: the first enabled register is
2221VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2222VGPR number.
2223
2224VGPR register initial state is defined in
2225:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2226
2227 .. table:: VGPR Register Set Up Order
2228 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2229
2230 ========== ========================== ====== ==============================
2231 VGPR Order Name Number Description
2232 (kernel descriptor enable of
2233 field) VGPRs
2234 ========== ========================== ====== ==============================
2235 First Work-Item Id X 1 32 bit work item id in X
2236 (Always initialized) dimension of work-group for
2237 wavefront lane.
2238 then Work-Item Id Y 1 32 bit work item id in Y
2239 (enable_vgpr_workitem_id dimension of work-group for
2240 > 0) wavefront lane.
2241 then Work-Item Id Z 1 32 bit work item id in Z
2242 (enable_vgpr_workitem_id dimension of work-group for
2243 > 1) wavefront lane.
2244 ========== ========================== ====== ==============================
2245
2246The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2247
22481. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2249 registers.
22502. Work-group Id registers X, Y, Z are set by ADC which supports any
2251 combination including none.
Tony Tye5bbcca62018-03-08 05:46:01 +000022523. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2253 its value cannot included with the flat scratch init value which is per queue.
Tony Tyef16a45e2017-06-06 20:31:59 +000022544. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2255 or (X, Y, Z).
2256
2257Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2258value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2259
2260The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002261has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002262instructions (GFX9).
2263
2264If buffer operations are used then the compiler can generate a V# with the
2265following properties:
2266
2267* base address of 0
2268* no swizzle
2269* ATC: 1 if IOMMU present (such as APU)
2270* ptr64: 1
2271* MTYPE set to support memory coherence that matches the runtime (such as CC for
2272 APU and NC for dGPU).
2273
2274.. _amdgpu-amdhsa-kernel-prolog:
2275
2276Kernel Prolog
2277~~~~~~~~~~~~~
2278
2279.. _amdgpu-amdhsa-m0:
2280
2281M0
2282++
2283
2284GFX6-GFX8
2285 The M0 register must be initialized with a value at least the total LDS size
2286 if the kernel may access LDS via DS or flat operations. Total LDS size is
2287 available in dispatch packet. For M0, it is also possible to use maximum
2288 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2289 GFX7-GFX8).
2290GFX9
2291 The M0 register is not used for range checking LDS accesses and so does not
2292 need to be initialized in the prolog.
2293
2294.. _amdgpu-amdhsa-flat-scratch:
2295
2296Flat Scratch
2297++++++++++++
2298
2299If the kernel may use flat operations to access scratch memory, the prolog code
2300must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye5bbcca62018-03-08 05:46:01 +00002301are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tyef16a45e2017-06-06 20:31:59 +00002302Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2303
2304GFX6
2305 Flat scratch is not supported.
2306
Tony Tye07d9f102017-11-10 01:00:54 +00002307GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002308 1. The low word of Flat Scratch Init is 32 bit byte offset from
2309 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2310 being managed by SPI for the queue executing the kernel dispatch. This is
2311 the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002312 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tyef16a45e2017-06-06 20:31:59 +00002313 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2314 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2315 by 8 before moving into FLAT_SCRATCH_LO.
2316 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2317 work-items scratch memory usage. This is directly loaded from the kernel
2318 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2319 DWORD. Having CP load it once avoids loading it at the beginning of every
2320 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2321 SIZE.
Tony Tyef59d0712017-11-10 20:51:43 +00002322
Tony Tyef16a45e2017-06-06 20:31:59 +00002323GFX9
2324 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2325 memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002326 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tyef16a45e2017-06-06 20:31:59 +00002327 pair for use as the flat scratch base in flat memory instructions.
2328
2329.. _amdgpu-amdhsa-memory-model:
2330
2331Memory Model
2332~~~~~~~~~~~~
2333
2334This section describes the mapping of LLVM memory model onto AMDGPU machine code
2335(see :ref:`memmodel`). *The implementation is WIP.*
2336
2337.. TODO
2338 Update when implementation complete.
2339
Tony Tyef16a45e2017-06-06 20:31:59 +00002340The AMDGPU backend supports the memory synchronization scopes specified in
2341:ref:`amdgpu-memory-scopes`.
2342
2343The code sequences used to implement the memory model are defined in table
2344:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2345
2346The sequences specify the order of instructions that a single thread must
2347execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2348to other memory instructions executed by the same thread. This allows them to be
2349moved earlier or later which can allow them to be combined with other instances
2350of the same instruction, or hoisted/sunk out of loops to improve
2351performance. Only the instructions related to the memory model are given;
2352additional ``s_waitcnt`` instructions are required to ensure registers are
2353defined before being used. These may be able to be combined with the memory
2354model ``s_waitcnt`` instructions as described above.
2355
Tony Tye6baa6d22017-10-18 22:16:55 +00002356The AMDGPU backend supports the following memory models:
2357
2358 HSA Memory Model [HSA]_
2359 The HSA memory model uses a single happens-before relation for all address
2360 spaces (see :ref:`amdgpu-address-spaces`).
2361 OpenCL Memory Model [OpenCL]_
2362 The OpenCL memory model which has separate happens-before relations for the
2363 global and local address spaces. Only a fence specifying both global and
2364 local address space, and seq_cst instructions join the relationships. Since
2365 the LLVM ``memfence`` instruction does not allow an address space to be
2366 specified the OpenCL fence has to convervatively assume both local and
2367 global address space was specified. However, optimizations can often be
2368 done to eliminate the additional ``s_waitcnt`` instructions when there are
2369 no intervening memory instructions which access the corresponding address
2370 space. The code sequences in the table indicate what can be omitted for the
2371 OpenCL memory. The target triple environment is used to determine if the
2372 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002373
2374``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2375operations.
2376
2377``buffer/global/flat_load/store/atomic`` instructions to global memory are
2378termed vector memory operations.
2379
2380For GFX6-GFX9:
2381
2382* Each agent has multiple compute units (CU).
2383* Each CU has multiple SIMDs that execute wavefronts.
2384* The wavefronts for a single work-group are executed in the same CU but may be
2385 executed by different SIMDs.
2386* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2387 executing on it.
2388* All LDS operations of a CU are performed as wavefront wide operations in a
2389 global order and involve no caching. Completion is reported to a wavefront in
2390 execution order.
2391* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye5bbcca62018-03-08 05:46:01 +00002392 CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tyef16a45e2017-06-06 20:31:59 +00002393 can be reordered relative to each other, which can result in reordering the
2394 visibility of vector memory operations with respect to LDS operations of other
2395 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002396 ensure synchronization between LDS operations and vector memory operations
Tony Tye5bbcca62018-03-08 05:46:01 +00002397 between wavefronts of a work-group, but not between operations performed by the
Tony Tyef16a45e2017-06-06 20:31:59 +00002398 same wavefront.
2399* The vector memory operations are performed as wavefront wide operations and
2400 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002401 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002402 vector memory order if they access LDS memory, and out of LDS operation order
2403 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002404* The vector memory operations access a single vector L1 cache shared by all
2405 SIMDs a CU. Therefore, no special action is required for coherence between the
2406 lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye5bbcca62018-03-08 05:46:01 +00002407 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye6baa6d22017-10-18 22:16:55 +00002408 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002409* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2410 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2411 scalar operations are used in a restricted way so do not impact the memory
2412 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2413* The vector and scalar memory operations use an L2 cache shared by all CUs on
2414 the same agent.
2415* The L2 cache has independent channels to service disjoint ranges of virtual
2416 addresses.
2417* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye5bbcca62018-03-08 05:46:01 +00002418 scalar memory operations performed by wavefronts executing in different work-groups
Tony Tyef16a45e2017-06-06 20:31:59 +00002419 (which may be executing on different CUs) of an agent can be reordered
2420 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002421 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002422 previous vector memory operation has completed before executing a subsequent
2423 vector memory or LDS operation and so can be used to meet the requirements of
2424 acquire and release.
2425* The L2 cache can be kept coherent with other agents on some targets, or ranges
2426 of virtual addresses can be set up to bypass it to ensure system coherence.
2427
Tony Tye07d9f102017-11-10 01:00:54 +00002428Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002429or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2430memory, atomic memory orderings are not meaningful and all accesses are treated
2431as non-atomic.
2432
2433Constant address space uses ``buffer/global_load`` instructions (or equivalent
2434scalar memory instructions). Since the constant address space contents do not
2435change during the execution of a kernel dispatch it is not legal to perform
2436stores, and atomic memory orderings are not meaningful and all access are
2437treated as non-atomic.
2438
2439A memory synchronization scope wider than work-group is not meaningful for the
2440group (LDS) address space and is treated as work-group.
2441
2442The memory model does not support the region address space which is treated as
2443non-atomic.
2444
2445Acquire memory ordering is not meaningful on store atomic instructions and is
2446treated as non-atomic.
2447
2448Release memory ordering is not meaningful on load atomic instructions and is
2449treated a non-atomic.
2450
2451Acquire-release memory ordering is not meaningful on load or store atomic
2452instructions and is treated as acquire and release respectively.
2453
2454AMDGPU backend only uses scalar memory operations to access memory that is
2455proven to not change during the execution of the kernel dispatch. This includes
2456constant address space and global address space for program scope const
2457variables. Therefore the kernel machine code does not have to maintain the
2458scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2459and vector L1 caches are invalidated between kernel dispatches by CP since
2460constant address space data may change between kernel dispatch executions. See
2461:ref:`amdgpu-amdhsa-memory-spaces`.
2462
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002463The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002464case the AMDGPU backend ensures the memory location used to spill is never
2465accessed by vector memory operations at the same time. If scalar writes are used
2466then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2467return since the locations may be used for vector memory instructions by a
Tony Tye5bbcca62018-03-08 05:46:01 +00002468future wavefront that uses the same scratch area, or a function call that creates a
Tony Tyef16a45e2017-06-06 20:31:59 +00002469frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2470as all scalar writes are write-before-read in the same thread.
2471
Tony Tye6baa6d22017-10-18 22:16:55 +00002472Scratch backing memory (which is used for the private address space)
2473is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2474address space is only accessed by a single thread, and is always
2475write-before-read, there is never a need to invalidate these entries from the L1
2476cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2477volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002478
2479On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002480to invalidate the L2 cache. This also causes it to be treated as
2481non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2482(cache coherent) and so the L2 cache will coherent with the CPU and other
2483agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002484
2485 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2486 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2487
Tony Tye6baa6d22017-10-18 22:16:55 +00002488 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002489 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2490 Ordering Sync Scope Address
2491 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002492 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002493 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002494 -----------------------------------------------------------------------------------
2495 load *none* *none* - global - !volatile & !nontemporal
2496 - generic
2497 - private 1. buffer/global/flat_load
2498 - constant
2499 - volatile & !nontemporal
2500
Tony Tyef16a45e2017-06-06 20:31:59 +00002501 1. buffer/global/flat_load
2502 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002503
2504 - nontemporal
2505
2506 1. buffer/global/flat_load
2507 glc=1 slc=1
2508
Tony Tyef16a45e2017-06-06 20:31:59 +00002509 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002510 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002511 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002512 - private 1. buffer/global/flat_store
2513 - constant
2514 - nontemporal
2515
2516 1. buffer/global/flat_stote
2517 glc=1 slc=1
2518
Tony Tyef16a45e2017-06-06 20:31:59 +00002519 store *none* *none* - local 1. ds_store
2520 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002521 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002522 load atomic unordered *any* *any* *Same as non-atomic*.
2523 store atomic unordered *any* *any* *Same as non-atomic*.
2524 atomicrmw unordered *any* *any* *Same as monotonic
2525 atomic*.
2526 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002527 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002528 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2529 - wavefront - generic
2530 - workgroup
2531 load atomic monotonic - singlethread - local 1. ds_load
2532 - wavefront
2533 - workgroup
2534 load atomic monotonic - agent - global 1. buffer/global/flat_load
2535 - system - generic glc=1
2536 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2537 - wavefront - generic
2538 - workgroup
2539 - agent
2540 - system
2541 store atomic monotonic - singlethread - local 1. ds_store
2542 - wavefront
2543 - workgroup
2544 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2545 - wavefront - generic
2546 - workgroup
2547 - agent
2548 - system
2549 atomicrmw monotonic - singlethread - local 1. ds_atomic
2550 - wavefront
2551 - workgroup
2552 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002553 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002554 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2555 - wavefront - local
2556 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002557 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2558 load atomic acquire - workgroup - local 1. ds_load
2559 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002560
Tony Tye6baa6d22017-10-18 22:16:55 +00002561 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002562 - Must happen before
2563 any following
2564 global/generic
2565 load/load
2566 atomic/store/store
2567 atomic/atomicrmw.
2568 - Ensures any
2569 following global
2570 data read is no
2571 older than the load
2572 atomic value being
2573 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002574 load atomic acquire - workgroup - generic 1. flat_load
2575 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002576
Tony Tye6baa6d22017-10-18 22:16:55 +00002577 - If OpenCL, omit.
2578 - Must happen before
2579 any following
2580 global/generic
2581 load/load
2582 atomic/store/store
2583 atomic/atomicrmw.
2584 - Ensures any
2585 following global
2586 data read is no
2587 older than the load
2588 atomic value being
2589 acquired.
2590 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002591 - system glc=1
2592 2. s_waitcnt vmcnt(0)
2593
2594 - Must happen before
2595 following
2596 buffer_wbinvl1_vol.
2597 - Ensures the load
2598 has completed
2599 before invalidating
2600 the cache.
2601
2602 3. buffer_wbinvl1_vol
2603
2604 - Must happen before
2605 any following
2606 global/generic
2607 load/load
2608 atomic/atomicrmw.
2609 - Ensures that
2610 following
2611 loads will not see
2612 stale global data.
2613
2614 load atomic acquire - agent - generic 1. flat_load glc=1
2615 - system 2. s_waitcnt vmcnt(0) &
2616 lgkmcnt(0)
2617
2618 - If OpenCL omit
2619 lgkmcnt(0).
2620 - Must happen before
2621 following
2622 buffer_wbinvl1_vol.
2623 - Ensures the flat_load
2624 has completed
2625 before invalidating
2626 the cache.
2627
2628 3. buffer_wbinvl1_vol
2629
2630 - Must happen before
2631 any following
2632 global/generic
2633 load/load
2634 atomic/atomicrmw.
2635 - Ensures that
2636 following loads
2637 will not see stale
2638 global data.
2639
2640 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2641 - wavefront - local
2642 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002643 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2644 atomicrmw acquire - workgroup - local 1. ds_atomic
2645 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002646
Tony Tye6baa6d22017-10-18 22:16:55 +00002647 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002648 - Must happen before
2649 any following
2650 global/generic
2651 load/load
2652 atomic/store/store
2653 atomic/atomicrmw.
2654 - Ensures any
2655 following global
2656 data read is no
2657 older than the
2658 atomicrmw value
2659 being acquired.
2660
Tony Tye6baa6d22017-10-18 22:16:55 +00002661 atomicrmw acquire - workgroup - generic 1. flat_atomic
2662 2. waitcnt lgkmcnt(0)
2663
2664 - If OpenCL, omit.
2665 - Must happen before
2666 any following
2667 global/generic
2668 load/load
2669 atomic/store/store
2670 atomic/atomicrmw.
2671 - Ensures any
2672 following global
2673 data read is no
2674 older than the
2675 atomicrmw value
2676 being acquired.
2677
2678 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002679 - system 2. s_waitcnt vmcnt(0)
2680
2681 - Must happen before
2682 following
2683 buffer_wbinvl1_vol.
2684 - Ensures the
2685 atomicrmw has
2686 completed before
2687 invalidating the
2688 cache.
2689
2690 3. buffer_wbinvl1_vol
2691
2692 - Must happen before
2693 any following
2694 global/generic
2695 load/load
2696 atomic/atomicrmw.
2697 - Ensures that
2698 following loads
2699 will not see stale
2700 global data.
2701
2702 atomicrmw acquire - agent - generic 1. flat_atomic
2703 - system 2. s_waitcnt vmcnt(0) &
2704 lgkmcnt(0)
2705
2706 - If OpenCL, omit
2707 lgkmcnt(0).
2708 - Must happen before
2709 following
2710 buffer_wbinvl1_vol.
2711 - Ensures the
2712 atomicrmw has
2713 completed before
2714 invalidating the
2715 cache.
2716
2717 3. buffer_wbinvl1_vol
2718
2719 - Must happen before
2720 any following
2721 global/generic
2722 load/load
2723 atomic/atomicrmw.
2724 - Ensures that
2725 following loads
2726 will not see stale
2727 global data.
2728
2729 fence acquire - singlethread *none* *none*
2730 - wavefront
2731 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2732
2733 - If OpenCL and
2734 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002735 not generic, omit.
2736 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002737 currently has no
2738 address space on
2739 the fence need to
2740 conservatively
2741 always generate. If
2742 fence had an
2743 address space then
2744 set to address
2745 space of OpenCL
2746 fence flag, or to
2747 generic if both
2748 local and global
2749 flags are
2750 specified.
2751 - Must happen after
2752 any preceding
2753 local/generic load
2754 atomic/atomicrmw
2755 with an equal or
2756 wider sync scope
2757 and memory ordering
2758 stronger than
2759 unordered (this is
2760 termed the
2761 fence-paired-atomic).
2762 - Must happen before
2763 any following
2764 global/generic
2765 load/load
2766 atomic/store/store
2767 atomic/atomicrmw.
2768 - Ensures any
2769 following global
2770 data read is no
2771 older than the
2772 value read by the
2773 fence-paired-atomic.
2774
Tony Tye6baa6d22017-10-18 22:16:55 +00002775 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2776 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002777
2778 - If OpenCL and
2779 address space is
2780 not generic, omit
2781 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002782 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002783 currently has no
2784 address space on
2785 the fence need to
2786 conservatively
2787 always generate
2788 (see comment for
2789 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002790 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002791 separate s_waitcnt
2792 vmcnt(0) and
2793 s_waitcnt
2794 lgkmcnt(0) to allow
2795 them to be
2796 independently moved
2797 according to the
2798 following rules.
2799 - s_waitcnt vmcnt(0)
2800 must happen after
2801 any preceding
2802 global/generic load
2803 atomic/atomicrmw
2804 with an equal or
2805 wider sync scope
2806 and memory ordering
2807 stronger than
2808 unordered (this is
2809 termed the
2810 fence-paired-atomic).
2811 - s_waitcnt lgkmcnt(0)
2812 must happen after
2813 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002814 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002815 atomic/atomicrmw
2816 with an equal or
2817 wider sync scope
2818 and memory ordering
2819 stronger than
2820 unordered (this is
2821 termed the
2822 fence-paired-atomic).
2823 - Must happen before
2824 the following
2825 buffer_wbinvl1_vol.
2826 - Ensures that the
2827 fence-paired atomic
2828 has completed
2829 before invalidating
2830 the
2831 cache. Therefore
2832 any following
2833 locations read must
2834 be no older than
2835 the value read by
2836 the
2837 fence-paired-atomic.
2838
2839 2. buffer_wbinvl1_vol
2840
Tony Tye6baa6d22017-10-18 22:16:55 +00002841 - Must happen before any
2842 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002843 load/load
2844 atomic/store/store
2845 atomic/atomicrmw.
2846 - Ensures that
2847 following loads
2848 will not see stale
2849 global data.
2850
2851 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002852 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002853 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2854 - wavefront - local
2855 - generic
2856 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002857
2858 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002859 - Must happen after
2860 any preceding
2861 local/generic
2862 load/store/load
2863 atomic/store
2864 atomic/atomicrmw.
2865 - Must happen before
2866 the following
2867 store.
2868 - Ensures that all
2869 memory operations
2870 to local have
2871 completed before
2872 performing the
2873 store that is being
2874 released.
2875
2876 2. buffer/global/flat_store
2877 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002878 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2879
2880 - If OpenCL, omit.
2881 - Must happen after
2882 any preceding
2883 local/generic
2884 load/store/load
2885 atomic/store
2886 atomic/atomicrmw.
2887 - Must happen before
2888 the following
2889 store.
2890 - Ensures that all
2891 memory operations
2892 to local have
2893 completed before
2894 performing the
2895 store that is being
2896 released.
2897
2898 2. flat_store
2899 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2900 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002901
2902 - If OpenCL, omit
2903 lgkmcnt(0).
2904 - Could be split into
2905 separate s_waitcnt
2906 vmcnt(0) and
2907 s_waitcnt
2908 lgkmcnt(0) to allow
2909 them to be
2910 independently moved
2911 according to the
2912 following rules.
2913 - s_waitcnt vmcnt(0)
2914 must happen after
2915 any preceding
2916 global/generic
2917 load/store/load
2918 atomic/store
2919 atomic/atomicrmw.
2920 - s_waitcnt lgkmcnt(0)
2921 must happen after
2922 any preceding
2923 local/generic
2924 load/store/load
2925 atomic/store
2926 atomic/atomicrmw.
2927 - Must happen before
2928 the following
2929 store.
2930 - Ensures that all
2931 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002932 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002933 completed before
2934 performing the
2935 store that is being
2936 released.
2937
2938 2. buffer/global/ds/flat_store
2939 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2940 - wavefront - local
2941 - generic
2942 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002943
2944 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002945 - Must happen after
2946 any preceding
2947 local/generic
2948 load/store/load
2949 atomic/store
2950 atomic/atomicrmw.
2951 - Must happen before
2952 the following
2953 atomicrmw.
2954 - Ensures that all
2955 memory operations
2956 to local have
2957 completed before
2958 performing the
2959 atomicrmw that is
2960 being released.
2961
2962 2. buffer/global/flat_atomic
2963 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00002964 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2965
2966 - If OpenCL, omit.
2967 - Must happen after
2968 any preceding
2969 local/generic
2970 load/store/load
2971 atomic/store
2972 atomic/atomicrmw.
2973 - Must happen before
2974 the following
2975 atomicrmw.
2976 - Ensures that all
2977 memory operations
2978 to local have
2979 completed before
2980 performing the
2981 atomicrmw that is
2982 being released.
2983
2984 2. flat_atomic
2985 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2986 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002987
2988 - If OpenCL, omit
2989 lgkmcnt(0).
2990 - Could be split into
2991 separate s_waitcnt
2992 vmcnt(0) and
2993 s_waitcnt
2994 lgkmcnt(0) to allow
2995 them to be
2996 independently moved
2997 according to the
2998 following rules.
2999 - s_waitcnt vmcnt(0)
3000 must happen after
3001 any preceding
3002 global/generic
3003 load/store/load
3004 atomic/store
3005 atomic/atomicrmw.
3006 - s_waitcnt lgkmcnt(0)
3007 must happen after
3008 any preceding
3009 local/generic
3010 load/store/load
3011 atomic/store
3012 atomic/atomicrmw.
3013 - Must happen before
3014 the following
3015 atomicrmw.
3016 - Ensures that all
3017 memory operations
3018 to global and local
3019 have completed
3020 before performing
3021 the atomicrmw that
3022 is being released.
3023
Tony Tye6baa6d22017-10-18 22:16:55 +00003024 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003025 fence release - singlethread *none* *none*
3026 - wavefront
3027 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3028
3029 - If OpenCL and
3030 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003031 not generic, omit.
3032 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003033 currently has no
3034 address space on
3035 the fence need to
3036 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003037 always generate. If
3038 fence had an
3039 address space then
3040 set to address
3041 space of OpenCL
3042 fence flag, or to
3043 generic if both
3044 local and global
3045 flags are
3046 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003047 - Must happen after
3048 any preceding
3049 local/generic
3050 load/load
3051 atomic/store/store
3052 atomic/atomicrmw.
3053 - Must happen before
3054 any following store
3055 atomic/atomicrmw
3056 with an equal or
3057 wider sync scope
3058 and memory ordering
3059 stronger than
3060 unordered (this is
3061 termed the
3062 fence-paired-atomic).
3063 - Ensures that all
3064 memory operations
3065 to local have
3066 completed before
3067 performing the
3068 following
3069 fence-paired-atomic.
3070
Tony Tye6baa6d22017-10-18 22:16:55 +00003071 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3072 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003073
3074 - If OpenCL and
3075 address space is
3076 not generic, omit
3077 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003078 - If OpenCL and
3079 address space is
3080 local, omit
3081 vmcnt(0).
3082 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003083 currently has no
3084 address space on
3085 the fence need to
3086 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003087 always generate. If
3088 fence had an
3089 address space then
3090 set to address
3091 space of OpenCL
3092 fence flag, or to
3093 generic if both
3094 local and global
3095 flags are
3096 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003097 - Could be split into
3098 separate s_waitcnt
3099 vmcnt(0) and
3100 s_waitcnt
3101 lgkmcnt(0) to allow
3102 them to be
3103 independently moved
3104 according to the
3105 following rules.
3106 - s_waitcnt vmcnt(0)
3107 must happen after
3108 any preceding
3109 global/generic
3110 load/store/load
3111 atomic/store
3112 atomic/atomicrmw.
3113 - s_waitcnt lgkmcnt(0)
3114 must happen after
3115 any preceding
3116 local/generic
3117 load/store/load
3118 atomic/store
3119 atomic/atomicrmw.
3120 - Must happen before
3121 any following store
3122 atomic/atomicrmw
3123 with an equal or
3124 wider sync scope
3125 and memory ordering
3126 stronger than
3127 unordered (this is
3128 termed the
3129 fence-paired-atomic).
3130 - Ensures that all
3131 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003132 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003133 completed before
3134 performing the
3135 following
3136 fence-paired-atomic.
3137
3138 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003139 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003140 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3141 - wavefront - local
3142 - generic
3143 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3144
Tony Tye6baa6d22017-10-18 22:16:55 +00003145 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003146 - Must happen after
3147 any preceding
3148 local/generic
3149 load/store/load
3150 atomic/store
3151 atomic/atomicrmw.
3152 - Must happen before
3153 the following
3154 atomicrmw.
3155 - Ensures that all
3156 memory operations
3157 to local have
3158 completed before
3159 performing the
3160 atomicrmw that is
3161 being released.
3162
Tony Tye6baa6d22017-10-18 22:16:55 +00003163 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003164 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3165 2. s_waitcnt lgkmcnt(0)
3166
Tony Tye6baa6d22017-10-18 22:16:55 +00003167 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003168 - Must happen before
3169 any following
3170 global/generic
3171 load/load
3172 atomic/store/store
3173 atomic/atomicrmw.
3174 - Ensures any
3175 following global
3176 data read is no
3177 older than the load
3178 atomic value being
3179 acquired.
3180
3181 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3182
Tony Tye6baa6d22017-10-18 22:16:55 +00003183 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003184 - Must happen after
3185 any preceding
3186 local/generic
3187 load/store/load
3188 atomic/store
3189 atomic/atomicrmw.
3190 - Must happen before
3191 the following
3192 atomicrmw.
3193 - Ensures that all
3194 memory operations
3195 to local have
3196 completed before
3197 performing the
3198 atomicrmw that is
3199 being released.
3200
3201 2. flat_atomic
3202 3. s_waitcnt lgkmcnt(0)
3203
Tony Tye6baa6d22017-10-18 22:16:55 +00003204 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003205 - Must happen before
3206 any following
3207 global/generic
3208 load/load
3209 atomic/store/store
3210 atomic/atomicrmw.
3211 - Ensures any
3212 following global
3213 data read is no
3214 older than the load
3215 atomic value being
3216 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003217
3218 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3219 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003220
3221 - If OpenCL, omit
3222 lgkmcnt(0).
3223 - Could be split into
3224 separate s_waitcnt
3225 vmcnt(0) and
3226 s_waitcnt
3227 lgkmcnt(0) to allow
3228 them to be
3229 independently moved
3230 according to the
3231 following rules.
3232 - s_waitcnt vmcnt(0)
3233 must happen after
3234 any preceding
3235 global/generic
3236 load/store/load
3237 atomic/store
3238 atomic/atomicrmw.
3239 - s_waitcnt lgkmcnt(0)
3240 must happen after
3241 any preceding
3242 local/generic
3243 load/store/load
3244 atomic/store
3245 atomic/atomicrmw.
3246 - Must happen before
3247 the following
3248 atomicrmw.
3249 - Ensures that all
3250 memory operations
3251 to global have
3252 completed before
3253 performing the
3254 atomicrmw that is
3255 being released.
3256
Tony Tye6baa6d22017-10-18 22:16:55 +00003257 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003258 3. s_waitcnt vmcnt(0)
3259
3260 - Must happen before
3261 following
3262 buffer_wbinvl1_vol.
3263 - Ensures the
3264 atomicrmw has
3265 completed before
3266 invalidating the
3267 cache.
3268
3269 4. buffer_wbinvl1_vol
3270
3271 - Must happen before
3272 any following
3273 global/generic
3274 load/load
3275 atomic/atomicrmw.
3276 - Ensures that
3277 following loads
3278 will not see stale
3279 global data.
3280
Tony Tye6baa6d22017-10-18 22:16:55 +00003281 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3282 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003283
3284 - If OpenCL, omit
3285 lgkmcnt(0).
3286 - Could be split into
3287 separate s_waitcnt
3288 vmcnt(0) and
3289 s_waitcnt
3290 lgkmcnt(0) to allow
3291 them to be
3292 independently moved
3293 according to the
3294 following rules.
3295 - s_waitcnt vmcnt(0)
3296 must happen after
3297 any preceding
3298 global/generic
3299 load/store/load
3300 atomic/store
3301 atomic/atomicrmw.
3302 - s_waitcnt lgkmcnt(0)
3303 must happen after
3304 any preceding
3305 local/generic
3306 load/store/load
3307 atomic/store
3308 atomic/atomicrmw.
3309 - Must happen before
3310 the following
3311 atomicrmw.
3312 - Ensures that all
3313 memory operations
3314 to global have
3315 completed before
3316 performing the
3317 atomicrmw that is
3318 being released.
3319
3320 2. flat_atomic
3321 3. s_waitcnt vmcnt(0) &
3322 lgkmcnt(0)
3323
3324 - If OpenCL, omit
3325 lgkmcnt(0).
3326 - Must happen before
3327 following
3328 buffer_wbinvl1_vol.
3329 - Ensures the
3330 atomicrmw has
3331 completed before
3332 invalidating the
3333 cache.
3334
3335 4. buffer_wbinvl1_vol
3336
3337 - Must happen before
3338 any following
3339 global/generic
3340 load/load
3341 atomic/atomicrmw.
3342 - Ensures that
3343 following loads
3344 will not see stale
3345 global data.
3346
3347 fence acq_rel - singlethread *none* *none*
3348 - wavefront
3349 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3350
3351 - If OpenCL and
3352 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003353 not generic, omit.
3354 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003355 since LLVM
3356 currently has no
3357 address space on
3358 the fence need to
3359 conservatively
3360 always generate
3361 (see comment for
3362 previous fence).
3363 - Must happen after
3364 any preceding
3365 local/generic
3366 load/load
3367 atomic/store/store
3368 atomic/atomicrmw.
3369 - Must happen before
3370 any following
3371 global/generic
3372 load/load
3373 atomic/store/store
3374 atomic/atomicrmw.
3375 - Ensures that all
3376 memory operations
3377 to local have
3378 completed before
3379 performing any
3380 following global
3381 memory operations.
3382 - Ensures that the
3383 preceding
3384 local/generic load
3385 atomic/atomicrmw
3386 with an equal or
3387 wider sync scope
3388 and memory ordering
3389 stronger than
3390 unordered (this is
3391 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003392 acquire-fence-paired-atomic
3393 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003394 before following
3395 global memory
3396 operations. This
3397 satisfies the
3398 requirements of
3399 acquire.
3400 - Ensures that all
3401 previous memory
3402 operations have
3403 completed before a
3404 following
3405 local/generic store
3406 atomic/atomicrmw
3407 with an equal or
3408 wider sync scope
3409 and memory ordering
3410 stronger than
3411 unordered (this is
3412 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003413 release-fence-paired-atomic
3414 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003415 requirements of
3416 release.
3417
Tony Tye6baa6d22017-10-18 22:16:55 +00003418 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3419 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003420
3421 - If OpenCL and
3422 address space is
3423 not generic, omit
3424 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003425 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003426 currently has no
3427 address space on
3428 the fence need to
3429 conservatively
3430 always generate
3431 (see comment for
3432 previous fence).
3433 - Could be split into
3434 separate s_waitcnt
3435 vmcnt(0) and
3436 s_waitcnt
3437 lgkmcnt(0) to allow
3438 them to be
3439 independently moved
3440 according to the
3441 following rules.
3442 - s_waitcnt vmcnt(0)
3443 must happen after
3444 any preceding
3445 global/generic
3446 load/store/load
3447 atomic/store
3448 atomic/atomicrmw.
3449 - s_waitcnt lgkmcnt(0)
3450 must happen after
3451 any preceding
3452 local/generic
3453 load/store/load
3454 atomic/store
3455 atomic/atomicrmw.
3456 - Must happen before
3457 the following
3458 buffer_wbinvl1_vol.
3459 - Ensures that the
3460 preceding
3461 global/local/generic
3462 load
3463 atomic/atomicrmw
3464 with an equal or
3465 wider sync scope
3466 and memory ordering
3467 stronger than
3468 unordered (this is
3469 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003470 acquire-fence-paired-atomic
3471 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003472 before invalidating
3473 the cache. This
3474 satisfies the
3475 requirements of
3476 acquire.
3477 - Ensures that all
3478 previous memory
3479 operations have
3480 completed before a
3481 following
3482 global/local/generic
3483 store
3484 atomic/atomicrmw
3485 with an equal or
3486 wider sync scope
3487 and memory ordering
3488 stronger than
3489 unordered (this is
3490 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003491 release-fence-paired-atomic
3492 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003493 requirements of
3494 release.
3495
3496 2. buffer_wbinvl1_vol
3497
3498 - Must happen before
3499 any following
3500 global/generic
3501 load/load
3502 atomic/store/store
3503 atomic/atomicrmw.
3504 - Ensures that
3505 following loads
3506 will not see stale
3507 global data. This
3508 satisfies the
3509 requirements of
3510 acquire.
3511
3512 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003513 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003514 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003515 - wavefront - local load atomic acquire,
3516 - generic except must generated
3517 all instructions even
3518 for OpenCL.*
3519 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3520 - generic
3521 - Must
3522 happen after
3523 preceding
3524 global/generic load
3525 atomic/store
3526 atomic/atomicrmw
3527 with memory
3528 ordering of seq_cst
3529 and with equal or
3530 wider sync scope.
3531 (Note that seq_cst
3532 fences have their
3533 own s_waitcnt
3534 lgkmcnt(0) and so do
3535 not need to be
3536 considered.)
3537 - Ensures any
3538 preceding
3539 sequential
3540 consistent local
3541 memory instructions
3542 have completed
3543 before executing
3544 this sequentially
3545 consistent
3546 instruction. This
3547 prevents reordering
3548 a seq_cst store
3549 followed by a
3550 seq_cst load. (Note
3551 that seq_cst is
3552 stronger than
3553 acquire/release as
3554 the reordering of
3555 load acquire
3556 followed by a store
3557 release is
3558 prevented by the
3559 waitcnt of
3560 the release, but
3561 there is nothing
3562 preventing a store
3563 release followed by
3564 load acquire from
3565 competing out of
3566 order.)
3567
3568 2. *Following
3569 instructions same as
3570 corresponding load
3571 atomic acquire,
3572 except must generated
3573 all instructions even
3574 for OpenCL.*
3575 load atomic seq_cst - workgroup - local *Same as corresponding
3576 load atomic acquire,
3577 except must generated
3578 all instructions even
3579 for OpenCL.*
3580 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3581 - system - generic vmcnt(0)
3582
3583 - Could be split into
3584 separate s_waitcnt
3585 vmcnt(0)
3586 and s_waitcnt
3587 lgkmcnt(0) to allow
3588 them to be
3589 independently moved
3590 according to the
3591 following rules.
3592 - waitcnt lgkmcnt(0)
3593 must happen after
3594 preceding
3595 global/generic load
3596 atomic/store
3597 atomic/atomicrmw
3598 with memory
3599 ordering of seq_cst
3600 and with equal or
3601 wider sync scope.
3602 (Note that seq_cst
3603 fences have their
3604 own s_waitcnt
3605 lgkmcnt(0) and so do
3606 not need to be
3607 considered.)
3608 - waitcnt vmcnt(0)
3609 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003610 preceding
3611 global/generic load
3612 atomic/store
3613 atomic/atomicrmw
3614 with memory
3615 ordering of seq_cst
3616 and with equal or
3617 wider sync scope.
3618 (Note that seq_cst
3619 fences have their
3620 own s_waitcnt
3621 vmcnt(0) and so do
3622 not need to be
3623 considered.)
3624 - Ensures any
3625 preceding
3626 sequential
3627 consistent global
3628 memory instructions
3629 have completed
3630 before executing
3631 this sequentially
3632 consistent
3633 instruction. This
3634 prevents reordering
3635 a seq_cst store
3636 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003637 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003638 that seq_cst is
3639 stronger than
3640 acquire/release as
3641 the reordering of
3642 load acquire
3643 followed by a store
3644 release is
3645 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003646 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003647 the release, but
3648 there is nothing
3649 preventing a store
3650 release followed by
3651 load acquire from
3652 competing out of
3653 order.)
3654
3655 2. *Following
3656 instructions same as
3657 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003658 atomic acquire,
3659 except must generated
3660 all instructions even
3661 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003662 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003663 - wavefront - local store atomic release,
3664 - workgroup - generic except must generated
3665 all instructions even
3666 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003667 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003668 - system - generic store atomic release,
3669 except must generated
3670 all instructions even
3671 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003672 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003673 - wavefront - local atomicrmw acq_rel,
3674 - workgroup - generic except must generated
3675 all instructions even
3676 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003677 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003678 - system - generic atomicrmw acq_rel,
3679 except must generated
3680 all instructions even
3681 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003682 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003683 - wavefront fence acq_rel,
3684 - workgroup except must generated
3685 - agent all instructions even
3686 - system for OpenCL.*
3687 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003688
3689The memory order also adds the single thread optimization constrains defined in
3690table
3691:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3692
3693 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3694 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3695
3696 ============ ==============================================================
3697 LLVM Memory Optimization Constraints
3698 Ordering
3699 ============ ==============================================================
3700 unordered *none*
3701 monotonic *none*
3702 acquire - If a load atomic/atomicrmw then no following load/load
3703 atomic/store/ store atomic/atomicrmw/fence instruction can
3704 be moved before the acquire.
3705 - If a fence then same as load atomic, plus no preceding
3706 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003707 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003708 atomic/store/ store atomic/atomicrmw/fence instruction can
3709 be moved after the release.
3710 - If a fence then same as store atomic, plus no following
3711 associated fence-paired-atomic can be moved before the
3712 fence.
3713 acq_rel Same constraints as both acquire and release.
3714 seq_cst - If a load atomic then same constraints as acquire, plus no
3715 preceding sequentially consistent load atomic/store
3716 atomic/atomicrmw/fence instruction can be moved after the
3717 seq_cst.
3718 - If a store atomic then the same constraints as release, plus
3719 no following sequentially consistent load atomic/store
3720 atomic/atomicrmw/fence instruction can be moved before the
3721 seq_cst.
3722 - If an atomicrmw/fence then same constraints as acq_rel.
3723 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003724
Wei Ding16289cf2017-02-21 18:48:01 +00003725Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003726~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003727
Tony Tyef16a45e2017-06-06 20:31:59 +00003728For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3729(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3730the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003731
Tony Tyef16a45e2017-06-06 20:31:59 +00003732 .. table:: AMDGPU Trap Handler for AMDHSA OS
3733 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003734
Tony Tyef16a45e2017-06-06 20:31:59 +00003735 =================== =============== =============== =======================
3736 Usage Code Sequence Trap Handler Description
3737 Inputs
3738 =================== =============== =============== =======================
3739 reserved ``s_trap 0x00`` Reserved by hardware.
3740 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3741 ``queue_ptr`` ``debugtrap``
3742 ``VGPR0``: intrinsic (not
3743 ``arg`` implemented).
3744 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3745 ``queue_ptr`` terminated and its
3746 associated queue put
3747 into the error state.
3748 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3749 ``queue_ptr`` installed handled
3750 same as ``llvm.trap``.
3751 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3752 breakpoints.
3753 debugger ``s_trap 0x08`` Reserved for debugger.
3754 debugger ``s_trap 0xfe`` Reserved for debugger.
3755 debugger ``s_trap 0xff`` Reserved for debugger.
3756 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003757
Tony Tye46d35762017-08-15 20:47:41 +00003758Unspecified OS
3759--------------
3760
3761This section provides code conventions used when the target triple OS is
3762empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003763
3764Trap Handler ABI
3765~~~~~~~~~~~~~~~~
3766
3767For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3768not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3769instructions are handled as follows:
3770
3771 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3772 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3773
3774 =============== =============== ===========================================
3775 Usage Code Sequence Description
3776 =============== =============== ===========================================
3777 llvm.trap s_endpgm Causes wavefront to be terminated.
3778 llvm.debugtrap *none* Compiler warning given that there is no
3779 trap handler installed.
3780 =============== =============== ===========================================
3781
3782Source Languages
3783================
3784
3785.. _amdgpu-opencl:
3786
3787OpenCL
3788------
3789
3790When generating code for the OpenCL language the target triple environment
3791should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3792
3793When the language is OpenCL the following differences occur:
3794
37951. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
37962. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye46d35762017-08-15 20:47:41 +000037973. Additional metadata is generated
3798 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003799
3800.. TODO
3801 Specify what affect this has. Hidden arguments added. Additional metadata
3802 generated.
3803
3804.. _amdgpu-hcc:
3805
3806HCC
3807---
3808
3809When generating code for the OpenCL language the target triple environment
3810should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3811
3812When the language is OpenCL the following differences occur:
3813
38141. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3815
3816.. TODO
3817 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003818
Tom Stellard45bb48e2015-06-13 03:28:10 +00003819Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003820---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003821
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003822AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef59d0712017-11-10 20:51:43 +00003823It supports AMDGCN GFX6-GFX9.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003824
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00003825This section describes general syntax for instructions and operands.
3826
3827Instructions
3828~~~~~~~~~~~~
3829
3830.. toctree::
3831 :hidden:
3832
3833 AMDGPUAsmGFX7
3834 AMDGPUAsmGFX8
3835 AMDGPUAsmGFX9
3836 AMDGPUOperandSyntax
3837
3838An instruction has the following syntax:
3839
3840 *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
3841
3842Note that operands are normally comma-separated while modifiers are space-separated.
3843
3844The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
3845
3846See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
3847:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
3848
3849Note that features under development are not included in this description.
3850
3851For more information about instructions, their semantics and supported combinations of
Tony Tyef16a45e2017-06-06 20:31:59 +00003852operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00003853[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003854
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003855Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003856~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003857
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003858The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003859
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003860* SGPR registers: s0, ... or s[0], ...
3861* VGPR registers: v0, ... or v[0], ...
3862* TTMP registers: ttmp0, ... or ttmp[0], ...
3863* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3864* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3865* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3866* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3867* Register index expressions: v[2*2], s[1-1:2-1]
3868* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003869
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00003870Modifiers
3871~~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003872
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00003873Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003874
Tony Tyef16a45e2017-06-06 20:31:59 +00003875Instruction Examples
3876~~~~~~~~~~~~~~~~~~~~
3877
3878DS
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00003879++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003880
3881.. code-block:: nasm
3882
3883 ds_add_u32 v2, v4 offset:16
3884 ds_write_src2_b64 v2 offset0:4 offset1:8
3885 ds_cmpst_f32 v2, v4, v6
3886 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3887
3888
3889For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3890
Tony Tyef16a45e2017-06-06 20:31:59 +00003891FLAT
3892++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003893
3894.. code-block:: nasm
3895
3896 flat_load_dword v1, v[3:4]
3897 flat_store_dwordx3 v[3:4], v[5:7]
3898 flat_atomic_swap v1, v[3:4], v5 glc
3899 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3900 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3901
3902For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3903
Tony Tyef16a45e2017-06-06 20:31:59 +00003904MUBUF
3905+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003906
3907.. code-block:: nasm
3908
3909 buffer_load_dword v1, off, s[4:7], s1
3910 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3911 buffer_store_format_xy v[1:2], off, s[4:7], s1
3912 buffer_wbinvl1
3913 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3914
3915For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3916
Tony Tyef16a45e2017-06-06 20:31:59 +00003917SMRD/SMEM
3918+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003919
3920.. code-block:: nasm
3921
3922 s_load_dword s1, s[2:3], 0xfc
3923 s_load_dwordx8 s[8:15], s[2:3], s4
3924 s_load_dwordx16 s[88:103], s[2:3], s4
3925 s_dcache_inv_vol
3926 s_memtime s[4:5]
3927
3928For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3929
Tony Tyef16a45e2017-06-06 20:31:59 +00003930SOP1
3931++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003932
3933.. code-block:: nasm
3934
3935 s_mov_b32 s1, s2
3936 s_mov_b64 s[0:1], 0x80000000
3937 s_cmov_b32 s1, 200
3938 s_wqm_b64 s[2:3], s[4:5]
3939 s_bcnt0_i32_b64 s1, s[2:3]
3940 s_swappc_b64 s[2:3], s[4:5]
3941 s_cbranch_join s[4:5]
3942
3943For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3944
Tony Tyef16a45e2017-06-06 20:31:59 +00003945SOP2
3946++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003947
3948.. code-block:: nasm
3949
3950 s_add_u32 s1, s2, s3
3951 s_and_b64 s[2:3], s[4:5], s[6:7]
3952 s_cselect_b32 s1, s2, s3
3953 s_andn2_b32 s2, s4, s6
3954 s_lshr_b64 s[2:3], s[4:5], s6
3955 s_ashr_i32 s2, s4, s6
3956 s_bfm_b64 s[2:3], s4, s6
3957 s_bfe_i64 s[2:3], s[4:5], s6
3958 s_cbranch_g_fork s[4:5], s[6:7]
3959
3960For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3961
Tony Tyef16a45e2017-06-06 20:31:59 +00003962SOPC
3963++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003964
3965.. code-block:: nasm
3966
3967 s_cmp_eq_i32 s1, s2
3968 s_bitcmp1_b32 s1, s2
3969 s_bitcmp0_b64 s[2:3], s4
3970 s_setvskip s3, s5
3971
3972For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3973
Tony Tyef16a45e2017-06-06 20:31:59 +00003974SOPP
3975++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003976
3977.. code-block:: nasm
3978
3979 s_barrier
3980 s_nop 2
3981 s_endpgm
3982 s_waitcnt 0 ; Wait for all counters to be 0
3983 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3984 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3985 s_sethalt 9
3986 s_sleep 10
3987 s_sendmsg 0x1
3988 s_sendmsg sendmsg(MSG_INTERRUPT)
3989 s_trap 1
3990
3991For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3992
3993Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003994of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003995range or acceptable values.
3996
Tony Tyef16a45e2017-06-06 20:31:59 +00003997VALU
3998++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003999
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004000For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4001the assembler will automatically use optimal encoding based on its operands.
4002To force specific encoding, one can add a suffix to the opcode of the instruction:
4003
4004* _e32 for 32-bit VOP1/VOP2/VOPC
4005* _e64 for 64-bit VOP3
4006* _dpp for VOP_DPP
4007* _sdwa for VOP_SDWA
4008
4009VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004010
4011.. code-block:: nasm
4012
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004013 v_mov_b32 v1, v2
4014 v_mov_b32_e32 v1, v2
4015 v_nop
4016 v_cvt_f64_i32_e32 v[1:2], v2
4017 v_floor_f32_e32 v1, v2
4018 v_bfrev_b32_e32 v1, v2
4019 v_add_f32_e32 v1, v2, v3
4020 v_mul_i32_i24_e64 v1, v2, 3
4021 v_mul_i32_i24_e32 v1, -3, v3
4022 v_mul_i32_i24_e32 v1, -100, v3
4023 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4024 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00004025
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004026VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004027
4028.. code-block:: nasm
4029
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004030 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4031 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4032 v_mov_b32 v0, v0 wave_shl:1
4033 v_mov_b32 v0, v0 row_mirror
4034 v_mov_b32 v0, v0 row_bcast:31
4035 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4036 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4037 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00004038
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004039VOP_SDWA examples:
4040
4041.. code-block:: nasm
4042
4043 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4044 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4045 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4046 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4047 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4048
4049For full list of supported instructions, refer to "Vector ALU instructions".
4050
4051HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004052~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004053
4054AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4055one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004056
4057.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004058+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004059
4060*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004061object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004062
4063.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004064+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4065
Tom Stellard347ac792015-06-26 21:15:07 +00004066
4067*major*, *minor*, and *stepping* are all integers that describe the instruction
4068set architecture (ISA) version of the assembly program.
4069
4070*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4071"AMD" and *arch* should always be equal to "AMDGPU".
4072
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004073By default, the assembler will derive the ISA version, *vendor*, and *arch*
4074from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004075
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004076.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004077+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004078
4079This directives specifies that the symbol with given name is a kernel entry point
4080(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004081
4082.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004083++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004084
4085This directive marks the beginning of a list of key / value pairs that are used
4086to specify the amd_kernel_code_t object that will be emitted by the assembler.
4087The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4088any amd_kernel_code_t values that are unspecified a default value will be
4089used. The default value for all keys is 0, with the following exceptions:
4090
4091- *kernel_code_version_major* defaults to 1.
4092- *machine_kind* defaults to 1.
4093- *machine_version_major*, *machine_version_minor*, and
4094 *machine_version_stepping* are derived from the value of the -mcpu option
4095 that is passed to the assembler.
4096- *kernel_code_entry_byte_offset* defaults to 256.
4097- *wavefront_size* defaults to 6.
4098- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004099 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004100 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4101
4102The *.amd_kernel_code_t* directive must be placed immediately after the
4103function label and before any instructions.
4104
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004105For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4106comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004107
4108Here is an example of a minimal amd_kernel_code_t specification:
4109
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004110.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004111
4112 .hsa_code_object_version 1,0
4113 .hsa_code_object_isa
4114
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004115 .hsatext
4116 .globl hello_world
4117 .p2align 8
4118 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004119
4120 hello_world:
4121
4122 .amd_kernel_code_t
4123 enable_sgpr_kernarg_segment_ptr = 1
4124 is_ptr64 = 1
4125 compute_pgm_rsrc1_vgprs = 0
4126 compute_pgm_rsrc1_sgprs = 0
4127 compute_pgm_rsrc2_user_sgpr = 2
4128 kernarg_segment_byte_size = 8
4129 wavefront_sgpr_count = 2
4130 workitem_vgpr_count = 3
4131 .end_amd_kernel_code_t
4132
4133 s_load_dwordx2 s[0:1], s[0:1] 0x0
4134 v_mov_b32 v0, 3.14159
4135 s_waitcnt lgkmcnt(0)
4136 v_mov_b32 v1, s0
4137 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004138 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004139 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004140 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004141 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004142
4143Additional Documentation
4144========================
4145
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004146.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4147.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4148.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4149.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4150.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4151.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4152.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4153.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004154.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4155.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4156.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4157.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4158.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004159.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004160.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4161.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__