blob: 268abc2ff866b7fe550948876f2b4a59a33943dc [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
26 .. table:: AMDGPU Target Triples
27 :name: amdgpu-target-triples-table
28
29 ============ ======== ========= ===========
30 Architecture Vendor OS Environment
31 ============ ======== ========= ===========
32 r600 amd <empty> <empty>
33 amdgcn amd <empty> <empty>
34 amdgcn amd amdhsa <empty>
35 amdgcn amd amdhsa opencl
36 amdgcn amd amdhsa amdgizcl
37 amdgcn amd amdhsa amdgiz
38 amdgcn amd amdhsa hcc
39 ============ ======== ========= ===========
40
41``r600-amd--``
42 Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
43 the MESA runtime.
44
45``amdgcn-amd--``
Tony Tye46d35762017-08-15 20:47:41 +000046 Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
Tony Tyef16a45e2017-06-06 20:31:59 +000047 the MESA runtime.
48
49``amdgcn-amd-amdhsa-``
50 Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
51 compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
52
53``amdgcn-amd-amdhsa-opencl``
54 Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
55 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
56 :ref:`amdgpu-opencl`.
57
58``amdgcn-amd-amdhsa-amdgizcl``
59 Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
60 is used (see :ref:`amdgpu-address-spaces`).
61
62``amdgcn-amd-amdhsa-amdgiz``
63 Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
64 used (see :ref:`amdgpu-address-spaces`).
65
66``amdgcn-amd-amdhsa-hcc``
67 Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
68 executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
69 :ref:`amdgpu-hcc`.
70
71.. _amdgpu-processors:
72
73Processors
74----------
75
76Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
77names from both the *Processor* and *Alternative Processor* can be used.
78
79 .. table:: AMDGPU Processors
80 :name: amdgpu-processors-table
81
82 ========== =========== ============ ===== ======= ==================
83 Processor Alternative Target dGPU/ Runtime Example
84 Processor Triple APU Support Products
85 Architecture
86 ========== =========== ============ ===== ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000087 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tyef16a45e2017-06-06 20:31:59 +000088 --------------------------------------------------------------------
89 r600 r600 dGPU
90 r630 r600 dGPU
91 rs880 r600 dGPU
92 rv670 r600 dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000093 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tyef16a45e2017-06-06 20:31:59 +000094 --------------------------------------------------------------------
95 rv710 r600 dGPU
96 rv730 r600 dGPU
97 rv770 r600 dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000098 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tyef16a45e2017-06-06 20:31:59 +000099 --------------------------------------------------------------------
100 cedar r600 dGPU
101 redwood r600 dGPU
102 sumo r600 dGPU
103 juniper r600 dGPU
104 cypress r600 dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000106 --------------------------------------------------------------------
107 barts r600 dGPU
108 turks r600 dGPU
109 caicos r600 dGPU
110 cayman r600 dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000112 --------------------------------------------------------------------
Konstantin Zhuravlyov6cbcb272017-08-08 04:28:31 +0000113 gfx600 - tahiti amdgcn dGPU
Tony Tyef16a45e2017-06-06 20:31:59 +0000114 gfx601 - pitcairn amdgcn dGPU
115 - verde
116 - oland
117 - hainan
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000119 --------------------------------------------------------------------
120 gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
121 - Radeon HD 8770
122 - R7 260
123 - R7 260X
124 \ - kaveri amdgcn APU - A6-7000
125 - A6 Pro-7050B
126 - A8-7100
127 - A8 Pro-7150B
128 - A10-7300
129 - A10 Pro-7350B
130 - FX-7500
131 - A8-7200P
132 - A10-7400P
133 - FX-7600P
134 gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
135 - FirePro W9100
136 - FirePro S9150
137 - FirePro S9170
Tony Tye1fd77b22017-06-08 01:47:25 +0000138 gfx702 dGPU ROCm - Radeon R9 290
Tony Tyef16a45e2017-06-06 20:31:59 +0000139 - Radeon R9 290x
140 - Radeon R390
Tony Tye1fd77b22017-06-08 01:47:25 +0000141 - Radeon R390x
142 gfx703 - kabini amdgcn APU - E1-2100
Tony Tyef16a45e2017-06-06 20:31:59 +0000143 - mullins - E1-2200
144 - E1-2500
145 - E2-3000
146 - E2-3800
147 - A4-5000
148 - A4-5100
149 - A6-5200
150 - A4 Pro-3340B
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000152 --------------------------------------------------------------------
153 gfx800 - iceland amdgcn dGPU - FirePro S7150
154 - FirePro S7100
155 - FirePro W7100
156 - Radeon R285
157 - Radeon R9 380
158 - Radeon R9 385
159 - Mobile FirePro
160 M7170
161 gfx801 - carrizo amdgcn APU - A6-8500P
162 - Pro A6-8500B
163 - A8-8600P
164 - Pro A8-8600B
165 - FX-8800P
166 - Pro A12-8800B
167 \ amdgcn APU ROCm - A10-8700P
168 - Pro A10-8700B
169 - A10-8780P
170 \ amdgcn APU - A10-9600P
171 - A10-9630P
172 - A12-9700P
173 - A12-9730P
174 - FX-9800P
175 - FX-9830P
176 \ amdgcn APU - E2-9010
177 - A6-9210
178 - A9-9410
179 gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
180 gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
181 - Radeon R9 Fury
182 - Radeon R9 FuryX
183 - Radeon Pro Duo
184 - FirePro S9300x2
Tony Tye46d35762017-08-15 20:47:41 +0000185 - Radeon Instinct MI8
Tony Tyef16a45e2017-06-06 20:31:59 +0000186 \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
187 - Radeon RX 480
Tony Tye46d35762017-08-15 20:47:41 +0000188 - Radeon Instinct MI6
Tony Tyef16a45e2017-06-06 20:31:59 +0000189 \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
190 gfx804 amdgcn dGPU Same as gfx803
191 gfx810 - stoney amdgcn APU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000192 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000193 --------------------------------------------------------------------
Tony Tye46d35762017-08-15 20:47:41 +0000194 gfx900 amdgcn dGPU - Radeon Vega
195 Frontier Edition
196 - Radeon RX Vega 56
197 - Radeon RX Vega 64
198 - Radeon RX Vega 64
199 Liquid
200 - Radeon Instinct MI25
Tony Tyef16a45e2017-06-06 20:31:59 +0000201 gfx901 amdgcn dGPU ROCm Same as gfx900
202 except XNACK is
203 enabled
204 gfx902 amdgcn APU *TBA*
205
206 .. TODO
207 Add product
208 names.
209 gfx903 amdgcn APU Same as gfx902
210 except XNACK is
211 enabled
212 ========== =========== ============ ===== ======= ==================
213
214.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000215
216Address Spaces
217--------------
218
Tony Tyef16a45e2017-06-06 20:31:59 +0000219The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000220
Tony Tyef16a45e2017-06-06 20:31:59 +0000221The memory space names used in the table, aside from the region memory space, is
222from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000223
Tony Tyef16a45e2017-06-06 20:31:59 +0000224LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000225
Tony Tyef16a45e2017-06-06 20:31:59 +0000226 .. table:: Address Space Mapping
227 :name: amdgpu-address-space-mapping-table
228
229 ================== ================= ================= ================= =================
230 LLVM Address Space Memory Space
231 ------------------ -----------------------------------------------------------------------
232 \ Current Default amdgiz/amdgizcl hcc Future Default
233 ================== ================= ================= ================= =================
234 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
235 1 Global Global Global Global
236 2 Constant Constant Constant Region (GDS)
237 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
238 4 Generic (Flat) Region (GDS) Region (GDS) Constant
239 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
240 ================== ================= ================= ================= =================
241
242Current Default
243 This is the current default address space mapping used for all languages
244 except hcc. This will shortly be deprecated.
245
246amdgiz/amdgizcl
247 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
248 is specified as the target triple environment value.
249
250hcc
251 This is the current address space mapping used when ``hcc`` is specified as
252 the target triple environment value.This will shortly be deprecated.
253
254Future Default
255 This will shortly be the only address space mapping for all languages using
256 AMDGPU backend.
257
258.. _amdgpu-memory-scopes:
259
260Memory Scopes
261-------------
262
263This section provides LLVM memory synchronization scopes supported by the AMDGPU
264backend memory model when the target triple OS is ``amdhsa`` (see
265:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
266
267The memory model supported is based on the HSA memory model [HSA]_ which is
268based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
269relation is transitive over the synchonizes-with relation independent of scope,
270and synchonizes-with allows the memory scope instances to be inclusive (see
271table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
272
273This is different to the OpenCL [OpenCL]_ memory model which does not have scope
274inclusion and requires the memory scopes to exactly match. However, this
275is conservatively correct for OpenCL.
276
277 .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
278 :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
279
280 ================ ==========================================================
281 LLVM Sync Scope Description
282 ================ ==========================================================
283 *none* The default: ``system``.
284
285 Synchronizes with, and participates in modification and
286 seq_cst total orderings with, other operations (except
287 image operations) for all address spaces (except private,
288 or generic that accesses private) provided the other
289 operation's sync scope is:
290
291 - ``system``.
292 - ``agent`` and executed by a thread on the same agent.
293 - ``workgroup`` and executed by a thread in the same
294 workgroup.
295 - ``wavefront`` and executed by a thread in the same
296 wavefront.
297
298 ``agent`` Synchronizes with, and participates in modification and
299 seq_cst total orderings with, other operations (except
300 image operations) for all address spaces (except private,
301 or generic that accesses private) provided the other
302 operation's sync scope is:
303
304 - ``system`` or ``agent`` and executed by a thread on the
305 same agent.
306 - ``workgroup`` and executed by a thread in the same
307 workgroup.
308 - ``wavefront`` and executed by a thread in the same
309 wavefront.
310
311 ``workgroup`` Synchronizes with, and participates in modification and
312 seq_cst total orderings with, other operations (except
313 image operations) for all address spaces (except private,
314 or generic that accesses private) provided the other
315 operation's sync scope is:
316
317 - ``system``, ``agent`` or ``workgroup`` and executed by a
318 thread in the same workgroup.
319 - ``wavefront`` and executed by a thread in the same
320 wavefront.
321
322 ``wavefront`` Synchronizes with, and participates in modification and
323 seq_cst total orderings with, other operations (except
324 image operations) for all address spaces (except private,
325 or generic that accesses private) provided the other
326 operation's sync scope is:
327
328 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
329 and executed by a thread in the same wavefront.
330
331 ``singlethread`` Only synchronizes with, and participates in modification
332 and seq_cst total orderings with, other operations (except
333 image operations) running in the same thread for all
334 address spaces (for example, in signal handlers).
335 ================ ==========================================================
336
337AMDGPU Intrinsics
338-----------------
339
340The AMDGPU backend implements the following intrinsics.
341
342*This section is WIP.*
343
344.. TODO
345 List AMDGPU intrinsics
346
347Code Object
348===========
349
350The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
351can be linked by ``lld`` to produce a standard ELF shared code object which can
352be loaded and executed on an AMDGPU target.
353
354Header
355------
356
357The AMDGPU backend uses the following ELF header:
358
359 .. table:: AMDGPU ELF Header
360 :name: amdgpu-elf-header-table
361
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000362 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000363 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000364 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000365 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
366 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Konstantin Zhuravlyov36963522017-10-03 21:18:03 +0000367 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``,
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000368 ``ELFOSABI_AMDGPU_PAL`` or
369 ``ELFOSABI_AMDGPU_MESA3D``
Konstantin Zhuravlyov36963522017-10-03 21:18:03 +0000370 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``,
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000371 ``ELFABIVERSION_AMDGPU_PAL`` or
372 ``ELFABIVERSION_AMDGPU_MESA3D``
Tony Tyef16a45e2017-06-06 20:31:59 +0000373 ``e_type`` ``ET_REL`` or ``ET_DYN``
374 ``e_machine`` ``EM_AMDGPU``
375 ``e_entry`` 0
376 ``e_flags`` 0
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000377 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000378
379..
380
381 .. table:: AMDGPU ELF Header Enumeration Values
382 :name: amdgpu-elf-header-enumeration-values-table
383
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000384 =============================== =====
385 Name Value
386 =============================== =====
387 ``EM_AMDGPU`` 224
388 ``ELFOSABI_AMDGPU_HSA`` 64
389 ``ELFOSABI_AMDGPU_PAL`` 65
390 ``ELFOSABI_AMDGPU_MESA3D`` 66
391 ``ELFABIVERSION_AMDGPU_HSA`` 1
392 ``ELFABIVERSION_AMDGPU_PAL`` 0
393 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
394 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000395
396``e_ident[EI_CLASS]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000397 The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
398 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000399
400``e_ident[EI_DATA]``
401 All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
402
403``e_ident[EI_OSABI]``
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000404 One of the following AMD GPU architecture specific OS ABIs:
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000405
406 * ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
407 the AMD HSA runtime ABI [HSA]_.
408
409 * ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
410 the AMD PAL runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000411
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000412 * ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
413 to the AMD MESA runtime ABI.
414
Tony Tyef16a45e2017-06-06 20:31:59 +0000415``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000416 The ABI version of the AMD GPU architecture specific OS ABI to which the code
417 object conforms:
418
419 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
420 runtime ABI.
421
422 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
423 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000424
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000425 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
426 runtime ABI.
427
Tony Tyef16a45e2017-06-06 20:31:59 +0000428``e_type``
429 Can be one of the following values:
430
431
432 ``ET_REL``
433 The type produced by the AMD GPU backend compiler as it is relocatable code
434 object.
435
436 ``ET_DYN``
437 The type produced by the linker as it is a shared code object.
438
439 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
440
441``e_machine``
442 The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
443 architecture family. The specific member is specified in the
444 ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
445 :ref:`amdgpu-note-records`).
446
447``e_entry``
448 The entry point is 0 as the entry points for individual kernels must be
449 selected in order to invoke them through AQL packets.
450
451``e_flags``
452 The value is 0 as no flags are used.
453
454Sections
455--------
456
457An AMDGPU target ELF code object has the standard ELF sections which include:
458
459 .. table:: AMDGPU ELF Sections
460 :name: amdgpu-elf-sections-table
461
462 ================== ================ =================================
463 Name Type Attributes
464 ================== ================ =================================
465 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
466 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
467 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
468 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
469 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
470 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
471 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
472 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
473 ``.note`` ``SHT_NOTE`` *none*
474 ``.rela``\ *name* ``SHT_RELA`` *none*
475 ``.rela.dyn`` ``SHT_RELA`` *none*
476 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
477 ``.shstrtab`` ``SHT_STRTAB`` *none*
478 ``.strtab`` ``SHT_STRTAB`` *none*
479 ``.symtab`` ``SHT_SYMTAB`` *none*
480 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
481 ================== ================ =================================
482
483These sections have their standard meanings (see [ELF]_) and are only generated
484if needed.
485
486``.debug``\ *\**
487 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
488 DWARF produced by the AMDGPU backend.
489
Tony Tye46d35762017-08-15 20:47:41 +0000490``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000491 The standard sections used by a dynamic loader.
492
493``.note``
494 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
495 backend.
496
497``.rela``\ *name*, ``.rela.dyn``
498 For relocatable code objects, *name* is the name of the section that the
499 relocation records apply. For example, ``.rela.text`` is the section name for
500 relocation records associated with the ``.text`` section.
501
502 For linked shared code objects, ``.rela.dyn`` contains all the relocation
503 records from each of the relocatable code object's ``.rela``\ *name* sections.
504
505 See :ref:`amdgpu-relocation-records` for the relocation records supported by
506 the AMDGPU backend.
507
508``.text``
509 The executable machine code for the kernels and functions they call. Generated
510 as position independent code. See :ref:`amdgpu-code-conventions` for
511 information on conventions used in the isa generation.
512
513.. _amdgpu-note-records:
514
515Note Records
516------------
517
518As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
519the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
520minimal zero byte padding must be generated to ensure the ``desc`` field size is
521a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
522be at least 4 to indicate at least 8 byte alignment.
523
524The AMDGPU backend code object uses the following ELF note records in the
525``.note`` section. The *Description* column specifies the layout of the note
526record’s ``desc`` field. All fields are consecutive bytes. Note records with
527variable size strings have a corresponding ``*_size`` field that specifies the
528number of bytes, including the terminating null character, in the string. The
529string(s) come immediately after the preceding fields.
530
531Additional note records can be present.
532
533 .. table:: AMDGPU ELF Note Records
534 :name: amdgpu-elf-note-records-table
535
Tony Tye46d35762017-08-15 20:47:41 +0000536 ===== ============================== ======================================
537 Name Type Description
538 ===== ============================== ======================================
539 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
540 "AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
541 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000542
543..
544
545 .. table:: AMDGPU ELF Note Record Enumeration Values
546 :name: amdgpu-elf-note-record-enumeration-values-table
547
Tony Tye46d35762017-08-15 20:47:41 +0000548 ============================== =====
549 Name Value
550 ============================== =====
551 *reserved* 0-9
552 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
553 ``NT_AMD_AMDGPU_ISA`` 11
554 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000555
556``NT_AMD_AMDGPU_ISA``
557 Specifies the instruction set architecture used by the machine code contained
558 in the code object.
559
560 This note record is required for code objects containing machine code for
561 processors matching the ``amdgcn`` architecture in table
562 :ref:`amdgpu-processors`.
563
564 The null terminated string has the following syntax:
565
566 *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
567
568 where:
569
570 *architecture*
571 The architecture from table :ref:`amdgpu-target-triples-table`.
572
573 This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
574 :ref:`amdgpu-target-triples`).
575
576 *vendor*
577 The vendor from table :ref:`amdgpu-target-triples-table`.
578
579 For the AMDGPU backend this is always ``amd``.
580
581 *os*
582 The OS from table :ref:`amdgpu-target-triples-table`.
583
584 *environment*
585 An environment from table :ref:`amdgpu-target-triples-table`, or blank if
586 the environment has no affect on the execution of the code object.
587
588 For the AMDGPU backend this is currently always blank.
589 *processor*
590 The processor from table :ref:`amdgpu-processors-table`.
591
592 For example:
593
594 ``amdgcn-amd-amdhsa--gfx901``
595
Tony Tye46d35762017-08-15 20:47:41 +0000596``NT_AMD_AMDGPU_HSA_METADATA``
597 Specifies extensible metadata associated with the code objects executed on HSA
598 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
599 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
600 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
601 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000602
Tony Tye46d35762017-08-15 20:47:41 +0000603.. _amdgpu-symbols:
604
605Symbols
606-------
607
608Symbols include the following:
609
610 .. table:: AMDGPU ELF Symbols
611 :name: amdgpu-elf-symbols-table
612
613 ===================== ============== ============= ==================
614 Name Type Section Description
615 ===================== ============== ============= ==================
616 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
617 - ``.rodata``
618 - ``.bss``
619 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
620 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
621 ===================== ============== ============= ==================
622
623Global variable
624 Global variables both used and defined by the compilation unit.
625
626 If the symbol is defined in the compilation unit then it is allocated in the
627 appropriate section according to if it has initialized data or is readonly.
628
629 If the symbol is external then its section is ``STN_UNDEF`` and the loader
630 will resolve relocations using the definition provided by another code object
631 or explicitly defined by the runtime.
632
633 All global symbols, whether defined in the compilation unit or external, are
634 accessed by the machine code indirectly through a GOT table entry. This
635 allows them to be preemptable. The GOT table is only supported when the target
636 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000637
638 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000639 Add description of linked shared object symbols. Seems undefined symbols
640 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000641
Tony Tye46d35762017-08-15 20:47:41 +0000642Kernel descriptor
643 Every HSA kernel has an associated kernel descriptor. It is the address of the
644 kernel descriptor that is used in the AQL dispatch packet used to invoke the
645 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
646 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
647
648Kernel entry point
649 Every HSA kernel also has a symbol for its machine code entry point.
650
651.. _amdgpu-relocation-records:
652
653Relocation Records
654------------------
655
656AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
657relocatable fields are:
658
659``word32``
660 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
661 alignment. These values use the same byte order as other word values in the
662 AMD GPU architecture.
663
664``word64``
665 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
666 alignment. These values use the same byte order as other word values in the
667 AMD GPU architecture.
668
669Following notations are used for specifying relocation calculations:
670
671**A**
672 Represents the addend used to compute the value of the relocatable field.
673
674**G**
675 Represents the offset into the global offset table at which the relocation
676 entry’s symbol will reside during execution.
677
678**GOT**
679 Represents the address of the global offset table.
680
681**P**
682 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
683 of the storage unit being relocated (computed using ``r_offset``).
684
685**S**
686 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000687 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
688
689**B**
690 Represents the base address of a loaded executable or shared object which is
691 the difference between the ELF address and the actual load address. Relocations
692 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000693
694The following relocation types are supported:
695
696 .. table:: AMDGPU ELF Relocation Records
697 :name: amdgpu-elf-relocation-records-table
698
699 ========================== ===== ========== ==============================
700 Relocation Type Value Field Calculation
701 ========================== ===== ========== ==============================
702 ``R_AMDGPU_NONE`` 0 *none* *none*
703 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
704 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
705 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
706 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
707 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
708 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
709 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
710 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
711 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
712 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
713 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tyed2884302017-10-16 20:44:29 +0000714 *reserved* 12
715 ``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye46d35762017-08-15 20:47:41 +0000716 ========================== ===== ========== ==============================
717
718.. _amdgpu-dwarf:
719
720DWARF
721-----
722
723Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
724information that maps the code object executable code and data to the source
725language constructs. It can be used by tools such as debuggers and profilers.
726
727Address Space Mapping
728~~~~~~~~~~~~~~~~~~~~~
729
730The following address space mapping is used:
731
732 .. table:: AMDGPU DWARF Address Space Mapping
733 :name: amdgpu-dwarf-address-space-mapping-table
734
735 =================== =================
736 DWARF Address Space Memory Space
737 =================== =================
738 1 Private (Scratch)
739 2 Local (group/LDS)
740 *omitted* Global
741 *omitted* Constant
742 *omitted* Generic (Flat)
743 *not supported* Region (GDS)
744 =================== =================
745
746See :ref:`amdgpu-address-spaces` for information on the memory space terminology
747used in the table.
748
749An ``address_class`` attribute is generated on pointer type DIEs to specify the
750DWARF address space of the value of the pointer when it is in the *private* or
751*local* address space. Otherwise the attribute is omitted.
752
753An ``XDEREF`` operation is generated in location list expressions for variables
754that are allocated in the *private* and *local* address space. Otherwise no
755``XDREF`` is omitted.
756
757Register Mapping
758~~~~~~~~~~~~~~~~
759
760*This section is WIP.*
761
762.. TODO
763 Define DWARF register enumeration.
764
765 If want to present a wavefront state then should expose vector registers as
766 64 wide (rather than per work-item view that LLVM uses). Either as separate
767 registers, or a 64x4 byte single register. In either case use a new LANE op
768 (akin to XDREF) to select the current lane usage in a location
769 expression. This would also allow scalar register spilling to vector register
770 lanes to be expressed (currently no debug information is being generated for
771 spilling). If choose a wide single register approach then use LANE in
772 conjunction with PIECE operation to select the dword part of the register for
773 the current lane. If the separate register approach then use LANE to select
774 the register.
775
776Source Text
777~~~~~~~~~~~
778
779*This section is WIP.*
780
781.. TODO
782 DWARF extension to include runtime generated source text.
783
784.. _amdgpu-code-conventions:
785
786Code Conventions
787================
788
789This section provides code conventions used for each supported target triple OS
790(see :ref:`amdgpu-target-triples`).
791
792AMDHSA
793------
794
795This section provides code conventions used when the target triple OS is
796``amdhsa`` (see :ref:`amdgpu-target-triples`).
797
798.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000799
800Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000801~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000802
Tony Tye46d35762017-08-15 20:47:41 +0000803The code object metadata specifies extensible metadata associated with the code
804objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
805[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
806(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
807``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
808information necessary to support the ROCM kernel queries. For example, the
809segment sizes needed in a dispatch packet. In addition, a high level language
810runtime may require other information to be included. For example, the AMD
811OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000812
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000813The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000814:doc:`YamlIO`).
815
Tony Tye46d35762017-08-15 20:47:41 +0000816.. TODO
817 Is the string null terminated? It probably should not if YAML allows it to
818 contain null characters, otherwise it should be.
819
Tony Tyef16a45e2017-06-06 20:31:59 +0000820The metadata is represented as a single YAML document comprised of the mapping
821defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
822referenced tables.
823
824For boolean values, the string values of ``false`` and ``true`` are used for
825false and true respectively.
826
827Additional information can be added to the mappings. To avoid conflicts, any
828non-AMD key names should be prefixed by "*vendor-name*.".
829
830 .. table:: AMDHSA Code Object Metadata Mapping
831 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
832
833 ========== ============== ========= =======================================
834 String Key Value Type Required? Description
835 ========== ============== ========= =======================================
836 "Version" sequence of Required - The first integer is the major
837 2 integers version. Currently 1.
838 - The second integer is the minor
839 version. Currently 0.
840 "Printf" sequence of Each string is encoded information
841 strings about a printf function call. The
842 encoded information is organized as
843 fields separated by colon (':'):
844
845 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
846
847 where:
848
849 ``ID``
850 A 32 bit integer as a unique id for
851 each printf function call
852
853 ``N``
854 A 32 bit integer equal to the number
855 of arguments of printf function call
856 minus 1
857
858 ``S[i]`` (where i = 0, 1, ... , N-1)
859 32 bit integers for the size in bytes
860 of the i-th FormatString argument of
861 the printf function call
862
863 FormatString
864 The format string passed to the
865 printf function call.
866 "Kernels" sequence of Required Sequence of the mappings for each
867 mapping kernel in the code object. See
868 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
869 for the definition of the mapping.
870 ========== ============== ========= =======================================
871
872..
873
874 .. table:: AMDHSA Code Object Kernel Metadata Mapping
875 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
876
877 ================= ============== ========= ================================
878 String Key Value Type Required? Description
879 ================= ============== ========= ================================
880 "Name" string Required Source name of the kernel.
881 "SymbolName" string Required Name of the kernel
882 descriptor ELF symbol.
883 "Language" string Source language of the kernel.
884 Values include:
885
886 - "OpenCL C"
887 - "OpenCL C++"
888 - "HCC"
889 - "OpenMP"
890
891 "LanguageVersion" sequence of - The first integer is the major
892 2 integers version.
893 - The second integer is the
894 minor version.
895 "Attrs" mapping Mapping of kernel attributes.
896 See
897 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
898 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +0000899 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +0000900 mapping kernel arguments. See
901 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
902 for the definition of the mapping.
903 "CodeProps" mapping Mapping of properties related to
904 the kernel code. See
905 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
906 for the mapping definition.
907 "DebugProps" mapping Mapping of properties related to
908 the kernel debugging. See
909 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
910 for the mapping definition.
911 ================= ============== ========= ================================
912
913..
914
915 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
916 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
917
918 =================== ============== ========= ==============================
919 String Key Value Type Required? Description
920 =================== ============== ========= ==============================
921 "ReqdWorkGroupSize" sequence of The dispatch work-group size
922 3 integers X, Y, Z must correspond to the
923 specified values.
924
925 Corresponds to the OpenCL
926 ``reqd_work_group_size``
927 attribute.
928 "WorkGroupSizeHint" sequence of The dispatch work-group size
929 3 integers X, Y, Z is likely to be the
930 specified values.
931
932 Corresponds to the OpenCL
933 ``work_group_size_hint``
934 attribute.
935 "VecTypeHint" string The name of a scalar or vector
936 type.
937
938 Corresponds to the OpenCL
939 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +0000940
941 "RuntimeHandle" string The external symbol name
942 associated with a kernel.
943 OpenCL runtime allocates a
944 global buffer for the symbol
945 and saves the kernel's address
946 to it, which is used for
947 device side enqueueing. Only
948 available for device side
949 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +0000950 =================== ============== ========= ==============================
951
952..
953
954 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
955 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
956
957 ================= ============== ========= ================================
958 String Key Value Type Required? Description
959 ================= ============== ========= ================================
960 "Name" string Kernel argument name.
961 "TypeName" string Kernel argument type name.
962 "Size" integer Required Kernel argument size in bytes.
963 "Align" integer Required Kernel argument alignment in
964 bytes. Must be a power of two.
965 "ValueKind" string Required Kernel argument kind that
966 specifies how to set up the
967 corresponding argument.
968 Values include:
969
970 "ByValue"
971 The argument is copied
972 directly into the kernarg.
973
974 "GlobalBuffer"
975 A global address space pointer
976 to the buffer data is passed
977 in the kernarg.
978
979 "DynamicSharedPointer"
980 A group address space pointer
981 to dynamically allocated LDS
982 is passed in the kernarg.
983
984 "Sampler"
985 A global address space
986 pointer to a S# is passed in
987 the kernarg.
988
989 "Image"
990 A global address space
991 pointer to a T# is passed in
992 the kernarg.
993
994 "Pipe"
995 A global address space pointer
996 to an OpenCL pipe is passed in
997 the kernarg.
998
999 "Queue"
1000 A global address space pointer
1001 to an OpenCL device enqueue
1002 queue is passed in the
1003 kernarg.
1004
1005 "HiddenGlobalOffsetX"
1006 The OpenCL grid dispatch
1007 global offset for the X
1008 dimension is passed in the
1009 kernarg.
1010
1011 "HiddenGlobalOffsetY"
1012 The OpenCL grid dispatch
1013 global offset for the Y
1014 dimension is passed in the
1015 kernarg.
1016
1017 "HiddenGlobalOffsetZ"
1018 The OpenCL grid dispatch
1019 global offset for the Z
1020 dimension is passed in the
1021 kernarg.
1022
1023 "HiddenNone"
1024 An argument that is not used
1025 by the kernel. Space needs to
1026 be left for it, but it does
1027 not need to be set up.
1028
1029 "HiddenPrintfBuffer"
1030 A global address space pointer
1031 to the runtime printf buffer
1032 is passed in kernarg.
1033
1034 "HiddenDefaultQueue"
1035 A global address space pointer
1036 to the OpenCL device enqueue
1037 queue that should be used by
1038 the kernel by default is
1039 passed in the kernarg.
1040
1041 "HiddenCompletionAction"
1042 *TBD*
1043
1044 .. TODO
1045 Add description.
1046
1047 "ValueType" string Required Kernel argument value type. Only
1048 present if "ValueKind" is
1049 "ByValue". For vector data
1050 types, the value is for the
1051 element type. Values include:
1052
1053 - "Struct"
1054 - "I8"
1055 - "U8"
1056 - "I16"
1057 - "U16"
1058 - "F16"
1059 - "I32"
1060 - "U32"
1061 - "F32"
1062 - "I64"
1063 - "U64"
1064 - "F64"
1065
1066 .. TODO
1067 How can it be determined if a
1068 vector type, and what size
1069 vector?
1070 "PointeeAlign" integer Alignment in bytes of pointee
1071 type for pointer type kernel
1072 argument. Must be a power
1073 of 2. Only present if
1074 "ValueKind" is
1075 "DynamicSharedPointer".
1076 "AddrSpaceQual" string Kernel argument address space
1077 qualifier. Only present if
1078 "ValueKind" is "GlobalBuffer" or
1079 "DynamicSharedPointer". Values
1080 are:
1081
1082 - "Private"
1083 - "Global"
1084 - "Constant"
1085 - "Local"
1086 - "Generic"
1087 - "Region"
1088
1089 .. TODO
1090 Is GlobalBuffer only Global
1091 or Constant? Is
1092 DynamicSharedPointer always
1093 Local? Can HCC allow Generic?
1094 How can Private or Region
1095 ever happen?
1096 "AccQual" string Kernel argument access
1097 qualifier. Only present if
1098 "ValueKind" is "Image" or
1099 "Pipe". Values
1100 are:
1101
1102 - "ReadOnly"
1103 - "WriteOnly"
1104 - "ReadWrite"
1105
1106 .. TODO
1107 Does this apply to
1108 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001109 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001110 performed by the kernel on the
1111 kernel argument. Only present if
1112 "ValueKind" is "GlobalBuffer",
1113 "Image", or "Pipe". This may be
1114 more restrictive than indicated
1115 by "AccQual" to reflect what the
1116 kernel actual does. If not
1117 present then the runtime must
1118 assume what is implied by
1119 "AccQual" and "IsConst". Values
1120 are:
1121
1122 - "ReadOnly"
1123 - "WriteOnly"
1124 - "ReadWrite"
1125
1126 "IsConst" boolean Indicates if the kernel argument
1127 is const qualified. Only present
1128 if "ValueKind" is
1129 "GlobalBuffer".
1130
1131 "IsRestrict" boolean Indicates if the kernel argument
1132 is restrict qualified. Only
1133 present if "ValueKind" is
1134 "GlobalBuffer".
1135
1136 "IsVolatile" boolean Indicates if the kernel argument
1137 is volatile qualified. Only
1138 present if "ValueKind" is
1139 "GlobalBuffer".
1140
1141 "IsPipe" boolean Indicates if the kernel argument
1142 is pipe qualified. Only present
1143 if "ValueKind" is "Pipe".
1144
1145 .. TODO
1146 Can GlobalBuffer be pipe
1147 qualified?
1148 ================= ============== ========= ================================
1149
1150..
1151
1152 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1153 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1154
1155 ============================ ============== ========= =====================
1156 String Key Value Type Required? Description
1157 ============================ ============== ========= =====================
1158 "KernargSegmentSize" integer Required The size in bytes of
1159 the kernarg segment
1160 that holds the values
1161 of the arguments to
1162 the kernel.
1163 "GroupSegmentFixedSize" integer Required The amount of group
1164 segment memory
1165 required by a
1166 work-group in
1167 bytes. This does not
1168 include any
1169 dynamically allocated
1170 group segment memory
1171 that may be added
1172 when the kernel is
1173 dispatched.
1174 "PrivateSegmentFixedSize" integer Required The amount of fixed
1175 private address space
1176 memory required for a
1177 work-item in
1178 bytes. If
1179 IsDynamicCallstack
1180 is 1 then additional
1181 space must be added
1182 to this value for the
1183 call stack.
1184 "KernargSegmentAlign" integer Required The maximum byte
1185 alignment of
1186 arguments in the
1187 kernarg segment. Must
1188 be a power of 2.
1189 "WavefrontSize" integer Required Wavefront size. Must
1190 be a power of 2.
1191 "NumSGPRs" integer Number of scalar
1192 registers used by a
1193 wavefront for
1194 GFX6-GFX9. This
1195 includes the special
1196 SGPRs for VCC, Flat
1197 Scratch (GFX7-GFX9)
1198 and XNACK (for
1199 GFX8-GFX9). It does
1200 not include the 16
1201 SGPR added if a trap
1202 handler is
1203 enabled. It is not
1204 rounded up to the
1205 allocation
1206 granularity.
1207 "NumVGPRs" integer Number of vector
1208 registers used by
1209 each work-item for
1210 GFX6-GFX9
Konstantin Zhuravlyov8d5e9e12017-10-18 17:31:09 +00001211 "MaxFlatWorkGroupSize" integer Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001212 work-group size
1213 supported by the
1214 kernel in work-items.
1215 "IsDynamicCallStack" boolean Indicates if the
1216 generated machine
1217 code is using a
1218 dynamically sized
1219 call stack.
1220 "IsXNACKEnabled" boolean Indicates if the
1221 generated machine
1222 code is capable of
1223 supporting XNACK.
1224 ============================ ============== ========= =====================
1225
1226..
1227
1228 .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
1229 :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
1230
1231 =================================== ============== ========= ==============
1232 String Key Value Type Required? Description
1233 =================================== ============== ========= ==============
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001234 "DebuggerABIVersion" sequence of
1235 2 integers
Tony Tyef16a45e2017-06-06 20:31:59 +00001236 "ReservedNumVGPRs" integer
1237 "ReservedFirstVGPR" integer
1238 "PrivateSegmentBufferSGPR" integer
1239 "WavefrontPrivateSegmentOffsetSGPR" integer
1240 =================================== ============== ========= ==============
1241
1242.. TODO
1243 Plan to remove the debug properties metadata.
1244
Tony Tyef16a45e2017-06-06 20:31:59 +00001245Kernel Dispatch
1246~~~~~~~~~~~~~~~
1247
1248The HSA architected queuing language (AQL) defines a user space memory interface
1249that can be used to control the dispatch of kernels, in an agent independent
1250way. An agent can have zero or more AQL queues created for it using the ROCm
1251runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1252*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1253mechanics and packet layouts.
1254
1255The packet processor of a kernel agent is responsible for detecting and
1256dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1257packet processor is implemented by the hardware command processor (CP),
1258asynchronous dispatch controller (ADC) and shader processor input controller
1259(SPI).
1260
1261The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1262mode driver to initialize and register the AQL queue with CP.
1263
1264To dispatch a kernel the following actions are performed. This can occur in the
1265CPU host program, or from an HSA kernel executing on a GPU.
1266
12671. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1268 executed is obtained.
12692. A pointer to the kernel descriptor (see
1270 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1271 obtained. It must be for a kernel that is contained in a code object that that
1272 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1273 associated.
12743. Space is allocated for the kernel arguments using the ROCm runtime allocator
1275 for a memory region with the kernarg property for the kernel agent that will
1276 execute the kernel. It must be at least 16 byte aligned.
12774. Kernel argument values are assigned to the kernel argument memory
1278 allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1279 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1280 memory in the same way constant memory is accessed. (Note that the HSA
1281 specification allows an implementation to copy the kernel argument contents to
1282 another location that is accessed by the kernel.)
12835. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1284 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1285 packet. The packet must be set up, and the final write must use an atomic
1286 store release to set the packet kind to ensure the packet contents are
1287 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1288 notify the kernel agent that the AQL queue has been updated. These rules, and
1289 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1290 System Architecture Specification* [HSA]_.
12916. A kernel dispatch packet includes information about the actual dispatch,
1292 such as grid and work-group size, together with information from the code
1293 object about the kernel, such as segment sizes. The ROCm runtime queries on
1294 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001295 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000012967. CP executes micro-code and is responsible for detecting and setting up the
1297 GPU to execute the wavefronts of a kernel dispatch.
12988. CP ensures that when the a wavefront starts executing the kernel machine
1299 code, the scalar general purpose registers (SGPR) and vector general purpose
1300 registers (VGPR) are set up as required by the machine code. The required
1301 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1302 register state is defined in
1303 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13049. The prolog of the kernel machine code (see
1305 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1306 before continuing executing the machine code that corresponds to the kernel.
130710. When the kernel dispatch has completed execution, CP signals the completion
1308 signal specified in the kernel dispatch packet if not 0.
1309
1310.. _amdgpu-amdhsa-memory-spaces:
1311
1312Memory Spaces
1313~~~~~~~~~~~~~
1314
1315The memory space properties are:
1316
1317 .. table:: AMDHSA Memory Spaces
1318 :name: amdgpu-amdhsa-memory-spaces-table
1319
1320 ================= =========== ======== ======= ==================
1321 Memory Space Name HSA Segment Hardware Address NULL Value
1322 Name Name Size
1323 ================= =========== ======== ======= ==================
1324 Private private scratch 32 0x00000000
1325 Local group LDS 32 0xFFFFFFFF
1326 Global global global 64 0x0000000000000000
1327 Constant constant *same as 64 0x0000000000000000
1328 global*
1329 Generic flat flat 64 0x0000000000000000
1330 Region N/A GDS 32 *not implemented
1331 for AMDHSA*
1332 ================= =========== ======== ======= ==================
1333
1334The global and constant memory spaces both use global virtual addresses, which
1335are the same virtual address space used by the CPU. However, some virtual
1336addresses may only be accessible to the CPU, some only accessible by the GPU,
1337and some by both.
1338
1339Using the constant memory space indicates that the data will not change during
1340the execution of the kernel. This allows scalar read instructions to be
1341used. The vector and scalar L1 caches are invalidated of volatile data before
1342each kernel dispatch execution to allow constant memory to change values between
1343kernel dispatches.
1344
1345The local memory space uses the hardware Local Data Store (LDS) which is
1346automatically allocated when the hardware creates work-groups of wavefronts, and
1347freed when all the wavefronts of a work-group have terminated. The data store
1348(DS) instructions can be used to access it.
1349
1350The private memory space uses the hardware scratch memory support. If the kernel
1351uses scratch, then the hardware allocates memory that is accessed using
1352wavefront lane dword (4 byte) interleaving. The mapping used from private
1353address to physical address is:
1354
1355 ``wavefront-scratch-base +
1356 (private-address * wavefront-size * 4) +
1357 (wavefront-lane-id * 4)``
1358
1359There are different ways that the wavefront scratch base address is determined
1360by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1361memory can be accessed in an interleaved manner using buffer instruction with
1362the scratch buffer descriptor and per wave scratch offset, by the scratch
1363instructions, or by flat instructions. If each lane of a wavefront accesses the
1364same private address, the interleaving results in adjacent dwords being accessed
1365and hence requires fewer cache lines to be fetched. Multi-dword access is not
1366supported except by flat and scratch instructions in GFX9.
1367
1368The generic address space uses the hardware flat address support available in
1369GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1370local appertures), that are outside the range of addressible global memory, to
1371map from a flat address to a private or local address.
1372
1373FLAT instructions can take a flat address and access global, private (scratch)
1374and group (LDS) memory depending in if the address is within one of the
1375apperture ranges. Flat access to scratch requires hardware aperture setup and
1376setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1377access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1378(see :ref:`amdgpu-amdhsa-m0`).
1379
1380To convert between a segment address and a flat address the base address of the
1381appertures address can be used. For GFX7-GFX8 these are available in the
1382:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1383Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1384GFX9 the appature base addresses are directly available as inline constant
1385registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1386address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1387which makes it easier to convert from flat to segment or segment to flat.
1388
Tony Tye46d35762017-08-15 20:47:41 +00001389Image and Samplers
1390~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001391
1392Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1393hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1394HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1395enumeration values for the queries that are not trivially deducible from the S#
1396representation.
1397
1398HSA Signals
1399~~~~~~~~~~~
1400
Tony Tye46d35762017-08-15 20:47:41 +00001401HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1402structure allocated in memory accessible from both the CPU and GPU. The
1403structure is defined by the ROCm runtime and subject to change between releases
1404(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001405
1406.. _amdgpu-amdhsa-hsa-aql-queue:
1407
1408HSA AQL Queue
1409~~~~~~~~~~~~~
1410
Tony Tye46d35762017-08-15 20:47:41 +00001411The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001412between releases (see [AMD-ROCm-github]_). For some processors it contains
1413fields needed to implement certain language features such as the flat address
1414aperture bases. It also contains fields used by CP such as managing the
1415allocation of scratch memory.
1416
1417.. _amdgpu-amdhsa-kernel-descriptor:
1418
1419Kernel Descriptor
1420~~~~~~~~~~~~~~~~~
1421
1422A kernel descriptor consists of the information needed by CP to initiate the
1423execution of a kernel, including the entry point address of the machine code
1424that implements the kernel.
1425
1426Kernel Descriptor for GFX6-GFX9
1427+++++++++++++++++++++++++++++++
1428
1429CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1430
1431 .. table:: Kernel Descriptor for GFX6-GFX9
1432 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1433
1434 ======= ======= =============================== ===========================
1435 Bits Size Field Name Description
1436 ======= ======= =============================== ===========================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001437 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001438 address space memory
1439 required for a work-group
1440 in bytes. This does not
1441 include any dynamically
1442 allocated local address
1443 space memory that may be
1444 added when the kernel is
1445 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001446 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001447 private address space
1448 memory required for a
1449 work-item in bytes. If
1450 is_dynamic_callstack is 1
1451 then additional space must
1452 be added to this value for
1453 the call stack.
Konstantin Zhuravlyov8d5e9e12017-10-18 17:31:09 +00001454 95:64 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
Tony Tyef16a45e2017-06-06 20:31:59 +00001455 size supported by the
1456 kernel in work-items.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001457 96 1 bit IsDynamicCallStack Indicates if the generated
Tony Tyef16a45e2017-06-06 20:31:59 +00001458 machine code is using a
1459 dynamically sized call
1460 stack.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001461 97 1 bit IsXNACKEnabled Indicates if the generated
Tony Tyef16a45e2017-06-06 20:31:59 +00001462 machine code is capable of
1463 suppoting XNACK.
1464 127:98 30 bits Reserved. Must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001465 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001466 negative) from base
1467 address of kernel
1468 descriptor to kernel's
1469 entry point instruction
1470 which must be 256 byte
1471 aligned.
1472 383:192 24 Reserved. Must be 0.
1473 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001474 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001475 program settings used by
1476 CP to set up
1477 ``COMPUTE_PGM_RSRC1``
1478 configuration
1479 register. See
1480 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001481 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001482 program settings used by
1483 CP to set up
1484 ``COMPUTE_PGM_RSRC2``
1485 configuration
1486 register. See
1487 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001488 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1489 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001490 (see
1491 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1492
1493 The total number of SGPR
1494 user data registers
1495 requested must not exceed
1496 16 and match value in
1497 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1498 Any requests beyond 16
1499 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001500 449 1 bit EnableSGPRDispatchPtr *see above*
1501 450 1 bit EnableSGPRQueuePtr *see above*
1502 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1503 452 1 bit EnableSGPRDispatchID *see above*
1504 453 1 bit EnableSGPRFlatScratchInit *see above*
1505 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1506 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1507 should always be 0.
1508 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1509 should always be 0.
1510 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1511 should always be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001512 463:458 6 bits Reserved. Must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001513 511:464 6 Reserved. Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001514 bytes
1515 512 **Total size 64 bytes.**
1516 ======= ===================================================================
1517
1518..
1519
1520 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1521 :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1522
Tony Tye3b340612017-06-07 00:46:08 +00001523 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001524 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001525 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001526 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001527 used by each work-item,
1528 granularity is device
1529 specific:
1530
1531 GFX6-9
1532 roundup((max-vgpg + 1)
1533 / 4) - 1
1534
1535 Used by CP to set up
1536 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001537 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001538 used by a wavefront,
1539 granularity is device
1540 specific:
1541
1542 GFX6-8
1543 roundup((max-sgpg + 1)
1544 / 8) - 1
1545 GFX9
1546 roundup((max-sgpg + 1)
1547 / 16) - 1
1548
1549 Includes the special SGPRs
1550 for VCC, Flat Scratch (for
1551 GFX7 onwards) and XNACK
1552 (for GFX8 onwards). It does
1553 not include the 16 SGPR
1554 added if a trap handler is
1555 enabled.
1556
1557 Used by CP to set up
1558 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001559 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001560
1561 Start executing wavefront
1562 at the specified priority.
1563
1564 CP is responsible for
1565 filling in
1566 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001567 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001568 with specified rounding
1569 mode for single (32
1570 bit) floating point
1571 precision floating point
1572 operations.
1573
1574 Floating point rounding
1575 mode values are defined in
1576 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1577
1578 Used by CP to set up
1579 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001580 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001581 with specified rounding
1582 denorm mode for half/double (16
1583 and 64 bit) floating point
1584 precision floating point
1585 operations.
1586
1587 Floating point rounding
1588 mode values are defined in
1589 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1590
1591 Used by CP to set up
1592 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001593 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001594 with specified denorm mode
1595 for single (32
1596 bit) floating point
1597 precision floating point
1598 operations.
1599
1600 Floating point denorm mode
1601 values are defined in
1602 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1603
1604 Used by CP to set up
1605 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001606 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001607 with specified denorm mode
1608 for half/double (16
1609 and 64 bit) floating point
1610 precision floating point
1611 operations.
1612
1613 Floating point denorm mode
1614 values are defined in
1615 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1616
1617 Used by CP to set up
1618 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001619 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001620
1621 Start executing wavefront
1622 in privilege trap handler
1623 mode.
1624
1625 CP is responsible for
1626 filling in
1627 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001628 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001629 with DX10 clamp mode
1630 enabled. Used by the vector
1631 ALU to force DX-10 style
1632 treatment of NaN's (when
1633 set, clamp NaN to zero,
1634 otherwise pass NaN
1635 through).
1636
1637 Used by CP to set up
1638 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001639 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001640
1641 Start executing wavefront
1642 in single step mode.
1643
1644 CP is responsible for
1645 filling in
1646 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001647 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001648 with IEEE mode
1649 enabled. Floating point
1650 opcodes that support
1651 exception flag gathering
1652 will quiet and propagate
1653 signaling-NaN inputs per
1654 IEEE 754-2008. Min_dx10 and
1655 max_dx10 become IEEE
1656 754-2008 compliant due to
1657 signaling-NaN propagation
1658 and quieting.
1659
1660 Used by CP to set up
1661 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001662 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001663
1664 Only one work-group allowed
1665 to execute on a compute
1666 unit.
1667
1668 CP is responsible for
1669 filling in
1670 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001671 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001672
1673 Flag that can be used to
1674 control debugging code.
1675
1676 CP is responsible for
1677 filling in
1678 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001679 26 1 bit FP16_OVFL GFX6-8:
1680 Reserved. Must be 0.
1681 GFX9:
1682 Wavefront starts
1683 execution with specified
1684 fp16 overflow mode.
1685
1686 - If 0, then fp16
1687 overflow generates
1688 +/-INF values.
1689 - If 1, then fp16
1690 overflow that is the
1691 result of an +/-INF
1692 input value or divide
1693 by 0 generates a
1694 +/-INF, otherwise
1695 clamps computed
1696 overflow to +/-MAX_FP16
1697 as appropriate.
1698
1699 Used by CP to set up
1700 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1701 31:27 5 bits Reserved. Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001702 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001703 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001704
1705..
1706
1707 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1708 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1709
Tony Tye3b340612017-06-07 00:46:08 +00001710 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001711 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001712 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001713 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1714 _WAVE_OFFSET SGPR wave scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001715 system register (see
1716 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1717
1718 Used by CP to set up
1719 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001720 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001721 user data registers
1722 requested. This number must
1723 match the number of user
1724 data registers enabled.
1725
1726 Used by CP to set up
1727 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001728 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001729 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001730 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001731 be enabled.
1732
1733 CP sets
1734 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1735 if the runtime has
1736 installed a trap handler
1737 regardless of the setting
1738 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001739 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001740 system SGPR register for
1741 the work-group id in the X
1742 dimension (see
1743 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1744
1745 Used by CP to set up
1746 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001747 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001748 system SGPR register for
1749 the work-group id in the Y
1750 dimension (see
1751 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1752
1753 Used by CP to set up
1754 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001755 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001756 system SGPR register for
1757 the work-group id in the Z
1758 dimension (see
1759 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1760
1761 Used by CP to set up
1762 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001763 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001764 system SGPR register for
1765 work-group information (see
1766 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1767
1768 Used by CP to set up
1769 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001770 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001771 VGPR system registers used
1772 for the work-item ID.
1773 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1774 defines the values.
1775
1776 Used by CP to set up
1777 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001778 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001779
1780 Wavefront starts execution
1781 with address watch
1782 exceptions enabled which
1783 are generated when L1 has
1784 witnessed a thread access
1785 an *address of
1786 interest*.
1787
1788 CP is responsible for
1789 filling in the address
1790 watch bit in
1791 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1792 according to what the
1793 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001794 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001795
1796 Wavefront starts execution
1797 with memory violation
1798 exceptions exceptions
1799 enabled which are generated
1800 when a memory violation has
1801 occurred for this wave from
1802 L1 or LDS
1803 (write-to-read-only-memory,
1804 mis-aligned atomic, LDS
1805 address out of range,
1806 illegal address, etc.).
1807
1808 CP sets the memory
1809 violation bit in
1810 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1811 according to what the
1812 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001813 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001814
1815 CP uses the rounded value
1816 from the dispatch packet,
1817 not this value, as the
1818 dispatch may contain
1819 dynamically allocated group
1820 segment memory. CP writes
1821 directly to
1822 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1823
1824 Amount of group segment
1825 (LDS) to allocate for each
1826 work-group. Granularity is
1827 device specific:
1828
1829 GFX6:
1830 roundup(lds-size / (64 * 4))
1831 GFX7-GFX9:
1832 roundup(lds-size / (128 * 4))
1833
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001834 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1835 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001836 enabled.
1837
1838 Used by CP to set up
1839 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1840 (set from bits 0..6).
1841
1842 IEEE 754 FP Invalid
1843 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001844 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1845 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001846 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001847 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1848 _DIVISION_BY_ZERO Zero
1849 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1850 _OVERFLOW
1851 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1852 _UNDERFLOW
1853 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1854 _INEXACT
1855 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1856 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001857 only)
1858 31 1 bit Reserved. Must be 0.
1859 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001860 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001861
1862..
1863
1864 .. table:: Floating Point Rounding Mode Enumeration Values
1865 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1866
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001867 ====================================== ===== ==============================
1868 Enumeration Name Value Description
1869 ====================================== ===== ==============================
1870 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1871 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1872 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1873 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1874 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001875
1876..
1877
1878 .. table:: Floating Point Denorm Mode Enumeration Values
1879 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1880
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001881 ====================================== ===== ==============================
1882 Enumeration Name Value Description
1883 ====================================== ===== ==============================
1884 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1885 Denorms
1886 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1887 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1888 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1889 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001890
1891..
1892
1893 .. table:: System VGPR Work-Item ID Enumeration Values
1894 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1895
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001896 ======================================== ===== ============================
1897 Enumeration Name Value Description
1898 ======================================== ===== ============================
1899 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1900 ID.
1901 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1902 dimensions ID.
1903 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1904 dimensions ID.
1905 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1906 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001907
1908.. _amdgpu-amdhsa-initial-kernel-execution-state:
1909
1910Initial Kernel Execution State
1911~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1912
1913This section defines the register state that will be set up by the packet
1914processor prior to the start of execution of every wavefront. This is limited by
1915the constraints of the hardware controllers of CP/ADC/SPI.
1916
1917The order of the SGPR registers is defined, but the compiler can specify which
1918ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1919fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1920for enabled registers are dense starting at SGPR0: the first enabled register is
1921SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1922an SGPR number.
1923
1924The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1925all waves of the grid. It is possible to specify more than 16 User SGPRs using
1926the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1927initialized. These are then immediately followed by the System SGPRs that are
1928set up by ADC/SPI and can have different values for each wave of the grid
1929dispatch.
1930
1931SGPR register initial state is defined in
1932:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1933
1934 .. table:: SGPR Register Set Up Order
1935 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1936
1937 ========== ========================== ====== ==============================
1938 SGPR Order Name Number Description
1939 (kernel descriptor enable of
1940 field) SGPRs
1941 ========== ========================== ====== ==============================
1942 First Private Segment Buffer 4 V# that can be used, together
1943 (enable_sgpr_private with Scratch Wave Offset as an
1944 _segment_buffer) offset, to access the private
1945 memory space using a segment
1946 address.
1947
1948 CP uses the value provided by
1949 the runtime.
1950 then Dispatch Ptr 2 64 bit address of AQL dispatch
1951 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
1952 actually executing.
1953 then Queue Ptr 2 64 bit address of amd_queue_t
1954 (enable_sgpr_queue_ptr) object for AQL queue on which
1955 the dispatch packet was
1956 queued.
1957 then Kernarg Segment Ptr 2 64 bit address of Kernarg
1958 (enable_sgpr_kernarg segment. This is directly
1959 _segment_ptr) copied from the
1960 kernarg_address in the kernel
1961 dispatch packet.
1962
1963 Having CP load it once avoids
1964 loading it at the beginning of
1965 every wavefront.
1966 then Dispatch Id 2 64 bit Dispatch ID of the
1967 (enable_sgpr_dispatch_id) dispatch packet being
1968 executed.
1969 then Flat Scratch Init 2 This is 2 SGPRs:
1970 (enable_sgpr_flat_scratch
1971 _init) GFX6
1972 Not supported.
1973 GFX7-GFX8
1974 The first SGPR is a 32 bit
1975 byte offset from
1976 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1977 to per SPI base of memory
1978 for scratch for the queue
1979 executing the kernel
1980 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00001981 from the runtime. (The
1982 Scratch Segment Buffer base
1983 address is
1984 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1985 plus this offset.) The value
1986 of Scratch Wave Offset must
1987 be added to this offset by
1988 the kernel machine code,
1989 right shifted by 8, and
1990 moved to the FLAT_SCRATCH_HI
1991 SGPR register.
1992 FLAT_SCRATCH_HI corresponds
1993 to SGPRn-4 on GFX7, and
1994 SGPRn-6 on GFX8 (where SGPRn
1995 is the highest numbered SGPR
1996 allocated to the wave).
1997 FLAT_SCRATCH_HI is
1998 multiplied by 256 (as it is
1999 in units of 256 bytes) and
2000 added to
2001 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2002 to calculate the per wave
2003 FLAT SCRATCH BASE in flat
2004 memory instructions that
2005 access the scratch
2006 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002007
2008 The second SGPR is 32 bit
2009 byte size of a single
2010 work-item’s scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002011 usage. CP obtains this from
2012 the runtime, and it is
2013 always a multiple of DWORD.
2014 CP checks that the value in
2015 the kernel dispatch packet
2016 Private Segment Byte Size is
2017 not larger, and requests the
2018 runtime to increase the
2019 queue's scratch size if
2020 necessary. The kernel code
2021 must move it to
2022 FLAT_SCRATCH_LO which is
2023 SGPRn-3 on GFX7 and SGPRn-5
2024 on GFX8. FLAT_SCRATCH_LO is
2025 used as the FLAT SCRATCH
2026 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002027 instructions. Having CP load
2028 it once avoids loading it at
2029 the beginning of every
Tony Tye46d35762017-08-15 20:47:41 +00002030 wavefront. GFX9 This is the
2031 64 bit base address of the
2032 per SPI scratch backing
2033 memory managed by SPI for
2034 the queue executing the
2035 kernel dispatch. CP obtains
2036 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002037 divides it if there are
2038 multiple Shader Arrays each
2039 with its own SPI). The value
2040 of Scratch Wave Offset must
2041 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002042 machine code and the result
2043 moved to the FLAT_SCRATCH
2044 SGPR which is SGPRn-6 and
2045 SGPRn-5. It is used as the
2046 FLAT SCRATCH BASE in flat
2047 memory instructions. then
2048 Private Segment Size 1 The
2049 32 bit byte size of a
2050 (enable_sgpr_private single
2051 work-item's
2052 scratch_segment_size) memory
2053 allocation. This is the
2054 value from the kernel
2055 dispatch packet Private
2056 Segment Byte Size rounded up
2057 by CP to a multiple of
2058 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002059
2060 Having CP load it once avoids
2061 loading it at the beginning of
2062 every wavefront.
2063
2064 This is not used for
2065 GFX7-GFX8 since it is the same
2066 value as the second SGPR of
2067 Flat Scratch Init. However, it
2068 may be needed for GFX9 which
2069 changes the meaning of the
2070 Flat Scratch Init value.
2071 then Grid Work-Group Count X 1 32 bit count of the number of
2072 (enable_sgpr_grid work-groups in the X dimension
2073 _workgroup_count_X) for the grid being
2074 executed. Computed from the
2075 fields in the kernel dispatch
2076 packet as ((grid_size.x +
2077 workgroup_size.x - 1) /
2078 workgroup_size.x).
2079 then Grid Work-Group Count Y 1 32 bit count of the number of
2080 (enable_sgpr_grid work-groups in the Y dimension
2081 _workgroup_count_Y && for the grid being
2082 less than 16 previous executed. Computed from the
2083 SGPRs) fields in the kernel dispatch
2084 packet as ((grid_size.y +
2085 workgroup_size.y - 1) /
2086 workgroupSize.y).
2087
2088 Only initialized if <16
2089 previous SGPRs initialized.
2090 then Grid Work-Group Count Z 1 32 bit count of the number of
2091 (enable_sgpr_grid work-groups in the Z dimension
2092 _workgroup_count_Z && for the grid being
2093 less than 16 previous executed. Computed from the
2094 SGPRs) fields in the kernel dispatch
2095 packet as ((grid_size.z +
2096 workgroup_size.z - 1) /
2097 workgroupSize.z).
2098
2099 Only initialized if <16
2100 previous SGPRs initialized.
2101 then Work-Group Id X 1 32 bit work-group id in X
2102 (enable_sgpr_workgroup_id dimension of grid for
2103 _X) wavefront.
2104 then Work-Group Id Y 1 32 bit work-group id in Y
2105 (enable_sgpr_workgroup_id dimension of grid for
2106 _Y) wavefront.
2107 then Work-Group Id Z 1 32 bit work-group id in Z
2108 (enable_sgpr_workgroup_id dimension of grid for
2109 _Z) wavefront.
2110 then Work-Group Info 1 {first_wave, 14’b0000,
2111 (enable_sgpr_workgroup ordered_append_term[10:0],
2112 _info) threadgroup_size_in_waves[5:0]}
2113 then Scratch Wave Offset 1 32 bit byte offset from base
2114 (enable_sgpr_private of scratch base of queue
2115 _segment_wave_offset) executing the kernel
2116 dispatch. Must be used as an
2117 offset with Private
2118 segment address when using
2119 Scratch Segment Buffer. It
2120 must be used to set up FLAT
2121 SCRATCH for flat addressing
2122 (see
2123 :ref:`amdgpu-amdhsa-flat-scratch`).
2124 ========== ========================== ====== ==============================
2125
2126The order of the VGPR registers is defined, but the compiler can specify which
2127ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2128fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2129for enabled registers are dense starting at VGPR0: the first enabled register is
2130VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2131VGPR number.
2132
2133VGPR register initial state is defined in
2134:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2135
2136 .. table:: VGPR Register Set Up Order
2137 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2138
2139 ========== ========================== ====== ==============================
2140 VGPR Order Name Number Description
2141 (kernel descriptor enable of
2142 field) VGPRs
2143 ========== ========================== ====== ==============================
2144 First Work-Item Id X 1 32 bit work item id in X
2145 (Always initialized) dimension of work-group for
2146 wavefront lane.
2147 then Work-Item Id Y 1 32 bit work item id in Y
2148 (enable_vgpr_workitem_id dimension of work-group for
2149 > 0) wavefront lane.
2150 then Work-Item Id Z 1 32 bit work item id in Z
2151 (enable_vgpr_workitem_id dimension of work-group for
2152 > 1) wavefront lane.
2153 ========== ========================== ====== ==============================
2154
2155The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2156
21571. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2158 registers.
21592. Work-group Id registers X, Y, Z are set by ADC which supports any
2160 combination including none.
21613. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2162 cannot included with the flat scratch init value which is per queue.
21634. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2164 or (X, Y, Z).
2165
2166Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2167value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2168
2169The global segment can be accessed either using buffer instructions (GFX6 which
2170has V# 64 bit address support), flat instructions (GFX7-9), or global
2171instructions (GFX9).
2172
2173If buffer operations are used then the compiler can generate a V# with the
2174following properties:
2175
2176* base address of 0
2177* no swizzle
2178* ATC: 1 if IOMMU present (such as APU)
2179* ptr64: 1
2180* MTYPE set to support memory coherence that matches the runtime (such as CC for
2181 APU and NC for dGPU).
2182
2183.. _amdgpu-amdhsa-kernel-prolog:
2184
2185Kernel Prolog
2186~~~~~~~~~~~~~
2187
2188.. _amdgpu-amdhsa-m0:
2189
2190M0
2191++
2192
2193GFX6-GFX8
2194 The M0 register must be initialized with a value at least the total LDS size
2195 if the kernel may access LDS via DS or flat operations. Total LDS size is
2196 available in dispatch packet. For M0, it is also possible to use maximum
2197 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2198 GFX7-GFX8).
2199GFX9
2200 The M0 register is not used for range checking LDS accesses and so does not
2201 need to be initialized in the prolog.
2202
2203.. _amdgpu-amdhsa-flat-scratch:
2204
2205Flat Scratch
2206++++++++++++
2207
2208If the kernel may use flat operations to access scratch memory, the prolog code
2209must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2210are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2211Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2212
2213GFX6
2214 Flat scratch is not supported.
2215
2216GFX7-8
2217 1. The low word of Flat Scratch Init is 32 bit byte offset from
2218 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2219 being managed by SPI for the queue executing the kernel dispatch. This is
2220 the same value used in the Scratch Segment Buffer V# base address. The
2221 prolog must add the value of Scratch Wave Offset to get the wave's byte
2222 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2223 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2224 by 8 before moving into FLAT_SCRATCH_LO.
2225 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2226 work-items scratch memory usage. This is directly loaded from the kernel
2227 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2228 DWORD. Having CP load it once avoids loading it at the beginning of every
2229 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2230 SIZE.
2231GFX9
2232 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2233 memory being managed by SPI for the queue executing the kernel dispatch. The
2234 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2235 pair for use as the flat scratch base in flat memory instructions.
2236
2237.. _amdgpu-amdhsa-memory-model:
2238
2239Memory Model
2240~~~~~~~~~~~~
2241
2242This section describes the mapping of LLVM memory model onto AMDGPU machine code
2243(see :ref:`memmodel`). *The implementation is WIP.*
2244
2245.. TODO
2246 Update when implementation complete.
2247
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002248 Support more relaxed OpenCL memory model to be controlled by environment
Tony Tyef16a45e2017-06-06 20:31:59 +00002249 component of target triple.
2250
2251The AMDGPU backend supports the memory synchronization scopes specified in
2252:ref:`amdgpu-memory-scopes`.
2253
2254The code sequences used to implement the memory model are defined in table
2255:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2256
2257The sequences specify the order of instructions that a single thread must
2258execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2259to other memory instructions executed by the same thread. This allows them to be
2260moved earlier or later which can allow them to be combined with other instances
2261of the same instruction, or hoisted/sunk out of loops to improve
2262performance. Only the instructions related to the memory model are given;
2263additional ``s_waitcnt`` instructions are required to ensure registers are
2264defined before being used. These may be able to be combined with the memory
2265model ``s_waitcnt`` instructions as described above.
2266
2267The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2268OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2269relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2270memory model which has separate happens-before relations for the global and
2271local address spaces, and only a fence specifying both global and local address
2272space joins the relationships. Since the LLVM ``memfence`` instruction does not
2273allow an address space to be specified the OpenCL fence has to convervatively
2274assume both local and global address space was specified. However, optimizations
2275can often be done to eliminate the additional ``s_waitcnt``instructions when
2276there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2277instructions. The code sequences in the table indicate what can be omitted for
2278the OpenCL memory. The target triple environment is used to determine if the
2279source language is OpenCL (see :ref:`amdgpu-opencl`).
2280
2281``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2282operations.
2283
2284``buffer/global/flat_load/store/atomic`` instructions to global memory are
2285termed vector memory operations.
2286
2287For GFX6-GFX9:
2288
2289* Each agent has multiple compute units (CU).
2290* Each CU has multiple SIMDs that execute wavefronts.
2291* The wavefronts for a single work-group are executed in the same CU but may be
2292 executed by different SIMDs.
2293* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2294 executing on it.
2295* All LDS operations of a CU are performed as wavefront wide operations in a
2296 global order and involve no caching. Completion is reported to a wavefront in
2297 execution order.
2298* The LDS memory has multiple request queues shared by the SIMDs of a
2299 CU. Therefore, the LDS operations performed by different waves of a work-group
2300 can be reordered relative to each other, which can result in reordering the
2301 visibility of vector memory operations with respect to LDS operations of other
2302 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002303 ensure synchronization between LDS operations and vector memory operations
Tony Tyef16a45e2017-06-06 20:31:59 +00002304 between waves of a work-group, but not between operations performed by the
2305 same wavefront.
2306* The vector memory operations are performed as wavefront wide operations and
2307 completion is reported to a wavefront in execution order. The exception is
2308 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2309 vector memory order if they access LDS memory, and out of LDS operation order
2310 if they access global memory.
2311* The vector memory operations access a vector L1 cache shared by all wavefronts
2312 on a CU. Therefore, no special action is required for coherence between
2313 wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2314 coherence between waves executing in different work-groups as they may be
2315 executing on different CUs.
2316* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2317 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2318 scalar operations are used in a restricted way so do not impact the memory
2319 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2320* The vector and scalar memory operations use an L2 cache shared by all CUs on
2321 the same agent.
2322* The L2 cache has independent channels to service disjoint ranges of virtual
2323 addresses.
2324* Each CU has a separate request queue per channel. Therefore, the vector and
2325 scalar memory operations performed by waves executing in different work-groups
2326 (which may be executing on different CUs) of an agent can be reordered
2327 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002328 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002329 previous vector memory operation has completed before executing a subsequent
2330 vector memory or LDS operation and so can be used to meet the requirements of
2331 acquire and release.
2332* The L2 cache can be kept coherent with other agents on some targets, or ranges
2333 of virtual addresses can be set up to bypass it to ensure system coherence.
2334
2335Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2336or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2337memory, atomic memory orderings are not meaningful and all accesses are treated
2338as non-atomic.
2339
2340Constant address space uses ``buffer/global_load`` instructions (or equivalent
2341scalar memory instructions). Since the constant address space contents do not
2342change during the execution of a kernel dispatch it is not legal to perform
2343stores, and atomic memory orderings are not meaningful and all access are
2344treated as non-atomic.
2345
2346A memory synchronization scope wider than work-group is not meaningful for the
2347group (LDS) address space and is treated as work-group.
2348
2349The memory model does not support the region address space which is treated as
2350non-atomic.
2351
2352Acquire memory ordering is not meaningful on store atomic instructions and is
2353treated as non-atomic.
2354
2355Release memory ordering is not meaningful on load atomic instructions and is
2356treated a non-atomic.
2357
2358Acquire-release memory ordering is not meaningful on load or store atomic
2359instructions and is treated as acquire and release respectively.
2360
2361AMDGPU backend only uses scalar memory operations to access memory that is
2362proven to not change during the execution of the kernel dispatch. This includes
2363constant address space and global address space for program scope const
2364variables. Therefore the kernel machine code does not have to maintain the
2365scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2366and vector L1 caches are invalidated between kernel dispatches by CP since
2367constant address space data may change between kernel dispatch executions. See
2368:ref:`amdgpu-amdhsa-memory-spaces`.
2369
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002370The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002371case the AMDGPU backend ensures the memory location used to spill is never
2372accessed by vector memory operations at the same time. If scalar writes are used
2373then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2374return since the locations may be used for vector memory instructions by a
2375future wave that uses the same scratch area, or a function call that creates a
2376frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2377as all scalar writes are write-before-read in the same thread.
2378
2379Scratch backing memory (which is used for the private address space) is accessed
2380with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2381is only accessed by a single thread, and is always write-before-read,
2382there is never a need to invalidate these entries from the L1 cache. Hence all
2383cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2384lines.
2385
2386On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2387to invalidate the L2 cache. This also causes it to be treated as non-volatile
2388and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2389coherent) and so the L2 cache will coherent with the CPU and other agents.
2390
2391 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2392 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2393
2394 ============ ============ ============== ========== =======================
2395 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2396 Ordering Sync Scope Address
2397 Space
2398 ============ ============ ============== ========== =======================
2399 **Non-Atomic**
2400 ---------------------------------------------------------------------------
2401 load *none* *none* - global non-volatile
2402 - generic 1. buffer/global/flat_load
2403 volatile
2404 1. buffer/global/flat_load
2405 glc=1
2406 load *none* *none* - local 1. ds_load
2407 store *none* *none* - global 1. buffer/global/flat_store
2408 - generic
2409 store *none* *none* - local 1. ds_store
2410 **Unordered Atomic**
2411 ---------------------------------------------------------------------------
2412 load atomic unordered *any* *any* *Same as non-atomic*.
2413 store atomic unordered *any* *any* *Same as non-atomic*.
2414 atomicrmw unordered *any* *any* *Same as monotonic
2415 atomic*.
2416 **Monotonic Atomic**
2417 ---------------------------------------------------------------------------
2418 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2419 - wavefront - generic
2420 - workgroup
2421 load atomic monotonic - singlethread - local 1. ds_load
2422 - wavefront
2423 - workgroup
2424 load atomic monotonic - agent - global 1. buffer/global/flat_load
2425 - system - generic glc=1
2426 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2427 - wavefront - generic
2428 - workgroup
2429 - agent
2430 - system
2431 store atomic monotonic - singlethread - local 1. ds_store
2432 - wavefront
2433 - workgroup
2434 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2435 - wavefront - generic
2436 - workgroup
2437 - agent
2438 - system
2439 atomicrmw monotonic - singlethread - local 1. ds_atomic
2440 - wavefront
2441 - workgroup
2442 **Acquire Atomic**
2443 ---------------------------------------------------------------------------
2444 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2445 - wavefront - local
2446 - generic
2447 load atomic acquire - workgroup - global 1. buffer/global_load
2448 load atomic acquire - workgroup - local 1. ds/flat_load
2449 - generic 2. s_waitcnt lgkmcnt(0)
2450
2451 - If OpenCL, omit
2452 waitcnt.
2453 - Must happen before
2454 any following
2455 global/generic
2456 load/load
2457 atomic/store/store
2458 atomic/atomicrmw.
2459 - Ensures any
2460 following global
2461 data read is no
2462 older than the load
2463 atomic value being
2464 acquired.
2465
2466 load atomic acquire - agent - global 1. buffer/global_load
2467 - system glc=1
2468 2. s_waitcnt vmcnt(0)
2469
2470 - Must happen before
2471 following
2472 buffer_wbinvl1_vol.
2473 - Ensures the load
2474 has completed
2475 before invalidating
2476 the cache.
2477
2478 3. buffer_wbinvl1_vol
2479
2480 - Must happen before
2481 any following
2482 global/generic
2483 load/load
2484 atomic/atomicrmw.
2485 - Ensures that
2486 following
2487 loads will not see
2488 stale global data.
2489
2490 load atomic acquire - agent - generic 1. flat_load glc=1
2491 - system 2. s_waitcnt vmcnt(0) &
2492 lgkmcnt(0)
2493
2494 - If OpenCL omit
2495 lgkmcnt(0).
2496 - Must happen before
2497 following
2498 buffer_wbinvl1_vol.
2499 - Ensures the flat_load
2500 has completed
2501 before invalidating
2502 the cache.
2503
2504 3. buffer_wbinvl1_vol
2505
2506 - Must happen before
2507 any following
2508 global/generic
2509 load/load
2510 atomic/atomicrmw.
2511 - Ensures that
2512 following loads
2513 will not see stale
2514 global data.
2515
2516 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2517 - wavefront - local
2518 - generic
2519 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
2520 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
2521 - generic 2. waitcnt lgkmcnt(0)
2522
2523 - If OpenCL, omit
2524 waitcnt.
2525 - Must happen before
2526 any following
2527 global/generic
2528 load/load
2529 atomic/store/store
2530 atomic/atomicrmw.
2531 - Ensures any
2532 following global
2533 data read is no
2534 older than the
2535 atomicrmw value
2536 being acquired.
2537
2538 atomicrmw acquire - agent - global 1. buffer/global_atomic
2539 - system 2. s_waitcnt vmcnt(0)
2540
2541 - Must happen before
2542 following
2543 buffer_wbinvl1_vol.
2544 - Ensures the
2545 atomicrmw has
2546 completed before
2547 invalidating the
2548 cache.
2549
2550 3. buffer_wbinvl1_vol
2551
2552 - Must happen before
2553 any following
2554 global/generic
2555 load/load
2556 atomic/atomicrmw.
2557 - Ensures that
2558 following loads
2559 will not see stale
2560 global data.
2561
2562 atomicrmw acquire - agent - generic 1. flat_atomic
2563 - system 2. s_waitcnt vmcnt(0) &
2564 lgkmcnt(0)
2565
2566 - If OpenCL, omit
2567 lgkmcnt(0).
2568 - Must happen before
2569 following
2570 buffer_wbinvl1_vol.
2571 - Ensures the
2572 atomicrmw has
2573 completed before
2574 invalidating the
2575 cache.
2576
2577 3. buffer_wbinvl1_vol
2578
2579 - Must happen before
2580 any following
2581 global/generic
2582 load/load
2583 atomic/atomicrmw.
2584 - Ensures that
2585 following loads
2586 will not see stale
2587 global data.
2588
2589 fence acquire - singlethread *none* *none*
2590 - wavefront
2591 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2592
2593 - If OpenCL and
2594 address space is
2595 not generic, omit
2596 waitcnt. However,
2597 since LLVM
2598 currently has no
2599 address space on
2600 the fence need to
2601 conservatively
2602 always generate. If
2603 fence had an
2604 address space then
2605 set to address
2606 space of OpenCL
2607 fence flag, or to
2608 generic if both
2609 local and global
2610 flags are
2611 specified.
2612 - Must happen after
2613 any preceding
2614 local/generic load
2615 atomic/atomicrmw
2616 with an equal or
2617 wider sync scope
2618 and memory ordering
2619 stronger than
2620 unordered (this is
2621 termed the
2622 fence-paired-atomic).
2623 - Must happen before
2624 any following
2625 global/generic
2626 load/load
2627 atomic/store/store
2628 atomic/atomicrmw.
2629 - Ensures any
2630 following global
2631 data read is no
2632 older than the
2633 value read by the
2634 fence-paired-atomic.
2635
2636 fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
2637 - system lgkmcnt(0)
2638
2639 - If OpenCL and
2640 address space is
2641 not generic, omit
2642 lgkmcnt(0).
2643 However, since LLVM
2644 currently has no
2645 address space on
2646 the fence need to
2647 conservatively
2648 always generate
2649 (see comment for
2650 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002651 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002652 separate s_waitcnt
2653 vmcnt(0) and
2654 s_waitcnt
2655 lgkmcnt(0) to allow
2656 them to be
2657 independently moved
2658 according to the
2659 following rules.
2660 - s_waitcnt vmcnt(0)
2661 must happen after
2662 any preceding
2663 global/generic load
2664 atomic/atomicrmw
2665 with an equal or
2666 wider sync scope
2667 and memory ordering
2668 stronger than
2669 unordered (this is
2670 termed the
2671 fence-paired-atomic).
2672 - s_waitcnt lgkmcnt(0)
2673 must happen after
2674 any preceding
2675 group/generic load
2676 atomic/atomicrmw
2677 with an equal or
2678 wider sync scope
2679 and memory ordering
2680 stronger than
2681 unordered (this is
2682 termed the
2683 fence-paired-atomic).
2684 - Must happen before
2685 the following
2686 buffer_wbinvl1_vol.
2687 - Ensures that the
2688 fence-paired atomic
2689 has completed
2690 before invalidating
2691 the
2692 cache. Therefore
2693 any following
2694 locations read must
2695 be no older than
2696 the value read by
2697 the
2698 fence-paired-atomic.
2699
2700 2. buffer_wbinvl1_vol
2701
2702 - Must happen before
2703 any following global/generic
2704 load/load
2705 atomic/store/store
2706 atomic/atomicrmw.
2707 - Ensures that
2708 following loads
2709 will not see stale
2710 global data.
2711
2712 **Release Atomic**
2713 ---------------------------------------------------------------------------
2714 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2715 - wavefront - local
2716 - generic
2717 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2718 - generic
2719 - If OpenCL, omit
2720 waitcnt.
2721 - Must happen after
2722 any preceding
2723 local/generic
2724 load/store/load
2725 atomic/store
2726 atomic/atomicrmw.
2727 - Must happen before
2728 the following
2729 store.
2730 - Ensures that all
2731 memory operations
2732 to local have
2733 completed before
2734 performing the
2735 store that is being
2736 released.
2737
2738 2. buffer/global/flat_store
2739 store atomic release - workgroup - local 1. ds_store
2740 store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
2741 - system - generic lgkmcnt(0)
2742
2743 - If OpenCL, omit
2744 lgkmcnt(0).
2745 - Could be split into
2746 separate s_waitcnt
2747 vmcnt(0) and
2748 s_waitcnt
2749 lgkmcnt(0) to allow
2750 them to be
2751 independently moved
2752 according to the
2753 following rules.
2754 - s_waitcnt vmcnt(0)
2755 must happen after
2756 any preceding
2757 global/generic
2758 load/store/load
2759 atomic/store
2760 atomic/atomicrmw.
2761 - s_waitcnt lgkmcnt(0)
2762 must happen after
2763 any preceding
2764 local/generic
2765 load/store/load
2766 atomic/store
2767 atomic/atomicrmw.
2768 - Must happen before
2769 the following
2770 store.
2771 - Ensures that all
2772 memory operations
2773 to global have
2774 completed before
2775 performing the
2776 store that is being
2777 released.
2778
2779 2. buffer/global/ds/flat_store
2780 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2781 - wavefront - local
2782 - generic
2783 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2784 - generic
2785 - If OpenCL, omit
2786 waitcnt.
2787 - Must happen after
2788 any preceding
2789 local/generic
2790 load/store/load
2791 atomic/store
2792 atomic/atomicrmw.
2793 - Must happen before
2794 the following
2795 atomicrmw.
2796 - Ensures that all
2797 memory operations
2798 to local have
2799 completed before
2800 performing the
2801 atomicrmw that is
2802 being released.
2803
2804 2. buffer/global/flat_atomic
2805 atomicrmw release - workgroup - local 1. ds_atomic
2806 atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
2807 - system - generic lgkmcnt(0)
2808
2809 - If OpenCL, omit
2810 lgkmcnt(0).
2811 - Could be split into
2812 separate s_waitcnt
2813 vmcnt(0) and
2814 s_waitcnt
2815 lgkmcnt(0) to allow
2816 them to be
2817 independently moved
2818 according to the
2819 following rules.
2820 - s_waitcnt vmcnt(0)
2821 must happen after
2822 any preceding
2823 global/generic
2824 load/store/load
2825 atomic/store
2826 atomic/atomicrmw.
2827 - s_waitcnt lgkmcnt(0)
2828 must happen after
2829 any preceding
2830 local/generic
2831 load/store/load
2832 atomic/store
2833 atomic/atomicrmw.
2834 - Must happen before
2835 the following
2836 atomicrmw.
2837 - Ensures that all
2838 memory operations
2839 to global and local
2840 have completed
2841 before performing
2842 the atomicrmw that
2843 is being released.
2844
2845 2. buffer/global/ds/flat_atomic*
2846 fence release - singlethread *none* *none*
2847 - wavefront
2848 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2849
2850 - If OpenCL and
2851 address space is
2852 not generic, omit
2853 waitcnt. However,
2854 since LLVM
2855 currently has no
2856 address space on
2857 the fence need to
2858 conservatively
2859 always generate
2860 (see comment for
2861 previous fence).
2862 - Must happen after
2863 any preceding
2864 local/generic
2865 load/load
2866 atomic/store/store
2867 atomic/atomicrmw.
2868 - Must happen before
2869 any following store
2870 atomic/atomicrmw
2871 with an equal or
2872 wider sync scope
2873 and memory ordering
2874 stronger than
2875 unordered (this is
2876 termed the
2877 fence-paired-atomic).
2878 - Ensures that all
2879 memory operations
2880 to local have
2881 completed before
2882 performing the
2883 following
2884 fence-paired-atomic.
2885
2886 fence release - agent *none* 1. s_waitcnt vmcnt(0) &
2887 - system lgkmcnt(0)
2888
2889 - If OpenCL and
2890 address space is
2891 not generic, omit
2892 lgkmcnt(0).
2893 However, since LLVM
2894 currently has no
2895 address space on
2896 the fence need to
2897 conservatively
2898 always generate
2899 (see comment for
2900 previous fence).
2901 - Could be split into
2902 separate s_waitcnt
2903 vmcnt(0) and
2904 s_waitcnt
2905 lgkmcnt(0) to allow
2906 them to be
2907 independently moved
2908 according to the
2909 following rules.
2910 - s_waitcnt vmcnt(0)
2911 must happen after
2912 any preceding
2913 global/generic
2914 load/store/load
2915 atomic/store
2916 atomic/atomicrmw.
2917 - s_waitcnt lgkmcnt(0)
2918 must happen after
2919 any preceding
2920 local/generic
2921 load/store/load
2922 atomic/store
2923 atomic/atomicrmw.
2924 - Must happen before
2925 any following store
2926 atomic/atomicrmw
2927 with an equal or
2928 wider sync scope
2929 and memory ordering
2930 stronger than
2931 unordered (this is
2932 termed the
2933 fence-paired-atomic).
2934 - Ensures that all
2935 memory operations
2936 to global have
2937 completed before
2938 performing the
2939 following
2940 fence-paired-atomic.
2941
2942 **Acquire-Release Atomic**
2943 ---------------------------------------------------------------------------
2944 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
2945 - wavefront - local
2946 - generic
2947 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
2948
2949 - If OpenCL, omit
2950 waitcnt.
2951 - Must happen after
2952 any preceding
2953 local/generic
2954 load/store/load
2955 atomic/store
2956 atomic/atomicrmw.
2957 - Must happen before
2958 the following
2959 atomicrmw.
2960 - Ensures that all
2961 memory operations
2962 to local have
2963 completed before
2964 performing the
2965 atomicrmw that is
2966 being released.
2967
2968 2. buffer/global_atomic
2969 atomicrmw acq_rel - workgroup - local 1. ds_atomic
2970 2. s_waitcnt lgkmcnt(0)
2971
2972 - If OpenCL, omit
2973 waitcnt.
2974 - Must happen before
2975 any following
2976 global/generic
2977 load/load
2978 atomic/store/store
2979 atomic/atomicrmw.
2980 - Ensures any
2981 following global
2982 data read is no
2983 older than the load
2984 atomic value being
2985 acquired.
2986
2987 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2988
2989 - If OpenCL, omit
2990 waitcnt.
2991 - Must happen after
2992 any preceding
2993 local/generic
2994 load/store/load
2995 atomic/store
2996 atomic/atomicrmw.
2997 - Must happen before
2998 the following
2999 atomicrmw.
3000 - Ensures that all
3001 memory operations
3002 to local have
3003 completed before
3004 performing the
3005 atomicrmw that is
3006 being released.
3007
3008 2. flat_atomic
3009 3. s_waitcnt lgkmcnt(0)
3010
3011 - If OpenCL, omit
3012 waitcnt.
3013 - Must happen before
3014 any following
3015 global/generic
3016 load/load
3017 atomic/store/store
3018 atomic/atomicrmw.
3019 - Ensures any
3020 following global
3021 data read is no
3022 older than the load
3023 atomic value being
3024 acquired.
3025 atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
3026 - system lgkmcnt(0)
3027
3028 - If OpenCL, omit
3029 lgkmcnt(0).
3030 - Could be split into
3031 separate s_waitcnt
3032 vmcnt(0) and
3033 s_waitcnt
3034 lgkmcnt(0) to allow
3035 them to be
3036 independently moved
3037 according to the
3038 following rules.
3039 - s_waitcnt vmcnt(0)
3040 must happen after
3041 any preceding
3042 global/generic
3043 load/store/load
3044 atomic/store
3045 atomic/atomicrmw.
3046 - s_waitcnt lgkmcnt(0)
3047 must happen after
3048 any preceding
3049 local/generic
3050 load/store/load
3051 atomic/store
3052 atomic/atomicrmw.
3053 - Must happen before
3054 the following
3055 atomicrmw.
3056 - Ensures that all
3057 memory operations
3058 to global have
3059 completed before
3060 performing the
3061 atomicrmw that is
3062 being released.
3063
3064 2. buffer/global_atomic
3065 3. s_waitcnt vmcnt(0)
3066
3067 - Must happen before
3068 following
3069 buffer_wbinvl1_vol.
3070 - Ensures the
3071 atomicrmw has
3072 completed before
3073 invalidating the
3074 cache.
3075
3076 4. buffer_wbinvl1_vol
3077
3078 - Must happen before
3079 any following
3080 global/generic
3081 load/load
3082 atomic/atomicrmw.
3083 - Ensures that
3084 following loads
3085 will not see stale
3086 global data.
3087
3088 atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
3089 - system lgkmcnt(0)
3090
3091 - If OpenCL, omit
3092 lgkmcnt(0).
3093 - Could be split into
3094 separate s_waitcnt
3095 vmcnt(0) and
3096 s_waitcnt
3097 lgkmcnt(0) to allow
3098 them to be
3099 independently moved
3100 according to the
3101 following rules.
3102 - s_waitcnt vmcnt(0)
3103 must happen after
3104 any preceding
3105 global/generic
3106 load/store/load
3107 atomic/store
3108 atomic/atomicrmw.
3109 - s_waitcnt lgkmcnt(0)
3110 must happen after
3111 any preceding
3112 local/generic
3113 load/store/load
3114 atomic/store
3115 atomic/atomicrmw.
3116 - Must happen before
3117 the following
3118 atomicrmw.
3119 - Ensures that all
3120 memory operations
3121 to global have
3122 completed before
3123 performing the
3124 atomicrmw that is
3125 being released.
3126
3127 2. flat_atomic
3128 3. s_waitcnt vmcnt(0) &
3129 lgkmcnt(0)
3130
3131 - If OpenCL, omit
3132 lgkmcnt(0).
3133 - Must happen before
3134 following
3135 buffer_wbinvl1_vol.
3136 - Ensures the
3137 atomicrmw has
3138 completed before
3139 invalidating the
3140 cache.
3141
3142 4. buffer_wbinvl1_vol
3143
3144 - Must happen before
3145 any following
3146 global/generic
3147 load/load
3148 atomic/atomicrmw.
3149 - Ensures that
3150 following loads
3151 will not see stale
3152 global data.
3153
3154 fence acq_rel - singlethread *none* *none*
3155 - wavefront
3156 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3157
3158 - If OpenCL and
3159 address space is
3160 not generic, omit
3161 waitcnt. However,
3162 since LLVM
3163 currently has no
3164 address space on
3165 the fence need to
3166 conservatively
3167 always generate
3168 (see comment for
3169 previous fence).
3170 - Must happen after
3171 any preceding
3172 local/generic
3173 load/load
3174 atomic/store/store
3175 atomic/atomicrmw.
3176 - Must happen before
3177 any following
3178 global/generic
3179 load/load
3180 atomic/store/store
3181 atomic/atomicrmw.
3182 - Ensures that all
3183 memory operations
3184 to local have
3185 completed before
3186 performing any
3187 following global
3188 memory operations.
3189 - Ensures that the
3190 preceding
3191 local/generic load
3192 atomic/atomicrmw
3193 with an equal or
3194 wider sync scope
3195 and memory ordering
3196 stronger than
3197 unordered (this is
3198 termed the
3199 fence-paired-atomic)
3200 has completed
3201 before following
3202 global memory
3203 operations. This
3204 satisfies the
3205 requirements of
3206 acquire.
3207 - Ensures that all
3208 previous memory
3209 operations have
3210 completed before a
3211 following
3212 local/generic store
3213 atomic/atomicrmw
3214 with an equal or
3215 wider sync scope
3216 and memory ordering
3217 stronger than
3218 unordered (this is
3219 termed the
3220 fence-paired-atomic).
3221 This satisfies the
3222 requirements of
3223 release.
3224
3225 fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
3226 - system lgkmcnt(0)
3227
3228 - If OpenCL and
3229 address space is
3230 not generic, omit
3231 lgkmcnt(0).
3232 However, since LLVM
3233 currently has no
3234 address space on
3235 the fence need to
3236 conservatively
3237 always generate
3238 (see comment for
3239 previous fence).
3240 - Could be split into
3241 separate s_waitcnt
3242 vmcnt(0) and
3243 s_waitcnt
3244 lgkmcnt(0) to allow
3245 them to be
3246 independently moved
3247 according to the
3248 following rules.
3249 - s_waitcnt vmcnt(0)
3250 must happen after
3251 any preceding
3252 global/generic
3253 load/store/load
3254 atomic/store
3255 atomic/atomicrmw.
3256 - s_waitcnt lgkmcnt(0)
3257 must happen after
3258 any preceding
3259 local/generic
3260 load/store/load
3261 atomic/store
3262 atomic/atomicrmw.
3263 - Must happen before
3264 the following
3265 buffer_wbinvl1_vol.
3266 - Ensures that the
3267 preceding
3268 global/local/generic
3269 load
3270 atomic/atomicrmw
3271 with an equal or
3272 wider sync scope
3273 and memory ordering
3274 stronger than
3275 unordered (this is
3276 termed the
3277 fence-paired-atomic)
3278 has completed
3279 before invalidating
3280 the cache. This
3281 satisfies the
3282 requirements of
3283 acquire.
3284 - Ensures that all
3285 previous memory
3286 operations have
3287 completed before a
3288 following
3289 global/local/generic
3290 store
3291 atomic/atomicrmw
3292 with an equal or
3293 wider sync scope
3294 and memory ordering
3295 stronger than
3296 unordered (this is
3297 termed the
3298 fence-paired-atomic).
3299 This satisfies the
3300 requirements of
3301 release.
3302
3303 2. buffer_wbinvl1_vol
3304
3305 - Must happen before
3306 any following
3307 global/generic
3308 load/load
3309 atomic/store/store
3310 atomic/atomicrmw.
3311 - Ensures that
3312 following loads
3313 will not see stale
3314 global data. This
3315 satisfies the
3316 requirements of
3317 acquire.
3318
3319 **Sequential Consistent Atomic**
3320 ---------------------------------------------------------------------------
3321 load atomic seq_cst - singlethread - global *Same as corresponding
3322 - wavefront - local load atomic acquire*.
3323 - workgroup - generic
3324 load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
3325 - system - local
3326 - generic - Must happen after
3327 preceding
3328 global/generic load
3329 atomic/store
3330 atomic/atomicrmw
3331 with memory
3332 ordering of seq_cst
3333 and with equal or
3334 wider sync scope.
3335 (Note that seq_cst
3336 fences have their
3337 own s_waitcnt
3338 vmcnt(0) and so do
3339 not need to be
3340 considered.)
3341 - Ensures any
3342 preceding
3343 sequential
3344 consistent global
3345 memory instructions
3346 have completed
3347 before executing
3348 this sequentially
3349 consistent
3350 instruction. This
3351 prevents reordering
3352 a seq_cst store
3353 followed by a
3354 seq_cst load (Note
3355 that seq_cst is
3356 stronger than
3357 acquire/release as
3358 the reordering of
3359 load acquire
3360 followed by a store
3361 release is
3362 prevented by the
3363 waitcnt vmcnt(0) of
3364 the release, but
3365 there is nothing
3366 preventing a store
3367 release followed by
3368 load acquire from
3369 competing out of
3370 order.)
3371
3372 2. *Following
3373 instructions same as
3374 corresponding load
3375 atomic acquire*.
3376
3377 store atomic seq_cst - singlethread - global *Same as corresponding
3378 - wavefront - local store atomic release*.
3379 - workgroup - generic
3380 store atomic seq_cst - agent - global *Same as corresponding
3381 - system - generic store atomic release*.
3382 atomicrmw seq_cst - singlethread - global *Same as corresponding
3383 - wavefront - local atomicrmw acq_rel*.
3384 - workgroup - generic
3385 atomicrmw seq_cst - agent - global *Same as corresponding
3386 - system - generic atomicrmw acq_rel*.
3387 fence seq_cst - singlethread *none* *Same as corresponding
3388 - wavefront fence acq_rel*.
3389 - workgroup
3390 - agent
3391 - system
3392 ============ ============ ============== ========== =======================
3393
3394The memory order also adds the single thread optimization constrains defined in
3395table
3396:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3397
3398 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3399 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3400
3401 ============ ==============================================================
3402 LLVM Memory Optimization Constraints
3403 Ordering
3404 ============ ==============================================================
3405 unordered *none*
3406 monotonic *none*
3407 acquire - If a load atomic/atomicrmw then no following load/load
3408 atomic/store/ store atomic/atomicrmw/fence instruction can
3409 be moved before the acquire.
3410 - If a fence then same as load atomic, plus no preceding
3411 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003412 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003413 atomic/store/ store atomic/atomicrmw/fence instruction can
3414 be moved after the release.
3415 - If a fence then same as store atomic, plus no following
3416 associated fence-paired-atomic can be moved before the
3417 fence.
3418 acq_rel Same constraints as both acquire and release.
3419 seq_cst - If a load atomic then same constraints as acquire, plus no
3420 preceding sequentially consistent load atomic/store
3421 atomic/atomicrmw/fence instruction can be moved after the
3422 seq_cst.
3423 - If a store atomic then the same constraints as release, plus
3424 no following sequentially consistent load atomic/store
3425 atomic/atomicrmw/fence instruction can be moved before the
3426 seq_cst.
3427 - If an atomicrmw/fence then same constraints as acq_rel.
3428 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003429
Wei Ding16289cf2017-02-21 18:48:01 +00003430Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003431~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003432
Tony Tyef16a45e2017-06-06 20:31:59 +00003433For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3434(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3435the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003436
Tony Tyef16a45e2017-06-06 20:31:59 +00003437 .. table:: AMDGPU Trap Handler for AMDHSA OS
3438 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003439
Tony Tyef16a45e2017-06-06 20:31:59 +00003440 =================== =============== =============== =======================
3441 Usage Code Sequence Trap Handler Description
3442 Inputs
3443 =================== =============== =============== =======================
3444 reserved ``s_trap 0x00`` Reserved by hardware.
3445 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3446 ``queue_ptr`` ``debugtrap``
3447 ``VGPR0``: intrinsic (not
3448 ``arg`` implemented).
3449 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3450 ``queue_ptr`` terminated and its
3451 associated queue put
3452 into the error state.
3453 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3454 ``queue_ptr`` installed handled
3455 same as ``llvm.trap``.
3456 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3457 breakpoints.
3458 debugger ``s_trap 0x08`` Reserved for debugger.
3459 debugger ``s_trap 0xfe`` Reserved for debugger.
3460 debugger ``s_trap 0xff`` Reserved for debugger.
3461 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003462
Tony Tye46d35762017-08-15 20:47:41 +00003463Unspecified OS
3464--------------
3465
3466This section provides code conventions used when the target triple OS is
3467empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003468
3469Trap Handler ABI
3470~~~~~~~~~~~~~~~~
3471
3472For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3473not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3474instructions are handled as follows:
3475
3476 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3477 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3478
3479 =============== =============== ===========================================
3480 Usage Code Sequence Description
3481 =============== =============== ===========================================
3482 llvm.trap s_endpgm Causes wavefront to be terminated.
3483 llvm.debugtrap *none* Compiler warning given that there is no
3484 trap handler installed.
3485 =============== =============== ===========================================
3486
3487Source Languages
3488================
3489
3490.. _amdgpu-opencl:
3491
3492OpenCL
3493------
3494
3495When generating code for the OpenCL language the target triple environment
3496should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3497
3498When the language is OpenCL the following differences occur:
3499
35001. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
35012. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye46d35762017-08-15 20:47:41 +000035023. Additional metadata is generated
3503 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003504
3505.. TODO
3506 Specify what affect this has. Hidden arguments added. Additional metadata
3507 generated.
3508
3509.. _amdgpu-hcc:
3510
3511HCC
3512---
3513
3514When generating code for the OpenCL language the target triple environment
3515should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3516
3517When the language is OpenCL the following differences occur:
3518
35191. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3520
3521.. TODO
3522 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003523
Tom Stellard45bb48e2015-06-13 03:28:10 +00003524Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003525---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003526
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003527AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef16a45e2017-06-06 20:31:59 +00003528It supports AMDGCN GFX6-GFX8.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003529
Tony Tyef16a45e2017-06-06 20:31:59 +00003530This section describes general syntax for instructions and operands. For more
3531information about instructions, their semantics and supported combinations of
3532operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00003533[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003534
Tony Tyef16a45e2017-06-06 20:31:59 +00003535An instruction has the following syntax (register operands are normally
3536comma-separated while extra operands are space-separated):
Tom Stellard45bb48e2015-06-13 03:28:10 +00003537
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003538*<opcode> <register_operand0>, ... <extra_operand0> ...*
Tom Stellard45bb48e2015-06-13 03:28:10 +00003539
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003540Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003541~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003542
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003543The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003544
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003545* SGPR registers: s0, ... or s[0], ...
3546* VGPR registers: v0, ... or v[0], ...
3547* TTMP registers: ttmp0, ... or ttmp[0], ...
3548* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3549* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3550* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3551* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3552* Register index expressions: v[2*2], s[1-1:2-1]
3553* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003554
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003555The following extra operands are supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003556
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003557* offset, offset0, offset1
3558* idxen, offen bits
3559* glc, slc, tfe bits
3560* waitcnt: integer or combination of counter values
3561* VOP3 modifiers:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003562
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003563 - abs (\| \|), neg (\-)
Tom Stellard45bb48e2015-06-13 03:28:10 +00003564
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003565* DPP modifiers:
3566
3567 - row_shl, row_shr, row_ror, row_rol
3568 - row_mirror, row_half_mirror, row_bcast
3569 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3570 - row_mask, bank_mask, bound_ctrl
3571
3572* SDWA modifiers:
3573
3574 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3575 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3576 - abs, neg, sext
3577
Tony Tyef16a45e2017-06-06 20:31:59 +00003578Instruction Examples
3579~~~~~~~~~~~~~~~~~~~~
3580
3581DS
3582~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003583
3584.. code-block:: nasm
3585
3586 ds_add_u32 v2, v4 offset:16
3587 ds_write_src2_b64 v2 offset0:4 offset1:8
3588 ds_cmpst_f32 v2, v4, v6
3589 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3590
3591
3592For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3593
Tony Tyef16a45e2017-06-06 20:31:59 +00003594FLAT
3595++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003596
3597.. code-block:: nasm
3598
3599 flat_load_dword v1, v[3:4]
3600 flat_store_dwordx3 v[3:4], v[5:7]
3601 flat_atomic_swap v1, v[3:4], v5 glc
3602 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3603 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3604
3605For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3606
Tony Tyef16a45e2017-06-06 20:31:59 +00003607MUBUF
3608+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003609
3610.. code-block:: nasm
3611
3612 buffer_load_dword v1, off, s[4:7], s1
3613 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3614 buffer_store_format_xy v[1:2], off, s[4:7], s1
3615 buffer_wbinvl1
3616 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3617
3618For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3619
Tony Tyef16a45e2017-06-06 20:31:59 +00003620SMRD/SMEM
3621+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003622
3623.. code-block:: nasm
3624
3625 s_load_dword s1, s[2:3], 0xfc
3626 s_load_dwordx8 s[8:15], s[2:3], s4
3627 s_load_dwordx16 s[88:103], s[2:3], s4
3628 s_dcache_inv_vol
3629 s_memtime s[4:5]
3630
3631For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3632
Tony Tyef16a45e2017-06-06 20:31:59 +00003633SOP1
3634++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003635
3636.. code-block:: nasm
3637
3638 s_mov_b32 s1, s2
3639 s_mov_b64 s[0:1], 0x80000000
3640 s_cmov_b32 s1, 200
3641 s_wqm_b64 s[2:3], s[4:5]
3642 s_bcnt0_i32_b64 s1, s[2:3]
3643 s_swappc_b64 s[2:3], s[4:5]
3644 s_cbranch_join s[4:5]
3645
3646For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3647
Tony Tyef16a45e2017-06-06 20:31:59 +00003648SOP2
3649++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003650
3651.. code-block:: nasm
3652
3653 s_add_u32 s1, s2, s3
3654 s_and_b64 s[2:3], s[4:5], s[6:7]
3655 s_cselect_b32 s1, s2, s3
3656 s_andn2_b32 s2, s4, s6
3657 s_lshr_b64 s[2:3], s[4:5], s6
3658 s_ashr_i32 s2, s4, s6
3659 s_bfm_b64 s[2:3], s4, s6
3660 s_bfe_i64 s[2:3], s[4:5], s6
3661 s_cbranch_g_fork s[4:5], s[6:7]
3662
3663For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3664
Tony Tyef16a45e2017-06-06 20:31:59 +00003665SOPC
3666++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003667
3668.. code-block:: nasm
3669
3670 s_cmp_eq_i32 s1, s2
3671 s_bitcmp1_b32 s1, s2
3672 s_bitcmp0_b64 s[2:3], s4
3673 s_setvskip s3, s5
3674
3675For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3676
Tony Tyef16a45e2017-06-06 20:31:59 +00003677SOPP
3678++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003679
3680.. code-block:: nasm
3681
3682 s_barrier
3683 s_nop 2
3684 s_endpgm
3685 s_waitcnt 0 ; Wait for all counters to be 0
3686 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3687 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3688 s_sethalt 9
3689 s_sleep 10
3690 s_sendmsg 0x1
3691 s_sendmsg sendmsg(MSG_INTERRUPT)
3692 s_trap 1
3693
3694For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3695
3696Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003697of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003698range or acceptable values.
3699
Tony Tyef16a45e2017-06-06 20:31:59 +00003700VALU
3701++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003702
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003703For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3704the assembler will automatically use optimal encoding based on its operands.
3705To force specific encoding, one can add a suffix to the opcode of the instruction:
3706
3707* _e32 for 32-bit VOP1/VOP2/VOPC
3708* _e64 for 64-bit VOP3
3709* _dpp for VOP_DPP
3710* _sdwa for VOP_SDWA
3711
3712VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003713
3714.. code-block:: nasm
3715
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003716 v_mov_b32 v1, v2
3717 v_mov_b32_e32 v1, v2
3718 v_nop
3719 v_cvt_f64_i32_e32 v[1:2], v2
3720 v_floor_f32_e32 v1, v2
3721 v_bfrev_b32_e32 v1, v2
3722 v_add_f32_e32 v1, v2, v3
3723 v_mul_i32_i24_e64 v1, v2, 3
3724 v_mul_i32_i24_e32 v1, -3, v3
3725 v_mul_i32_i24_e32 v1, -100, v3
3726 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3727 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00003728
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003729VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003730
3731.. code-block:: nasm
3732
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003733 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3734 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3735 v_mov_b32 v0, v0 wave_shl:1
3736 v_mov_b32 v0, v0 row_mirror
3737 v_mov_b32 v0, v0 row_bcast:31
3738 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3739 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3740 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00003741
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003742VOP_SDWA examples:
3743
3744.. code-block:: nasm
3745
3746 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3747 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3748 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3749 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3750 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3751
3752For full list of supported instructions, refer to "Vector ALU instructions".
3753
3754HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00003755~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003756
3757AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3758one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00003759
3760.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00003761+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00003762
3763*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003764object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00003765
3766.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00003767+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3768
Tom Stellard347ac792015-06-26 21:15:07 +00003769
3770*major*, *minor*, and *stepping* are all integers that describe the instruction
3771set architecture (ISA) version of the assembly program.
3772
3773*vendor* and *arch* are quoted strings. *vendor* should always be equal to
3774"AMD" and *arch* should always be equal to "AMDGPU".
3775
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003776By default, the assembler will derive the ISA version, *vendor*, and *arch*
3777from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00003778
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003779.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00003780+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003781
3782This directives specifies that the symbol with given name is a kernel entry point
3783(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00003784
3785.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00003786++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00003787
3788This directive marks the beginning of a list of key / value pairs that are used
3789to specify the amd_kernel_code_t object that will be emitted by the assembler.
3790The list must be terminated by the *.end_amd_kernel_code_t* directive. For
3791any amd_kernel_code_t values that are unspecified a default value will be
3792used. The default value for all keys is 0, with the following exceptions:
3793
3794- *kernel_code_version_major* defaults to 1.
3795- *machine_kind* defaults to 1.
3796- *machine_version_major*, *machine_version_minor*, and
3797 *machine_version_stepping* are derived from the value of the -mcpu option
3798 that is passed to the assembler.
3799- *kernel_code_entry_byte_offset* defaults to 256.
3800- *wavefront_size* defaults to 6.
3801- *kernarg_segment_alignment*, *group_segment_alignment*, and
3802 *private_segment_alignment* default to 4. Note that alignments are specified
3803 as a power of two, so a value of **n** means an alignment of 2^ **n**.
3804
3805The *.amd_kernel_code_t* directive must be placed immediately after the
3806function label and before any instructions.
3807
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003808For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
3809comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00003810
3811Here is an example of a minimal amd_kernel_code_t specification:
3812
Aaron Ballman887ad0e2016-07-19 17:46:55 +00003813.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00003814
3815 .hsa_code_object_version 1,0
3816 .hsa_code_object_isa
3817
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003818 .hsatext
3819 .globl hello_world
3820 .p2align 8
3821 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00003822
3823 hello_world:
3824
3825 .amd_kernel_code_t
3826 enable_sgpr_kernarg_segment_ptr = 1
3827 is_ptr64 = 1
3828 compute_pgm_rsrc1_vgprs = 0
3829 compute_pgm_rsrc1_sgprs = 0
3830 compute_pgm_rsrc2_user_sgpr = 2
3831 kernarg_segment_byte_size = 8
3832 wavefront_sgpr_count = 2
3833 workitem_vgpr_count = 3
3834 .end_amd_kernel_code_t
3835
3836 s_load_dwordx2 s[0:1], s[0:1] 0x0
3837 v_mov_b32 v0, 3.14159
3838 s_waitcnt lgkmcnt(0)
3839 v_mov_b32 v1, s0
3840 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003841 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00003842 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00003843 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003844 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00003845
3846Additional Documentation
3847========================
3848
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00003849.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
3850.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
3851.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
3852.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
3853.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
3854.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
3855.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
3856.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00003857.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
3858.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
3859.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
3860.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
3861.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
3862.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
3863.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
3864.. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
3865.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
3866.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
3867.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__