blob: 8b94a3e6a5a57a9a8f55b7bcfe6f82007230fa6a [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
26 .. table:: AMDGPU Target Triples
27 :name: amdgpu-target-triples-table
28
29 ============ ======== ========= ===========
30 Architecture Vendor OS Environment
31 ============ ======== ========= ===========
32 r600 amd <empty> <empty>
33 amdgcn amd <empty> <empty>
34 amdgcn amd amdhsa <empty>
35 amdgcn amd amdhsa opencl
36 amdgcn amd amdhsa amdgizcl
37 amdgcn amd amdhsa amdgiz
38 amdgcn amd amdhsa hcc
39 ============ ======== ========= ===========
40
41``r600-amd--``
42 Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
43 the MESA runtime.
44
45``amdgcn-amd--``
Tony Tye46d35762017-08-15 20:47:41 +000046 Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
Tony Tyef16a45e2017-06-06 20:31:59 +000047 the MESA runtime.
48
49``amdgcn-amd-amdhsa-``
50 Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
51 compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
52
53``amdgcn-amd-amdhsa-opencl``
54 Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
55 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
56 :ref:`amdgpu-opencl`.
57
58``amdgcn-amd-amdhsa-amdgizcl``
59 Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
60 is used (see :ref:`amdgpu-address-spaces`).
61
62``amdgcn-amd-amdhsa-amdgiz``
63 Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
64 used (see :ref:`amdgpu-address-spaces`).
65
66``amdgcn-amd-amdhsa-hcc``
67 Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
68 executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
69 :ref:`amdgpu-hcc`.
70
71.. _amdgpu-processors:
72
73Processors
74----------
75
76Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
77names from both the *Processor* and *Alternative Processor* can be used.
78
79 .. table:: AMDGPU Processors
80 :name: amdgpu-processors-table
81
82 ========== =========== ============ ===== ======= ==================
83 Processor Alternative Target dGPU/ Runtime Example
84 Processor Triple APU Support Products
85 Architecture
86 ========== =========== ============ ===== ======= ==================
87 **R600** [AMD-R6xx]_
88 --------------------------------------------------------------------
89 r600 r600 dGPU
90 r630 r600 dGPU
91 rs880 r600 dGPU
92 rv670 r600 dGPU
93 **R700** [AMD-R7xx]_
94 --------------------------------------------------------------------
95 rv710 r600 dGPU
96 rv730 r600 dGPU
97 rv770 r600 dGPU
98 **Evergreen** [AMD-Evergreen]_
99 --------------------------------------------------------------------
100 cedar r600 dGPU
101 redwood r600 dGPU
102 sumo r600 dGPU
103 juniper r600 dGPU
104 cypress r600 dGPU
105 **Northern Islands** [AMD-Cayman-Trinity]_
106 --------------------------------------------------------------------
107 barts r600 dGPU
108 turks r600 dGPU
109 caicos r600 dGPU
110 cayman r600 dGPU
111 **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_
112 --------------------------------------------------------------------
Konstantin Zhuravlyov6cbcb272017-08-08 04:28:31 +0000113 gfx600 - tahiti amdgcn dGPU
Tony Tyef16a45e2017-06-06 20:31:59 +0000114 gfx601 - pitcairn amdgcn dGPU
115 - verde
116 - oland
117 - hainan
118 **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_
119 --------------------------------------------------------------------
120 gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
121 - Radeon HD 8770
122 - R7 260
123 - R7 260X
124 \ - kaveri amdgcn APU - A6-7000
125 - A6 Pro-7050B
126 - A8-7100
127 - A8 Pro-7150B
128 - A10-7300
129 - A10 Pro-7350B
130 - FX-7500
131 - A8-7200P
132 - A10-7400P
133 - FX-7600P
134 gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
135 - FirePro W9100
136 - FirePro S9150
137 - FirePro S9170
Tony Tye1fd77b22017-06-08 01:47:25 +0000138 gfx702 dGPU ROCm - Radeon R9 290
Tony Tyef16a45e2017-06-06 20:31:59 +0000139 - Radeon R9 290x
140 - Radeon R390
Tony Tye1fd77b22017-06-08 01:47:25 +0000141 - Radeon R390x
142 gfx703 - kabini amdgcn APU - E1-2100
Tony Tyef16a45e2017-06-06 20:31:59 +0000143 - mullins - E1-2200
144 - E1-2500
145 - E2-3000
146 - E2-3800
147 - A4-5000
148 - A4-5100
149 - A6-5200
150 - A4 Pro-3340B
151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_
152 --------------------------------------------------------------------
153 gfx800 - iceland amdgcn dGPU - FirePro S7150
154 - FirePro S7100
155 - FirePro W7100
156 - Radeon R285
157 - Radeon R9 380
158 - Radeon R9 385
159 - Mobile FirePro
160 M7170
161 gfx801 - carrizo amdgcn APU - A6-8500P
162 - Pro A6-8500B
163 - A8-8600P
164 - Pro A8-8600B
165 - FX-8800P
166 - Pro A12-8800B
167 \ amdgcn APU ROCm - A10-8700P
168 - Pro A10-8700B
169 - A10-8780P
170 \ amdgcn APU - A10-9600P
171 - A10-9630P
172 - A12-9700P
173 - A12-9730P
174 - FX-9800P
175 - FX-9830P
176 \ amdgcn APU - E2-9010
177 - A6-9210
178 - A9-9410
179 gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
180 gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
181 - Radeon R9 Fury
182 - Radeon R9 FuryX
183 - Radeon Pro Duo
184 - FirePro S9300x2
Tony Tye46d35762017-08-15 20:47:41 +0000185 - Radeon Instinct MI8
Tony Tyef16a45e2017-06-06 20:31:59 +0000186 \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
187 - Radeon RX 480
Tony Tye46d35762017-08-15 20:47:41 +0000188 - Radeon Instinct MI6
Tony Tyef16a45e2017-06-06 20:31:59 +0000189 \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
190 gfx804 amdgcn dGPU Same as gfx803
191 gfx810 - stoney amdgcn APU
Tony Tye46d35762017-08-15 20:47:41 +0000192 **GCN GFX9** [AMD-Vega]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000193 --------------------------------------------------------------------
Tony Tye46d35762017-08-15 20:47:41 +0000194 gfx900 amdgcn dGPU - Radeon Vega
195 Frontier Edition
196 - Radeon RX Vega 56
197 - Radeon RX Vega 64
198 - Radeon RX Vega 64
199 Liquid
200 - Radeon Instinct MI25
Tony Tyef16a45e2017-06-06 20:31:59 +0000201 gfx901 amdgcn dGPU ROCm Same as gfx900
202 except XNACK is
203 enabled
204 gfx902 amdgcn APU *TBA*
205
206 .. TODO
207 Add product
208 names.
209 gfx903 amdgcn APU Same as gfx902
210 except XNACK is
211 enabled
212 ========== =========== ============ ===== ======= ==================
213
214.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000215
216Address Spaces
217--------------
218
Tony Tyef16a45e2017-06-06 20:31:59 +0000219The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000220
Tony Tyef16a45e2017-06-06 20:31:59 +0000221The memory space names used in the table, aside from the region memory space, is
222from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000223
Tony Tyef16a45e2017-06-06 20:31:59 +0000224LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000225
Tony Tyef16a45e2017-06-06 20:31:59 +0000226 .. table:: Address Space Mapping
227 :name: amdgpu-address-space-mapping-table
228
229 ================== ================= ================= ================= =================
230 LLVM Address Space Memory Space
231 ------------------ -----------------------------------------------------------------------
232 \ Current Default amdgiz/amdgizcl hcc Future Default
233 ================== ================= ================= ================= =================
234 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
235 1 Global Global Global Global
236 2 Constant Constant Constant Region (GDS)
237 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
238 4 Generic (Flat) Region (GDS) Region (GDS) Constant
239 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
240 ================== ================= ================= ================= =================
241
242Current Default
243 This is the current default address space mapping used for all languages
244 except hcc. This will shortly be deprecated.
245
246amdgiz/amdgizcl
247 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
248 is specified as the target triple environment value.
249
250hcc
251 This is the current address space mapping used when ``hcc`` is specified as
252 the target triple environment value.This will shortly be deprecated.
253
254Future Default
255 This will shortly be the only address space mapping for all languages using
256 AMDGPU backend.
257
258.. _amdgpu-memory-scopes:
259
260Memory Scopes
261-------------
262
263This section provides LLVM memory synchronization scopes supported by the AMDGPU
264backend memory model when the target triple OS is ``amdhsa`` (see
265:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
266
267The memory model supported is based on the HSA memory model [HSA]_ which is
268based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
269relation is transitive over the synchonizes-with relation independent of scope,
270and synchonizes-with allows the memory scope instances to be inclusive (see
271table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
272
273This is different to the OpenCL [OpenCL]_ memory model which does not have scope
274inclusion and requires the memory scopes to exactly match. However, this
275is conservatively correct for OpenCL.
276
277 .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
278 :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
279
280 ================ ==========================================================
281 LLVM Sync Scope Description
282 ================ ==========================================================
283 *none* The default: ``system``.
284
285 Synchronizes with, and participates in modification and
286 seq_cst total orderings with, other operations (except
287 image operations) for all address spaces (except private,
288 or generic that accesses private) provided the other
289 operation's sync scope is:
290
291 - ``system``.
292 - ``agent`` and executed by a thread on the same agent.
293 - ``workgroup`` and executed by a thread in the same
294 workgroup.
295 - ``wavefront`` and executed by a thread in the same
296 wavefront.
297
298 ``agent`` Synchronizes with, and participates in modification and
299 seq_cst total orderings with, other operations (except
300 image operations) for all address spaces (except private,
301 or generic that accesses private) provided the other
302 operation's sync scope is:
303
304 - ``system`` or ``agent`` and executed by a thread on the
305 same agent.
306 - ``workgroup`` and executed by a thread in the same
307 workgroup.
308 - ``wavefront`` and executed by a thread in the same
309 wavefront.
310
311 ``workgroup`` Synchronizes with, and participates in modification and
312 seq_cst total orderings with, other operations (except
313 image operations) for all address spaces (except private,
314 or generic that accesses private) provided the other
315 operation's sync scope is:
316
317 - ``system``, ``agent`` or ``workgroup`` and executed by a
318 thread in the same workgroup.
319 - ``wavefront`` and executed by a thread in the same
320 wavefront.
321
322 ``wavefront`` Synchronizes with, and participates in modification and
323 seq_cst total orderings with, other operations (except
324 image operations) for all address spaces (except private,
325 or generic that accesses private) provided the other
326 operation's sync scope is:
327
328 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
329 and executed by a thread in the same wavefront.
330
331 ``singlethread`` Only synchronizes with, and participates in modification
332 and seq_cst total orderings with, other operations (except
333 image operations) running in the same thread for all
334 address spaces (for example, in signal handlers).
335 ================ ==========================================================
336
337AMDGPU Intrinsics
338-----------------
339
340The AMDGPU backend implements the following intrinsics.
341
342*This section is WIP.*
343
344.. TODO
345 List AMDGPU intrinsics
346
347Code Object
348===========
349
350The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
351can be linked by ``lld`` to produce a standard ELF shared code object which can
352be loaded and executed on an AMDGPU target.
353
354Header
355------
356
357The AMDGPU backend uses the following ELF header:
358
359 .. table:: AMDGPU ELF Header
360 :name: amdgpu-elf-header-table
361
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000362 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000363 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000364 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000365 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
366 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Konstantin Zhuravlyov36963522017-10-03 21:18:03 +0000367 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``,
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000368 ``ELFOSABI_AMDGPU_PAL`` or
369 ``ELFOSABI_AMDGPU_MESA3D``
Konstantin Zhuravlyov36963522017-10-03 21:18:03 +0000370 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``,
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000371 ``ELFABIVERSION_AMDGPU_PAL`` or
372 ``ELFABIVERSION_AMDGPU_MESA3D``
Tony Tyef16a45e2017-06-06 20:31:59 +0000373 ``e_type`` ``ET_REL`` or ``ET_DYN``
374 ``e_machine`` ``EM_AMDGPU``
375 ``e_entry`` 0
376 ``e_flags`` 0
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000377 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000378
379..
380
381 .. table:: AMDGPU ELF Header Enumeration Values
382 :name: amdgpu-elf-header-enumeration-values-table
383
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000384 =============================== =====
385 Name Value
386 =============================== =====
387 ``EM_AMDGPU`` 224
388 ``ELFOSABI_AMDGPU_HSA`` 64
389 ``ELFOSABI_AMDGPU_PAL`` 65
390 ``ELFOSABI_AMDGPU_MESA3D`` 66
391 ``ELFABIVERSION_AMDGPU_HSA`` 1
392 ``ELFABIVERSION_AMDGPU_PAL`` 0
393 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
394 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000395
396``e_ident[EI_CLASS]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000397 The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
398 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000399
400``e_ident[EI_DATA]``
401 All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
402
403``e_ident[EI_OSABI]``
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000404 One of the following AMD GPU architecture specific OS ABIs:
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000405
406 * ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
407 the AMD HSA runtime ABI [HSA]_.
408
409 * ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
410 the AMD PAL runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000411
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000412 * ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
413 to the AMD MESA runtime ABI.
414
Tony Tyef16a45e2017-06-06 20:31:59 +0000415``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000416 The ABI version of the AMD GPU architecture specific OS ABI to which the code
417 object conforms:
418
419 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
420 runtime ABI.
421
422 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
423 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000424
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000425 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
426 runtime ABI.
427
Tony Tyef16a45e2017-06-06 20:31:59 +0000428``e_type``
429 Can be one of the following values:
430
431
432 ``ET_REL``
433 The type produced by the AMD GPU backend compiler as it is relocatable code
434 object.
435
436 ``ET_DYN``
437 The type produced by the linker as it is a shared code object.
438
439 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
440
441``e_machine``
442 The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
443 architecture family. The specific member is specified in the
444 ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
445 :ref:`amdgpu-note-records`).
446
447``e_entry``
448 The entry point is 0 as the entry points for individual kernels must be
449 selected in order to invoke them through AQL packets.
450
451``e_flags``
452 The value is 0 as no flags are used.
453
454Sections
455--------
456
457An AMDGPU target ELF code object has the standard ELF sections which include:
458
459 .. table:: AMDGPU ELF Sections
460 :name: amdgpu-elf-sections-table
461
462 ================== ================ =================================
463 Name Type Attributes
464 ================== ================ =================================
465 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
466 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
467 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
468 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
469 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
470 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
471 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
472 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
473 ``.note`` ``SHT_NOTE`` *none*
474 ``.rela``\ *name* ``SHT_RELA`` *none*
475 ``.rela.dyn`` ``SHT_RELA`` *none*
476 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
477 ``.shstrtab`` ``SHT_STRTAB`` *none*
478 ``.strtab`` ``SHT_STRTAB`` *none*
479 ``.symtab`` ``SHT_SYMTAB`` *none*
480 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
481 ================== ================ =================================
482
483These sections have their standard meanings (see [ELF]_) and are only generated
484if needed.
485
486``.debug``\ *\**
487 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
488 DWARF produced by the AMDGPU backend.
489
Tony Tye46d35762017-08-15 20:47:41 +0000490``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000491 The standard sections used by a dynamic loader.
492
493``.note``
494 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
495 backend.
496
497``.rela``\ *name*, ``.rela.dyn``
498 For relocatable code objects, *name* is the name of the section that the
499 relocation records apply. For example, ``.rela.text`` is the section name for
500 relocation records associated with the ``.text`` section.
501
502 For linked shared code objects, ``.rela.dyn`` contains all the relocation
503 records from each of the relocatable code object's ``.rela``\ *name* sections.
504
505 See :ref:`amdgpu-relocation-records` for the relocation records supported by
506 the AMDGPU backend.
507
508``.text``
509 The executable machine code for the kernels and functions they call. Generated
510 as position independent code. See :ref:`amdgpu-code-conventions` for
511 information on conventions used in the isa generation.
512
513.. _amdgpu-note-records:
514
515Note Records
516------------
517
518As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
519the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
520minimal zero byte padding must be generated to ensure the ``desc`` field size is
521a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
522be at least 4 to indicate at least 8 byte alignment.
523
524The AMDGPU backend code object uses the following ELF note records in the
525``.note`` section. The *Description* column specifies the layout of the note
526record’s ``desc`` field. All fields are consecutive bytes. Note records with
527variable size strings have a corresponding ``*_size`` field that specifies the
528number of bytes, including the terminating null character, in the string. The
529string(s) come immediately after the preceding fields.
530
531Additional note records can be present.
532
533 .. table:: AMDGPU ELF Note Records
534 :name: amdgpu-elf-note-records-table
535
Tony Tye46d35762017-08-15 20:47:41 +0000536 ===== ============================== ======================================
537 Name Type Description
538 ===== ============================== ======================================
539 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
540 "AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
541 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000542
543..
544
545 .. table:: AMDGPU ELF Note Record Enumeration Values
546 :name: amdgpu-elf-note-record-enumeration-values-table
547
Tony Tye46d35762017-08-15 20:47:41 +0000548 ============================== =====
549 Name Value
550 ============================== =====
551 *reserved* 0-9
552 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
553 ``NT_AMD_AMDGPU_ISA`` 11
554 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000555
556``NT_AMD_AMDGPU_ISA``
557 Specifies the instruction set architecture used by the machine code contained
558 in the code object.
559
560 This note record is required for code objects containing machine code for
561 processors matching the ``amdgcn`` architecture in table
562 :ref:`amdgpu-processors`.
563
564 The null terminated string has the following syntax:
565
566 *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
567
568 where:
569
570 *architecture*
571 The architecture from table :ref:`amdgpu-target-triples-table`.
572
573 This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
574 :ref:`amdgpu-target-triples`).
575
576 *vendor*
577 The vendor from table :ref:`amdgpu-target-triples-table`.
578
579 For the AMDGPU backend this is always ``amd``.
580
581 *os*
582 The OS from table :ref:`amdgpu-target-triples-table`.
583
584 *environment*
585 An environment from table :ref:`amdgpu-target-triples-table`, or blank if
586 the environment has no affect on the execution of the code object.
587
588 For the AMDGPU backend this is currently always blank.
589 *processor*
590 The processor from table :ref:`amdgpu-processors-table`.
591
592 For example:
593
594 ``amdgcn-amd-amdhsa--gfx901``
595
Tony Tye46d35762017-08-15 20:47:41 +0000596``NT_AMD_AMDGPU_HSA_METADATA``
597 Specifies extensible metadata associated with the code objects executed on HSA
598 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
599 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
600 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
601 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000602
Tony Tye46d35762017-08-15 20:47:41 +0000603.. _amdgpu-symbols:
604
605Symbols
606-------
607
608Symbols include the following:
609
610 .. table:: AMDGPU ELF Symbols
611 :name: amdgpu-elf-symbols-table
612
613 ===================== ============== ============= ==================
614 Name Type Section Description
615 ===================== ============== ============= ==================
616 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
617 - ``.rodata``
618 - ``.bss``
619 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
620 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
621 ===================== ============== ============= ==================
622
623Global variable
624 Global variables both used and defined by the compilation unit.
625
626 If the symbol is defined in the compilation unit then it is allocated in the
627 appropriate section according to if it has initialized data or is readonly.
628
629 If the symbol is external then its section is ``STN_UNDEF`` and the loader
630 will resolve relocations using the definition provided by another code object
631 or explicitly defined by the runtime.
632
633 All global symbols, whether defined in the compilation unit or external, are
634 accessed by the machine code indirectly through a GOT table entry. This
635 allows them to be preemptable. The GOT table is only supported when the target
636 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000637
638 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000639 Add description of linked shared object symbols. Seems undefined symbols
640 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000641
Tony Tye46d35762017-08-15 20:47:41 +0000642Kernel descriptor
643 Every HSA kernel has an associated kernel descriptor. It is the address of the
644 kernel descriptor that is used in the AQL dispatch packet used to invoke the
645 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
646 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
647
648Kernel entry point
649 Every HSA kernel also has a symbol for its machine code entry point.
650
651.. _amdgpu-relocation-records:
652
653Relocation Records
654------------------
655
656AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
657relocatable fields are:
658
659``word32``
660 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
661 alignment. These values use the same byte order as other word values in the
662 AMD GPU architecture.
663
664``word64``
665 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
666 alignment. These values use the same byte order as other word values in the
667 AMD GPU architecture.
668
669Following notations are used for specifying relocation calculations:
670
671**A**
672 Represents the addend used to compute the value of the relocatable field.
673
674**G**
675 Represents the offset into the global offset table at which the relocation
676 entry’s symbol will reside during execution.
677
678**GOT**
679 Represents the address of the global offset table.
680
681**P**
682 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
683 of the storage unit being relocated (computed using ``r_offset``).
684
685**S**
686 Represents the value of the symbol whose index resides in the relocation
687 entry.
688
689The following relocation types are supported:
690
691 .. table:: AMDGPU ELF Relocation Records
692 :name: amdgpu-elf-relocation-records-table
693
694 ========================== ===== ========== ==============================
695 Relocation Type Value Field Calculation
696 ========================== ===== ========== ==============================
697 ``R_AMDGPU_NONE`` 0 *none* *none*
698 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
699 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
700 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
701 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
702 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
703 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
704 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
705 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
706 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
707 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
708 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
709 ========================== ===== ========== ==============================
710
711.. _amdgpu-dwarf:
712
713DWARF
714-----
715
716Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
717information that maps the code object executable code and data to the source
718language constructs. It can be used by tools such as debuggers and profilers.
719
720Address Space Mapping
721~~~~~~~~~~~~~~~~~~~~~
722
723The following address space mapping is used:
724
725 .. table:: AMDGPU DWARF Address Space Mapping
726 :name: amdgpu-dwarf-address-space-mapping-table
727
728 =================== =================
729 DWARF Address Space Memory Space
730 =================== =================
731 1 Private (Scratch)
732 2 Local (group/LDS)
733 *omitted* Global
734 *omitted* Constant
735 *omitted* Generic (Flat)
736 *not supported* Region (GDS)
737 =================== =================
738
739See :ref:`amdgpu-address-spaces` for information on the memory space terminology
740used in the table.
741
742An ``address_class`` attribute is generated on pointer type DIEs to specify the
743DWARF address space of the value of the pointer when it is in the *private* or
744*local* address space. Otherwise the attribute is omitted.
745
746An ``XDEREF`` operation is generated in location list expressions for variables
747that are allocated in the *private* and *local* address space. Otherwise no
748``XDREF`` is omitted.
749
750Register Mapping
751~~~~~~~~~~~~~~~~
752
753*This section is WIP.*
754
755.. TODO
756 Define DWARF register enumeration.
757
758 If want to present a wavefront state then should expose vector registers as
759 64 wide (rather than per work-item view that LLVM uses). Either as separate
760 registers, or a 64x4 byte single register. In either case use a new LANE op
761 (akin to XDREF) to select the current lane usage in a location
762 expression. This would also allow scalar register spilling to vector register
763 lanes to be expressed (currently no debug information is being generated for
764 spilling). If choose a wide single register approach then use LANE in
765 conjunction with PIECE operation to select the dword part of the register for
766 the current lane. If the separate register approach then use LANE to select
767 the register.
768
769Source Text
770~~~~~~~~~~~
771
772*This section is WIP.*
773
774.. TODO
775 DWARF extension to include runtime generated source text.
776
777.. _amdgpu-code-conventions:
778
779Code Conventions
780================
781
782This section provides code conventions used for each supported target triple OS
783(see :ref:`amdgpu-target-triples`).
784
785AMDHSA
786------
787
788This section provides code conventions used when the target triple OS is
789``amdhsa`` (see :ref:`amdgpu-target-triples`).
790
791.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000792
793Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000794~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000795
Tony Tye46d35762017-08-15 20:47:41 +0000796The code object metadata specifies extensible metadata associated with the code
797objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
798[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
799(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
800``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
801information necessary to support the ROCM kernel queries. For example, the
802segment sizes needed in a dispatch packet. In addition, a high level language
803runtime may require other information to be included. For example, the AMD
804OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000805
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000806The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000807:doc:`YamlIO`).
808
Tony Tye46d35762017-08-15 20:47:41 +0000809.. TODO
810 Is the string null terminated? It probably should not if YAML allows it to
811 contain null characters, otherwise it should be.
812
Tony Tyef16a45e2017-06-06 20:31:59 +0000813The metadata is represented as a single YAML document comprised of the mapping
814defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
815referenced tables.
816
817For boolean values, the string values of ``false`` and ``true`` are used for
818false and true respectively.
819
820Additional information can be added to the mappings. To avoid conflicts, any
821non-AMD key names should be prefixed by "*vendor-name*.".
822
823 .. table:: AMDHSA Code Object Metadata Mapping
824 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
825
826 ========== ============== ========= =======================================
827 String Key Value Type Required? Description
828 ========== ============== ========= =======================================
829 "Version" sequence of Required - The first integer is the major
830 2 integers version. Currently 1.
831 - The second integer is the minor
832 version. Currently 0.
833 "Printf" sequence of Each string is encoded information
834 strings about a printf function call. The
835 encoded information is organized as
836 fields separated by colon (':'):
837
838 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
839
840 where:
841
842 ``ID``
843 A 32 bit integer as a unique id for
844 each printf function call
845
846 ``N``
847 A 32 bit integer equal to the number
848 of arguments of printf function call
849 minus 1
850
851 ``S[i]`` (where i = 0, 1, ... , N-1)
852 32 bit integers for the size in bytes
853 of the i-th FormatString argument of
854 the printf function call
855
856 FormatString
857 The format string passed to the
858 printf function call.
859 "Kernels" sequence of Required Sequence of the mappings for each
860 mapping kernel in the code object. See
861 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
862 for the definition of the mapping.
863 ========== ============== ========= =======================================
864
865..
866
867 .. table:: AMDHSA Code Object Kernel Metadata Mapping
868 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
869
870 ================= ============== ========= ================================
871 String Key Value Type Required? Description
872 ================= ============== ========= ================================
873 "Name" string Required Source name of the kernel.
874 "SymbolName" string Required Name of the kernel
875 descriptor ELF symbol.
876 "Language" string Source language of the kernel.
877 Values include:
878
879 - "OpenCL C"
880 - "OpenCL C++"
881 - "HCC"
882 - "OpenMP"
883
884 "LanguageVersion" sequence of - The first integer is the major
885 2 integers version.
886 - The second integer is the
887 minor version.
888 "Attrs" mapping Mapping of kernel attributes.
889 See
890 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
891 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +0000892 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +0000893 mapping kernel arguments. See
894 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
895 for the definition of the mapping.
896 "CodeProps" mapping Mapping of properties related to
897 the kernel code. See
898 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
899 for the mapping definition.
900 "DebugProps" mapping Mapping of properties related to
901 the kernel debugging. See
902 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
903 for the mapping definition.
904 ================= ============== ========= ================================
905
906..
907
908 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
909 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
910
911 =================== ============== ========= ==============================
912 String Key Value Type Required? Description
913 =================== ============== ========= ==============================
914 "ReqdWorkGroupSize" sequence of The dispatch work-group size
915 3 integers X, Y, Z must correspond to the
916 specified values.
917
918 Corresponds to the OpenCL
919 ``reqd_work_group_size``
920 attribute.
921 "WorkGroupSizeHint" sequence of The dispatch work-group size
922 3 integers X, Y, Z is likely to be the
923 specified values.
924
925 Corresponds to the OpenCL
926 ``work_group_size_hint``
927 attribute.
928 "VecTypeHint" string The name of a scalar or vector
929 type.
930
931 Corresponds to the OpenCL
932 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +0000933
934 "RuntimeHandle" string The external symbol name
935 associated with a kernel.
936 OpenCL runtime allocates a
937 global buffer for the symbol
938 and saves the kernel's address
939 to it, which is used for
940 device side enqueueing. Only
941 available for device side
942 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +0000943 =================== ============== ========= ==============================
944
945..
946
947 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
948 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
949
950 ================= ============== ========= ================================
951 String Key Value Type Required? Description
952 ================= ============== ========= ================================
953 "Name" string Kernel argument name.
954 "TypeName" string Kernel argument type name.
955 "Size" integer Required Kernel argument size in bytes.
956 "Align" integer Required Kernel argument alignment in
957 bytes. Must be a power of two.
958 "ValueKind" string Required Kernel argument kind that
959 specifies how to set up the
960 corresponding argument.
961 Values include:
962
963 "ByValue"
964 The argument is copied
965 directly into the kernarg.
966
967 "GlobalBuffer"
968 A global address space pointer
969 to the buffer data is passed
970 in the kernarg.
971
972 "DynamicSharedPointer"
973 A group address space pointer
974 to dynamically allocated LDS
975 is passed in the kernarg.
976
977 "Sampler"
978 A global address space
979 pointer to a S# is passed in
980 the kernarg.
981
982 "Image"
983 A global address space
984 pointer to a T# is passed in
985 the kernarg.
986
987 "Pipe"
988 A global address space pointer
989 to an OpenCL pipe is passed in
990 the kernarg.
991
992 "Queue"
993 A global address space pointer
994 to an OpenCL device enqueue
995 queue is passed in the
996 kernarg.
997
998 "HiddenGlobalOffsetX"
999 The OpenCL grid dispatch
1000 global offset for the X
1001 dimension is passed in the
1002 kernarg.
1003
1004 "HiddenGlobalOffsetY"
1005 The OpenCL grid dispatch
1006 global offset for the Y
1007 dimension is passed in the
1008 kernarg.
1009
1010 "HiddenGlobalOffsetZ"
1011 The OpenCL grid dispatch
1012 global offset for the Z
1013 dimension is passed in the
1014 kernarg.
1015
1016 "HiddenNone"
1017 An argument that is not used
1018 by the kernel. Space needs to
1019 be left for it, but it does
1020 not need to be set up.
1021
1022 "HiddenPrintfBuffer"
1023 A global address space pointer
1024 to the runtime printf buffer
1025 is passed in kernarg.
1026
1027 "HiddenDefaultQueue"
1028 A global address space pointer
1029 to the OpenCL device enqueue
1030 queue that should be used by
1031 the kernel by default is
1032 passed in the kernarg.
1033
1034 "HiddenCompletionAction"
1035 *TBD*
1036
1037 .. TODO
1038 Add description.
1039
1040 "ValueType" string Required Kernel argument value type. Only
1041 present if "ValueKind" is
1042 "ByValue". For vector data
1043 types, the value is for the
1044 element type. Values include:
1045
1046 - "Struct"
1047 - "I8"
1048 - "U8"
1049 - "I16"
1050 - "U16"
1051 - "F16"
1052 - "I32"
1053 - "U32"
1054 - "F32"
1055 - "I64"
1056 - "U64"
1057 - "F64"
1058
1059 .. TODO
1060 How can it be determined if a
1061 vector type, and what size
1062 vector?
1063 "PointeeAlign" integer Alignment in bytes of pointee
1064 type for pointer type kernel
1065 argument. Must be a power
1066 of 2. Only present if
1067 "ValueKind" is
1068 "DynamicSharedPointer".
1069 "AddrSpaceQual" string Kernel argument address space
1070 qualifier. Only present if
1071 "ValueKind" is "GlobalBuffer" or
1072 "DynamicSharedPointer". Values
1073 are:
1074
1075 - "Private"
1076 - "Global"
1077 - "Constant"
1078 - "Local"
1079 - "Generic"
1080 - "Region"
1081
1082 .. TODO
1083 Is GlobalBuffer only Global
1084 or Constant? Is
1085 DynamicSharedPointer always
1086 Local? Can HCC allow Generic?
1087 How can Private or Region
1088 ever happen?
1089 "AccQual" string Kernel argument access
1090 qualifier. Only present if
1091 "ValueKind" is "Image" or
1092 "Pipe". Values
1093 are:
1094
1095 - "ReadOnly"
1096 - "WriteOnly"
1097 - "ReadWrite"
1098
1099 .. TODO
1100 Does this apply to
1101 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001102 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001103 performed by the kernel on the
1104 kernel argument. Only present if
1105 "ValueKind" is "GlobalBuffer",
1106 "Image", or "Pipe". This may be
1107 more restrictive than indicated
1108 by "AccQual" to reflect what the
1109 kernel actual does. If not
1110 present then the runtime must
1111 assume what is implied by
1112 "AccQual" and "IsConst". Values
1113 are:
1114
1115 - "ReadOnly"
1116 - "WriteOnly"
1117 - "ReadWrite"
1118
1119 "IsConst" boolean Indicates if the kernel argument
1120 is const qualified. Only present
1121 if "ValueKind" is
1122 "GlobalBuffer".
1123
1124 "IsRestrict" boolean Indicates if the kernel argument
1125 is restrict qualified. Only
1126 present if "ValueKind" is
1127 "GlobalBuffer".
1128
1129 "IsVolatile" boolean Indicates if the kernel argument
1130 is volatile qualified. Only
1131 present if "ValueKind" is
1132 "GlobalBuffer".
1133
1134 "IsPipe" boolean Indicates if the kernel argument
1135 is pipe qualified. Only present
1136 if "ValueKind" is "Pipe".
1137
1138 .. TODO
1139 Can GlobalBuffer be pipe
1140 qualified?
1141 ================= ============== ========= ================================
1142
1143..
1144
1145 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1146 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1147
1148 ============================ ============== ========= =====================
1149 String Key Value Type Required? Description
1150 ============================ ============== ========= =====================
1151 "KernargSegmentSize" integer Required The size in bytes of
1152 the kernarg segment
1153 that holds the values
1154 of the arguments to
1155 the kernel.
1156 "GroupSegmentFixedSize" integer Required The amount of group
1157 segment memory
1158 required by a
1159 work-group in
1160 bytes. This does not
1161 include any
1162 dynamically allocated
1163 group segment memory
1164 that may be added
1165 when the kernel is
1166 dispatched.
1167 "PrivateSegmentFixedSize" integer Required The amount of fixed
1168 private address space
1169 memory required for a
1170 work-item in
1171 bytes. If
1172 IsDynamicCallstack
1173 is 1 then additional
1174 space must be added
1175 to this value for the
1176 call stack.
1177 "KernargSegmentAlign" integer Required The maximum byte
1178 alignment of
1179 arguments in the
1180 kernarg segment. Must
1181 be a power of 2.
1182 "WavefrontSize" integer Required Wavefront size. Must
1183 be a power of 2.
1184 "NumSGPRs" integer Number of scalar
1185 registers used by a
1186 wavefront for
1187 GFX6-GFX9. This
1188 includes the special
1189 SGPRs for VCC, Flat
1190 Scratch (GFX7-GFX9)
1191 and XNACK (for
1192 GFX8-GFX9). It does
1193 not include the 16
1194 SGPR added if a trap
1195 handler is
1196 enabled. It is not
1197 rounded up to the
1198 allocation
1199 granularity.
1200 "NumVGPRs" integer Number of vector
1201 registers used by
1202 each work-item for
1203 GFX6-GFX9
1204 "MaxFlatWorkgroupSize" integer Maximum flat
1205 work-group size
1206 supported by the
1207 kernel in work-items.
1208 "IsDynamicCallStack" boolean Indicates if the
1209 generated machine
1210 code is using a
1211 dynamically sized
1212 call stack.
1213 "IsXNACKEnabled" boolean Indicates if the
1214 generated machine
1215 code is capable of
1216 supporting XNACK.
1217 ============================ ============== ========= =====================
1218
1219..
1220
1221 .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
1222 :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
1223
1224 =================================== ============== ========= ==============
1225 String Key Value Type Required? Description
1226 =================================== ============== ========= ==============
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001227 "DebuggerABIVersion" sequence of
1228 2 integers
Tony Tyef16a45e2017-06-06 20:31:59 +00001229 "ReservedNumVGPRs" integer
1230 "ReservedFirstVGPR" integer
1231 "PrivateSegmentBufferSGPR" integer
1232 "WavefrontPrivateSegmentOffsetSGPR" integer
1233 =================================== ============== ========= ==============
1234
1235.. TODO
1236 Plan to remove the debug properties metadata.
1237
Tony Tyef16a45e2017-06-06 20:31:59 +00001238Kernel Dispatch
1239~~~~~~~~~~~~~~~
1240
1241The HSA architected queuing language (AQL) defines a user space memory interface
1242that can be used to control the dispatch of kernels, in an agent independent
1243way. An agent can have zero or more AQL queues created for it using the ROCm
1244runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1245*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1246mechanics and packet layouts.
1247
1248The packet processor of a kernel agent is responsible for detecting and
1249dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1250packet processor is implemented by the hardware command processor (CP),
1251asynchronous dispatch controller (ADC) and shader processor input controller
1252(SPI).
1253
1254The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1255mode driver to initialize and register the AQL queue with CP.
1256
1257To dispatch a kernel the following actions are performed. This can occur in the
1258CPU host program, or from an HSA kernel executing on a GPU.
1259
12601. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1261 executed is obtained.
12622. A pointer to the kernel descriptor (see
1263 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1264 obtained. It must be for a kernel that is contained in a code object that that
1265 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1266 associated.
12673. Space is allocated for the kernel arguments using the ROCm runtime allocator
1268 for a memory region with the kernarg property for the kernel agent that will
1269 execute the kernel. It must be at least 16 byte aligned.
12704. Kernel argument values are assigned to the kernel argument memory
1271 allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1272 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1273 memory in the same way constant memory is accessed. (Note that the HSA
1274 specification allows an implementation to copy the kernel argument contents to
1275 another location that is accessed by the kernel.)
12765. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1277 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1278 packet. The packet must be set up, and the final write must use an atomic
1279 store release to set the packet kind to ensure the packet contents are
1280 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1281 notify the kernel agent that the AQL queue has been updated. These rules, and
1282 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1283 System Architecture Specification* [HSA]_.
12846. A kernel dispatch packet includes information about the actual dispatch,
1285 such as grid and work-group size, together with information from the code
1286 object about the kernel, such as segment sizes. The ROCm runtime queries on
1287 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001288 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000012897. CP executes micro-code and is responsible for detecting and setting up the
1290 GPU to execute the wavefronts of a kernel dispatch.
12918. CP ensures that when the a wavefront starts executing the kernel machine
1292 code, the scalar general purpose registers (SGPR) and vector general purpose
1293 registers (VGPR) are set up as required by the machine code. The required
1294 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1295 register state is defined in
1296 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
12979. The prolog of the kernel machine code (see
1298 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1299 before continuing executing the machine code that corresponds to the kernel.
130010. When the kernel dispatch has completed execution, CP signals the completion
1301 signal specified in the kernel dispatch packet if not 0.
1302
1303.. _amdgpu-amdhsa-memory-spaces:
1304
1305Memory Spaces
1306~~~~~~~~~~~~~
1307
1308The memory space properties are:
1309
1310 .. table:: AMDHSA Memory Spaces
1311 :name: amdgpu-amdhsa-memory-spaces-table
1312
1313 ================= =========== ======== ======= ==================
1314 Memory Space Name HSA Segment Hardware Address NULL Value
1315 Name Name Size
1316 ================= =========== ======== ======= ==================
1317 Private private scratch 32 0x00000000
1318 Local group LDS 32 0xFFFFFFFF
1319 Global global global 64 0x0000000000000000
1320 Constant constant *same as 64 0x0000000000000000
1321 global*
1322 Generic flat flat 64 0x0000000000000000
1323 Region N/A GDS 32 *not implemented
1324 for AMDHSA*
1325 ================= =========== ======== ======= ==================
1326
1327The global and constant memory spaces both use global virtual addresses, which
1328are the same virtual address space used by the CPU. However, some virtual
1329addresses may only be accessible to the CPU, some only accessible by the GPU,
1330and some by both.
1331
1332Using the constant memory space indicates that the data will not change during
1333the execution of the kernel. This allows scalar read instructions to be
1334used. The vector and scalar L1 caches are invalidated of volatile data before
1335each kernel dispatch execution to allow constant memory to change values between
1336kernel dispatches.
1337
1338The local memory space uses the hardware Local Data Store (LDS) which is
1339automatically allocated when the hardware creates work-groups of wavefronts, and
1340freed when all the wavefronts of a work-group have terminated. The data store
1341(DS) instructions can be used to access it.
1342
1343The private memory space uses the hardware scratch memory support. If the kernel
1344uses scratch, then the hardware allocates memory that is accessed using
1345wavefront lane dword (4 byte) interleaving. The mapping used from private
1346address to physical address is:
1347
1348 ``wavefront-scratch-base +
1349 (private-address * wavefront-size * 4) +
1350 (wavefront-lane-id * 4)``
1351
1352There are different ways that the wavefront scratch base address is determined
1353by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1354memory can be accessed in an interleaved manner using buffer instruction with
1355the scratch buffer descriptor and per wave scratch offset, by the scratch
1356instructions, or by flat instructions. If each lane of a wavefront accesses the
1357same private address, the interleaving results in adjacent dwords being accessed
1358and hence requires fewer cache lines to be fetched. Multi-dword access is not
1359supported except by flat and scratch instructions in GFX9.
1360
1361The generic address space uses the hardware flat address support available in
1362GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1363local appertures), that are outside the range of addressible global memory, to
1364map from a flat address to a private or local address.
1365
1366FLAT instructions can take a flat address and access global, private (scratch)
1367and group (LDS) memory depending in if the address is within one of the
1368apperture ranges. Flat access to scratch requires hardware aperture setup and
1369setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1370access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1371(see :ref:`amdgpu-amdhsa-m0`).
1372
1373To convert between a segment address and a flat address the base address of the
1374appertures address can be used. For GFX7-GFX8 these are available in the
1375:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1376Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1377GFX9 the appature base addresses are directly available as inline constant
1378registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1379address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1380which makes it easier to convert from flat to segment or segment to flat.
1381
Tony Tye46d35762017-08-15 20:47:41 +00001382Image and Samplers
1383~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001384
1385Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1386hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1387HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1388enumeration values for the queries that are not trivially deducible from the S#
1389representation.
1390
1391HSA Signals
1392~~~~~~~~~~~
1393
Tony Tye46d35762017-08-15 20:47:41 +00001394HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1395structure allocated in memory accessible from both the CPU and GPU. The
1396structure is defined by the ROCm runtime and subject to change between releases
1397(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001398
1399.. _amdgpu-amdhsa-hsa-aql-queue:
1400
1401HSA AQL Queue
1402~~~~~~~~~~~~~
1403
Tony Tye46d35762017-08-15 20:47:41 +00001404The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001405between releases (see [AMD-ROCm-github]_). For some processors it contains
1406fields needed to implement certain language features such as the flat address
1407aperture bases. It also contains fields used by CP such as managing the
1408allocation of scratch memory.
1409
1410.. _amdgpu-amdhsa-kernel-descriptor:
1411
1412Kernel Descriptor
1413~~~~~~~~~~~~~~~~~
1414
1415A kernel descriptor consists of the information needed by CP to initiate the
1416execution of a kernel, including the entry point address of the machine code
1417that implements the kernel.
1418
1419Kernel Descriptor for GFX6-GFX9
1420+++++++++++++++++++++++++++++++
1421
1422CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1423
1424 .. table:: Kernel Descriptor for GFX6-GFX9
1425 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1426
1427 ======= ======= =============================== ===========================
1428 Bits Size Field Name Description
1429 ======= ======= =============================== ===========================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001430 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001431 address space memory
1432 required for a work-group
1433 in bytes. This does not
1434 include any dynamically
1435 allocated local address
1436 space memory that may be
1437 added when the kernel is
1438 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001439 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001440 private address space
1441 memory required for a
1442 work-item in bytes. If
1443 is_dynamic_callstack is 1
1444 then additional space must
1445 be added to this value for
1446 the call stack.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001447 95:64 4 bytes MaxFlatWorkgroupSize Maximum flat work-group
Tony Tyef16a45e2017-06-06 20:31:59 +00001448 size supported by the
1449 kernel in work-items.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001450 96 1 bit IsDynamicCallStack Indicates if the generated
Tony Tyef16a45e2017-06-06 20:31:59 +00001451 machine code is using a
1452 dynamically sized call
1453 stack.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001454 97 1 bit IsXNACKEnabled Indicates if the generated
Tony Tyef16a45e2017-06-06 20:31:59 +00001455 machine code is capable of
1456 suppoting XNACK.
1457 127:98 30 bits Reserved. Must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001458 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001459 negative) from base
1460 address of kernel
1461 descriptor to kernel's
1462 entry point instruction
1463 which must be 256 byte
1464 aligned.
1465 383:192 24 Reserved. Must be 0.
1466 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001467 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001468 program settings used by
1469 CP to set up
1470 ``COMPUTE_PGM_RSRC1``
1471 configuration
1472 register. See
1473 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001474 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001475 program settings used by
1476 CP to set up
1477 ``COMPUTE_PGM_RSRC2``
1478 configuration
1479 register. See
1480 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001481 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1482 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001483 (see
1484 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1485
1486 The total number of SGPR
1487 user data registers
1488 requested must not exceed
1489 16 and match value in
1490 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1491 Any requests beyond 16
1492 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001493 449 1 bit EnableSGPRDispatchPtr *see above*
1494 450 1 bit EnableSGPRQueuePtr *see above*
1495 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1496 452 1 bit EnableSGPRDispatchID *see above*
1497 453 1 bit EnableSGPRFlatScratchInit *see above*
1498 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1499 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1500 should always be 0.
1501 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1502 should always be 0.
1503 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1504 should always be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001505 463:458 6 bits Reserved. Must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001506 511:464 6 Reserved. Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001507 bytes
1508 512 **Total size 64 bytes.**
1509 ======= ===================================================================
1510
1511..
1512
1513 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1514 :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1515
Tony Tye3b340612017-06-07 00:46:08 +00001516 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001517 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001518 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001519 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001520 used by each work-item,
1521 granularity is device
1522 specific:
1523
1524 GFX6-9
1525 roundup((max-vgpg + 1)
1526 / 4) - 1
1527
1528 Used by CP to set up
1529 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001530 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001531 used by a wavefront,
1532 granularity is device
1533 specific:
1534
1535 GFX6-8
1536 roundup((max-sgpg + 1)
1537 / 8) - 1
1538 GFX9
1539 roundup((max-sgpg + 1)
1540 / 16) - 1
1541
1542 Includes the special SGPRs
1543 for VCC, Flat Scratch (for
1544 GFX7 onwards) and XNACK
1545 (for GFX8 onwards). It does
1546 not include the 16 SGPR
1547 added if a trap handler is
1548 enabled.
1549
1550 Used by CP to set up
1551 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001552 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001553
1554 Start executing wavefront
1555 at the specified priority.
1556
1557 CP is responsible for
1558 filling in
1559 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001560 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001561 with specified rounding
1562 mode for single (32
1563 bit) floating point
1564 precision floating point
1565 operations.
1566
1567 Floating point rounding
1568 mode values are defined in
1569 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1570
1571 Used by CP to set up
1572 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001573 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001574 with specified rounding
1575 denorm mode for half/double (16
1576 and 64 bit) floating point
1577 precision floating point
1578 operations.
1579
1580 Floating point rounding
1581 mode values are defined in
1582 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1583
1584 Used by CP to set up
1585 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001586 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001587 with specified denorm mode
1588 for single (32
1589 bit) floating point
1590 precision floating point
1591 operations.
1592
1593 Floating point denorm mode
1594 values are defined in
1595 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1596
1597 Used by CP to set up
1598 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001599 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001600 with specified denorm mode
1601 for half/double (16
1602 and 64 bit) floating point
1603 precision floating point
1604 operations.
1605
1606 Floating point denorm mode
1607 values are defined in
1608 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1609
1610 Used by CP to set up
1611 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001612 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001613
1614 Start executing wavefront
1615 in privilege trap handler
1616 mode.
1617
1618 CP is responsible for
1619 filling in
1620 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001621 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001622 with DX10 clamp mode
1623 enabled. Used by the vector
1624 ALU to force DX-10 style
1625 treatment of NaN's (when
1626 set, clamp NaN to zero,
1627 otherwise pass NaN
1628 through).
1629
1630 Used by CP to set up
1631 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001632 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001633
1634 Start executing wavefront
1635 in single step mode.
1636
1637 CP is responsible for
1638 filling in
1639 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001640 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001641 with IEEE mode
1642 enabled. Floating point
1643 opcodes that support
1644 exception flag gathering
1645 will quiet and propagate
1646 signaling-NaN inputs per
1647 IEEE 754-2008. Min_dx10 and
1648 max_dx10 become IEEE
1649 754-2008 compliant due to
1650 signaling-NaN propagation
1651 and quieting.
1652
1653 Used by CP to set up
1654 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001655 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001656
1657 Only one work-group allowed
1658 to execute on a compute
1659 unit.
1660
1661 CP is responsible for
1662 filling in
1663 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001664 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001665
1666 Flag that can be used to
1667 control debugging code.
1668
1669 CP is responsible for
1670 filling in
1671 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001672 26 1 bit FP16_OVFL GFX6-8:
1673 Reserved. Must be 0.
1674 GFX9:
1675 Wavefront starts
1676 execution with specified
1677 fp16 overflow mode.
1678
1679 - If 0, then fp16
1680 overflow generates
1681 +/-INF values.
1682 - If 1, then fp16
1683 overflow that is the
1684 result of an +/-INF
1685 input value or divide
1686 by 0 generates a
1687 +/-INF, otherwise
1688 clamps computed
1689 overflow to +/-MAX_FP16
1690 as appropriate.
1691
1692 Used by CP to set up
1693 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1694 31:27 5 bits Reserved. Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001695 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001696 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001697
1698..
1699
1700 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1701 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1702
Tony Tye3b340612017-06-07 00:46:08 +00001703 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001704 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001705 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001706 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1707 _WAVE_OFFSET SGPR wave scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001708 system register (see
1709 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1710
1711 Used by CP to set up
1712 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001713 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001714 user data registers
1715 requested. This number must
1716 match the number of user
1717 data registers enabled.
1718
1719 Used by CP to set up
1720 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001721 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001722 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001723 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001724 be enabled.
1725
1726 CP sets
1727 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1728 if the runtime has
1729 installed a trap handler
1730 regardless of the setting
1731 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001732 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001733 system SGPR register for
1734 the work-group id in the X
1735 dimension (see
1736 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1737
1738 Used by CP to set up
1739 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001740 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001741 system SGPR register for
1742 the work-group id in the Y
1743 dimension (see
1744 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1745
1746 Used by CP to set up
1747 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001748 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001749 system SGPR register for
1750 the work-group id in the Z
1751 dimension (see
1752 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1753
1754 Used by CP to set up
1755 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001756 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001757 system SGPR register for
1758 work-group information (see
1759 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1760
1761 Used by CP to set up
1762 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001763 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001764 VGPR system registers used
1765 for the work-item ID.
1766 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1767 defines the values.
1768
1769 Used by CP to set up
1770 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001771 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001772
1773 Wavefront starts execution
1774 with address watch
1775 exceptions enabled which
1776 are generated when L1 has
1777 witnessed a thread access
1778 an *address of
1779 interest*.
1780
1781 CP is responsible for
1782 filling in the address
1783 watch bit in
1784 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1785 according to what the
1786 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001787 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001788
1789 Wavefront starts execution
1790 with memory violation
1791 exceptions exceptions
1792 enabled which are generated
1793 when a memory violation has
1794 occurred for this wave from
1795 L1 or LDS
1796 (write-to-read-only-memory,
1797 mis-aligned atomic, LDS
1798 address out of range,
1799 illegal address, etc.).
1800
1801 CP sets the memory
1802 violation bit in
1803 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1804 according to what the
1805 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001806 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001807
1808 CP uses the rounded value
1809 from the dispatch packet,
1810 not this value, as the
1811 dispatch may contain
1812 dynamically allocated group
1813 segment memory. CP writes
1814 directly to
1815 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1816
1817 Amount of group segment
1818 (LDS) to allocate for each
1819 work-group. Granularity is
1820 device specific:
1821
1822 GFX6:
1823 roundup(lds-size / (64 * 4))
1824 GFX7-GFX9:
1825 roundup(lds-size / (128 * 4))
1826
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001827 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1828 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001829 enabled.
1830
1831 Used by CP to set up
1832 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1833 (set from bits 0..6).
1834
1835 IEEE 754 FP Invalid
1836 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001837 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1838 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001839 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001840 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1841 _DIVISION_BY_ZERO Zero
1842 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1843 _OVERFLOW
1844 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1845 _UNDERFLOW
1846 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1847 _INEXACT
1848 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1849 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001850 only)
1851 31 1 bit Reserved. Must be 0.
1852 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001853 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001854
1855..
1856
1857 .. table:: Floating Point Rounding Mode Enumeration Values
1858 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1859
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001860 ====================================== ===== ==============================
1861 Enumeration Name Value Description
1862 ====================================== ===== ==============================
1863 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1864 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1865 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1866 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1867 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001868
1869..
1870
1871 .. table:: Floating Point Denorm Mode Enumeration Values
1872 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1873
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001874 ====================================== ===== ==============================
1875 Enumeration Name Value Description
1876 ====================================== ===== ==============================
1877 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1878 Denorms
1879 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1880 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1881 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1882 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001883
1884..
1885
1886 .. table:: System VGPR Work-Item ID Enumeration Values
1887 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1888
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001889 ======================================== ===== ============================
1890 Enumeration Name Value Description
1891 ======================================== ===== ============================
1892 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1893 ID.
1894 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1895 dimensions ID.
1896 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1897 dimensions ID.
1898 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1899 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001900
1901.. _amdgpu-amdhsa-initial-kernel-execution-state:
1902
1903Initial Kernel Execution State
1904~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1905
1906This section defines the register state that will be set up by the packet
1907processor prior to the start of execution of every wavefront. This is limited by
1908the constraints of the hardware controllers of CP/ADC/SPI.
1909
1910The order of the SGPR registers is defined, but the compiler can specify which
1911ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1912fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1913for enabled registers are dense starting at SGPR0: the first enabled register is
1914SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1915an SGPR number.
1916
1917The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1918all waves of the grid. It is possible to specify more than 16 User SGPRs using
1919the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1920initialized. These are then immediately followed by the System SGPRs that are
1921set up by ADC/SPI and can have different values for each wave of the grid
1922dispatch.
1923
1924SGPR register initial state is defined in
1925:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1926
1927 .. table:: SGPR Register Set Up Order
1928 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1929
1930 ========== ========================== ====== ==============================
1931 SGPR Order Name Number Description
1932 (kernel descriptor enable of
1933 field) SGPRs
1934 ========== ========================== ====== ==============================
1935 First Private Segment Buffer 4 V# that can be used, together
1936 (enable_sgpr_private with Scratch Wave Offset as an
1937 _segment_buffer) offset, to access the private
1938 memory space using a segment
1939 address.
1940
1941 CP uses the value provided by
1942 the runtime.
1943 then Dispatch Ptr 2 64 bit address of AQL dispatch
1944 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
1945 actually executing.
1946 then Queue Ptr 2 64 bit address of amd_queue_t
1947 (enable_sgpr_queue_ptr) object for AQL queue on which
1948 the dispatch packet was
1949 queued.
1950 then Kernarg Segment Ptr 2 64 bit address of Kernarg
1951 (enable_sgpr_kernarg segment. This is directly
1952 _segment_ptr) copied from the
1953 kernarg_address in the kernel
1954 dispatch packet.
1955
1956 Having CP load it once avoids
1957 loading it at the beginning of
1958 every wavefront.
1959 then Dispatch Id 2 64 bit Dispatch ID of the
1960 (enable_sgpr_dispatch_id) dispatch packet being
1961 executed.
1962 then Flat Scratch Init 2 This is 2 SGPRs:
1963 (enable_sgpr_flat_scratch
1964 _init) GFX6
1965 Not supported.
1966 GFX7-GFX8
1967 The first SGPR is a 32 bit
1968 byte offset from
1969 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1970 to per SPI base of memory
1971 for scratch for the queue
1972 executing the kernel
1973 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00001974 from the runtime. (The
1975 Scratch Segment Buffer base
1976 address is
1977 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1978 plus this offset.) The value
1979 of Scratch Wave Offset must
1980 be added to this offset by
1981 the kernel machine code,
1982 right shifted by 8, and
1983 moved to the FLAT_SCRATCH_HI
1984 SGPR register.
1985 FLAT_SCRATCH_HI corresponds
1986 to SGPRn-4 on GFX7, and
1987 SGPRn-6 on GFX8 (where SGPRn
1988 is the highest numbered SGPR
1989 allocated to the wave).
1990 FLAT_SCRATCH_HI is
1991 multiplied by 256 (as it is
1992 in units of 256 bytes) and
1993 added to
1994 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1995 to calculate the per wave
1996 FLAT SCRATCH BASE in flat
1997 memory instructions that
1998 access the scratch
1999 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002000
2001 The second SGPR is 32 bit
2002 byte size of a single
2003 work-item’s scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002004 usage. CP obtains this from
2005 the runtime, and it is
2006 always a multiple of DWORD.
2007 CP checks that the value in
2008 the kernel dispatch packet
2009 Private Segment Byte Size is
2010 not larger, and requests the
2011 runtime to increase the
2012 queue's scratch size if
2013 necessary. The kernel code
2014 must move it to
2015 FLAT_SCRATCH_LO which is
2016 SGPRn-3 on GFX7 and SGPRn-5
2017 on GFX8. FLAT_SCRATCH_LO is
2018 used as the FLAT SCRATCH
2019 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002020 instructions. Having CP load
2021 it once avoids loading it at
2022 the beginning of every
Tony Tye46d35762017-08-15 20:47:41 +00002023 wavefront. GFX9 This is the
2024 64 bit base address of the
2025 per SPI scratch backing
2026 memory managed by SPI for
2027 the queue executing the
2028 kernel dispatch. CP obtains
2029 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002030 divides it if there are
2031 multiple Shader Arrays each
2032 with its own SPI). The value
2033 of Scratch Wave Offset must
2034 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002035 machine code and the result
2036 moved to the FLAT_SCRATCH
2037 SGPR which is SGPRn-6 and
2038 SGPRn-5. It is used as the
2039 FLAT SCRATCH BASE in flat
2040 memory instructions. then
2041 Private Segment Size 1 The
2042 32 bit byte size of a
2043 (enable_sgpr_private single
2044 work-item's
2045 scratch_segment_size) memory
2046 allocation. This is the
2047 value from the kernel
2048 dispatch packet Private
2049 Segment Byte Size rounded up
2050 by CP to a multiple of
2051 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002052
2053 Having CP load it once avoids
2054 loading it at the beginning of
2055 every wavefront.
2056
2057 This is not used for
2058 GFX7-GFX8 since it is the same
2059 value as the second SGPR of
2060 Flat Scratch Init. However, it
2061 may be needed for GFX9 which
2062 changes the meaning of the
2063 Flat Scratch Init value.
2064 then Grid Work-Group Count X 1 32 bit count of the number of
2065 (enable_sgpr_grid work-groups in the X dimension
2066 _workgroup_count_X) for the grid being
2067 executed. Computed from the
2068 fields in the kernel dispatch
2069 packet as ((grid_size.x +
2070 workgroup_size.x - 1) /
2071 workgroup_size.x).
2072 then Grid Work-Group Count Y 1 32 bit count of the number of
2073 (enable_sgpr_grid work-groups in the Y dimension
2074 _workgroup_count_Y && for the grid being
2075 less than 16 previous executed. Computed from the
2076 SGPRs) fields in the kernel dispatch
2077 packet as ((grid_size.y +
2078 workgroup_size.y - 1) /
2079 workgroupSize.y).
2080
2081 Only initialized if <16
2082 previous SGPRs initialized.
2083 then Grid Work-Group Count Z 1 32 bit count of the number of
2084 (enable_sgpr_grid work-groups in the Z dimension
2085 _workgroup_count_Z && for the grid being
2086 less than 16 previous executed. Computed from the
2087 SGPRs) fields in the kernel dispatch
2088 packet as ((grid_size.z +
2089 workgroup_size.z - 1) /
2090 workgroupSize.z).
2091
2092 Only initialized if <16
2093 previous SGPRs initialized.
2094 then Work-Group Id X 1 32 bit work-group id in X
2095 (enable_sgpr_workgroup_id dimension of grid for
2096 _X) wavefront.
2097 then Work-Group Id Y 1 32 bit work-group id in Y
2098 (enable_sgpr_workgroup_id dimension of grid for
2099 _Y) wavefront.
2100 then Work-Group Id Z 1 32 bit work-group id in Z
2101 (enable_sgpr_workgroup_id dimension of grid for
2102 _Z) wavefront.
2103 then Work-Group Info 1 {first_wave, 14’b0000,
2104 (enable_sgpr_workgroup ordered_append_term[10:0],
2105 _info) threadgroup_size_in_waves[5:0]}
2106 then Scratch Wave Offset 1 32 bit byte offset from base
2107 (enable_sgpr_private of scratch base of queue
2108 _segment_wave_offset) executing the kernel
2109 dispatch. Must be used as an
2110 offset with Private
2111 segment address when using
2112 Scratch Segment Buffer. It
2113 must be used to set up FLAT
2114 SCRATCH for flat addressing
2115 (see
2116 :ref:`amdgpu-amdhsa-flat-scratch`).
2117 ========== ========================== ====== ==============================
2118
2119The order of the VGPR registers is defined, but the compiler can specify which
2120ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2121fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2122for enabled registers are dense starting at VGPR0: the first enabled register is
2123VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2124VGPR number.
2125
2126VGPR register initial state is defined in
2127:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2128
2129 .. table:: VGPR Register Set Up Order
2130 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2131
2132 ========== ========================== ====== ==============================
2133 VGPR Order Name Number Description
2134 (kernel descriptor enable of
2135 field) VGPRs
2136 ========== ========================== ====== ==============================
2137 First Work-Item Id X 1 32 bit work item id in X
2138 (Always initialized) dimension of work-group for
2139 wavefront lane.
2140 then Work-Item Id Y 1 32 bit work item id in Y
2141 (enable_vgpr_workitem_id dimension of work-group for
2142 > 0) wavefront lane.
2143 then Work-Item Id Z 1 32 bit work item id in Z
2144 (enable_vgpr_workitem_id dimension of work-group for
2145 > 1) wavefront lane.
2146 ========== ========================== ====== ==============================
2147
2148The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2149
21501. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2151 registers.
21522. Work-group Id registers X, Y, Z are set by ADC which supports any
2153 combination including none.
21543. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2155 cannot included with the flat scratch init value which is per queue.
21564. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2157 or (X, Y, Z).
2158
2159Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2160value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2161
2162The global segment can be accessed either using buffer instructions (GFX6 which
2163has V# 64 bit address support), flat instructions (GFX7-9), or global
2164instructions (GFX9).
2165
2166If buffer operations are used then the compiler can generate a V# with the
2167following properties:
2168
2169* base address of 0
2170* no swizzle
2171* ATC: 1 if IOMMU present (such as APU)
2172* ptr64: 1
2173* MTYPE set to support memory coherence that matches the runtime (such as CC for
2174 APU and NC for dGPU).
2175
2176.. _amdgpu-amdhsa-kernel-prolog:
2177
2178Kernel Prolog
2179~~~~~~~~~~~~~
2180
2181.. _amdgpu-amdhsa-m0:
2182
2183M0
2184++
2185
2186GFX6-GFX8
2187 The M0 register must be initialized with a value at least the total LDS size
2188 if the kernel may access LDS via DS or flat operations. Total LDS size is
2189 available in dispatch packet. For M0, it is also possible to use maximum
2190 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2191 GFX7-GFX8).
2192GFX9
2193 The M0 register is not used for range checking LDS accesses and so does not
2194 need to be initialized in the prolog.
2195
2196.. _amdgpu-amdhsa-flat-scratch:
2197
2198Flat Scratch
2199++++++++++++
2200
2201If the kernel may use flat operations to access scratch memory, the prolog code
2202must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2203are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2204Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2205
2206GFX6
2207 Flat scratch is not supported.
2208
2209GFX7-8
2210 1. The low word of Flat Scratch Init is 32 bit byte offset from
2211 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2212 being managed by SPI for the queue executing the kernel dispatch. This is
2213 the same value used in the Scratch Segment Buffer V# base address. The
2214 prolog must add the value of Scratch Wave Offset to get the wave's byte
2215 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2216 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2217 by 8 before moving into FLAT_SCRATCH_LO.
2218 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2219 work-items scratch memory usage. This is directly loaded from the kernel
2220 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2221 DWORD. Having CP load it once avoids loading it at the beginning of every
2222 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2223 SIZE.
2224GFX9
2225 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2226 memory being managed by SPI for the queue executing the kernel dispatch. The
2227 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2228 pair for use as the flat scratch base in flat memory instructions.
2229
2230.. _amdgpu-amdhsa-memory-model:
2231
2232Memory Model
2233~~~~~~~~~~~~
2234
2235This section describes the mapping of LLVM memory model onto AMDGPU machine code
2236(see :ref:`memmodel`). *The implementation is WIP.*
2237
2238.. TODO
2239 Update when implementation complete.
2240
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002241 Support more relaxed OpenCL memory model to be controlled by environment
Tony Tyef16a45e2017-06-06 20:31:59 +00002242 component of target triple.
2243
2244The AMDGPU backend supports the memory synchronization scopes specified in
2245:ref:`amdgpu-memory-scopes`.
2246
2247The code sequences used to implement the memory model are defined in table
2248:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2249
2250The sequences specify the order of instructions that a single thread must
2251execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2252to other memory instructions executed by the same thread. This allows them to be
2253moved earlier or later which can allow them to be combined with other instances
2254of the same instruction, or hoisted/sunk out of loops to improve
2255performance. Only the instructions related to the memory model are given;
2256additional ``s_waitcnt`` instructions are required to ensure registers are
2257defined before being used. These may be able to be combined with the memory
2258model ``s_waitcnt`` instructions as described above.
2259
2260The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2261OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2262relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2263memory model which has separate happens-before relations for the global and
2264local address spaces, and only a fence specifying both global and local address
2265space joins the relationships. Since the LLVM ``memfence`` instruction does not
2266allow an address space to be specified the OpenCL fence has to convervatively
2267assume both local and global address space was specified. However, optimizations
2268can often be done to eliminate the additional ``s_waitcnt``instructions when
2269there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2270instructions. The code sequences in the table indicate what can be omitted for
2271the OpenCL memory. The target triple environment is used to determine if the
2272source language is OpenCL (see :ref:`amdgpu-opencl`).
2273
2274``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2275operations.
2276
2277``buffer/global/flat_load/store/atomic`` instructions to global memory are
2278termed vector memory operations.
2279
2280For GFX6-GFX9:
2281
2282* Each agent has multiple compute units (CU).
2283* Each CU has multiple SIMDs that execute wavefronts.
2284* The wavefronts for a single work-group are executed in the same CU but may be
2285 executed by different SIMDs.
2286* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2287 executing on it.
2288* All LDS operations of a CU are performed as wavefront wide operations in a
2289 global order and involve no caching. Completion is reported to a wavefront in
2290 execution order.
2291* The LDS memory has multiple request queues shared by the SIMDs of a
2292 CU. Therefore, the LDS operations performed by different waves of a work-group
2293 can be reordered relative to each other, which can result in reordering the
2294 visibility of vector memory operations with respect to LDS operations of other
2295 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002296 ensure synchronization between LDS operations and vector memory operations
Tony Tyef16a45e2017-06-06 20:31:59 +00002297 between waves of a work-group, but not between operations performed by the
2298 same wavefront.
2299* The vector memory operations are performed as wavefront wide operations and
2300 completion is reported to a wavefront in execution order. The exception is
2301 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2302 vector memory order if they access LDS memory, and out of LDS operation order
2303 if they access global memory.
2304* The vector memory operations access a vector L1 cache shared by all wavefronts
2305 on a CU. Therefore, no special action is required for coherence between
2306 wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2307 coherence between waves executing in different work-groups as they may be
2308 executing on different CUs.
2309* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2310 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2311 scalar operations are used in a restricted way so do not impact the memory
2312 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2313* The vector and scalar memory operations use an L2 cache shared by all CUs on
2314 the same agent.
2315* The L2 cache has independent channels to service disjoint ranges of virtual
2316 addresses.
2317* Each CU has a separate request queue per channel. Therefore, the vector and
2318 scalar memory operations performed by waves executing in different work-groups
2319 (which may be executing on different CUs) of an agent can be reordered
2320 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002321 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002322 previous vector memory operation has completed before executing a subsequent
2323 vector memory or LDS operation and so can be used to meet the requirements of
2324 acquire and release.
2325* The L2 cache can be kept coherent with other agents on some targets, or ranges
2326 of virtual addresses can be set up to bypass it to ensure system coherence.
2327
2328Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2329or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2330memory, atomic memory orderings are not meaningful and all accesses are treated
2331as non-atomic.
2332
2333Constant address space uses ``buffer/global_load`` instructions (or equivalent
2334scalar memory instructions). Since the constant address space contents do not
2335change during the execution of a kernel dispatch it is not legal to perform
2336stores, and atomic memory orderings are not meaningful and all access are
2337treated as non-atomic.
2338
2339A memory synchronization scope wider than work-group is not meaningful for the
2340group (LDS) address space and is treated as work-group.
2341
2342The memory model does not support the region address space which is treated as
2343non-atomic.
2344
2345Acquire memory ordering is not meaningful on store atomic instructions and is
2346treated as non-atomic.
2347
2348Release memory ordering is not meaningful on load atomic instructions and is
2349treated a non-atomic.
2350
2351Acquire-release memory ordering is not meaningful on load or store atomic
2352instructions and is treated as acquire and release respectively.
2353
2354AMDGPU backend only uses scalar memory operations to access memory that is
2355proven to not change during the execution of the kernel dispatch. This includes
2356constant address space and global address space for program scope const
2357variables. Therefore the kernel machine code does not have to maintain the
2358scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2359and vector L1 caches are invalidated between kernel dispatches by CP since
2360constant address space data may change between kernel dispatch executions. See
2361:ref:`amdgpu-amdhsa-memory-spaces`.
2362
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002363The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002364case the AMDGPU backend ensures the memory location used to spill is never
2365accessed by vector memory operations at the same time. If scalar writes are used
2366then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2367return since the locations may be used for vector memory instructions by a
2368future wave that uses the same scratch area, or a function call that creates a
2369frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2370as all scalar writes are write-before-read in the same thread.
2371
2372Scratch backing memory (which is used for the private address space) is accessed
2373with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2374is only accessed by a single thread, and is always write-before-read,
2375there is never a need to invalidate these entries from the L1 cache. Hence all
2376cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2377lines.
2378
2379On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2380to invalidate the L2 cache. This also causes it to be treated as non-volatile
2381and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2382coherent) and so the L2 cache will coherent with the CPU and other agents.
2383
2384 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2385 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2386
2387 ============ ============ ============== ========== =======================
2388 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2389 Ordering Sync Scope Address
2390 Space
2391 ============ ============ ============== ========== =======================
2392 **Non-Atomic**
2393 ---------------------------------------------------------------------------
2394 load *none* *none* - global non-volatile
2395 - generic 1. buffer/global/flat_load
2396 volatile
2397 1. buffer/global/flat_load
2398 glc=1
2399 load *none* *none* - local 1. ds_load
2400 store *none* *none* - global 1. buffer/global/flat_store
2401 - generic
2402 store *none* *none* - local 1. ds_store
2403 **Unordered Atomic**
2404 ---------------------------------------------------------------------------
2405 load atomic unordered *any* *any* *Same as non-atomic*.
2406 store atomic unordered *any* *any* *Same as non-atomic*.
2407 atomicrmw unordered *any* *any* *Same as monotonic
2408 atomic*.
2409 **Monotonic Atomic**
2410 ---------------------------------------------------------------------------
2411 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2412 - wavefront - generic
2413 - workgroup
2414 load atomic monotonic - singlethread - local 1. ds_load
2415 - wavefront
2416 - workgroup
2417 load atomic monotonic - agent - global 1. buffer/global/flat_load
2418 - system - generic glc=1
2419 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2420 - wavefront - generic
2421 - workgroup
2422 - agent
2423 - system
2424 store atomic monotonic - singlethread - local 1. ds_store
2425 - wavefront
2426 - workgroup
2427 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2428 - wavefront - generic
2429 - workgroup
2430 - agent
2431 - system
2432 atomicrmw monotonic - singlethread - local 1. ds_atomic
2433 - wavefront
2434 - workgroup
2435 **Acquire Atomic**
2436 ---------------------------------------------------------------------------
2437 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2438 - wavefront - local
2439 - generic
2440 load atomic acquire - workgroup - global 1. buffer/global_load
2441 load atomic acquire - workgroup - local 1. ds/flat_load
2442 - generic 2. s_waitcnt lgkmcnt(0)
2443
2444 - If OpenCL, omit
2445 waitcnt.
2446 - Must happen before
2447 any following
2448 global/generic
2449 load/load
2450 atomic/store/store
2451 atomic/atomicrmw.
2452 - Ensures any
2453 following global
2454 data read is no
2455 older than the load
2456 atomic value being
2457 acquired.
2458
2459 load atomic acquire - agent - global 1. buffer/global_load
2460 - system glc=1
2461 2. s_waitcnt vmcnt(0)
2462
2463 - Must happen before
2464 following
2465 buffer_wbinvl1_vol.
2466 - Ensures the load
2467 has completed
2468 before invalidating
2469 the cache.
2470
2471 3. buffer_wbinvl1_vol
2472
2473 - Must happen before
2474 any following
2475 global/generic
2476 load/load
2477 atomic/atomicrmw.
2478 - Ensures that
2479 following
2480 loads will not see
2481 stale global data.
2482
2483 load atomic acquire - agent - generic 1. flat_load glc=1
2484 - system 2. s_waitcnt vmcnt(0) &
2485 lgkmcnt(0)
2486
2487 - If OpenCL omit
2488 lgkmcnt(0).
2489 - Must happen before
2490 following
2491 buffer_wbinvl1_vol.
2492 - Ensures the flat_load
2493 has completed
2494 before invalidating
2495 the cache.
2496
2497 3. buffer_wbinvl1_vol
2498
2499 - Must happen before
2500 any following
2501 global/generic
2502 load/load
2503 atomic/atomicrmw.
2504 - Ensures that
2505 following loads
2506 will not see stale
2507 global data.
2508
2509 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2510 - wavefront - local
2511 - generic
2512 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
2513 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
2514 - generic 2. waitcnt lgkmcnt(0)
2515
2516 - If OpenCL, omit
2517 waitcnt.
2518 - Must happen before
2519 any following
2520 global/generic
2521 load/load
2522 atomic/store/store
2523 atomic/atomicrmw.
2524 - Ensures any
2525 following global
2526 data read is no
2527 older than the
2528 atomicrmw value
2529 being acquired.
2530
2531 atomicrmw acquire - agent - global 1. buffer/global_atomic
2532 - system 2. s_waitcnt vmcnt(0)
2533
2534 - Must happen before
2535 following
2536 buffer_wbinvl1_vol.
2537 - Ensures the
2538 atomicrmw has
2539 completed before
2540 invalidating the
2541 cache.
2542
2543 3. buffer_wbinvl1_vol
2544
2545 - Must happen before
2546 any following
2547 global/generic
2548 load/load
2549 atomic/atomicrmw.
2550 - Ensures that
2551 following loads
2552 will not see stale
2553 global data.
2554
2555 atomicrmw acquire - agent - generic 1. flat_atomic
2556 - system 2. s_waitcnt vmcnt(0) &
2557 lgkmcnt(0)
2558
2559 - If OpenCL, omit
2560 lgkmcnt(0).
2561 - Must happen before
2562 following
2563 buffer_wbinvl1_vol.
2564 - Ensures the
2565 atomicrmw has
2566 completed before
2567 invalidating the
2568 cache.
2569
2570 3. buffer_wbinvl1_vol
2571
2572 - Must happen before
2573 any following
2574 global/generic
2575 load/load
2576 atomic/atomicrmw.
2577 - Ensures that
2578 following loads
2579 will not see stale
2580 global data.
2581
2582 fence acquire - singlethread *none* *none*
2583 - wavefront
2584 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2585
2586 - If OpenCL and
2587 address space is
2588 not generic, omit
2589 waitcnt. However,
2590 since LLVM
2591 currently has no
2592 address space on
2593 the fence need to
2594 conservatively
2595 always generate. If
2596 fence had an
2597 address space then
2598 set to address
2599 space of OpenCL
2600 fence flag, or to
2601 generic if both
2602 local and global
2603 flags are
2604 specified.
2605 - Must happen after
2606 any preceding
2607 local/generic load
2608 atomic/atomicrmw
2609 with an equal or
2610 wider sync scope
2611 and memory ordering
2612 stronger than
2613 unordered (this is
2614 termed the
2615 fence-paired-atomic).
2616 - Must happen before
2617 any following
2618 global/generic
2619 load/load
2620 atomic/store/store
2621 atomic/atomicrmw.
2622 - Ensures any
2623 following global
2624 data read is no
2625 older than the
2626 value read by the
2627 fence-paired-atomic.
2628
2629 fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
2630 - system lgkmcnt(0)
2631
2632 - If OpenCL and
2633 address space is
2634 not generic, omit
2635 lgkmcnt(0).
2636 However, since LLVM
2637 currently has no
2638 address space on
2639 the fence need to
2640 conservatively
2641 always generate
2642 (see comment for
2643 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002644 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002645 separate s_waitcnt
2646 vmcnt(0) and
2647 s_waitcnt
2648 lgkmcnt(0) to allow
2649 them to be
2650 independently moved
2651 according to the
2652 following rules.
2653 - s_waitcnt vmcnt(0)
2654 must happen after
2655 any preceding
2656 global/generic load
2657 atomic/atomicrmw
2658 with an equal or
2659 wider sync scope
2660 and memory ordering
2661 stronger than
2662 unordered (this is
2663 termed the
2664 fence-paired-atomic).
2665 - s_waitcnt lgkmcnt(0)
2666 must happen after
2667 any preceding
2668 group/generic load
2669 atomic/atomicrmw
2670 with an equal or
2671 wider sync scope
2672 and memory ordering
2673 stronger than
2674 unordered (this is
2675 termed the
2676 fence-paired-atomic).
2677 - Must happen before
2678 the following
2679 buffer_wbinvl1_vol.
2680 - Ensures that the
2681 fence-paired atomic
2682 has completed
2683 before invalidating
2684 the
2685 cache. Therefore
2686 any following
2687 locations read must
2688 be no older than
2689 the value read by
2690 the
2691 fence-paired-atomic.
2692
2693 2. buffer_wbinvl1_vol
2694
2695 - Must happen before
2696 any following global/generic
2697 load/load
2698 atomic/store/store
2699 atomic/atomicrmw.
2700 - Ensures that
2701 following loads
2702 will not see stale
2703 global data.
2704
2705 **Release Atomic**
2706 ---------------------------------------------------------------------------
2707 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2708 - wavefront - local
2709 - generic
2710 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2711 - generic
2712 - If OpenCL, omit
2713 waitcnt.
2714 - Must happen after
2715 any preceding
2716 local/generic
2717 load/store/load
2718 atomic/store
2719 atomic/atomicrmw.
2720 - Must happen before
2721 the following
2722 store.
2723 - Ensures that all
2724 memory operations
2725 to local have
2726 completed before
2727 performing the
2728 store that is being
2729 released.
2730
2731 2. buffer/global/flat_store
2732 store atomic release - workgroup - local 1. ds_store
2733 store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
2734 - system - generic lgkmcnt(0)
2735
2736 - If OpenCL, omit
2737 lgkmcnt(0).
2738 - Could be split into
2739 separate s_waitcnt
2740 vmcnt(0) and
2741 s_waitcnt
2742 lgkmcnt(0) to allow
2743 them to be
2744 independently moved
2745 according to the
2746 following rules.
2747 - s_waitcnt vmcnt(0)
2748 must happen after
2749 any preceding
2750 global/generic
2751 load/store/load
2752 atomic/store
2753 atomic/atomicrmw.
2754 - s_waitcnt lgkmcnt(0)
2755 must happen after
2756 any preceding
2757 local/generic
2758 load/store/load
2759 atomic/store
2760 atomic/atomicrmw.
2761 - Must happen before
2762 the following
2763 store.
2764 - Ensures that all
2765 memory operations
2766 to global have
2767 completed before
2768 performing the
2769 store that is being
2770 released.
2771
2772 2. buffer/global/ds/flat_store
2773 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2774 - wavefront - local
2775 - generic
2776 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2777 - generic
2778 - If OpenCL, omit
2779 waitcnt.
2780 - Must happen after
2781 any preceding
2782 local/generic
2783 load/store/load
2784 atomic/store
2785 atomic/atomicrmw.
2786 - Must happen before
2787 the following
2788 atomicrmw.
2789 - Ensures that all
2790 memory operations
2791 to local have
2792 completed before
2793 performing the
2794 atomicrmw that is
2795 being released.
2796
2797 2. buffer/global/flat_atomic
2798 atomicrmw release - workgroup - local 1. ds_atomic
2799 atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
2800 - system - generic lgkmcnt(0)
2801
2802 - If OpenCL, omit
2803 lgkmcnt(0).
2804 - Could be split into
2805 separate s_waitcnt
2806 vmcnt(0) and
2807 s_waitcnt
2808 lgkmcnt(0) to allow
2809 them to be
2810 independently moved
2811 according to the
2812 following rules.
2813 - s_waitcnt vmcnt(0)
2814 must happen after
2815 any preceding
2816 global/generic
2817 load/store/load
2818 atomic/store
2819 atomic/atomicrmw.
2820 - s_waitcnt lgkmcnt(0)
2821 must happen after
2822 any preceding
2823 local/generic
2824 load/store/load
2825 atomic/store
2826 atomic/atomicrmw.
2827 - Must happen before
2828 the following
2829 atomicrmw.
2830 - Ensures that all
2831 memory operations
2832 to global and local
2833 have completed
2834 before performing
2835 the atomicrmw that
2836 is being released.
2837
2838 2. buffer/global/ds/flat_atomic*
2839 fence release - singlethread *none* *none*
2840 - wavefront
2841 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2842
2843 - If OpenCL and
2844 address space is
2845 not generic, omit
2846 waitcnt. However,
2847 since LLVM
2848 currently has no
2849 address space on
2850 the fence need to
2851 conservatively
2852 always generate
2853 (see comment for
2854 previous fence).
2855 - Must happen after
2856 any preceding
2857 local/generic
2858 load/load
2859 atomic/store/store
2860 atomic/atomicrmw.
2861 - Must happen before
2862 any following store
2863 atomic/atomicrmw
2864 with an equal or
2865 wider sync scope
2866 and memory ordering
2867 stronger than
2868 unordered (this is
2869 termed the
2870 fence-paired-atomic).
2871 - Ensures that all
2872 memory operations
2873 to local have
2874 completed before
2875 performing the
2876 following
2877 fence-paired-atomic.
2878
2879 fence release - agent *none* 1. s_waitcnt vmcnt(0) &
2880 - system lgkmcnt(0)
2881
2882 - If OpenCL and
2883 address space is
2884 not generic, omit
2885 lgkmcnt(0).
2886 However, since LLVM
2887 currently has no
2888 address space on
2889 the fence need to
2890 conservatively
2891 always generate
2892 (see comment for
2893 previous fence).
2894 - Could be split into
2895 separate s_waitcnt
2896 vmcnt(0) and
2897 s_waitcnt
2898 lgkmcnt(0) to allow
2899 them to be
2900 independently moved
2901 according to the
2902 following rules.
2903 - s_waitcnt vmcnt(0)
2904 must happen after
2905 any preceding
2906 global/generic
2907 load/store/load
2908 atomic/store
2909 atomic/atomicrmw.
2910 - s_waitcnt lgkmcnt(0)
2911 must happen after
2912 any preceding
2913 local/generic
2914 load/store/load
2915 atomic/store
2916 atomic/atomicrmw.
2917 - Must happen before
2918 any following store
2919 atomic/atomicrmw
2920 with an equal or
2921 wider sync scope
2922 and memory ordering
2923 stronger than
2924 unordered (this is
2925 termed the
2926 fence-paired-atomic).
2927 - Ensures that all
2928 memory operations
2929 to global have
2930 completed before
2931 performing the
2932 following
2933 fence-paired-atomic.
2934
2935 **Acquire-Release Atomic**
2936 ---------------------------------------------------------------------------
2937 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
2938 - wavefront - local
2939 - generic
2940 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
2941
2942 - If OpenCL, omit
2943 waitcnt.
2944 - Must happen after
2945 any preceding
2946 local/generic
2947 load/store/load
2948 atomic/store
2949 atomic/atomicrmw.
2950 - Must happen before
2951 the following
2952 atomicrmw.
2953 - Ensures that all
2954 memory operations
2955 to local have
2956 completed before
2957 performing the
2958 atomicrmw that is
2959 being released.
2960
2961 2. buffer/global_atomic
2962 atomicrmw acq_rel - workgroup - local 1. ds_atomic
2963 2. s_waitcnt lgkmcnt(0)
2964
2965 - If OpenCL, omit
2966 waitcnt.
2967 - Must happen before
2968 any following
2969 global/generic
2970 load/load
2971 atomic/store/store
2972 atomic/atomicrmw.
2973 - Ensures any
2974 following global
2975 data read is no
2976 older than the load
2977 atomic value being
2978 acquired.
2979
2980 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2981
2982 - If OpenCL, omit
2983 waitcnt.
2984 - Must happen after
2985 any preceding
2986 local/generic
2987 load/store/load
2988 atomic/store
2989 atomic/atomicrmw.
2990 - Must happen before
2991 the following
2992 atomicrmw.
2993 - Ensures that all
2994 memory operations
2995 to local have
2996 completed before
2997 performing the
2998 atomicrmw that is
2999 being released.
3000
3001 2. flat_atomic
3002 3. s_waitcnt lgkmcnt(0)
3003
3004 - If OpenCL, omit
3005 waitcnt.
3006 - Must happen before
3007 any following
3008 global/generic
3009 load/load
3010 atomic/store/store
3011 atomic/atomicrmw.
3012 - Ensures any
3013 following global
3014 data read is no
3015 older than the load
3016 atomic value being
3017 acquired.
3018 atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
3019 - system lgkmcnt(0)
3020
3021 - If OpenCL, omit
3022 lgkmcnt(0).
3023 - Could be split into
3024 separate s_waitcnt
3025 vmcnt(0) and
3026 s_waitcnt
3027 lgkmcnt(0) to allow
3028 them to be
3029 independently moved
3030 according to the
3031 following rules.
3032 - s_waitcnt vmcnt(0)
3033 must happen after
3034 any preceding
3035 global/generic
3036 load/store/load
3037 atomic/store
3038 atomic/atomicrmw.
3039 - s_waitcnt lgkmcnt(0)
3040 must happen after
3041 any preceding
3042 local/generic
3043 load/store/load
3044 atomic/store
3045 atomic/atomicrmw.
3046 - Must happen before
3047 the following
3048 atomicrmw.
3049 - Ensures that all
3050 memory operations
3051 to global have
3052 completed before
3053 performing the
3054 atomicrmw that is
3055 being released.
3056
3057 2. buffer/global_atomic
3058 3. s_waitcnt vmcnt(0)
3059
3060 - Must happen before
3061 following
3062 buffer_wbinvl1_vol.
3063 - Ensures the
3064 atomicrmw has
3065 completed before
3066 invalidating the
3067 cache.
3068
3069 4. buffer_wbinvl1_vol
3070
3071 - Must happen before
3072 any following
3073 global/generic
3074 load/load
3075 atomic/atomicrmw.
3076 - Ensures that
3077 following loads
3078 will not see stale
3079 global data.
3080
3081 atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
3082 - system lgkmcnt(0)
3083
3084 - If OpenCL, omit
3085 lgkmcnt(0).
3086 - Could be split into
3087 separate s_waitcnt
3088 vmcnt(0) and
3089 s_waitcnt
3090 lgkmcnt(0) to allow
3091 them to be
3092 independently moved
3093 according to the
3094 following rules.
3095 - s_waitcnt vmcnt(0)
3096 must happen after
3097 any preceding
3098 global/generic
3099 load/store/load
3100 atomic/store
3101 atomic/atomicrmw.
3102 - s_waitcnt lgkmcnt(0)
3103 must happen after
3104 any preceding
3105 local/generic
3106 load/store/load
3107 atomic/store
3108 atomic/atomicrmw.
3109 - Must happen before
3110 the following
3111 atomicrmw.
3112 - Ensures that all
3113 memory operations
3114 to global have
3115 completed before
3116 performing the
3117 atomicrmw that is
3118 being released.
3119
3120 2. flat_atomic
3121 3. s_waitcnt vmcnt(0) &
3122 lgkmcnt(0)
3123
3124 - If OpenCL, omit
3125 lgkmcnt(0).
3126 - Must happen before
3127 following
3128 buffer_wbinvl1_vol.
3129 - Ensures the
3130 atomicrmw has
3131 completed before
3132 invalidating the
3133 cache.
3134
3135 4. buffer_wbinvl1_vol
3136
3137 - Must happen before
3138 any following
3139 global/generic
3140 load/load
3141 atomic/atomicrmw.
3142 - Ensures that
3143 following loads
3144 will not see stale
3145 global data.
3146
3147 fence acq_rel - singlethread *none* *none*
3148 - wavefront
3149 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3150
3151 - If OpenCL and
3152 address space is
3153 not generic, omit
3154 waitcnt. However,
3155 since LLVM
3156 currently has no
3157 address space on
3158 the fence need to
3159 conservatively
3160 always generate
3161 (see comment for
3162 previous fence).
3163 - Must happen after
3164 any preceding
3165 local/generic
3166 load/load
3167 atomic/store/store
3168 atomic/atomicrmw.
3169 - Must happen before
3170 any following
3171 global/generic
3172 load/load
3173 atomic/store/store
3174 atomic/atomicrmw.
3175 - Ensures that all
3176 memory operations
3177 to local have
3178 completed before
3179 performing any
3180 following global
3181 memory operations.
3182 - Ensures that the
3183 preceding
3184 local/generic load
3185 atomic/atomicrmw
3186 with an equal or
3187 wider sync scope
3188 and memory ordering
3189 stronger than
3190 unordered (this is
3191 termed the
3192 fence-paired-atomic)
3193 has completed
3194 before following
3195 global memory
3196 operations. This
3197 satisfies the
3198 requirements of
3199 acquire.
3200 - Ensures that all
3201 previous memory
3202 operations have
3203 completed before a
3204 following
3205 local/generic store
3206 atomic/atomicrmw
3207 with an equal or
3208 wider sync scope
3209 and memory ordering
3210 stronger than
3211 unordered (this is
3212 termed the
3213 fence-paired-atomic).
3214 This satisfies the
3215 requirements of
3216 release.
3217
3218 fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
3219 - system lgkmcnt(0)
3220
3221 - If OpenCL and
3222 address space is
3223 not generic, omit
3224 lgkmcnt(0).
3225 However, since LLVM
3226 currently has no
3227 address space on
3228 the fence need to
3229 conservatively
3230 always generate
3231 (see comment for
3232 previous fence).
3233 - Could be split into
3234 separate s_waitcnt
3235 vmcnt(0) and
3236 s_waitcnt
3237 lgkmcnt(0) to allow
3238 them to be
3239 independently moved
3240 according to the
3241 following rules.
3242 - s_waitcnt vmcnt(0)
3243 must happen after
3244 any preceding
3245 global/generic
3246 load/store/load
3247 atomic/store
3248 atomic/atomicrmw.
3249 - s_waitcnt lgkmcnt(0)
3250 must happen after
3251 any preceding
3252 local/generic
3253 load/store/load
3254 atomic/store
3255 atomic/atomicrmw.
3256 - Must happen before
3257 the following
3258 buffer_wbinvl1_vol.
3259 - Ensures that the
3260 preceding
3261 global/local/generic
3262 load
3263 atomic/atomicrmw
3264 with an equal or
3265 wider sync scope
3266 and memory ordering
3267 stronger than
3268 unordered (this is
3269 termed the
3270 fence-paired-atomic)
3271 has completed
3272 before invalidating
3273 the cache. This
3274 satisfies the
3275 requirements of
3276 acquire.
3277 - Ensures that all
3278 previous memory
3279 operations have
3280 completed before a
3281 following
3282 global/local/generic
3283 store
3284 atomic/atomicrmw
3285 with an equal or
3286 wider sync scope
3287 and memory ordering
3288 stronger than
3289 unordered (this is
3290 termed the
3291 fence-paired-atomic).
3292 This satisfies the
3293 requirements of
3294 release.
3295
3296 2. buffer_wbinvl1_vol
3297
3298 - Must happen before
3299 any following
3300 global/generic
3301 load/load
3302 atomic/store/store
3303 atomic/atomicrmw.
3304 - Ensures that
3305 following loads
3306 will not see stale
3307 global data. This
3308 satisfies the
3309 requirements of
3310 acquire.
3311
3312 **Sequential Consistent Atomic**
3313 ---------------------------------------------------------------------------
3314 load atomic seq_cst - singlethread - global *Same as corresponding
3315 - wavefront - local load atomic acquire*.
3316 - workgroup - generic
3317 load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
3318 - system - local
3319 - generic - Must happen after
3320 preceding
3321 global/generic load
3322 atomic/store
3323 atomic/atomicrmw
3324 with memory
3325 ordering of seq_cst
3326 and with equal or
3327 wider sync scope.
3328 (Note that seq_cst
3329 fences have their
3330 own s_waitcnt
3331 vmcnt(0) and so do
3332 not need to be
3333 considered.)
3334 - Ensures any
3335 preceding
3336 sequential
3337 consistent global
3338 memory instructions
3339 have completed
3340 before executing
3341 this sequentially
3342 consistent
3343 instruction. This
3344 prevents reordering
3345 a seq_cst store
3346 followed by a
3347 seq_cst load (Note
3348 that seq_cst is
3349 stronger than
3350 acquire/release as
3351 the reordering of
3352 load acquire
3353 followed by a store
3354 release is
3355 prevented by the
3356 waitcnt vmcnt(0) of
3357 the release, but
3358 there is nothing
3359 preventing a store
3360 release followed by
3361 load acquire from
3362 competing out of
3363 order.)
3364
3365 2. *Following
3366 instructions same as
3367 corresponding load
3368 atomic acquire*.
3369
3370 store atomic seq_cst - singlethread - global *Same as corresponding
3371 - wavefront - local store atomic release*.
3372 - workgroup - generic
3373 store atomic seq_cst - agent - global *Same as corresponding
3374 - system - generic store atomic release*.
3375 atomicrmw seq_cst - singlethread - global *Same as corresponding
3376 - wavefront - local atomicrmw acq_rel*.
3377 - workgroup - generic
3378 atomicrmw seq_cst - agent - global *Same as corresponding
3379 - system - generic atomicrmw acq_rel*.
3380 fence seq_cst - singlethread *none* *Same as corresponding
3381 - wavefront fence acq_rel*.
3382 - workgroup
3383 - agent
3384 - system
3385 ============ ============ ============== ========== =======================
3386
3387The memory order also adds the single thread optimization constrains defined in
3388table
3389:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3390
3391 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3392 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3393
3394 ============ ==============================================================
3395 LLVM Memory Optimization Constraints
3396 Ordering
3397 ============ ==============================================================
3398 unordered *none*
3399 monotonic *none*
3400 acquire - If a load atomic/atomicrmw then no following load/load
3401 atomic/store/ store atomic/atomicrmw/fence instruction can
3402 be moved before the acquire.
3403 - If a fence then same as load atomic, plus no preceding
3404 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003405 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003406 atomic/store/ store atomic/atomicrmw/fence instruction can
3407 be moved after the release.
3408 - If a fence then same as store atomic, plus no following
3409 associated fence-paired-atomic can be moved before the
3410 fence.
3411 acq_rel Same constraints as both acquire and release.
3412 seq_cst - If a load atomic then same constraints as acquire, plus no
3413 preceding sequentially consistent load atomic/store
3414 atomic/atomicrmw/fence instruction can be moved after the
3415 seq_cst.
3416 - If a store atomic then the same constraints as release, plus
3417 no following sequentially consistent load atomic/store
3418 atomic/atomicrmw/fence instruction can be moved before the
3419 seq_cst.
3420 - If an atomicrmw/fence then same constraints as acq_rel.
3421 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003422
Wei Ding16289cf2017-02-21 18:48:01 +00003423Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003424~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003425
Tony Tyef16a45e2017-06-06 20:31:59 +00003426For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3427(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3428the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003429
Tony Tyef16a45e2017-06-06 20:31:59 +00003430 .. table:: AMDGPU Trap Handler for AMDHSA OS
3431 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003432
Tony Tyef16a45e2017-06-06 20:31:59 +00003433 =================== =============== =============== =======================
3434 Usage Code Sequence Trap Handler Description
3435 Inputs
3436 =================== =============== =============== =======================
3437 reserved ``s_trap 0x00`` Reserved by hardware.
3438 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3439 ``queue_ptr`` ``debugtrap``
3440 ``VGPR0``: intrinsic (not
3441 ``arg`` implemented).
3442 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3443 ``queue_ptr`` terminated and its
3444 associated queue put
3445 into the error state.
3446 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3447 ``queue_ptr`` installed handled
3448 same as ``llvm.trap``.
3449 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3450 breakpoints.
3451 debugger ``s_trap 0x08`` Reserved for debugger.
3452 debugger ``s_trap 0xfe`` Reserved for debugger.
3453 debugger ``s_trap 0xff`` Reserved for debugger.
3454 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003455
Tony Tye46d35762017-08-15 20:47:41 +00003456Unspecified OS
3457--------------
3458
3459This section provides code conventions used when the target triple OS is
3460empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003461
3462Trap Handler ABI
3463~~~~~~~~~~~~~~~~
3464
3465For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3466not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3467instructions are handled as follows:
3468
3469 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3470 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3471
3472 =============== =============== ===========================================
3473 Usage Code Sequence Description
3474 =============== =============== ===========================================
3475 llvm.trap s_endpgm Causes wavefront to be terminated.
3476 llvm.debugtrap *none* Compiler warning given that there is no
3477 trap handler installed.
3478 =============== =============== ===========================================
3479
3480Source Languages
3481================
3482
3483.. _amdgpu-opencl:
3484
3485OpenCL
3486------
3487
3488When generating code for the OpenCL language the target triple environment
3489should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3490
3491When the language is OpenCL the following differences occur:
3492
34931. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
34942. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye46d35762017-08-15 20:47:41 +000034953. Additional metadata is generated
3496 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003497
3498.. TODO
3499 Specify what affect this has. Hidden arguments added. Additional metadata
3500 generated.
3501
3502.. _amdgpu-hcc:
3503
3504HCC
3505---
3506
3507When generating code for the OpenCL language the target triple environment
3508should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3509
3510When the language is OpenCL the following differences occur:
3511
35121. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3513
3514.. TODO
3515 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003516
Tom Stellard45bb48e2015-06-13 03:28:10 +00003517Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003518---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003519
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003520AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef16a45e2017-06-06 20:31:59 +00003521It supports AMDGCN GFX6-GFX8.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003522
Tony Tyef16a45e2017-06-06 20:31:59 +00003523This section describes general syntax for instructions and operands. For more
3524information about instructions, their semantics and supported combinations of
3525operands, refer to one of instruction set architecture manuals
Tony Tye46d35762017-08-15 20:47:41 +00003526[AMD-Souther-Islands]_, [AMD-Sea-Islands]_, [AMD-Volcanic-Islands]_ and
3527[AMD-Vega]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003528
Tony Tyef16a45e2017-06-06 20:31:59 +00003529An instruction has the following syntax (register operands are normally
3530comma-separated while extra operands are space-separated):
Tom Stellard45bb48e2015-06-13 03:28:10 +00003531
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003532*<opcode> <register_operand0>, ... <extra_operand0> ...*
Tom Stellard45bb48e2015-06-13 03:28:10 +00003533
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003534Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003535~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003536
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003537The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003538
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003539* SGPR registers: s0, ... or s[0], ...
3540* VGPR registers: v0, ... or v[0], ...
3541* TTMP registers: ttmp0, ... or ttmp[0], ...
3542* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3543* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3544* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3545* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3546* Register index expressions: v[2*2], s[1-1:2-1]
3547* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003548
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003549The following extra operands are supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003550
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003551* offset, offset0, offset1
3552* idxen, offen bits
3553* glc, slc, tfe bits
3554* waitcnt: integer or combination of counter values
3555* VOP3 modifiers:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003556
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003557 - abs (\| \|), neg (\-)
Tom Stellard45bb48e2015-06-13 03:28:10 +00003558
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003559* DPP modifiers:
3560
3561 - row_shl, row_shr, row_ror, row_rol
3562 - row_mirror, row_half_mirror, row_bcast
3563 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3564 - row_mask, bank_mask, bound_ctrl
3565
3566* SDWA modifiers:
3567
3568 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3569 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3570 - abs, neg, sext
3571
Tony Tyef16a45e2017-06-06 20:31:59 +00003572Instruction Examples
3573~~~~~~~~~~~~~~~~~~~~
3574
3575DS
3576~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003577
3578.. code-block:: nasm
3579
3580 ds_add_u32 v2, v4 offset:16
3581 ds_write_src2_b64 v2 offset0:4 offset1:8
3582 ds_cmpst_f32 v2, v4, v6
3583 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3584
3585
3586For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3587
Tony Tyef16a45e2017-06-06 20:31:59 +00003588FLAT
3589++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003590
3591.. code-block:: nasm
3592
3593 flat_load_dword v1, v[3:4]
3594 flat_store_dwordx3 v[3:4], v[5:7]
3595 flat_atomic_swap v1, v[3:4], v5 glc
3596 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3597 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3598
3599For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3600
Tony Tyef16a45e2017-06-06 20:31:59 +00003601MUBUF
3602+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003603
3604.. code-block:: nasm
3605
3606 buffer_load_dword v1, off, s[4:7], s1
3607 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3608 buffer_store_format_xy v[1:2], off, s[4:7], s1
3609 buffer_wbinvl1
3610 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3611
3612For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3613
Tony Tyef16a45e2017-06-06 20:31:59 +00003614SMRD/SMEM
3615+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003616
3617.. code-block:: nasm
3618
3619 s_load_dword s1, s[2:3], 0xfc
3620 s_load_dwordx8 s[8:15], s[2:3], s4
3621 s_load_dwordx16 s[88:103], s[2:3], s4
3622 s_dcache_inv_vol
3623 s_memtime s[4:5]
3624
3625For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3626
Tony Tyef16a45e2017-06-06 20:31:59 +00003627SOP1
3628++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003629
3630.. code-block:: nasm
3631
3632 s_mov_b32 s1, s2
3633 s_mov_b64 s[0:1], 0x80000000
3634 s_cmov_b32 s1, 200
3635 s_wqm_b64 s[2:3], s[4:5]
3636 s_bcnt0_i32_b64 s1, s[2:3]
3637 s_swappc_b64 s[2:3], s[4:5]
3638 s_cbranch_join s[4:5]
3639
3640For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3641
Tony Tyef16a45e2017-06-06 20:31:59 +00003642SOP2
3643++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003644
3645.. code-block:: nasm
3646
3647 s_add_u32 s1, s2, s3
3648 s_and_b64 s[2:3], s[4:5], s[6:7]
3649 s_cselect_b32 s1, s2, s3
3650 s_andn2_b32 s2, s4, s6
3651 s_lshr_b64 s[2:3], s[4:5], s6
3652 s_ashr_i32 s2, s4, s6
3653 s_bfm_b64 s[2:3], s4, s6
3654 s_bfe_i64 s[2:3], s[4:5], s6
3655 s_cbranch_g_fork s[4:5], s[6:7]
3656
3657For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3658
Tony Tyef16a45e2017-06-06 20:31:59 +00003659SOPC
3660++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003661
3662.. code-block:: nasm
3663
3664 s_cmp_eq_i32 s1, s2
3665 s_bitcmp1_b32 s1, s2
3666 s_bitcmp0_b64 s[2:3], s4
3667 s_setvskip s3, s5
3668
3669For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3670
Tony Tyef16a45e2017-06-06 20:31:59 +00003671SOPP
3672++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003673
3674.. code-block:: nasm
3675
3676 s_barrier
3677 s_nop 2
3678 s_endpgm
3679 s_waitcnt 0 ; Wait for all counters to be 0
3680 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3681 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3682 s_sethalt 9
3683 s_sleep 10
3684 s_sendmsg 0x1
3685 s_sendmsg sendmsg(MSG_INTERRUPT)
3686 s_trap 1
3687
3688For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3689
3690Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003691of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003692range or acceptable values.
3693
Tony Tyef16a45e2017-06-06 20:31:59 +00003694VALU
3695++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003696
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003697For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3698the assembler will automatically use optimal encoding based on its operands.
3699To force specific encoding, one can add a suffix to the opcode of the instruction:
3700
3701* _e32 for 32-bit VOP1/VOP2/VOPC
3702* _e64 for 64-bit VOP3
3703* _dpp for VOP_DPP
3704* _sdwa for VOP_SDWA
3705
3706VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003707
3708.. code-block:: nasm
3709
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003710 v_mov_b32 v1, v2
3711 v_mov_b32_e32 v1, v2
3712 v_nop
3713 v_cvt_f64_i32_e32 v[1:2], v2
3714 v_floor_f32_e32 v1, v2
3715 v_bfrev_b32_e32 v1, v2
3716 v_add_f32_e32 v1, v2, v3
3717 v_mul_i32_i24_e64 v1, v2, 3
3718 v_mul_i32_i24_e32 v1, -3, v3
3719 v_mul_i32_i24_e32 v1, -100, v3
3720 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3721 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00003722
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003723VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003724
3725.. code-block:: nasm
3726
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003727 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3728 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3729 v_mov_b32 v0, v0 wave_shl:1
3730 v_mov_b32 v0, v0 row_mirror
3731 v_mov_b32 v0, v0 row_bcast:31
3732 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3733 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3734 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00003735
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003736VOP_SDWA examples:
3737
3738.. code-block:: nasm
3739
3740 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3741 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3742 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3743 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3744 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3745
3746For full list of supported instructions, refer to "Vector ALU instructions".
3747
3748HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00003749~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003750
3751AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3752one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00003753
3754.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00003755+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00003756
3757*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003758object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00003759
3760.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00003761+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3762
Tom Stellard347ac792015-06-26 21:15:07 +00003763
3764*major*, *minor*, and *stepping* are all integers that describe the instruction
3765set architecture (ISA) version of the assembly program.
3766
3767*vendor* and *arch* are quoted strings. *vendor* should always be equal to
3768"AMD" and *arch* should always be equal to "AMDGPU".
3769
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003770By default, the assembler will derive the ISA version, *vendor*, and *arch*
3771from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00003772
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003773.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00003774+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003775
3776This directives specifies that the symbol with given name is a kernel entry point
3777(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00003778
3779.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00003780++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00003781
3782This directive marks the beginning of a list of key / value pairs that are used
3783to specify the amd_kernel_code_t object that will be emitted by the assembler.
3784The list must be terminated by the *.end_amd_kernel_code_t* directive. For
3785any amd_kernel_code_t values that are unspecified a default value will be
3786used. The default value for all keys is 0, with the following exceptions:
3787
3788- *kernel_code_version_major* defaults to 1.
3789- *machine_kind* defaults to 1.
3790- *machine_version_major*, *machine_version_minor*, and
3791 *machine_version_stepping* are derived from the value of the -mcpu option
3792 that is passed to the assembler.
3793- *kernel_code_entry_byte_offset* defaults to 256.
3794- *wavefront_size* defaults to 6.
3795- *kernarg_segment_alignment*, *group_segment_alignment*, and
3796 *private_segment_alignment* default to 4. Note that alignments are specified
3797 as a power of two, so a value of **n** means an alignment of 2^ **n**.
3798
3799The *.amd_kernel_code_t* directive must be placed immediately after the
3800function label and before any instructions.
3801
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003802For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
3803comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00003804
3805Here is an example of a minimal amd_kernel_code_t specification:
3806
Aaron Ballman887ad0e2016-07-19 17:46:55 +00003807.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00003808
3809 .hsa_code_object_version 1,0
3810 .hsa_code_object_isa
3811
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003812 .hsatext
3813 .globl hello_world
3814 .p2align 8
3815 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00003816
3817 hello_world:
3818
3819 .amd_kernel_code_t
3820 enable_sgpr_kernarg_segment_ptr = 1
3821 is_ptr64 = 1
3822 compute_pgm_rsrc1_vgprs = 0
3823 compute_pgm_rsrc1_sgprs = 0
3824 compute_pgm_rsrc2_user_sgpr = 2
3825 kernarg_segment_byte_size = 8
3826 wavefront_sgpr_count = 2
3827 workitem_vgpr_count = 3
3828 .end_amd_kernel_code_t
3829
3830 s_load_dwordx2 s[0:1], s[0:1] 0x0
3831 v_mov_b32 v0, 3.14159
3832 s_waitcnt lgkmcnt(0)
3833 v_mov_b32 v1, s0
3834 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003835 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00003836 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00003837 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003838 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00003839
3840Additional Documentation
3841========================
3842
3843.. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
3844.. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
3845.. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
3846.. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
3847.. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
3848.. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
3849.. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
Tony Tye46d35762017-08-15 20:47:41 +00003850.. [AMD-Vega] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00003851.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
3852.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
3853.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
3854.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
3855.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
3856.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
3857.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
3858.. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
3859.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
3860.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
3861.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__