blob: ac8c0bd884fb8122f352a1655a9185fb319c54ed [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
26 .. table:: AMDGPU Target Triples
27 :name: amdgpu-target-triples-table
28
29 ============ ======== ========= ===========
30 Architecture Vendor OS Environment
31 ============ ======== ========= ===========
32 r600 amd <empty> <empty>
33 amdgcn amd <empty> <empty>
34 amdgcn amd amdhsa <empty>
35 amdgcn amd amdhsa opencl
36 amdgcn amd amdhsa amdgizcl
37 amdgcn amd amdhsa amdgiz
38 amdgcn amd amdhsa hcc
39 ============ ======== ========= ===========
40
41``r600-amd--``
42 Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
43 the MESA runtime.
44
45``amdgcn-amd--``
Tony Tye46d35762017-08-15 20:47:41 +000046 Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
Tony Tyef16a45e2017-06-06 20:31:59 +000047 the MESA runtime.
48
49``amdgcn-amd-amdhsa-``
50 Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
51 compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
52
53``amdgcn-amd-amdhsa-opencl``
54 Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
55 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
56 :ref:`amdgpu-opencl`.
57
58``amdgcn-amd-amdhsa-amdgizcl``
59 Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
60 is used (see :ref:`amdgpu-address-spaces`).
61
62``amdgcn-amd-amdhsa-amdgiz``
63 Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
64 used (see :ref:`amdgpu-address-spaces`).
65
66``amdgcn-amd-amdhsa-hcc``
67 Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
68 executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
69 :ref:`amdgpu-hcc`.
70
71.. _amdgpu-processors:
72
73Processors
74----------
75
76Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
77names from both the *Processor* and *Alternative Processor* can be used.
78
79 .. table:: AMDGPU Processors
80 :name: amdgpu-processors-table
81
82 ========== =========== ============ ===== ======= ==================
83 Processor Alternative Target dGPU/ Runtime Example
84 Processor Triple APU Support Products
85 Architecture
86 ========== =========== ============ ===== ======= ==================
87 **R600** [AMD-R6xx]_
88 --------------------------------------------------------------------
89 r600 r600 dGPU
90 r630 r600 dGPU
91 rs880 r600 dGPU
92 rv670 r600 dGPU
93 **R700** [AMD-R7xx]_
94 --------------------------------------------------------------------
95 rv710 r600 dGPU
96 rv730 r600 dGPU
97 rv770 r600 dGPU
98 **Evergreen** [AMD-Evergreen]_
99 --------------------------------------------------------------------
100 cedar r600 dGPU
101 redwood r600 dGPU
102 sumo r600 dGPU
103 juniper r600 dGPU
104 cypress r600 dGPU
105 **Northern Islands** [AMD-Cayman-Trinity]_
106 --------------------------------------------------------------------
107 barts r600 dGPU
108 turks r600 dGPU
109 caicos r600 dGPU
110 cayman r600 dGPU
111 **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_
112 --------------------------------------------------------------------
Konstantin Zhuravlyov6cbcb272017-08-08 04:28:31 +0000113 gfx600 - tahiti amdgcn dGPU
Tony Tyef16a45e2017-06-06 20:31:59 +0000114 gfx601 - pitcairn amdgcn dGPU
115 - verde
116 - oland
117 - hainan
118 **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_
119 --------------------------------------------------------------------
120 gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
121 - Radeon HD 8770
122 - R7 260
123 - R7 260X
124 \ - kaveri amdgcn APU - A6-7000
125 - A6 Pro-7050B
126 - A8-7100
127 - A8 Pro-7150B
128 - A10-7300
129 - A10 Pro-7350B
130 - FX-7500
131 - A8-7200P
132 - A10-7400P
133 - FX-7600P
134 gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
135 - FirePro W9100
136 - FirePro S9150
137 - FirePro S9170
Tony Tye1fd77b22017-06-08 01:47:25 +0000138 gfx702 dGPU ROCm - Radeon R9 290
Tony Tyef16a45e2017-06-06 20:31:59 +0000139 - Radeon R9 290x
140 - Radeon R390
Tony Tye1fd77b22017-06-08 01:47:25 +0000141 - Radeon R390x
142 gfx703 - kabini amdgcn APU - E1-2100
Tony Tyef16a45e2017-06-06 20:31:59 +0000143 - mullins - E1-2200
144 - E1-2500
145 - E2-3000
146 - E2-3800
147 - A4-5000
148 - A4-5100
149 - A6-5200
150 - A4 Pro-3340B
151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_
152 --------------------------------------------------------------------
153 gfx800 - iceland amdgcn dGPU - FirePro S7150
154 - FirePro S7100
155 - FirePro W7100
156 - Radeon R285
157 - Radeon R9 380
158 - Radeon R9 385
159 - Mobile FirePro
160 M7170
161 gfx801 - carrizo amdgcn APU - A6-8500P
162 - Pro A6-8500B
163 - A8-8600P
164 - Pro A8-8600B
165 - FX-8800P
166 - Pro A12-8800B
167 \ amdgcn APU ROCm - A10-8700P
168 - Pro A10-8700B
169 - A10-8780P
170 \ amdgcn APU - A10-9600P
171 - A10-9630P
172 - A12-9700P
173 - A12-9730P
174 - FX-9800P
175 - FX-9830P
176 \ amdgcn APU - E2-9010
177 - A6-9210
178 - A9-9410
179 gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
180 gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
181 - Radeon R9 Fury
182 - Radeon R9 FuryX
183 - Radeon Pro Duo
184 - FirePro S9300x2
Tony Tye46d35762017-08-15 20:47:41 +0000185 - Radeon Instinct MI8
Tony Tyef16a45e2017-06-06 20:31:59 +0000186 \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
187 - Radeon RX 480
Tony Tye46d35762017-08-15 20:47:41 +0000188 - Radeon Instinct MI6
Tony Tyef16a45e2017-06-06 20:31:59 +0000189 \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
190 gfx804 amdgcn dGPU Same as gfx803
191 gfx810 - stoney amdgcn APU
Tony Tye46d35762017-08-15 20:47:41 +0000192 **GCN GFX9** [AMD-Vega]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000193 --------------------------------------------------------------------
Tony Tye46d35762017-08-15 20:47:41 +0000194 gfx900 amdgcn dGPU - Radeon Vega
195 Frontier Edition
196 - Radeon RX Vega 56
197 - Radeon RX Vega 64
198 - Radeon RX Vega 64
199 Liquid
200 - Radeon Instinct MI25
Tony Tyef16a45e2017-06-06 20:31:59 +0000201 gfx901 amdgcn dGPU ROCm Same as gfx900
202 except XNACK is
203 enabled
204 gfx902 amdgcn APU *TBA*
205
206 .. TODO
207 Add product
208 names.
209 gfx903 amdgcn APU Same as gfx902
210 except XNACK is
211 enabled
212 ========== =========== ============ ===== ======= ==================
213
214.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000215
216Address Spaces
217--------------
218
Tony Tyef16a45e2017-06-06 20:31:59 +0000219The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000220
Tony Tyef16a45e2017-06-06 20:31:59 +0000221The memory space names used in the table, aside from the region memory space, is
222from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000223
Tony Tyef16a45e2017-06-06 20:31:59 +0000224LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000225
Tony Tyef16a45e2017-06-06 20:31:59 +0000226 .. table:: Address Space Mapping
227 :name: amdgpu-address-space-mapping-table
228
229 ================== ================= ================= ================= =================
230 LLVM Address Space Memory Space
231 ------------------ -----------------------------------------------------------------------
232 \ Current Default amdgiz/amdgizcl hcc Future Default
233 ================== ================= ================= ================= =================
234 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
235 1 Global Global Global Global
236 2 Constant Constant Constant Region (GDS)
237 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
238 4 Generic (Flat) Region (GDS) Region (GDS) Constant
239 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
240 ================== ================= ================= ================= =================
241
242Current Default
243 This is the current default address space mapping used for all languages
244 except hcc. This will shortly be deprecated.
245
246amdgiz/amdgizcl
247 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
248 is specified as the target triple environment value.
249
250hcc
251 This is the current address space mapping used when ``hcc`` is specified as
252 the target triple environment value.This will shortly be deprecated.
253
254Future Default
255 This will shortly be the only address space mapping for all languages using
256 AMDGPU backend.
257
258.. _amdgpu-memory-scopes:
259
260Memory Scopes
261-------------
262
263This section provides LLVM memory synchronization scopes supported by the AMDGPU
264backend memory model when the target triple OS is ``amdhsa`` (see
265:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
266
267The memory model supported is based on the HSA memory model [HSA]_ which is
268based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
269relation is transitive over the synchonizes-with relation independent of scope,
270and synchonizes-with allows the memory scope instances to be inclusive (see
271table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
272
273This is different to the OpenCL [OpenCL]_ memory model which does not have scope
274inclusion and requires the memory scopes to exactly match. However, this
275is conservatively correct for OpenCL.
276
277 .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
278 :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
279
280 ================ ==========================================================
281 LLVM Sync Scope Description
282 ================ ==========================================================
283 *none* The default: ``system``.
284
285 Synchronizes with, and participates in modification and
286 seq_cst total orderings with, other operations (except
287 image operations) for all address spaces (except private,
288 or generic that accesses private) provided the other
289 operation's sync scope is:
290
291 - ``system``.
292 - ``agent`` and executed by a thread on the same agent.
293 - ``workgroup`` and executed by a thread in the same
294 workgroup.
295 - ``wavefront`` and executed by a thread in the same
296 wavefront.
297
298 ``agent`` Synchronizes with, and participates in modification and
299 seq_cst total orderings with, other operations (except
300 image operations) for all address spaces (except private,
301 or generic that accesses private) provided the other
302 operation's sync scope is:
303
304 - ``system`` or ``agent`` and executed by a thread on the
305 same agent.
306 - ``workgroup`` and executed by a thread in the same
307 workgroup.
308 - ``wavefront`` and executed by a thread in the same
309 wavefront.
310
311 ``workgroup`` Synchronizes with, and participates in modification and
312 seq_cst total orderings with, other operations (except
313 image operations) for all address spaces (except private,
314 or generic that accesses private) provided the other
315 operation's sync scope is:
316
317 - ``system``, ``agent`` or ``workgroup`` and executed by a
318 thread in the same workgroup.
319 - ``wavefront`` and executed by a thread in the same
320 wavefront.
321
322 ``wavefront`` Synchronizes with, and participates in modification and
323 seq_cst total orderings with, other operations (except
324 image operations) for all address spaces (except private,
325 or generic that accesses private) provided the other
326 operation's sync scope is:
327
328 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
329 and executed by a thread in the same wavefront.
330
331 ``singlethread`` Only synchronizes with, and participates in modification
332 and seq_cst total orderings with, other operations (except
333 image operations) running in the same thread for all
334 address spaces (for example, in signal handlers).
335 ================ ==========================================================
336
337AMDGPU Intrinsics
338-----------------
339
340The AMDGPU backend implements the following intrinsics.
341
342*This section is WIP.*
343
344.. TODO
345 List AMDGPU intrinsics
346
347Code Object
348===========
349
350The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
351can be linked by ``lld`` to produce a standard ELF shared code object which can
352be loaded and executed on an AMDGPU target.
353
354Header
355------
356
357The AMDGPU backend uses the following ELF header:
358
359 .. table:: AMDGPU ELF Header
360 :name: amdgpu-elf-header-table
361
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000362 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000363 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000364 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000365 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
366 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000367 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA`` or
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000368 ``ELFOSABI_AMDGPU_PAL`` or
369 ``ELFOSABI_AMDGPU_MESA3D``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000370 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA`` or
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000371 ``ELFABIVERSION_AMDGPU_PAL`` or
372 ``ELFABIVERSION_AMDGPU_MESA3D``
Tony Tyef16a45e2017-06-06 20:31:59 +0000373 ``e_type`` ``ET_REL`` or ``ET_DYN``
374 ``e_machine`` ``EM_AMDGPU``
375 ``e_entry`` 0
376 ``e_flags`` 0
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000377 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000378
379..
380
381 .. table:: AMDGPU ELF Header Enumeration Values
382 :name: amdgpu-elf-header-enumeration-values-table
383
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000384 =============================== =====
385 Name Value
386 =============================== =====
387 ``EM_AMDGPU`` 224
388 ``ELFOSABI_AMDGPU_HSA`` 64
389 ``ELFOSABI_AMDGPU_PAL`` 65
390 ``ELFOSABI_AMDGPU_MESA3D`` 66
391 ``ELFABIVERSION_AMDGPU_HSA`` 1
392 ``ELFABIVERSION_AMDGPU_PAL`` 0
393 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
394 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000395
396``e_ident[EI_CLASS]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000397 The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
398 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000399
400``e_ident[EI_DATA]``
401 All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
402
403``e_ident[EI_OSABI]``
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000404 One of the following AMD GPU architecture specific OS ABIs:
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000405
406 * ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
407 the AMD HSA runtime ABI [HSA]_.
408
409 * ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
410 the AMD PAL runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000411
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000412 * ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
413 to the AMD MESA runtime ABI.
414
Tony Tyef16a45e2017-06-06 20:31:59 +0000415``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000416 The ABI version of the AMD GPU architecture specific OS ABI to which the code
417 object conforms:
418
419 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
420 runtime ABI.
421
422 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
423 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000424
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000425 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
426 runtime ABI.
427
Tony Tyef16a45e2017-06-06 20:31:59 +0000428``e_type``
429 Can be one of the following values:
430
431
432 ``ET_REL``
433 The type produced by the AMD GPU backend compiler as it is relocatable code
434 object.
435
436 ``ET_DYN``
437 The type produced by the linker as it is a shared code object.
438
439 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
440
441``e_machine``
442 The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
443 architecture family. The specific member is specified in the
444 ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
445 :ref:`amdgpu-note-records`).
446
447``e_entry``
448 The entry point is 0 as the entry points for individual kernels must be
449 selected in order to invoke them through AQL packets.
450
451``e_flags``
452 The value is 0 as no flags are used.
453
454Sections
455--------
456
457An AMDGPU target ELF code object has the standard ELF sections which include:
458
459 .. table:: AMDGPU ELF Sections
460 :name: amdgpu-elf-sections-table
461
462 ================== ================ =================================
463 Name Type Attributes
464 ================== ================ =================================
465 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
466 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
467 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
468 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
469 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
470 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
471 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
472 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
473 ``.note`` ``SHT_NOTE`` *none*
474 ``.rela``\ *name* ``SHT_RELA`` *none*
475 ``.rela.dyn`` ``SHT_RELA`` *none*
476 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
477 ``.shstrtab`` ``SHT_STRTAB`` *none*
478 ``.strtab`` ``SHT_STRTAB`` *none*
479 ``.symtab`` ``SHT_SYMTAB`` *none*
480 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
481 ================== ================ =================================
482
483These sections have their standard meanings (see [ELF]_) and are only generated
484if needed.
485
486``.debug``\ *\**
487 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
488 DWARF produced by the AMDGPU backend.
489
Tony Tye46d35762017-08-15 20:47:41 +0000490``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000491 The standard sections used by a dynamic loader.
492
493``.note``
494 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
495 backend.
496
497``.rela``\ *name*, ``.rela.dyn``
498 For relocatable code objects, *name* is the name of the section that the
499 relocation records apply. For example, ``.rela.text`` is the section name for
500 relocation records associated with the ``.text`` section.
501
502 For linked shared code objects, ``.rela.dyn`` contains all the relocation
503 records from each of the relocatable code object's ``.rela``\ *name* sections.
504
505 See :ref:`amdgpu-relocation-records` for the relocation records supported by
506 the AMDGPU backend.
507
508``.text``
509 The executable machine code for the kernels and functions they call. Generated
510 as position independent code. See :ref:`amdgpu-code-conventions` for
511 information on conventions used in the isa generation.
512
513.. _amdgpu-note-records:
514
515Note Records
516------------
517
518As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
519the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
520minimal zero byte padding must be generated to ensure the ``desc`` field size is
521a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
522be at least 4 to indicate at least 8 byte alignment.
523
524The AMDGPU backend code object uses the following ELF note records in the
525``.note`` section. The *Description* column specifies the layout of the note
526record’s ``desc`` field. All fields are consecutive bytes. Note records with
527variable size strings have a corresponding ``*_size`` field that specifies the
528number of bytes, including the terminating null character, in the string. The
529string(s) come immediately after the preceding fields.
530
531Additional note records can be present.
532
533 .. table:: AMDGPU ELF Note Records
534 :name: amdgpu-elf-note-records-table
535
Tony Tye46d35762017-08-15 20:47:41 +0000536 ===== ============================== ======================================
537 Name Type Description
538 ===== ============================== ======================================
539 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
540 "AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
541 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000542
543..
544
545 .. table:: AMDGPU ELF Note Record Enumeration Values
546 :name: amdgpu-elf-note-record-enumeration-values-table
547
Tony Tye46d35762017-08-15 20:47:41 +0000548 ============================== =====
549 Name Value
550 ============================== =====
551 *reserved* 0-9
552 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
553 ``NT_AMD_AMDGPU_ISA`` 11
554 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000555
556``NT_AMD_AMDGPU_ISA``
557 Specifies the instruction set architecture used by the machine code contained
558 in the code object.
559
560 This note record is required for code objects containing machine code for
561 processors matching the ``amdgcn`` architecture in table
562 :ref:`amdgpu-processors`.
563
564 The null terminated string has the following syntax:
565
566 *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
567
568 where:
569
570 *architecture*
571 The architecture from table :ref:`amdgpu-target-triples-table`.
572
573 This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
574 :ref:`amdgpu-target-triples`).
575
576 *vendor*
577 The vendor from table :ref:`amdgpu-target-triples-table`.
578
579 For the AMDGPU backend this is always ``amd``.
580
581 *os*
582 The OS from table :ref:`amdgpu-target-triples-table`.
583
584 *environment*
585 An environment from table :ref:`amdgpu-target-triples-table`, or blank if
586 the environment has no affect on the execution of the code object.
587
588 For the AMDGPU backend this is currently always blank.
589 *processor*
590 The processor from table :ref:`amdgpu-processors-table`.
591
592 For example:
593
594 ``amdgcn-amd-amdhsa--gfx901``
595
Tony Tye46d35762017-08-15 20:47:41 +0000596``NT_AMD_AMDGPU_HSA_METADATA``
597 Specifies extensible metadata associated with the code objects executed on HSA
598 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
599 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
600 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
601 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000602
Tony Tye46d35762017-08-15 20:47:41 +0000603.. _amdgpu-symbols:
604
605Symbols
606-------
607
608Symbols include the following:
609
610 .. table:: AMDGPU ELF Symbols
611 :name: amdgpu-elf-symbols-table
612
613 ===================== ============== ============= ==================
614 Name Type Section Description
615 ===================== ============== ============= ==================
616 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
617 - ``.rodata``
618 - ``.bss``
619 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
620 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
621 ===================== ============== ============= ==================
622
623Global variable
624 Global variables both used and defined by the compilation unit.
625
626 If the symbol is defined in the compilation unit then it is allocated in the
627 appropriate section according to if it has initialized data or is readonly.
628
629 If the symbol is external then its section is ``STN_UNDEF`` and the loader
630 will resolve relocations using the definition provided by another code object
631 or explicitly defined by the runtime.
632
633 All global symbols, whether defined in the compilation unit or external, are
634 accessed by the machine code indirectly through a GOT table entry. This
635 allows them to be preemptable. The GOT table is only supported when the target
636 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000637
638 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000639 Add description of linked shared object symbols. Seems undefined symbols
640 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000641
Tony Tye46d35762017-08-15 20:47:41 +0000642Kernel descriptor
643 Every HSA kernel has an associated kernel descriptor. It is the address of the
644 kernel descriptor that is used in the AQL dispatch packet used to invoke the
645 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
646 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
647
648Kernel entry point
649 Every HSA kernel also has a symbol for its machine code entry point.
650
651.. _amdgpu-relocation-records:
652
653Relocation Records
654------------------
655
656AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
657relocatable fields are:
658
659``word32``
660 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
661 alignment. These values use the same byte order as other word values in the
662 AMD GPU architecture.
663
664``word64``
665 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
666 alignment. These values use the same byte order as other word values in the
667 AMD GPU architecture.
668
669Following notations are used for specifying relocation calculations:
670
671**A**
672 Represents the addend used to compute the value of the relocatable field.
673
674**G**
675 Represents the offset into the global offset table at which the relocation
676 entry’s symbol will reside during execution.
677
678**GOT**
679 Represents the address of the global offset table.
680
681**P**
682 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
683 of the storage unit being relocated (computed using ``r_offset``).
684
685**S**
686 Represents the value of the symbol whose index resides in the relocation
687 entry.
688
689The following relocation types are supported:
690
691 .. table:: AMDGPU ELF Relocation Records
692 :name: amdgpu-elf-relocation-records-table
693
694 ========================== ===== ========== ==============================
695 Relocation Type Value Field Calculation
696 ========================== ===== ========== ==============================
697 ``R_AMDGPU_NONE`` 0 *none* *none*
698 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
699 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
700 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
701 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
702 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
703 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
704 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
705 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
706 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
707 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
708 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
709 ========================== ===== ========== ==============================
710
711.. _amdgpu-dwarf:
712
713DWARF
714-----
715
716Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
717information that maps the code object executable code and data to the source
718language constructs. It can be used by tools such as debuggers and profilers.
719
720Address Space Mapping
721~~~~~~~~~~~~~~~~~~~~~
722
723The following address space mapping is used:
724
725 .. table:: AMDGPU DWARF Address Space Mapping
726 :name: amdgpu-dwarf-address-space-mapping-table
727
728 =================== =================
729 DWARF Address Space Memory Space
730 =================== =================
731 1 Private (Scratch)
732 2 Local (group/LDS)
733 *omitted* Global
734 *omitted* Constant
735 *omitted* Generic (Flat)
736 *not supported* Region (GDS)
737 =================== =================
738
739See :ref:`amdgpu-address-spaces` for information on the memory space terminology
740used in the table.
741
742An ``address_class`` attribute is generated on pointer type DIEs to specify the
743DWARF address space of the value of the pointer when it is in the *private* or
744*local* address space. Otherwise the attribute is omitted.
745
746An ``XDEREF`` operation is generated in location list expressions for variables
747that are allocated in the *private* and *local* address space. Otherwise no
748``XDREF`` is omitted.
749
750Register Mapping
751~~~~~~~~~~~~~~~~
752
753*This section is WIP.*
754
755.. TODO
756 Define DWARF register enumeration.
757
758 If want to present a wavefront state then should expose vector registers as
759 64 wide (rather than per work-item view that LLVM uses). Either as separate
760 registers, or a 64x4 byte single register. In either case use a new LANE op
761 (akin to XDREF) to select the current lane usage in a location
762 expression. This would also allow scalar register spilling to vector register
763 lanes to be expressed (currently no debug information is being generated for
764 spilling). If choose a wide single register approach then use LANE in
765 conjunction with PIECE operation to select the dword part of the register for
766 the current lane. If the separate register approach then use LANE to select
767 the register.
768
769Source Text
770~~~~~~~~~~~
771
772*This section is WIP.*
773
774.. TODO
775 DWARF extension to include runtime generated source text.
776
777.. _amdgpu-code-conventions:
778
779Code Conventions
780================
781
782This section provides code conventions used for each supported target triple OS
783(see :ref:`amdgpu-target-triples`).
784
785AMDHSA
786------
787
788This section provides code conventions used when the target triple OS is
789``amdhsa`` (see :ref:`amdgpu-target-triples`).
790
791.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000792
793Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000794~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000795
Tony Tye46d35762017-08-15 20:47:41 +0000796The code object metadata specifies extensible metadata associated with the code
797objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
798[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
799(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
800``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
801information necessary to support the ROCM kernel queries. For example, the
802segment sizes needed in a dispatch packet. In addition, a high level language
803runtime may require other information to be included. For example, the AMD
804OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000805
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000806The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000807:doc:`YamlIO`).
808
Tony Tye46d35762017-08-15 20:47:41 +0000809.. TODO
810 Is the string null terminated? It probably should not if YAML allows it to
811 contain null characters, otherwise it should be.
812
Tony Tyef16a45e2017-06-06 20:31:59 +0000813The metadata is represented as a single YAML document comprised of the mapping
814defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
815referenced tables.
816
817For boolean values, the string values of ``false`` and ``true`` are used for
818false and true respectively.
819
820Additional information can be added to the mappings. To avoid conflicts, any
821non-AMD key names should be prefixed by "*vendor-name*.".
822
823 .. table:: AMDHSA Code Object Metadata Mapping
824 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
825
826 ========== ============== ========= =======================================
827 String Key Value Type Required? Description
828 ========== ============== ========= =======================================
829 "Version" sequence of Required - The first integer is the major
830 2 integers version. Currently 1.
831 - The second integer is the minor
832 version. Currently 0.
833 "Printf" sequence of Each string is encoded information
834 strings about a printf function call. The
835 encoded information is organized as
836 fields separated by colon (':'):
837
838 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
839
840 where:
841
842 ``ID``
843 A 32 bit integer as a unique id for
844 each printf function call
845
846 ``N``
847 A 32 bit integer equal to the number
848 of arguments of printf function call
849 minus 1
850
851 ``S[i]`` (where i = 0, 1, ... , N-1)
852 32 bit integers for the size in bytes
853 of the i-th FormatString argument of
854 the printf function call
855
856 FormatString
857 The format string passed to the
858 printf function call.
859 "Kernels" sequence of Required Sequence of the mappings for each
860 mapping kernel in the code object. See
861 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
862 for the definition of the mapping.
863 ========== ============== ========= =======================================
864
865..
866
867 .. table:: AMDHSA Code Object Kernel Metadata Mapping
868 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
869
870 ================= ============== ========= ================================
871 String Key Value Type Required? Description
872 ================= ============== ========= ================================
873 "Name" string Required Source name of the kernel.
874 "SymbolName" string Required Name of the kernel
875 descriptor ELF symbol.
876 "Language" string Source language of the kernel.
877 Values include:
878
879 - "OpenCL C"
880 - "OpenCL C++"
881 - "HCC"
882 - "OpenMP"
883
884 "LanguageVersion" sequence of - The first integer is the major
885 2 integers version.
886 - The second integer is the
887 minor version.
888 "Attrs" mapping Mapping of kernel attributes.
889 See
890 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
891 for the mapping definition.
892 "Arguments" sequence of Sequence of mappings of the
893 mapping kernel arguments. See
894 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
895 for the definition of the mapping.
896 "CodeProps" mapping Mapping of properties related to
897 the kernel code. See
898 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
899 for the mapping definition.
900 "DebugProps" mapping Mapping of properties related to
901 the kernel debugging. See
902 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
903 for the mapping definition.
904 ================= ============== ========= ================================
905
906..
907
908 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
909 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
910
911 =================== ============== ========= ==============================
912 String Key Value Type Required? Description
913 =================== ============== ========= ==============================
914 "ReqdWorkGroupSize" sequence of The dispatch work-group size
915 3 integers X, Y, Z must correspond to the
916 specified values.
917
918 Corresponds to the OpenCL
919 ``reqd_work_group_size``
920 attribute.
921 "WorkGroupSizeHint" sequence of The dispatch work-group size
922 3 integers X, Y, Z is likely to be the
923 specified values.
924
925 Corresponds to the OpenCL
926 ``work_group_size_hint``
927 attribute.
928 "VecTypeHint" string The name of a scalar or vector
929 type.
930
931 Corresponds to the OpenCL
932 ``vec_type_hint`` attribute.
933 =================== ============== ========= ==============================
934
935..
936
937 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
938 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
939
940 ================= ============== ========= ================================
941 String Key Value Type Required? Description
942 ================= ============== ========= ================================
943 "Name" string Kernel argument name.
944 "TypeName" string Kernel argument type name.
945 "Size" integer Required Kernel argument size in bytes.
946 "Align" integer Required Kernel argument alignment in
947 bytes. Must be a power of two.
948 "ValueKind" string Required Kernel argument kind that
949 specifies how to set up the
950 corresponding argument.
951 Values include:
952
953 "ByValue"
954 The argument is copied
955 directly into the kernarg.
956
957 "GlobalBuffer"
958 A global address space pointer
959 to the buffer data is passed
960 in the kernarg.
961
962 "DynamicSharedPointer"
963 A group address space pointer
964 to dynamically allocated LDS
965 is passed in the kernarg.
966
967 "Sampler"
968 A global address space
969 pointer to a S# is passed in
970 the kernarg.
971
972 "Image"
973 A global address space
974 pointer to a T# is passed in
975 the kernarg.
976
977 "Pipe"
978 A global address space pointer
979 to an OpenCL pipe is passed in
980 the kernarg.
981
982 "Queue"
983 A global address space pointer
984 to an OpenCL device enqueue
985 queue is passed in the
986 kernarg.
987
988 "HiddenGlobalOffsetX"
989 The OpenCL grid dispatch
990 global offset for the X
991 dimension is passed in the
992 kernarg.
993
994 "HiddenGlobalOffsetY"
995 The OpenCL grid dispatch
996 global offset for the Y
997 dimension is passed in the
998 kernarg.
999
1000 "HiddenGlobalOffsetZ"
1001 The OpenCL grid dispatch
1002 global offset for the Z
1003 dimension is passed in the
1004 kernarg.
1005
1006 "HiddenNone"
1007 An argument that is not used
1008 by the kernel. Space needs to
1009 be left for it, but it does
1010 not need to be set up.
1011
1012 "HiddenPrintfBuffer"
1013 A global address space pointer
1014 to the runtime printf buffer
1015 is passed in kernarg.
1016
1017 "HiddenDefaultQueue"
1018 A global address space pointer
1019 to the OpenCL device enqueue
1020 queue that should be used by
1021 the kernel by default is
1022 passed in the kernarg.
1023
1024 "HiddenCompletionAction"
1025 *TBD*
1026
1027 .. TODO
1028 Add description.
1029
1030 "ValueType" string Required Kernel argument value type. Only
1031 present if "ValueKind" is
1032 "ByValue". For vector data
1033 types, the value is for the
1034 element type. Values include:
1035
1036 - "Struct"
1037 - "I8"
1038 - "U8"
1039 - "I16"
1040 - "U16"
1041 - "F16"
1042 - "I32"
1043 - "U32"
1044 - "F32"
1045 - "I64"
1046 - "U64"
1047 - "F64"
1048
1049 .. TODO
1050 How can it be determined if a
1051 vector type, and what size
1052 vector?
1053 "PointeeAlign" integer Alignment in bytes of pointee
1054 type for pointer type kernel
1055 argument. Must be a power
1056 of 2. Only present if
1057 "ValueKind" is
1058 "DynamicSharedPointer".
1059 "AddrSpaceQual" string Kernel argument address space
1060 qualifier. Only present if
1061 "ValueKind" is "GlobalBuffer" or
1062 "DynamicSharedPointer". Values
1063 are:
1064
1065 - "Private"
1066 - "Global"
1067 - "Constant"
1068 - "Local"
1069 - "Generic"
1070 - "Region"
1071
1072 .. TODO
1073 Is GlobalBuffer only Global
1074 or Constant? Is
1075 DynamicSharedPointer always
1076 Local? Can HCC allow Generic?
1077 How can Private or Region
1078 ever happen?
1079 "AccQual" string Kernel argument access
1080 qualifier. Only present if
1081 "ValueKind" is "Image" or
1082 "Pipe". Values
1083 are:
1084
1085 - "ReadOnly"
1086 - "WriteOnly"
1087 - "ReadWrite"
1088
1089 .. TODO
1090 Does this apply to
1091 GlobalBuffer?
1092 "ActualAcc" string The actual memory accesses
1093 performed by the kernel on the
1094 kernel argument. Only present if
1095 "ValueKind" is "GlobalBuffer",
1096 "Image", or "Pipe". This may be
1097 more restrictive than indicated
1098 by "AccQual" to reflect what the
1099 kernel actual does. If not
1100 present then the runtime must
1101 assume what is implied by
1102 "AccQual" and "IsConst". Values
1103 are:
1104
1105 - "ReadOnly"
1106 - "WriteOnly"
1107 - "ReadWrite"
1108
1109 "IsConst" boolean Indicates if the kernel argument
1110 is const qualified. Only present
1111 if "ValueKind" is
1112 "GlobalBuffer".
1113
1114 "IsRestrict" boolean Indicates if the kernel argument
1115 is restrict qualified. Only
1116 present if "ValueKind" is
1117 "GlobalBuffer".
1118
1119 "IsVolatile" boolean Indicates if the kernel argument
1120 is volatile qualified. Only
1121 present if "ValueKind" is
1122 "GlobalBuffer".
1123
1124 "IsPipe" boolean Indicates if the kernel argument
1125 is pipe qualified. Only present
1126 if "ValueKind" is "Pipe".
1127
1128 .. TODO
1129 Can GlobalBuffer be pipe
1130 qualified?
1131 ================= ============== ========= ================================
1132
1133..
1134
1135 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1136 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1137
1138 ============================ ============== ========= =====================
1139 String Key Value Type Required? Description
1140 ============================ ============== ========= =====================
1141 "KernargSegmentSize" integer Required The size in bytes of
1142 the kernarg segment
1143 that holds the values
1144 of the arguments to
1145 the kernel.
1146 "GroupSegmentFixedSize" integer Required The amount of group
1147 segment memory
1148 required by a
1149 work-group in
1150 bytes. This does not
1151 include any
1152 dynamically allocated
1153 group segment memory
1154 that may be added
1155 when the kernel is
1156 dispatched.
1157 "PrivateSegmentFixedSize" integer Required The amount of fixed
1158 private address space
1159 memory required for a
1160 work-item in
1161 bytes. If
1162 IsDynamicCallstack
1163 is 1 then additional
1164 space must be added
1165 to this value for the
1166 call stack.
1167 "KernargSegmentAlign" integer Required The maximum byte
1168 alignment of
1169 arguments in the
1170 kernarg segment. Must
1171 be a power of 2.
1172 "WavefrontSize" integer Required Wavefront size. Must
1173 be a power of 2.
1174 "NumSGPRs" integer Number of scalar
1175 registers used by a
1176 wavefront for
1177 GFX6-GFX9. This
1178 includes the special
1179 SGPRs for VCC, Flat
1180 Scratch (GFX7-GFX9)
1181 and XNACK (for
1182 GFX8-GFX9). It does
1183 not include the 16
1184 SGPR added if a trap
1185 handler is
1186 enabled. It is not
1187 rounded up to the
1188 allocation
1189 granularity.
1190 "NumVGPRs" integer Number of vector
1191 registers used by
1192 each work-item for
1193 GFX6-GFX9
1194 "MaxFlatWorkgroupSize" integer Maximum flat
1195 work-group size
1196 supported by the
1197 kernel in work-items.
1198 "IsDynamicCallStack" boolean Indicates if the
1199 generated machine
1200 code is using a
1201 dynamically sized
1202 call stack.
1203 "IsXNACKEnabled" boolean Indicates if the
1204 generated machine
1205 code is capable of
1206 supporting XNACK.
1207 ============================ ============== ========= =====================
1208
1209..
1210
1211 .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
1212 :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
1213
1214 =================================== ============== ========= ==============
1215 String Key Value Type Required? Description
1216 =================================== ============== ========= ==============
1217 "DebuggerABIVersion" string
1218 "ReservedNumVGPRs" integer
1219 "ReservedFirstVGPR" integer
1220 "PrivateSegmentBufferSGPR" integer
1221 "WavefrontPrivateSegmentOffsetSGPR" integer
1222 =================================== ============== ========= ==============
1223
1224.. TODO
1225 Plan to remove the debug properties metadata.
1226
Tony Tyef16a45e2017-06-06 20:31:59 +00001227Kernel Dispatch
1228~~~~~~~~~~~~~~~
1229
1230The HSA architected queuing language (AQL) defines a user space memory interface
1231that can be used to control the dispatch of kernels, in an agent independent
1232way. An agent can have zero or more AQL queues created for it using the ROCm
1233runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1234*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1235mechanics and packet layouts.
1236
1237The packet processor of a kernel agent is responsible for detecting and
1238dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1239packet processor is implemented by the hardware command processor (CP),
1240asynchronous dispatch controller (ADC) and shader processor input controller
1241(SPI).
1242
1243The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1244mode driver to initialize and register the AQL queue with CP.
1245
1246To dispatch a kernel the following actions are performed. This can occur in the
1247CPU host program, or from an HSA kernel executing on a GPU.
1248
12491. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1250 executed is obtained.
12512. A pointer to the kernel descriptor (see
1252 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1253 obtained. It must be for a kernel that is contained in a code object that that
1254 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1255 associated.
12563. Space is allocated for the kernel arguments using the ROCm runtime allocator
1257 for a memory region with the kernarg property for the kernel agent that will
1258 execute the kernel. It must be at least 16 byte aligned.
12594. Kernel argument values are assigned to the kernel argument memory
1260 allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1261 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1262 memory in the same way constant memory is accessed. (Note that the HSA
1263 specification allows an implementation to copy the kernel argument contents to
1264 another location that is accessed by the kernel.)
12655. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1266 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1267 packet. The packet must be set up, and the final write must use an atomic
1268 store release to set the packet kind to ensure the packet contents are
1269 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1270 notify the kernel agent that the AQL queue has been updated. These rules, and
1271 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1272 System Architecture Specification* [HSA]_.
12736. A kernel dispatch packet includes information about the actual dispatch,
1274 such as grid and work-group size, together with information from the code
1275 object about the kernel, such as segment sizes. The ROCm runtime queries on
1276 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001277 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000012787. CP executes micro-code and is responsible for detecting and setting up the
1279 GPU to execute the wavefronts of a kernel dispatch.
12808. CP ensures that when the a wavefront starts executing the kernel machine
1281 code, the scalar general purpose registers (SGPR) and vector general purpose
1282 registers (VGPR) are set up as required by the machine code. The required
1283 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1284 register state is defined in
1285 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
12869. The prolog of the kernel machine code (see
1287 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1288 before continuing executing the machine code that corresponds to the kernel.
128910. When the kernel dispatch has completed execution, CP signals the completion
1290 signal specified in the kernel dispatch packet if not 0.
1291
1292.. _amdgpu-amdhsa-memory-spaces:
1293
1294Memory Spaces
1295~~~~~~~~~~~~~
1296
1297The memory space properties are:
1298
1299 .. table:: AMDHSA Memory Spaces
1300 :name: amdgpu-amdhsa-memory-spaces-table
1301
1302 ================= =========== ======== ======= ==================
1303 Memory Space Name HSA Segment Hardware Address NULL Value
1304 Name Name Size
1305 ================= =========== ======== ======= ==================
1306 Private private scratch 32 0x00000000
1307 Local group LDS 32 0xFFFFFFFF
1308 Global global global 64 0x0000000000000000
1309 Constant constant *same as 64 0x0000000000000000
1310 global*
1311 Generic flat flat 64 0x0000000000000000
1312 Region N/A GDS 32 *not implemented
1313 for AMDHSA*
1314 ================= =========== ======== ======= ==================
1315
1316The global and constant memory spaces both use global virtual addresses, which
1317are the same virtual address space used by the CPU. However, some virtual
1318addresses may only be accessible to the CPU, some only accessible by the GPU,
1319and some by both.
1320
1321Using the constant memory space indicates that the data will not change during
1322the execution of the kernel. This allows scalar read instructions to be
1323used. The vector and scalar L1 caches are invalidated of volatile data before
1324each kernel dispatch execution to allow constant memory to change values between
1325kernel dispatches.
1326
1327The local memory space uses the hardware Local Data Store (LDS) which is
1328automatically allocated when the hardware creates work-groups of wavefronts, and
1329freed when all the wavefronts of a work-group have terminated. The data store
1330(DS) instructions can be used to access it.
1331
1332The private memory space uses the hardware scratch memory support. If the kernel
1333uses scratch, then the hardware allocates memory that is accessed using
1334wavefront lane dword (4 byte) interleaving. The mapping used from private
1335address to physical address is:
1336
1337 ``wavefront-scratch-base +
1338 (private-address * wavefront-size * 4) +
1339 (wavefront-lane-id * 4)``
1340
1341There are different ways that the wavefront scratch base address is determined
1342by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1343memory can be accessed in an interleaved manner using buffer instruction with
1344the scratch buffer descriptor and per wave scratch offset, by the scratch
1345instructions, or by flat instructions. If each lane of a wavefront accesses the
1346same private address, the interleaving results in adjacent dwords being accessed
1347and hence requires fewer cache lines to be fetched. Multi-dword access is not
1348supported except by flat and scratch instructions in GFX9.
1349
1350The generic address space uses the hardware flat address support available in
1351GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1352local appertures), that are outside the range of addressible global memory, to
1353map from a flat address to a private or local address.
1354
1355FLAT instructions can take a flat address and access global, private (scratch)
1356and group (LDS) memory depending in if the address is within one of the
1357apperture ranges. Flat access to scratch requires hardware aperture setup and
1358setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1359access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1360(see :ref:`amdgpu-amdhsa-m0`).
1361
1362To convert between a segment address and a flat address the base address of the
1363appertures address can be used. For GFX7-GFX8 these are available in the
1364:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1365Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1366GFX9 the appature base addresses are directly available as inline constant
1367registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1368address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1369which makes it easier to convert from flat to segment or segment to flat.
1370
Tony Tye46d35762017-08-15 20:47:41 +00001371Image and Samplers
1372~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001373
1374Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1375hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1376HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1377enumeration values for the queries that are not trivially deducible from the S#
1378representation.
1379
1380HSA Signals
1381~~~~~~~~~~~
1382
Tony Tye46d35762017-08-15 20:47:41 +00001383HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1384structure allocated in memory accessible from both the CPU and GPU. The
1385structure is defined by the ROCm runtime and subject to change between releases
1386(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001387
1388.. _amdgpu-amdhsa-hsa-aql-queue:
1389
1390HSA AQL Queue
1391~~~~~~~~~~~~~
1392
Tony Tye46d35762017-08-15 20:47:41 +00001393The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001394between releases (see [AMD-ROCm-github]_). For some processors it contains
1395fields needed to implement certain language features such as the flat address
1396aperture bases. It also contains fields used by CP such as managing the
1397allocation of scratch memory.
1398
1399.. _amdgpu-amdhsa-kernel-descriptor:
1400
1401Kernel Descriptor
1402~~~~~~~~~~~~~~~~~
1403
1404A kernel descriptor consists of the information needed by CP to initiate the
1405execution of a kernel, including the entry point address of the machine code
1406that implements the kernel.
1407
1408Kernel Descriptor for GFX6-GFX9
1409+++++++++++++++++++++++++++++++
1410
1411CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1412
1413 .. table:: Kernel Descriptor for GFX6-GFX9
1414 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1415
1416 ======= ======= =============================== ===========================
1417 Bits Size Field Name Description
1418 ======= ======= =============================== ===========================
1419 31:0 4 bytes group_segment_fixed_size The amount of fixed local
1420 address space memory
1421 required for a work-group
1422 in bytes. This does not
1423 include any dynamically
1424 allocated local address
1425 space memory that may be
1426 added when the kernel is
1427 dispatched.
1428 63:32 4 bytes private_segment_fixed_size The amount of fixed
1429 private address space
1430 memory required for a
1431 work-item in bytes. If
1432 is_dynamic_callstack is 1
1433 then additional space must
1434 be added to this value for
1435 the call stack.
1436 95:64 4 bytes max_flat_workgroup_size Maximum flat work-group
1437 size supported by the
1438 kernel in work-items.
1439 96 1 bit is_dynamic_call_stack Indicates if the generated
1440 machine code is using a
1441 dynamically sized call
1442 stack.
1443 97 1 bit is_xnack_enabled Indicates if the generated
1444 machine code is capable of
1445 suppoting XNACK.
1446 127:98 30 bits Reserved. Must be 0.
1447 191:128 8 bytes kernel_code_entry_byte_offset Byte offset (possibly
1448 negative) from base
1449 address of kernel
1450 descriptor to kernel's
1451 entry point instruction
1452 which must be 256 byte
1453 aligned.
1454 383:192 24 Reserved. Must be 0.
1455 bytes
1456 415:384 4 bytes compute_pgm_rsrc1 Compute Shader (CS)
1457 program settings used by
1458 CP to set up
1459 ``COMPUTE_PGM_RSRC1``
1460 configuration
1461 register. See
1462 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
1463 447:416 4 bytes compute_pgm_rsrc2 Compute Shader (CS)
1464 program settings used by
1465 CP to set up
1466 ``COMPUTE_PGM_RSRC2``
1467 configuration
1468 register. See
1469 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1470 448 1 bit enable_sgpr_private_segment Enable the setup of the
1471 _buffer SGPR user data registers
1472 (see
1473 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1474
1475 The total number of SGPR
1476 user data registers
1477 requested must not exceed
1478 16 and match value in
1479 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1480 Any requests beyond 16
1481 will be ignored.
1482 449 1 bit enable_sgpr_dispatch_ptr *see above*
1483 450 1 bit enable_sgpr_queue_ptr *see above*
1484 451 1 bit enable_sgpr_kernarg_segment_ptr *see above*
1485 452 1 bit enable_sgpr_dispatch_id *see above*
1486 453 1 bit enable_sgpr_flat_scratch_init *see above*
1487 454 1 bit enable_sgpr_private_segment *see above*
1488 _size
1489 455 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1490 _count_X should always be 0.
1491 456 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1492 _count_Y should always be 0.
1493 457 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1494 _count_Z should always be 0.
1495 463:458 6 bits Reserved. Must be 0.
1496 511:464 4 Reserved. Must be 0.
1497 bytes
1498 512 **Total size 64 bytes.**
1499 ======= ===================================================================
1500
1501..
1502
1503 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1504 :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1505
Tony Tye3b340612017-06-07 00:46:08 +00001506 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001507 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001508 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001509 5:0 6 bits granulated_workitem_vgpr_count Number of vector registers
1510 used by each work-item,
1511 granularity is device
1512 specific:
1513
1514 GFX6-9
1515 roundup((max-vgpg + 1)
1516 / 4) - 1
1517
1518 Used by CP to set up
1519 ``COMPUTE_PGM_RSRC1.VGPRS``.
1520 9:6 4 bits granulated_wavefront_sgpr_count Number of scalar registers
1521 used by a wavefront,
1522 granularity is device
1523 specific:
1524
1525 GFX6-8
1526 roundup((max-sgpg + 1)
1527 / 8) - 1
1528 GFX9
1529 roundup((max-sgpg + 1)
1530 / 16) - 1
1531
1532 Includes the special SGPRs
1533 for VCC, Flat Scratch (for
1534 GFX7 onwards) and XNACK
1535 (for GFX8 onwards). It does
1536 not include the 16 SGPR
1537 added if a trap handler is
1538 enabled.
1539
1540 Used by CP to set up
1541 ``COMPUTE_PGM_RSRC1.SGPRS``.
1542 11:10 2 bits priority Must be 0.
1543
1544 Start executing wavefront
1545 at the specified priority.
1546
1547 CP is responsible for
1548 filling in
1549 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1550 13:12 2 bits float_mode_round_32 Wavefront starts execution
1551 with specified rounding
1552 mode for single (32
1553 bit) floating point
1554 precision floating point
1555 operations.
1556
1557 Floating point rounding
1558 mode values are defined in
1559 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1560
1561 Used by CP to set up
1562 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1563 15:14 2 bits float_mode_round_16_64 Wavefront starts execution
1564 with specified rounding
1565 denorm mode for half/double (16
1566 and 64 bit) floating point
1567 precision floating point
1568 operations.
1569
1570 Floating point rounding
1571 mode values are defined in
1572 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1573
1574 Used by CP to set up
1575 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1576 17:16 2 bits float_mode_denorm_32 Wavefront starts execution
1577 with specified denorm mode
1578 for single (32
1579 bit) floating point
1580 precision floating point
1581 operations.
1582
1583 Floating point denorm mode
1584 values are defined in
1585 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1586
1587 Used by CP to set up
1588 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1589 19:18 2 bits float_mode_denorm_16_64 Wavefront starts execution
1590 with specified denorm mode
1591 for half/double (16
1592 and 64 bit) floating point
1593 precision floating point
1594 operations.
1595
1596 Floating point denorm mode
1597 values are defined in
1598 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1599
1600 Used by CP to set up
1601 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1602 20 1 bit priv Must be 0.
1603
1604 Start executing wavefront
1605 in privilege trap handler
1606 mode.
1607
1608 CP is responsible for
1609 filling in
1610 ``COMPUTE_PGM_RSRC1.PRIV``.
1611 21 1 bit enable_dx10_clamp Wavefront starts execution
1612 with DX10 clamp mode
1613 enabled. Used by the vector
1614 ALU to force DX-10 style
1615 treatment of NaN's (when
1616 set, clamp NaN to zero,
1617 otherwise pass NaN
1618 through).
1619
1620 Used by CP to set up
1621 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1622 22 1 bit debug_mode Must be 0.
1623
1624 Start executing wavefront
1625 in single step mode.
1626
1627 CP is responsible for
1628 filling in
1629 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1630 23 1 bit enable_ieee_mode Wavefront starts execution
1631 with IEEE mode
1632 enabled. Floating point
1633 opcodes that support
1634 exception flag gathering
1635 will quiet and propagate
1636 signaling-NaN inputs per
1637 IEEE 754-2008. Min_dx10 and
1638 max_dx10 become IEEE
1639 754-2008 compliant due to
1640 signaling-NaN propagation
1641 and quieting.
1642
1643 Used by CP to set up
1644 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1645 24 1 bit bulky Must be 0.
1646
1647 Only one work-group allowed
1648 to execute on a compute
1649 unit.
1650
1651 CP is responsible for
1652 filling in
1653 ``COMPUTE_PGM_RSRC1.BULKY``.
1654 25 1 bit cdbg_user Must be 0.
1655
1656 Flag that can be used to
1657 control debugging code.
1658
1659 CP is responsible for
1660 filling in
1661 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1662 31:26 6 bits Reserved. Must be 0.
1663 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001664 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001665
1666..
1667
1668 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1669 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1670
Tony Tye3b340612017-06-07 00:46:08 +00001671 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001672 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001673 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001674 0 1 bit enable_sgpr_private_segment Enable the setup of the
1675 _wave_offset SGPR wave scratch offset
1676 system register (see
1677 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1678
1679 Used by CP to set up
1680 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1681 5:1 5 bits user_sgpr_count The total number of SGPR
1682 user data registers
1683 requested. This number must
1684 match the number of user
1685 data registers enabled.
1686
1687 Used by CP to set up
1688 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1689 6 1 bit enable_trap_handler Set to 1 if code contains a
1690 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001691 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001692 be enabled.
1693
1694 CP sets
1695 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1696 if the runtime has
1697 installed a trap handler
1698 regardless of the setting
1699 of this field.
1700 7 1 bit enable_sgpr_workgroup_id_x Enable the setup of the
1701 system SGPR register for
1702 the work-group id in the X
1703 dimension (see
1704 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1705
1706 Used by CP to set up
1707 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1708 8 1 bit enable_sgpr_workgroup_id_y Enable the setup of the
1709 system SGPR register for
1710 the work-group id in the Y
1711 dimension (see
1712 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1713
1714 Used by CP to set up
1715 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1716 9 1 bit enable_sgpr_workgroup_id_z Enable the setup of the
1717 system SGPR register for
1718 the work-group id in the Z
1719 dimension (see
1720 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1721
1722 Used by CP to set up
1723 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1724 10 1 bit enable_sgpr_workgroup_info Enable the setup of the
1725 system SGPR register for
1726 work-group information (see
1727 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1728
1729 Used by CP to set up
1730 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1731 12:11 2 bits enable_vgpr_workitem_id Enable the setup of the
1732 VGPR system registers used
1733 for the work-item ID.
1734 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1735 defines the values.
1736
1737 Used by CP to set up
1738 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1739 13 1 bit enable_exception_address_watch Must be 0.
1740
1741 Wavefront starts execution
1742 with address watch
1743 exceptions enabled which
1744 are generated when L1 has
1745 witnessed a thread access
1746 an *address of
1747 interest*.
1748
1749 CP is responsible for
1750 filling in the address
1751 watch bit in
1752 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1753 according to what the
1754 runtime requests.
1755 14 1 bit enable_exception_memory Must be 0.
1756
1757 Wavefront starts execution
1758 with memory violation
1759 exceptions exceptions
1760 enabled which are generated
1761 when a memory violation has
1762 occurred for this wave from
1763 L1 or LDS
1764 (write-to-read-only-memory,
1765 mis-aligned atomic, LDS
1766 address out of range,
1767 illegal address, etc.).
1768
1769 CP sets the memory
1770 violation bit in
1771 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1772 according to what the
1773 runtime requests.
1774 23:15 9 bits granulated_lds_size Must be 0.
1775
1776 CP uses the rounded value
1777 from the dispatch packet,
1778 not this value, as the
1779 dispatch may contain
1780 dynamically allocated group
1781 segment memory. CP writes
1782 directly to
1783 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1784
1785 Amount of group segment
1786 (LDS) to allocate for each
1787 work-group. Granularity is
1788 device specific:
1789
1790 GFX6:
1791 roundup(lds-size / (64 * 4))
1792 GFX7-GFX9:
1793 roundup(lds-size / (128 * 4))
1794
1795 24 1 bit enable_exception_ieee_754_fp Wavefront starts execution
1796 _invalid_operation with specified exceptions
1797 enabled.
1798
1799 Used by CP to set up
1800 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1801 (set from bits 0..6).
1802
1803 IEEE 754 FP Invalid
1804 Operation
1805 25 1 bit enable_exception_fp_denormal FP Denormal one or more
1806 _source input operands is a
1807 denormal number
1808 26 1 bit enable_exception_ieee_754_fp IEEE 754 FP Division by
1809 _division_by_zero Zero
1810 27 1 bit enable_exception_ieee_754_fp IEEE 754 FP FP Overflow
1811 _overflow
1812 28 1 bit enable_exception_ieee_754_fp IEEE 754 FP Underflow
1813 _underflow
1814 29 1 bit enable_exception_ieee_754_fp IEEE 754 FP Inexact
1815 _inexact
1816 30 1 bit enable_exception_int_divide_by Integer Division by Zero
1817 _zero (rcp_iflag_f32 instruction
1818 only)
1819 31 1 bit Reserved. Must be 0.
1820 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001821 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001822
1823..
1824
1825 .. table:: Floating Point Rounding Mode Enumeration Values
1826 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1827
1828 ===================================== ===== ===============================
1829 Enumeration Name Value Description
1830 ===================================== ===== ===============================
1831 AMD_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1832 AMD_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1833 AMD_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1834 AMD_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1835 ===================================== ===== ===============================
1836
1837..
1838
1839 .. table:: Floating Point Denorm Mode Enumeration Values
1840 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1841
1842 ===================================== ===== ===============================
1843 Enumeration Name Value Description
1844 ===================================== ===== ===============================
1845 AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1846 Denorms
1847 AMD_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1848 AMD_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1849 AMD_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1850 ===================================== ===== ===============================
1851
1852..
1853
1854 .. table:: System VGPR Work-Item ID Enumeration Values
1855 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1856
1857 ===================================== ===== ===============================
1858 Enumeration Name Value Description
1859 ===================================== ===== ===============================
1860 AMD_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension ID.
1861 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1862 dimensions ID.
1863 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1864 dimensions ID.
1865 AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1866 ===================================== ===== ===============================
1867
1868.. _amdgpu-amdhsa-initial-kernel-execution-state:
1869
1870Initial Kernel Execution State
1871~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1872
1873This section defines the register state that will be set up by the packet
1874processor prior to the start of execution of every wavefront. This is limited by
1875the constraints of the hardware controllers of CP/ADC/SPI.
1876
1877The order of the SGPR registers is defined, but the compiler can specify which
1878ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1879fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1880for enabled registers are dense starting at SGPR0: the first enabled register is
1881SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1882an SGPR number.
1883
1884The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1885all waves of the grid. It is possible to specify more than 16 User SGPRs using
1886the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1887initialized. These are then immediately followed by the System SGPRs that are
1888set up by ADC/SPI and can have different values for each wave of the grid
1889dispatch.
1890
1891SGPR register initial state is defined in
1892:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1893
1894 .. table:: SGPR Register Set Up Order
1895 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1896
1897 ========== ========================== ====== ==============================
1898 SGPR Order Name Number Description
1899 (kernel descriptor enable of
1900 field) SGPRs
1901 ========== ========================== ====== ==============================
1902 First Private Segment Buffer 4 V# that can be used, together
1903 (enable_sgpr_private with Scratch Wave Offset as an
1904 _segment_buffer) offset, to access the private
1905 memory space using a segment
1906 address.
1907
1908 CP uses the value provided by
1909 the runtime.
1910 then Dispatch Ptr 2 64 bit address of AQL dispatch
1911 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
1912 actually executing.
1913 then Queue Ptr 2 64 bit address of amd_queue_t
1914 (enable_sgpr_queue_ptr) object for AQL queue on which
1915 the dispatch packet was
1916 queued.
1917 then Kernarg Segment Ptr 2 64 bit address of Kernarg
1918 (enable_sgpr_kernarg segment. This is directly
1919 _segment_ptr) copied from the
1920 kernarg_address in the kernel
1921 dispatch packet.
1922
1923 Having CP load it once avoids
1924 loading it at the beginning of
1925 every wavefront.
1926 then Dispatch Id 2 64 bit Dispatch ID of the
1927 (enable_sgpr_dispatch_id) dispatch packet being
1928 executed.
1929 then Flat Scratch Init 2 This is 2 SGPRs:
1930 (enable_sgpr_flat_scratch
1931 _init) GFX6
1932 Not supported.
1933 GFX7-GFX8
1934 The first SGPR is a 32 bit
1935 byte offset from
1936 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1937 to per SPI base of memory
1938 for scratch for the queue
1939 executing the kernel
1940 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00001941 from the runtime. (The
1942 Scratch Segment Buffer base
1943 address is
1944 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1945 plus this offset.) The value
1946 of Scratch Wave Offset must
1947 be added to this offset by
1948 the kernel machine code,
1949 right shifted by 8, and
1950 moved to the FLAT_SCRATCH_HI
1951 SGPR register.
1952 FLAT_SCRATCH_HI corresponds
1953 to SGPRn-4 on GFX7, and
1954 SGPRn-6 on GFX8 (where SGPRn
1955 is the highest numbered SGPR
1956 allocated to the wave).
1957 FLAT_SCRATCH_HI is
1958 multiplied by 256 (as it is
1959 in units of 256 bytes) and
1960 added to
1961 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1962 to calculate the per wave
1963 FLAT SCRATCH BASE in flat
1964 memory instructions that
1965 access the scratch
1966 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00001967
1968 The second SGPR is 32 bit
1969 byte size of a single
1970 work-item’s scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00001971 usage. CP obtains this from
1972 the runtime, and it is
1973 always a multiple of DWORD.
1974 CP checks that the value in
1975 the kernel dispatch packet
1976 Private Segment Byte Size is
1977 not larger, and requests the
1978 runtime to increase the
1979 queue's scratch size if
1980 necessary. The kernel code
1981 must move it to
1982 FLAT_SCRATCH_LO which is
1983 SGPRn-3 on GFX7 and SGPRn-5
1984 on GFX8. FLAT_SCRATCH_LO is
1985 used as the FLAT SCRATCH
1986 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00001987 instructions. Having CP load
1988 it once avoids loading it at
1989 the beginning of every
Tony Tye46d35762017-08-15 20:47:41 +00001990 wavefront. GFX9 This is the
1991 64 bit base address of the
1992 per SPI scratch backing
1993 memory managed by SPI for
1994 the queue executing the
1995 kernel dispatch. CP obtains
1996 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00001997 divides it if there are
1998 multiple Shader Arrays each
1999 with its own SPI). The value
2000 of Scratch Wave Offset must
2001 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002002 machine code and the result
2003 moved to the FLAT_SCRATCH
2004 SGPR which is SGPRn-6 and
2005 SGPRn-5. It is used as the
2006 FLAT SCRATCH BASE in flat
2007 memory instructions. then
2008 Private Segment Size 1 The
2009 32 bit byte size of a
2010 (enable_sgpr_private single
2011 work-item's
2012 scratch_segment_size) memory
2013 allocation. This is the
2014 value from the kernel
2015 dispatch packet Private
2016 Segment Byte Size rounded up
2017 by CP to a multiple of
2018 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002019
2020 Having CP load it once avoids
2021 loading it at the beginning of
2022 every wavefront.
2023
2024 This is not used for
2025 GFX7-GFX8 since it is the same
2026 value as the second SGPR of
2027 Flat Scratch Init. However, it
2028 may be needed for GFX9 which
2029 changes the meaning of the
2030 Flat Scratch Init value.
2031 then Grid Work-Group Count X 1 32 bit count of the number of
2032 (enable_sgpr_grid work-groups in the X dimension
2033 _workgroup_count_X) for the grid being
2034 executed. Computed from the
2035 fields in the kernel dispatch
2036 packet as ((grid_size.x +
2037 workgroup_size.x - 1) /
2038 workgroup_size.x).
2039 then Grid Work-Group Count Y 1 32 bit count of the number of
2040 (enable_sgpr_grid work-groups in the Y dimension
2041 _workgroup_count_Y && for the grid being
2042 less than 16 previous executed. Computed from the
2043 SGPRs) fields in the kernel dispatch
2044 packet as ((grid_size.y +
2045 workgroup_size.y - 1) /
2046 workgroupSize.y).
2047
2048 Only initialized if <16
2049 previous SGPRs initialized.
2050 then Grid Work-Group Count Z 1 32 bit count of the number of
2051 (enable_sgpr_grid work-groups in the Z dimension
2052 _workgroup_count_Z && for the grid being
2053 less than 16 previous executed. Computed from the
2054 SGPRs) fields in the kernel dispatch
2055 packet as ((grid_size.z +
2056 workgroup_size.z - 1) /
2057 workgroupSize.z).
2058
2059 Only initialized if <16
2060 previous SGPRs initialized.
2061 then Work-Group Id X 1 32 bit work-group id in X
2062 (enable_sgpr_workgroup_id dimension of grid for
2063 _X) wavefront.
2064 then Work-Group Id Y 1 32 bit work-group id in Y
2065 (enable_sgpr_workgroup_id dimension of grid for
2066 _Y) wavefront.
2067 then Work-Group Id Z 1 32 bit work-group id in Z
2068 (enable_sgpr_workgroup_id dimension of grid for
2069 _Z) wavefront.
2070 then Work-Group Info 1 {first_wave, 14’b0000,
2071 (enable_sgpr_workgroup ordered_append_term[10:0],
2072 _info) threadgroup_size_in_waves[5:0]}
2073 then Scratch Wave Offset 1 32 bit byte offset from base
2074 (enable_sgpr_private of scratch base of queue
2075 _segment_wave_offset) executing the kernel
2076 dispatch. Must be used as an
2077 offset with Private
2078 segment address when using
2079 Scratch Segment Buffer. It
2080 must be used to set up FLAT
2081 SCRATCH for flat addressing
2082 (see
2083 :ref:`amdgpu-amdhsa-flat-scratch`).
2084 ========== ========================== ====== ==============================
2085
2086The order of the VGPR registers is defined, but the compiler can specify which
2087ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2088fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2089for enabled registers are dense starting at VGPR0: the first enabled register is
2090VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2091VGPR number.
2092
2093VGPR register initial state is defined in
2094:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2095
2096 .. table:: VGPR Register Set Up Order
2097 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2098
2099 ========== ========================== ====== ==============================
2100 VGPR Order Name Number Description
2101 (kernel descriptor enable of
2102 field) VGPRs
2103 ========== ========================== ====== ==============================
2104 First Work-Item Id X 1 32 bit work item id in X
2105 (Always initialized) dimension of work-group for
2106 wavefront lane.
2107 then Work-Item Id Y 1 32 bit work item id in Y
2108 (enable_vgpr_workitem_id dimension of work-group for
2109 > 0) wavefront lane.
2110 then Work-Item Id Z 1 32 bit work item id in Z
2111 (enable_vgpr_workitem_id dimension of work-group for
2112 > 1) wavefront lane.
2113 ========== ========================== ====== ==============================
2114
2115The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2116
21171. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2118 registers.
21192. Work-group Id registers X, Y, Z are set by ADC which supports any
2120 combination including none.
21213. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2122 cannot included with the flat scratch init value which is per queue.
21234. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2124 or (X, Y, Z).
2125
2126Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2127value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2128
2129The global segment can be accessed either using buffer instructions (GFX6 which
2130has V# 64 bit address support), flat instructions (GFX7-9), or global
2131instructions (GFX9).
2132
2133If buffer operations are used then the compiler can generate a V# with the
2134following properties:
2135
2136* base address of 0
2137* no swizzle
2138* ATC: 1 if IOMMU present (such as APU)
2139* ptr64: 1
2140* MTYPE set to support memory coherence that matches the runtime (such as CC for
2141 APU and NC for dGPU).
2142
2143.. _amdgpu-amdhsa-kernel-prolog:
2144
2145Kernel Prolog
2146~~~~~~~~~~~~~
2147
2148.. _amdgpu-amdhsa-m0:
2149
2150M0
2151++
2152
2153GFX6-GFX8
2154 The M0 register must be initialized with a value at least the total LDS size
2155 if the kernel may access LDS via DS or flat operations. Total LDS size is
2156 available in dispatch packet. For M0, it is also possible to use maximum
2157 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2158 GFX7-GFX8).
2159GFX9
2160 The M0 register is not used for range checking LDS accesses and so does not
2161 need to be initialized in the prolog.
2162
2163.. _amdgpu-amdhsa-flat-scratch:
2164
2165Flat Scratch
2166++++++++++++
2167
2168If the kernel may use flat operations to access scratch memory, the prolog code
2169must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2170are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2171Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2172
2173GFX6
2174 Flat scratch is not supported.
2175
2176GFX7-8
2177 1. The low word of Flat Scratch Init is 32 bit byte offset from
2178 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2179 being managed by SPI for the queue executing the kernel dispatch. This is
2180 the same value used in the Scratch Segment Buffer V# base address. The
2181 prolog must add the value of Scratch Wave Offset to get the wave's byte
2182 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2183 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2184 by 8 before moving into FLAT_SCRATCH_LO.
2185 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2186 work-items scratch memory usage. This is directly loaded from the kernel
2187 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2188 DWORD. Having CP load it once avoids loading it at the beginning of every
2189 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2190 SIZE.
2191GFX9
2192 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2193 memory being managed by SPI for the queue executing the kernel dispatch. The
2194 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2195 pair for use as the flat scratch base in flat memory instructions.
2196
2197.. _amdgpu-amdhsa-memory-model:
2198
2199Memory Model
2200~~~~~~~~~~~~
2201
2202This section describes the mapping of LLVM memory model onto AMDGPU machine code
2203(see :ref:`memmodel`). *The implementation is WIP.*
2204
2205.. TODO
2206 Update when implementation complete.
2207
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002208 Support more relaxed OpenCL memory model to be controlled by environment
Tony Tyef16a45e2017-06-06 20:31:59 +00002209 component of target triple.
2210
2211The AMDGPU backend supports the memory synchronization scopes specified in
2212:ref:`amdgpu-memory-scopes`.
2213
2214The code sequences used to implement the memory model are defined in table
2215:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2216
2217The sequences specify the order of instructions that a single thread must
2218execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2219to other memory instructions executed by the same thread. This allows them to be
2220moved earlier or later which can allow them to be combined with other instances
2221of the same instruction, or hoisted/sunk out of loops to improve
2222performance. Only the instructions related to the memory model are given;
2223additional ``s_waitcnt`` instructions are required to ensure registers are
2224defined before being used. These may be able to be combined with the memory
2225model ``s_waitcnt`` instructions as described above.
2226
2227The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2228OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2229relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2230memory model which has separate happens-before relations for the global and
2231local address spaces, and only a fence specifying both global and local address
2232space joins the relationships. Since the LLVM ``memfence`` instruction does not
2233allow an address space to be specified the OpenCL fence has to convervatively
2234assume both local and global address space was specified. However, optimizations
2235can often be done to eliminate the additional ``s_waitcnt``instructions when
2236there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2237instructions. The code sequences in the table indicate what can be omitted for
2238the OpenCL memory. The target triple environment is used to determine if the
2239source language is OpenCL (see :ref:`amdgpu-opencl`).
2240
2241``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2242operations.
2243
2244``buffer/global/flat_load/store/atomic`` instructions to global memory are
2245termed vector memory operations.
2246
2247For GFX6-GFX9:
2248
2249* Each agent has multiple compute units (CU).
2250* Each CU has multiple SIMDs that execute wavefronts.
2251* The wavefronts for a single work-group are executed in the same CU but may be
2252 executed by different SIMDs.
2253* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2254 executing on it.
2255* All LDS operations of a CU are performed as wavefront wide operations in a
2256 global order and involve no caching. Completion is reported to a wavefront in
2257 execution order.
2258* The LDS memory has multiple request queues shared by the SIMDs of a
2259 CU. Therefore, the LDS operations performed by different waves of a work-group
2260 can be reordered relative to each other, which can result in reordering the
2261 visibility of vector memory operations with respect to LDS operations of other
2262 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002263 ensure synchronization between LDS operations and vector memory operations
Tony Tyef16a45e2017-06-06 20:31:59 +00002264 between waves of a work-group, but not between operations performed by the
2265 same wavefront.
2266* The vector memory operations are performed as wavefront wide operations and
2267 completion is reported to a wavefront in execution order. The exception is
2268 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2269 vector memory order if they access LDS memory, and out of LDS operation order
2270 if they access global memory.
2271* The vector memory operations access a vector L1 cache shared by all wavefronts
2272 on a CU. Therefore, no special action is required for coherence between
2273 wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2274 coherence between waves executing in different work-groups as they may be
2275 executing on different CUs.
2276* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2277 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2278 scalar operations are used in a restricted way so do not impact the memory
2279 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2280* The vector and scalar memory operations use an L2 cache shared by all CUs on
2281 the same agent.
2282* The L2 cache has independent channels to service disjoint ranges of virtual
2283 addresses.
2284* Each CU has a separate request queue per channel. Therefore, the vector and
2285 scalar memory operations performed by waves executing in different work-groups
2286 (which may be executing on different CUs) of an agent can be reordered
2287 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002288 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002289 previous vector memory operation has completed before executing a subsequent
2290 vector memory or LDS operation and so can be used to meet the requirements of
2291 acquire and release.
2292* The L2 cache can be kept coherent with other agents on some targets, or ranges
2293 of virtual addresses can be set up to bypass it to ensure system coherence.
2294
2295Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2296or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2297memory, atomic memory orderings are not meaningful and all accesses are treated
2298as non-atomic.
2299
2300Constant address space uses ``buffer/global_load`` instructions (or equivalent
2301scalar memory instructions). Since the constant address space contents do not
2302change during the execution of a kernel dispatch it is not legal to perform
2303stores, and atomic memory orderings are not meaningful and all access are
2304treated as non-atomic.
2305
2306A memory synchronization scope wider than work-group is not meaningful for the
2307group (LDS) address space and is treated as work-group.
2308
2309The memory model does not support the region address space which is treated as
2310non-atomic.
2311
2312Acquire memory ordering is not meaningful on store atomic instructions and is
2313treated as non-atomic.
2314
2315Release memory ordering is not meaningful on load atomic instructions and is
2316treated a non-atomic.
2317
2318Acquire-release memory ordering is not meaningful on load or store atomic
2319instructions and is treated as acquire and release respectively.
2320
2321AMDGPU backend only uses scalar memory operations to access memory that is
2322proven to not change during the execution of the kernel dispatch. This includes
2323constant address space and global address space for program scope const
2324variables. Therefore the kernel machine code does not have to maintain the
2325scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2326and vector L1 caches are invalidated between kernel dispatches by CP since
2327constant address space data may change between kernel dispatch executions. See
2328:ref:`amdgpu-amdhsa-memory-spaces`.
2329
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002330The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002331case the AMDGPU backend ensures the memory location used to spill is never
2332accessed by vector memory operations at the same time. If scalar writes are used
2333then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2334return since the locations may be used for vector memory instructions by a
2335future wave that uses the same scratch area, or a function call that creates a
2336frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2337as all scalar writes are write-before-read in the same thread.
2338
2339Scratch backing memory (which is used for the private address space) is accessed
2340with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2341is only accessed by a single thread, and is always write-before-read,
2342there is never a need to invalidate these entries from the L1 cache. Hence all
2343cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2344lines.
2345
2346On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2347to invalidate the L2 cache. This also causes it to be treated as non-volatile
2348and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2349coherent) and so the L2 cache will coherent with the CPU and other agents.
2350
2351 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2352 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2353
2354 ============ ============ ============== ========== =======================
2355 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2356 Ordering Sync Scope Address
2357 Space
2358 ============ ============ ============== ========== =======================
2359 **Non-Atomic**
2360 ---------------------------------------------------------------------------
2361 load *none* *none* - global non-volatile
2362 - generic 1. buffer/global/flat_load
2363 volatile
2364 1. buffer/global/flat_load
2365 glc=1
2366 load *none* *none* - local 1. ds_load
2367 store *none* *none* - global 1. buffer/global/flat_store
2368 - generic
2369 store *none* *none* - local 1. ds_store
2370 **Unordered Atomic**
2371 ---------------------------------------------------------------------------
2372 load atomic unordered *any* *any* *Same as non-atomic*.
2373 store atomic unordered *any* *any* *Same as non-atomic*.
2374 atomicrmw unordered *any* *any* *Same as monotonic
2375 atomic*.
2376 **Monotonic Atomic**
2377 ---------------------------------------------------------------------------
2378 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2379 - wavefront - generic
2380 - workgroup
2381 load atomic monotonic - singlethread - local 1. ds_load
2382 - wavefront
2383 - workgroup
2384 load atomic monotonic - agent - global 1. buffer/global/flat_load
2385 - system - generic glc=1
2386 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2387 - wavefront - generic
2388 - workgroup
2389 - agent
2390 - system
2391 store atomic monotonic - singlethread - local 1. ds_store
2392 - wavefront
2393 - workgroup
2394 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2395 - wavefront - generic
2396 - workgroup
2397 - agent
2398 - system
2399 atomicrmw monotonic - singlethread - local 1. ds_atomic
2400 - wavefront
2401 - workgroup
2402 **Acquire Atomic**
2403 ---------------------------------------------------------------------------
2404 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2405 - wavefront - local
2406 - generic
2407 load atomic acquire - workgroup - global 1. buffer/global_load
2408 load atomic acquire - workgroup - local 1. ds/flat_load
2409 - generic 2. s_waitcnt lgkmcnt(0)
2410
2411 - If OpenCL, omit
2412 waitcnt.
2413 - Must happen before
2414 any following
2415 global/generic
2416 load/load
2417 atomic/store/store
2418 atomic/atomicrmw.
2419 - Ensures any
2420 following global
2421 data read is no
2422 older than the load
2423 atomic value being
2424 acquired.
2425
2426 load atomic acquire - agent - global 1. buffer/global_load
2427 - system glc=1
2428 2. s_waitcnt vmcnt(0)
2429
2430 - Must happen before
2431 following
2432 buffer_wbinvl1_vol.
2433 - Ensures the load
2434 has completed
2435 before invalidating
2436 the cache.
2437
2438 3. buffer_wbinvl1_vol
2439
2440 - Must happen before
2441 any following
2442 global/generic
2443 load/load
2444 atomic/atomicrmw.
2445 - Ensures that
2446 following
2447 loads will not see
2448 stale global data.
2449
2450 load atomic acquire - agent - generic 1. flat_load glc=1
2451 - system 2. s_waitcnt vmcnt(0) &
2452 lgkmcnt(0)
2453
2454 - If OpenCL omit
2455 lgkmcnt(0).
2456 - Must happen before
2457 following
2458 buffer_wbinvl1_vol.
2459 - Ensures the flat_load
2460 has completed
2461 before invalidating
2462 the cache.
2463
2464 3. buffer_wbinvl1_vol
2465
2466 - Must happen before
2467 any following
2468 global/generic
2469 load/load
2470 atomic/atomicrmw.
2471 - Ensures that
2472 following loads
2473 will not see stale
2474 global data.
2475
2476 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2477 - wavefront - local
2478 - generic
2479 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
2480 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
2481 - generic 2. waitcnt lgkmcnt(0)
2482
2483 - If OpenCL, omit
2484 waitcnt.
2485 - Must happen before
2486 any following
2487 global/generic
2488 load/load
2489 atomic/store/store
2490 atomic/atomicrmw.
2491 - Ensures any
2492 following global
2493 data read is no
2494 older than the
2495 atomicrmw value
2496 being acquired.
2497
2498 atomicrmw acquire - agent - global 1. buffer/global_atomic
2499 - system 2. s_waitcnt vmcnt(0)
2500
2501 - Must happen before
2502 following
2503 buffer_wbinvl1_vol.
2504 - Ensures the
2505 atomicrmw has
2506 completed before
2507 invalidating the
2508 cache.
2509
2510 3. buffer_wbinvl1_vol
2511
2512 - Must happen before
2513 any following
2514 global/generic
2515 load/load
2516 atomic/atomicrmw.
2517 - Ensures that
2518 following loads
2519 will not see stale
2520 global data.
2521
2522 atomicrmw acquire - agent - generic 1. flat_atomic
2523 - system 2. s_waitcnt vmcnt(0) &
2524 lgkmcnt(0)
2525
2526 - If OpenCL, omit
2527 lgkmcnt(0).
2528 - Must happen before
2529 following
2530 buffer_wbinvl1_vol.
2531 - Ensures the
2532 atomicrmw has
2533 completed before
2534 invalidating the
2535 cache.
2536
2537 3. buffer_wbinvl1_vol
2538
2539 - Must happen before
2540 any following
2541 global/generic
2542 load/load
2543 atomic/atomicrmw.
2544 - Ensures that
2545 following loads
2546 will not see stale
2547 global data.
2548
2549 fence acquire - singlethread *none* *none*
2550 - wavefront
2551 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2552
2553 - If OpenCL and
2554 address space is
2555 not generic, omit
2556 waitcnt. However,
2557 since LLVM
2558 currently has no
2559 address space on
2560 the fence need to
2561 conservatively
2562 always generate. If
2563 fence had an
2564 address space then
2565 set to address
2566 space of OpenCL
2567 fence flag, or to
2568 generic if both
2569 local and global
2570 flags are
2571 specified.
2572 - Must happen after
2573 any preceding
2574 local/generic load
2575 atomic/atomicrmw
2576 with an equal or
2577 wider sync scope
2578 and memory ordering
2579 stronger than
2580 unordered (this is
2581 termed the
2582 fence-paired-atomic).
2583 - Must happen before
2584 any following
2585 global/generic
2586 load/load
2587 atomic/store/store
2588 atomic/atomicrmw.
2589 - Ensures any
2590 following global
2591 data read is no
2592 older than the
2593 value read by the
2594 fence-paired-atomic.
2595
2596 fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
2597 - system lgkmcnt(0)
2598
2599 - If OpenCL and
2600 address space is
2601 not generic, omit
2602 lgkmcnt(0).
2603 However, since LLVM
2604 currently has no
2605 address space on
2606 the fence need to
2607 conservatively
2608 always generate
2609 (see comment for
2610 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002611 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002612 separate s_waitcnt
2613 vmcnt(0) and
2614 s_waitcnt
2615 lgkmcnt(0) to allow
2616 them to be
2617 independently moved
2618 according to the
2619 following rules.
2620 - s_waitcnt vmcnt(0)
2621 must happen after
2622 any preceding
2623 global/generic load
2624 atomic/atomicrmw
2625 with an equal or
2626 wider sync scope
2627 and memory ordering
2628 stronger than
2629 unordered (this is
2630 termed the
2631 fence-paired-atomic).
2632 - s_waitcnt lgkmcnt(0)
2633 must happen after
2634 any preceding
2635 group/generic load
2636 atomic/atomicrmw
2637 with an equal or
2638 wider sync scope
2639 and memory ordering
2640 stronger than
2641 unordered (this is
2642 termed the
2643 fence-paired-atomic).
2644 - Must happen before
2645 the following
2646 buffer_wbinvl1_vol.
2647 - Ensures that the
2648 fence-paired atomic
2649 has completed
2650 before invalidating
2651 the
2652 cache. Therefore
2653 any following
2654 locations read must
2655 be no older than
2656 the value read by
2657 the
2658 fence-paired-atomic.
2659
2660 2. buffer_wbinvl1_vol
2661
2662 - Must happen before
2663 any following global/generic
2664 load/load
2665 atomic/store/store
2666 atomic/atomicrmw.
2667 - Ensures that
2668 following loads
2669 will not see stale
2670 global data.
2671
2672 **Release Atomic**
2673 ---------------------------------------------------------------------------
2674 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2675 - wavefront - local
2676 - generic
2677 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2678 - generic
2679 - If OpenCL, omit
2680 waitcnt.
2681 - Must happen after
2682 any preceding
2683 local/generic
2684 load/store/load
2685 atomic/store
2686 atomic/atomicrmw.
2687 - Must happen before
2688 the following
2689 store.
2690 - Ensures that all
2691 memory operations
2692 to local have
2693 completed before
2694 performing the
2695 store that is being
2696 released.
2697
2698 2. buffer/global/flat_store
2699 store atomic release - workgroup - local 1. ds_store
2700 store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
2701 - system - generic lgkmcnt(0)
2702
2703 - If OpenCL, omit
2704 lgkmcnt(0).
2705 - Could be split into
2706 separate s_waitcnt
2707 vmcnt(0) and
2708 s_waitcnt
2709 lgkmcnt(0) to allow
2710 them to be
2711 independently moved
2712 according to the
2713 following rules.
2714 - s_waitcnt vmcnt(0)
2715 must happen after
2716 any preceding
2717 global/generic
2718 load/store/load
2719 atomic/store
2720 atomic/atomicrmw.
2721 - s_waitcnt lgkmcnt(0)
2722 must happen after
2723 any preceding
2724 local/generic
2725 load/store/load
2726 atomic/store
2727 atomic/atomicrmw.
2728 - Must happen before
2729 the following
2730 store.
2731 - Ensures that all
2732 memory operations
2733 to global have
2734 completed before
2735 performing the
2736 store that is being
2737 released.
2738
2739 2. buffer/global/ds/flat_store
2740 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2741 - wavefront - local
2742 - generic
2743 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2744 - generic
2745 - If OpenCL, omit
2746 waitcnt.
2747 - Must happen after
2748 any preceding
2749 local/generic
2750 load/store/load
2751 atomic/store
2752 atomic/atomicrmw.
2753 - Must happen before
2754 the following
2755 atomicrmw.
2756 - Ensures that all
2757 memory operations
2758 to local have
2759 completed before
2760 performing the
2761 atomicrmw that is
2762 being released.
2763
2764 2. buffer/global/flat_atomic
2765 atomicrmw release - workgroup - local 1. ds_atomic
2766 atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
2767 - system - generic lgkmcnt(0)
2768
2769 - If OpenCL, omit
2770 lgkmcnt(0).
2771 - Could be split into
2772 separate s_waitcnt
2773 vmcnt(0) and
2774 s_waitcnt
2775 lgkmcnt(0) to allow
2776 them to be
2777 independently moved
2778 according to the
2779 following rules.
2780 - s_waitcnt vmcnt(0)
2781 must happen after
2782 any preceding
2783 global/generic
2784 load/store/load
2785 atomic/store
2786 atomic/atomicrmw.
2787 - s_waitcnt lgkmcnt(0)
2788 must happen after
2789 any preceding
2790 local/generic
2791 load/store/load
2792 atomic/store
2793 atomic/atomicrmw.
2794 - Must happen before
2795 the following
2796 atomicrmw.
2797 - Ensures that all
2798 memory operations
2799 to global and local
2800 have completed
2801 before performing
2802 the atomicrmw that
2803 is being released.
2804
2805 2. buffer/global/ds/flat_atomic*
2806 fence release - singlethread *none* *none*
2807 - wavefront
2808 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2809
2810 - If OpenCL and
2811 address space is
2812 not generic, omit
2813 waitcnt. However,
2814 since LLVM
2815 currently has no
2816 address space on
2817 the fence need to
2818 conservatively
2819 always generate
2820 (see comment for
2821 previous fence).
2822 - Must happen after
2823 any preceding
2824 local/generic
2825 load/load
2826 atomic/store/store
2827 atomic/atomicrmw.
2828 - Must happen before
2829 any following store
2830 atomic/atomicrmw
2831 with an equal or
2832 wider sync scope
2833 and memory ordering
2834 stronger than
2835 unordered (this is
2836 termed the
2837 fence-paired-atomic).
2838 - Ensures that all
2839 memory operations
2840 to local have
2841 completed before
2842 performing the
2843 following
2844 fence-paired-atomic.
2845
2846 fence release - agent *none* 1. s_waitcnt vmcnt(0) &
2847 - system lgkmcnt(0)
2848
2849 - If OpenCL and
2850 address space is
2851 not generic, omit
2852 lgkmcnt(0).
2853 However, since LLVM
2854 currently has no
2855 address space on
2856 the fence need to
2857 conservatively
2858 always generate
2859 (see comment for
2860 previous fence).
2861 - Could be split into
2862 separate s_waitcnt
2863 vmcnt(0) and
2864 s_waitcnt
2865 lgkmcnt(0) to allow
2866 them to be
2867 independently moved
2868 according to the
2869 following rules.
2870 - s_waitcnt vmcnt(0)
2871 must happen after
2872 any preceding
2873 global/generic
2874 load/store/load
2875 atomic/store
2876 atomic/atomicrmw.
2877 - s_waitcnt lgkmcnt(0)
2878 must happen after
2879 any preceding
2880 local/generic
2881 load/store/load
2882 atomic/store
2883 atomic/atomicrmw.
2884 - Must happen before
2885 any following store
2886 atomic/atomicrmw
2887 with an equal or
2888 wider sync scope
2889 and memory ordering
2890 stronger than
2891 unordered (this is
2892 termed the
2893 fence-paired-atomic).
2894 - Ensures that all
2895 memory operations
2896 to global have
2897 completed before
2898 performing the
2899 following
2900 fence-paired-atomic.
2901
2902 **Acquire-Release Atomic**
2903 ---------------------------------------------------------------------------
2904 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
2905 - wavefront - local
2906 - generic
2907 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
2908
2909 - If OpenCL, omit
2910 waitcnt.
2911 - Must happen after
2912 any preceding
2913 local/generic
2914 load/store/load
2915 atomic/store
2916 atomic/atomicrmw.
2917 - Must happen before
2918 the following
2919 atomicrmw.
2920 - Ensures that all
2921 memory operations
2922 to local have
2923 completed before
2924 performing the
2925 atomicrmw that is
2926 being released.
2927
2928 2. buffer/global_atomic
2929 atomicrmw acq_rel - workgroup - local 1. ds_atomic
2930 2. s_waitcnt lgkmcnt(0)
2931
2932 - If OpenCL, omit
2933 waitcnt.
2934 - Must happen before
2935 any following
2936 global/generic
2937 load/load
2938 atomic/store/store
2939 atomic/atomicrmw.
2940 - Ensures any
2941 following global
2942 data read is no
2943 older than the load
2944 atomic value being
2945 acquired.
2946
2947 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2948
2949 - If OpenCL, omit
2950 waitcnt.
2951 - Must happen after
2952 any preceding
2953 local/generic
2954 load/store/load
2955 atomic/store
2956 atomic/atomicrmw.
2957 - Must happen before
2958 the following
2959 atomicrmw.
2960 - Ensures that all
2961 memory operations
2962 to local have
2963 completed before
2964 performing the
2965 atomicrmw that is
2966 being released.
2967
2968 2. flat_atomic
2969 3. s_waitcnt lgkmcnt(0)
2970
2971 - If OpenCL, omit
2972 waitcnt.
2973 - Must happen before
2974 any following
2975 global/generic
2976 load/load
2977 atomic/store/store
2978 atomic/atomicrmw.
2979 - Ensures any
2980 following global
2981 data read is no
2982 older than the load
2983 atomic value being
2984 acquired.
2985 atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
2986 - system lgkmcnt(0)
2987
2988 - If OpenCL, omit
2989 lgkmcnt(0).
2990 - Could be split into
2991 separate s_waitcnt
2992 vmcnt(0) and
2993 s_waitcnt
2994 lgkmcnt(0) to allow
2995 them to be
2996 independently moved
2997 according to the
2998 following rules.
2999 - s_waitcnt vmcnt(0)
3000 must happen after
3001 any preceding
3002 global/generic
3003 load/store/load
3004 atomic/store
3005 atomic/atomicrmw.
3006 - s_waitcnt lgkmcnt(0)
3007 must happen after
3008 any preceding
3009 local/generic
3010 load/store/load
3011 atomic/store
3012 atomic/atomicrmw.
3013 - Must happen before
3014 the following
3015 atomicrmw.
3016 - Ensures that all
3017 memory operations
3018 to global have
3019 completed before
3020 performing the
3021 atomicrmw that is
3022 being released.
3023
3024 2. buffer/global_atomic
3025 3. s_waitcnt vmcnt(0)
3026
3027 - Must happen before
3028 following
3029 buffer_wbinvl1_vol.
3030 - Ensures the
3031 atomicrmw has
3032 completed before
3033 invalidating the
3034 cache.
3035
3036 4. buffer_wbinvl1_vol
3037
3038 - Must happen before
3039 any following
3040 global/generic
3041 load/load
3042 atomic/atomicrmw.
3043 - Ensures that
3044 following loads
3045 will not see stale
3046 global data.
3047
3048 atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
3049 - system lgkmcnt(0)
3050
3051 - If OpenCL, omit
3052 lgkmcnt(0).
3053 - Could be split into
3054 separate s_waitcnt
3055 vmcnt(0) and
3056 s_waitcnt
3057 lgkmcnt(0) to allow
3058 them to be
3059 independently moved
3060 according to the
3061 following rules.
3062 - s_waitcnt vmcnt(0)
3063 must happen after
3064 any preceding
3065 global/generic
3066 load/store/load
3067 atomic/store
3068 atomic/atomicrmw.
3069 - s_waitcnt lgkmcnt(0)
3070 must happen after
3071 any preceding
3072 local/generic
3073 load/store/load
3074 atomic/store
3075 atomic/atomicrmw.
3076 - Must happen before
3077 the following
3078 atomicrmw.
3079 - Ensures that all
3080 memory operations
3081 to global have
3082 completed before
3083 performing the
3084 atomicrmw that is
3085 being released.
3086
3087 2. flat_atomic
3088 3. s_waitcnt vmcnt(0) &
3089 lgkmcnt(0)
3090
3091 - If OpenCL, omit
3092 lgkmcnt(0).
3093 - Must happen before
3094 following
3095 buffer_wbinvl1_vol.
3096 - Ensures the
3097 atomicrmw has
3098 completed before
3099 invalidating the
3100 cache.
3101
3102 4. buffer_wbinvl1_vol
3103
3104 - Must happen before
3105 any following
3106 global/generic
3107 load/load
3108 atomic/atomicrmw.
3109 - Ensures that
3110 following loads
3111 will not see stale
3112 global data.
3113
3114 fence acq_rel - singlethread *none* *none*
3115 - wavefront
3116 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3117
3118 - If OpenCL and
3119 address space is
3120 not generic, omit
3121 waitcnt. However,
3122 since LLVM
3123 currently has no
3124 address space on
3125 the fence need to
3126 conservatively
3127 always generate
3128 (see comment for
3129 previous fence).
3130 - Must happen after
3131 any preceding
3132 local/generic
3133 load/load
3134 atomic/store/store
3135 atomic/atomicrmw.
3136 - Must happen before
3137 any following
3138 global/generic
3139 load/load
3140 atomic/store/store
3141 atomic/atomicrmw.
3142 - Ensures that all
3143 memory operations
3144 to local have
3145 completed before
3146 performing any
3147 following global
3148 memory operations.
3149 - Ensures that the
3150 preceding
3151 local/generic load
3152 atomic/atomicrmw
3153 with an equal or
3154 wider sync scope
3155 and memory ordering
3156 stronger than
3157 unordered (this is
3158 termed the
3159 fence-paired-atomic)
3160 has completed
3161 before following
3162 global memory
3163 operations. This
3164 satisfies the
3165 requirements of
3166 acquire.
3167 - Ensures that all
3168 previous memory
3169 operations have
3170 completed before a
3171 following
3172 local/generic store
3173 atomic/atomicrmw
3174 with an equal or
3175 wider sync scope
3176 and memory ordering
3177 stronger than
3178 unordered (this is
3179 termed the
3180 fence-paired-atomic).
3181 This satisfies the
3182 requirements of
3183 release.
3184
3185 fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
3186 - system lgkmcnt(0)
3187
3188 - If OpenCL and
3189 address space is
3190 not generic, omit
3191 lgkmcnt(0).
3192 However, since LLVM
3193 currently has no
3194 address space on
3195 the fence need to
3196 conservatively
3197 always generate
3198 (see comment for
3199 previous fence).
3200 - Could be split into
3201 separate s_waitcnt
3202 vmcnt(0) and
3203 s_waitcnt
3204 lgkmcnt(0) to allow
3205 them to be
3206 independently moved
3207 according to the
3208 following rules.
3209 - s_waitcnt vmcnt(0)
3210 must happen after
3211 any preceding
3212 global/generic
3213 load/store/load
3214 atomic/store
3215 atomic/atomicrmw.
3216 - s_waitcnt lgkmcnt(0)
3217 must happen after
3218 any preceding
3219 local/generic
3220 load/store/load
3221 atomic/store
3222 atomic/atomicrmw.
3223 - Must happen before
3224 the following
3225 buffer_wbinvl1_vol.
3226 - Ensures that the
3227 preceding
3228 global/local/generic
3229 load
3230 atomic/atomicrmw
3231 with an equal or
3232 wider sync scope
3233 and memory ordering
3234 stronger than
3235 unordered (this is
3236 termed the
3237 fence-paired-atomic)
3238 has completed
3239 before invalidating
3240 the cache. This
3241 satisfies the
3242 requirements of
3243 acquire.
3244 - Ensures that all
3245 previous memory
3246 operations have
3247 completed before a
3248 following
3249 global/local/generic
3250 store
3251 atomic/atomicrmw
3252 with an equal or
3253 wider sync scope
3254 and memory ordering
3255 stronger than
3256 unordered (this is
3257 termed the
3258 fence-paired-atomic).
3259 This satisfies the
3260 requirements of
3261 release.
3262
3263 2. buffer_wbinvl1_vol
3264
3265 - Must happen before
3266 any following
3267 global/generic
3268 load/load
3269 atomic/store/store
3270 atomic/atomicrmw.
3271 - Ensures that
3272 following loads
3273 will not see stale
3274 global data. This
3275 satisfies the
3276 requirements of
3277 acquire.
3278
3279 **Sequential Consistent Atomic**
3280 ---------------------------------------------------------------------------
3281 load atomic seq_cst - singlethread - global *Same as corresponding
3282 - wavefront - local load atomic acquire*.
3283 - workgroup - generic
3284 load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
3285 - system - local
3286 - generic - Must happen after
3287 preceding
3288 global/generic load
3289 atomic/store
3290 atomic/atomicrmw
3291 with memory
3292 ordering of seq_cst
3293 and with equal or
3294 wider sync scope.
3295 (Note that seq_cst
3296 fences have their
3297 own s_waitcnt
3298 vmcnt(0) and so do
3299 not need to be
3300 considered.)
3301 - Ensures any
3302 preceding
3303 sequential
3304 consistent global
3305 memory instructions
3306 have completed
3307 before executing
3308 this sequentially
3309 consistent
3310 instruction. This
3311 prevents reordering
3312 a seq_cst store
3313 followed by a
3314 seq_cst load (Note
3315 that seq_cst is
3316 stronger than
3317 acquire/release as
3318 the reordering of
3319 load acquire
3320 followed by a store
3321 release is
3322 prevented by the
3323 waitcnt vmcnt(0) of
3324 the release, but
3325 there is nothing
3326 preventing a store
3327 release followed by
3328 load acquire from
3329 competing out of
3330 order.)
3331
3332 2. *Following
3333 instructions same as
3334 corresponding load
3335 atomic acquire*.
3336
3337 store atomic seq_cst - singlethread - global *Same as corresponding
3338 - wavefront - local store atomic release*.
3339 - workgroup - generic
3340 store atomic seq_cst - agent - global *Same as corresponding
3341 - system - generic store atomic release*.
3342 atomicrmw seq_cst - singlethread - global *Same as corresponding
3343 - wavefront - local atomicrmw acq_rel*.
3344 - workgroup - generic
3345 atomicrmw seq_cst - agent - global *Same as corresponding
3346 - system - generic atomicrmw acq_rel*.
3347 fence seq_cst - singlethread *none* *Same as corresponding
3348 - wavefront fence acq_rel*.
3349 - workgroup
3350 - agent
3351 - system
3352 ============ ============ ============== ========== =======================
3353
3354The memory order also adds the single thread optimization constrains defined in
3355table
3356:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3357
3358 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3359 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3360
3361 ============ ==============================================================
3362 LLVM Memory Optimization Constraints
3363 Ordering
3364 ============ ==============================================================
3365 unordered *none*
3366 monotonic *none*
3367 acquire - If a load atomic/atomicrmw then no following load/load
3368 atomic/store/ store atomic/atomicrmw/fence instruction can
3369 be moved before the acquire.
3370 - If a fence then same as load atomic, plus no preceding
3371 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003372 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003373 atomic/store/ store atomic/atomicrmw/fence instruction can
3374 be moved after the release.
3375 - If a fence then same as store atomic, plus no following
3376 associated fence-paired-atomic can be moved before the
3377 fence.
3378 acq_rel Same constraints as both acquire and release.
3379 seq_cst - If a load atomic then same constraints as acquire, plus no
3380 preceding sequentially consistent load atomic/store
3381 atomic/atomicrmw/fence instruction can be moved after the
3382 seq_cst.
3383 - If a store atomic then the same constraints as release, plus
3384 no following sequentially consistent load atomic/store
3385 atomic/atomicrmw/fence instruction can be moved before the
3386 seq_cst.
3387 - If an atomicrmw/fence then same constraints as acq_rel.
3388 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003389
Wei Ding16289cf2017-02-21 18:48:01 +00003390Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003391~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003392
Tony Tyef16a45e2017-06-06 20:31:59 +00003393For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3394(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3395the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003396
Tony Tyef16a45e2017-06-06 20:31:59 +00003397 .. table:: AMDGPU Trap Handler for AMDHSA OS
3398 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003399
Tony Tyef16a45e2017-06-06 20:31:59 +00003400 =================== =============== =============== =======================
3401 Usage Code Sequence Trap Handler Description
3402 Inputs
3403 =================== =============== =============== =======================
3404 reserved ``s_trap 0x00`` Reserved by hardware.
3405 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3406 ``queue_ptr`` ``debugtrap``
3407 ``VGPR0``: intrinsic (not
3408 ``arg`` implemented).
3409 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3410 ``queue_ptr`` terminated and its
3411 associated queue put
3412 into the error state.
3413 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3414 ``queue_ptr`` installed handled
3415 same as ``llvm.trap``.
3416 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3417 breakpoints.
3418 debugger ``s_trap 0x08`` Reserved for debugger.
3419 debugger ``s_trap 0xfe`` Reserved for debugger.
3420 debugger ``s_trap 0xff`` Reserved for debugger.
3421 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003422
Tony Tye46d35762017-08-15 20:47:41 +00003423Unspecified OS
3424--------------
3425
3426This section provides code conventions used when the target triple OS is
3427empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003428
3429Trap Handler ABI
3430~~~~~~~~~~~~~~~~
3431
3432For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3433not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3434instructions are handled as follows:
3435
3436 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3437 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3438
3439 =============== =============== ===========================================
3440 Usage Code Sequence Description
3441 =============== =============== ===========================================
3442 llvm.trap s_endpgm Causes wavefront to be terminated.
3443 llvm.debugtrap *none* Compiler warning given that there is no
3444 trap handler installed.
3445 =============== =============== ===========================================
3446
3447Source Languages
3448================
3449
3450.. _amdgpu-opencl:
3451
3452OpenCL
3453------
3454
3455When generating code for the OpenCL language the target triple environment
3456should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3457
3458When the language is OpenCL the following differences occur:
3459
34601. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
34612. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye46d35762017-08-15 20:47:41 +000034623. Additional metadata is generated
3463 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003464
3465.. TODO
3466 Specify what affect this has. Hidden arguments added. Additional metadata
3467 generated.
3468
3469.. _amdgpu-hcc:
3470
3471HCC
3472---
3473
3474When generating code for the OpenCL language the target triple environment
3475should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3476
3477When the language is OpenCL the following differences occur:
3478
34791. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3480
3481.. TODO
3482 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003483
Tom Stellard45bb48e2015-06-13 03:28:10 +00003484Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003485---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003486
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003487AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef16a45e2017-06-06 20:31:59 +00003488It supports AMDGCN GFX6-GFX8.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003489
Tony Tyef16a45e2017-06-06 20:31:59 +00003490This section describes general syntax for instructions and operands. For more
3491information about instructions, their semantics and supported combinations of
3492operands, refer to one of instruction set architecture manuals
Tony Tye46d35762017-08-15 20:47:41 +00003493[AMD-Souther-Islands]_, [AMD-Sea-Islands]_, [AMD-Volcanic-Islands]_ and
3494[AMD-Vega]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003495
Tony Tyef16a45e2017-06-06 20:31:59 +00003496An instruction has the following syntax (register operands are normally
3497comma-separated while extra operands are space-separated):
Tom Stellard45bb48e2015-06-13 03:28:10 +00003498
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003499*<opcode> <register_operand0>, ... <extra_operand0> ...*
Tom Stellard45bb48e2015-06-13 03:28:10 +00003500
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003501Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003502~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003503
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003504The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003505
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003506* SGPR registers: s0, ... or s[0], ...
3507* VGPR registers: v0, ... or v[0], ...
3508* TTMP registers: ttmp0, ... or ttmp[0], ...
3509* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3510* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3511* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3512* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3513* Register index expressions: v[2*2], s[1-1:2-1]
3514* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003515
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003516The following extra operands are supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003517
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003518* offset, offset0, offset1
3519* idxen, offen bits
3520* glc, slc, tfe bits
3521* waitcnt: integer or combination of counter values
3522* VOP3 modifiers:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003523
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003524 - abs (\| \|), neg (\-)
Tom Stellard45bb48e2015-06-13 03:28:10 +00003525
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003526* DPP modifiers:
3527
3528 - row_shl, row_shr, row_ror, row_rol
3529 - row_mirror, row_half_mirror, row_bcast
3530 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3531 - row_mask, bank_mask, bound_ctrl
3532
3533* SDWA modifiers:
3534
3535 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3536 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3537 - abs, neg, sext
3538
Tony Tyef16a45e2017-06-06 20:31:59 +00003539Instruction Examples
3540~~~~~~~~~~~~~~~~~~~~
3541
3542DS
3543~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003544
3545.. code-block:: nasm
3546
3547 ds_add_u32 v2, v4 offset:16
3548 ds_write_src2_b64 v2 offset0:4 offset1:8
3549 ds_cmpst_f32 v2, v4, v6
3550 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3551
3552
3553For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3554
Tony Tyef16a45e2017-06-06 20:31:59 +00003555FLAT
3556++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003557
3558.. code-block:: nasm
3559
3560 flat_load_dword v1, v[3:4]
3561 flat_store_dwordx3 v[3:4], v[5:7]
3562 flat_atomic_swap v1, v[3:4], v5 glc
3563 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3564 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3565
3566For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3567
Tony Tyef16a45e2017-06-06 20:31:59 +00003568MUBUF
3569+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003570
3571.. code-block:: nasm
3572
3573 buffer_load_dword v1, off, s[4:7], s1
3574 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3575 buffer_store_format_xy v[1:2], off, s[4:7], s1
3576 buffer_wbinvl1
3577 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3578
3579For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3580
Tony Tyef16a45e2017-06-06 20:31:59 +00003581SMRD/SMEM
3582+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003583
3584.. code-block:: nasm
3585
3586 s_load_dword s1, s[2:3], 0xfc
3587 s_load_dwordx8 s[8:15], s[2:3], s4
3588 s_load_dwordx16 s[88:103], s[2:3], s4
3589 s_dcache_inv_vol
3590 s_memtime s[4:5]
3591
3592For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3593
Tony Tyef16a45e2017-06-06 20:31:59 +00003594SOP1
3595++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003596
3597.. code-block:: nasm
3598
3599 s_mov_b32 s1, s2
3600 s_mov_b64 s[0:1], 0x80000000
3601 s_cmov_b32 s1, 200
3602 s_wqm_b64 s[2:3], s[4:5]
3603 s_bcnt0_i32_b64 s1, s[2:3]
3604 s_swappc_b64 s[2:3], s[4:5]
3605 s_cbranch_join s[4:5]
3606
3607For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3608
Tony Tyef16a45e2017-06-06 20:31:59 +00003609SOP2
3610++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003611
3612.. code-block:: nasm
3613
3614 s_add_u32 s1, s2, s3
3615 s_and_b64 s[2:3], s[4:5], s[6:7]
3616 s_cselect_b32 s1, s2, s3
3617 s_andn2_b32 s2, s4, s6
3618 s_lshr_b64 s[2:3], s[4:5], s6
3619 s_ashr_i32 s2, s4, s6
3620 s_bfm_b64 s[2:3], s4, s6
3621 s_bfe_i64 s[2:3], s[4:5], s6
3622 s_cbranch_g_fork s[4:5], s[6:7]
3623
3624For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3625
Tony Tyef16a45e2017-06-06 20:31:59 +00003626SOPC
3627++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003628
3629.. code-block:: nasm
3630
3631 s_cmp_eq_i32 s1, s2
3632 s_bitcmp1_b32 s1, s2
3633 s_bitcmp0_b64 s[2:3], s4
3634 s_setvskip s3, s5
3635
3636For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3637
Tony Tyef16a45e2017-06-06 20:31:59 +00003638SOPP
3639++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003640
3641.. code-block:: nasm
3642
3643 s_barrier
3644 s_nop 2
3645 s_endpgm
3646 s_waitcnt 0 ; Wait for all counters to be 0
3647 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3648 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3649 s_sethalt 9
3650 s_sleep 10
3651 s_sendmsg 0x1
3652 s_sendmsg sendmsg(MSG_INTERRUPT)
3653 s_trap 1
3654
3655For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3656
3657Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003658of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003659range or acceptable values.
3660
Tony Tyef16a45e2017-06-06 20:31:59 +00003661VALU
3662++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003663
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003664For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3665the assembler will automatically use optimal encoding based on its operands.
3666To force specific encoding, one can add a suffix to the opcode of the instruction:
3667
3668* _e32 for 32-bit VOP1/VOP2/VOPC
3669* _e64 for 64-bit VOP3
3670* _dpp for VOP_DPP
3671* _sdwa for VOP_SDWA
3672
3673VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003674
3675.. code-block:: nasm
3676
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003677 v_mov_b32 v1, v2
3678 v_mov_b32_e32 v1, v2
3679 v_nop
3680 v_cvt_f64_i32_e32 v[1:2], v2
3681 v_floor_f32_e32 v1, v2
3682 v_bfrev_b32_e32 v1, v2
3683 v_add_f32_e32 v1, v2, v3
3684 v_mul_i32_i24_e64 v1, v2, 3
3685 v_mul_i32_i24_e32 v1, -3, v3
3686 v_mul_i32_i24_e32 v1, -100, v3
3687 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3688 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00003689
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003690VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003691
3692.. code-block:: nasm
3693
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003694 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3695 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3696 v_mov_b32 v0, v0 wave_shl:1
3697 v_mov_b32 v0, v0 row_mirror
3698 v_mov_b32 v0, v0 row_bcast:31
3699 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3700 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3701 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00003702
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003703VOP_SDWA examples:
3704
3705.. code-block:: nasm
3706
3707 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3708 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3709 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3710 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3711 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3712
3713For full list of supported instructions, refer to "Vector ALU instructions".
3714
3715HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00003716~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003717
3718AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3719one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00003720
3721.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00003722+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00003723
3724*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003725object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00003726
3727.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00003728+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3729
Tom Stellard347ac792015-06-26 21:15:07 +00003730
3731*major*, *minor*, and *stepping* are all integers that describe the instruction
3732set architecture (ISA) version of the assembly program.
3733
3734*vendor* and *arch* are quoted strings. *vendor* should always be equal to
3735"AMD" and *arch* should always be equal to "AMDGPU".
3736
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003737By default, the assembler will derive the ISA version, *vendor*, and *arch*
3738from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00003739
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003740.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00003741+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003742
3743This directives specifies that the symbol with given name is a kernel entry point
3744(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00003745
3746.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00003747++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00003748
3749This directive marks the beginning of a list of key / value pairs that are used
3750to specify the amd_kernel_code_t object that will be emitted by the assembler.
3751The list must be terminated by the *.end_amd_kernel_code_t* directive. For
3752any amd_kernel_code_t values that are unspecified a default value will be
3753used. The default value for all keys is 0, with the following exceptions:
3754
3755- *kernel_code_version_major* defaults to 1.
3756- *machine_kind* defaults to 1.
3757- *machine_version_major*, *machine_version_minor*, and
3758 *machine_version_stepping* are derived from the value of the -mcpu option
3759 that is passed to the assembler.
3760- *kernel_code_entry_byte_offset* defaults to 256.
3761- *wavefront_size* defaults to 6.
3762- *kernarg_segment_alignment*, *group_segment_alignment*, and
3763 *private_segment_alignment* default to 4. Note that alignments are specified
3764 as a power of two, so a value of **n** means an alignment of 2^ **n**.
3765
3766The *.amd_kernel_code_t* directive must be placed immediately after the
3767function label and before any instructions.
3768
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003769For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
3770comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00003771
3772Here is an example of a minimal amd_kernel_code_t specification:
3773
Aaron Ballman887ad0e2016-07-19 17:46:55 +00003774.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00003775
3776 .hsa_code_object_version 1,0
3777 .hsa_code_object_isa
3778
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003779 .hsatext
3780 .globl hello_world
3781 .p2align 8
3782 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00003783
3784 hello_world:
3785
3786 .amd_kernel_code_t
3787 enable_sgpr_kernarg_segment_ptr = 1
3788 is_ptr64 = 1
3789 compute_pgm_rsrc1_vgprs = 0
3790 compute_pgm_rsrc1_sgprs = 0
3791 compute_pgm_rsrc2_user_sgpr = 2
3792 kernarg_segment_byte_size = 8
3793 wavefront_sgpr_count = 2
3794 workitem_vgpr_count = 3
3795 .end_amd_kernel_code_t
3796
3797 s_load_dwordx2 s[0:1], s[0:1] 0x0
3798 v_mov_b32 v0, 3.14159
3799 s_waitcnt lgkmcnt(0)
3800 v_mov_b32 v1, s0
3801 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003802 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00003803 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00003804 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003805 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00003806
3807Additional Documentation
3808========================
3809
3810.. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
3811.. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
3812.. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
3813.. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
3814.. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
3815.. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
3816.. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
Tony Tye46d35762017-08-15 20:47:41 +00003817.. [AMD-Vega] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00003818.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
3819.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
3820.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
3821.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
3822.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
3823.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
3824.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
3825.. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
3826.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
3827.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
3828.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__