blob: f6dcfc6b579ff08e52c0911b9244f7474bba9ed0 [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
26 .. table:: AMDGPU Target Triples
27 :name: amdgpu-target-triples-table
28
29 ============ ======== ========= ===========
30 Architecture Vendor OS Environment
31 ============ ======== ========= ===========
32 r600 amd <empty> <empty>
33 amdgcn amd <empty> <empty>
34 amdgcn amd amdhsa <empty>
35 amdgcn amd amdhsa opencl
36 amdgcn amd amdhsa amdgizcl
37 amdgcn amd amdhsa amdgiz
38 amdgcn amd amdhsa hcc
39 ============ ======== ========= ===========
40
41``r600-amd--``
42 Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
43 the MESA runtime.
44
45``amdgcn-amd--``
46 Supports AMD GPUs GCN 6 onwards for graphics and compute shaders executed on
47 the MESA runtime.
48
49``amdgcn-amd-amdhsa-``
50 Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
51 compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
52
53``amdgcn-amd-amdhsa-opencl``
54 Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
55 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
56 :ref:`amdgpu-opencl`.
57
58``amdgcn-amd-amdhsa-amdgizcl``
59 Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
60 is used (see :ref:`amdgpu-address-spaces`).
61
62``amdgcn-amd-amdhsa-amdgiz``
63 Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
64 used (see :ref:`amdgpu-address-spaces`).
65
66``amdgcn-amd-amdhsa-hcc``
67 Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
68 executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
69 :ref:`amdgpu-hcc`.
70
71.. _amdgpu-processors:
72
73Processors
74----------
75
76Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
77names from both the *Processor* and *Alternative Processor* can be used.
78
79 .. table:: AMDGPU Processors
80 :name: amdgpu-processors-table
81
82 ========== =========== ============ ===== ======= ==================
83 Processor Alternative Target dGPU/ Runtime Example
84 Processor Triple APU Support Products
85 Architecture
86 ========== =========== ============ ===== ======= ==================
87 **R600** [AMD-R6xx]_
88 --------------------------------------------------------------------
89 r600 r600 dGPU
90 r630 r600 dGPU
91 rs880 r600 dGPU
92 rv670 r600 dGPU
93 **R700** [AMD-R7xx]_
94 --------------------------------------------------------------------
95 rv710 r600 dGPU
96 rv730 r600 dGPU
97 rv770 r600 dGPU
98 **Evergreen** [AMD-Evergreen]_
99 --------------------------------------------------------------------
100 cedar r600 dGPU
101 redwood r600 dGPU
102 sumo r600 dGPU
103 juniper r600 dGPU
104 cypress r600 dGPU
105 **Northern Islands** [AMD-Cayman-Trinity]_
106 --------------------------------------------------------------------
107 barts r600 dGPU
108 turks r600 dGPU
109 caicos r600 dGPU
110 cayman r600 dGPU
111 **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_
112 --------------------------------------------------------------------
113 gfx600 - SI amdgcn dGPU
114 - tahiti
115 gfx601 - pitcairn amdgcn dGPU
116 - verde
117 - oland
118 - hainan
119 **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_
120 --------------------------------------------------------------------
121 gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
122 - Radeon HD 8770
123 - R7 260
124 - R7 260X
125 \ - kaveri amdgcn APU - A6-7000
126 - A6 Pro-7050B
127 - A8-7100
128 - A8 Pro-7150B
129 - A10-7300
130 - A10 Pro-7350B
131 - FX-7500
132 - A8-7200P
133 - A10-7400P
134 - FX-7600P
135 gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
136 - FirePro W9100
137 - FirePro S9150
138 - FirePro S9170
139 \ dGPU ROCm - Radeon R9 290
140 - Radeon R9 290x
141 - Radeon R390
142 - Radeonb R390x
143 gfx702 - kabini amdgcn APU - E1-2100
144 - mullins - E1-2200
145 - E1-2500
146 - E2-3000
147 - E2-3800
148 - A4-5000
149 - A4-5100
150 - A6-5200
151 - A4 Pro-3340B
152 **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_
153 --------------------------------------------------------------------
154 gfx800 - iceland amdgcn dGPU - FirePro S7150
155 - FirePro S7100
156 - FirePro W7100
157 - Radeon R285
158 - Radeon R9 380
159 - Radeon R9 385
160 - Mobile FirePro
161 M7170
162 gfx801 - carrizo amdgcn APU - A6-8500P
163 - Pro A6-8500B
164 - A8-8600P
165 - Pro A8-8600B
166 - FX-8800P
167 - Pro A12-8800B
168 \ amdgcn APU ROCm - A10-8700P
169 - Pro A10-8700B
170 - A10-8780P
171 \ amdgcn APU - A10-9600P
172 - A10-9630P
173 - A12-9700P
174 - A12-9730P
175 - FX-9800P
176 - FX-9830P
177 \ amdgcn APU - E2-9010
178 - A6-9210
179 - A9-9410
180 gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
181 gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
182 - Radeon R9 Fury
183 - Radeon R9 FuryX
184 - Radeon Pro Duo
185 - FirePro S9300x2
186 \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
187 - Radeon RX 480
188 \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
189 gfx804 amdgcn dGPU Same as gfx803
190 gfx810 - stoney amdgcn APU
191 **GCN GFX9**
192 --------------------------------------------------------------------
193 gfx900 amdgcn dGPU - FirePro W9500
194 - FirePro S9500
195 - FirePro S9500x2
196 gfx901 amdgcn dGPU ROCm Same as gfx900
197 except XNACK is
198 enabled
199 gfx902 amdgcn APU *TBA*
200
201 .. TODO
202 Add product
203 names.
204 gfx903 amdgcn APU Same as gfx902
205 except XNACK is
206 enabled
207 ========== =========== ============ ===== ======= ==================
208
209.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000210
211Address Spaces
212--------------
213
Tony Tyef16a45e2017-06-06 20:31:59 +0000214The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000215
Tony Tyef16a45e2017-06-06 20:31:59 +0000216The memory space names used in the table, aside from the region memory space, is
217from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000218
Tony Tyef16a45e2017-06-06 20:31:59 +0000219LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000220
Tony Tyef16a45e2017-06-06 20:31:59 +0000221 .. table:: Address Space Mapping
222 :name: amdgpu-address-space-mapping-table
223
224 ================== ================= ================= ================= =================
225 LLVM Address Space Memory Space
226 ------------------ -----------------------------------------------------------------------
227 \ Current Default amdgiz/amdgizcl hcc Future Default
228 ================== ================= ================= ================= =================
229 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
230 1 Global Global Global Global
231 2 Constant Constant Constant Region (GDS)
232 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
233 4 Generic (Flat) Region (GDS) Region (GDS) Constant
234 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
235 ================== ================= ================= ================= =================
236
237Current Default
238 This is the current default address space mapping used for all languages
239 except hcc. This will shortly be deprecated.
240
241amdgiz/amdgizcl
242 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
243 is specified as the target triple environment value.
244
245hcc
246 This is the current address space mapping used when ``hcc`` is specified as
247 the target triple environment value.This will shortly be deprecated.
248
249Future Default
250 This will shortly be the only address space mapping for all languages using
251 AMDGPU backend.
252
253.. _amdgpu-memory-scopes:
254
255Memory Scopes
256-------------
257
258This section provides LLVM memory synchronization scopes supported by the AMDGPU
259backend memory model when the target triple OS is ``amdhsa`` (see
260:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
261
262The memory model supported is based on the HSA memory model [HSA]_ which is
263based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
264relation is transitive over the synchonizes-with relation independent of scope,
265and synchonizes-with allows the memory scope instances to be inclusive (see
266table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
267
268This is different to the OpenCL [OpenCL]_ memory model which does not have scope
269inclusion and requires the memory scopes to exactly match. However, this
270is conservatively correct for OpenCL.
271
272 .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
273 :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
274
275 ================ ==========================================================
276 LLVM Sync Scope Description
277 ================ ==========================================================
278 *none* The default: ``system``.
279
280 Synchronizes with, and participates in modification and
281 seq_cst total orderings with, other operations (except
282 image operations) for all address spaces (except private,
283 or generic that accesses private) provided the other
284 operation's sync scope is:
285
286 - ``system``.
287 - ``agent`` and executed by a thread on the same agent.
288 - ``workgroup`` and executed by a thread in the same
289 workgroup.
290 - ``wavefront`` and executed by a thread in the same
291 wavefront.
292
293 ``agent`` Synchronizes with, and participates in modification and
294 seq_cst total orderings with, other operations (except
295 image operations) for all address spaces (except private,
296 or generic that accesses private) provided the other
297 operation's sync scope is:
298
299 - ``system`` or ``agent`` and executed by a thread on the
300 same agent.
301 - ``workgroup`` and executed by a thread in the same
302 workgroup.
303 - ``wavefront`` and executed by a thread in the same
304 wavefront.
305
306 ``workgroup`` Synchronizes with, and participates in modification and
307 seq_cst total orderings with, other operations (except
308 image operations) for all address spaces (except private,
309 or generic that accesses private) provided the other
310 operation's sync scope is:
311
312 - ``system``, ``agent`` or ``workgroup`` and executed by a
313 thread in the same workgroup.
314 - ``wavefront`` and executed by a thread in the same
315 wavefront.
316
317 ``wavefront`` Synchronizes with, and participates in modification and
318 seq_cst total orderings with, other operations (except
319 image operations) for all address spaces (except private,
320 or generic that accesses private) provided the other
321 operation's sync scope is:
322
323 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
324 and executed by a thread in the same wavefront.
325
326 ``singlethread`` Only synchronizes with, and participates in modification
327 and seq_cst total orderings with, other operations (except
328 image operations) running in the same thread for all
329 address spaces (for example, in signal handlers).
330 ================ ==========================================================
331
332AMDGPU Intrinsics
333-----------------
334
335The AMDGPU backend implements the following intrinsics.
336
337*This section is WIP.*
338
339.. TODO
340 List AMDGPU intrinsics
341
342Code Object
343===========
344
345The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
346can be linked by ``lld`` to produce a standard ELF shared code object which can
347be loaded and executed on an AMDGPU target.
348
349Header
350------
351
352The AMDGPU backend uses the following ELF header:
353
354 .. table:: AMDGPU ELF Header
355 :name: amdgpu-elf-header-table
356
357 ========================== =========================
358 Field Value
359 ========================== =========================
360 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
361 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
362 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``
363 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``
364 ``e_type`` ``ET_REL`` or ``ET_DYN``
365 ``e_machine`` ``EM_AMDGPU``
366 ``e_entry`` 0
367 ``e_flags`` 0
368 ========================== =========================
369
370..
371
372 .. table:: AMDGPU ELF Header Enumeration Values
373 :name: amdgpu-elf-header-enumeration-values-table
374
375 ============================ =====
376 Name Value
377 ============================ =====
378 ``EM_AMDGPU`` 224
379 ``ELFOSABI_AMDGPU_HSA`` 64
380 ``ELFABIVERSION_AMDGPU_HSA`` 1
381 ============================ =====
382
383``e_ident[EI_CLASS]``
384 The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64 bit
385 applications.
386
387``e_ident[EI_DATA]``
388 All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
389
390``e_ident[EI_OSABI]``
391 The AMD GPU architecture specific OS ABI of ``ELFOSABI_AMDGPU_HSA`` is used to
392 specify that the code object conforms to the AMD HSA runtime ABI [HSA]_.
393
394``e_ident[EI_ABIVERSION]``
395 The AMD GPU architecture specific OS ABI version of
396 ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA runtime
397 ABI to which the code object conforms.
398
399``e_type``
400 Can be one of the following values:
401
402
403 ``ET_REL``
404 The type produced by the AMD GPU backend compiler as it is relocatable code
405 object.
406
407 ``ET_DYN``
408 The type produced by the linker as it is a shared code object.
409
410 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
411
412``e_machine``
413 The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
414 architecture family. The specific member is specified in the
415 ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
416 :ref:`amdgpu-note-records`).
417
418``e_entry``
419 The entry point is 0 as the entry points for individual kernels must be
420 selected in order to invoke them through AQL packets.
421
422``e_flags``
423 The value is 0 as no flags are used.
424
425Sections
426--------
427
428An AMDGPU target ELF code object has the standard ELF sections which include:
429
430 .. table:: AMDGPU ELF Sections
431 :name: amdgpu-elf-sections-table
432
433 ================== ================ =================================
434 Name Type Attributes
435 ================== ================ =================================
436 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
437 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
438 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
439 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
440 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
441 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
442 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
443 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
444 ``.note`` ``SHT_NOTE`` *none*
445 ``.rela``\ *name* ``SHT_RELA`` *none*
446 ``.rela.dyn`` ``SHT_RELA`` *none*
447 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
448 ``.shstrtab`` ``SHT_STRTAB`` *none*
449 ``.strtab`` ``SHT_STRTAB`` *none*
450 ``.symtab`` ``SHT_SYMTAB`` *none*
451 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
452 ================== ================ =================================
453
454These sections have their standard meanings (see [ELF]_) and are only generated
455if needed.
456
457``.debug``\ *\**
458 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
459 DWARF produced by the AMDGPU backend.
460
461``.dynamic``, ``.dynstr``, ``.dynstr``, ``.hash``
462 The standard sections used by a dynamic loader.
463
464``.note``
465 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
466 backend.
467
468``.rela``\ *name*, ``.rela.dyn``
469 For relocatable code objects, *name* is the name of the section that the
470 relocation records apply. For example, ``.rela.text`` is the section name for
471 relocation records associated with the ``.text`` section.
472
473 For linked shared code objects, ``.rela.dyn`` contains all the relocation
474 records from each of the relocatable code object's ``.rela``\ *name* sections.
475
476 See :ref:`amdgpu-relocation-records` for the relocation records supported by
477 the AMDGPU backend.
478
479``.text``
480 The executable machine code for the kernels and functions they call. Generated
481 as position independent code. See :ref:`amdgpu-code-conventions` for
482 information on conventions used in the isa generation.
483
484.. _amdgpu-note-records:
485
486Note Records
487------------
488
489As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
490the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
491minimal zero byte padding must be generated to ensure the ``desc`` field size is
492a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
493be at least 4 to indicate at least 8 byte alignment.
494
495The AMDGPU backend code object uses the following ELF note records in the
496``.note`` section. The *Description* column specifies the layout of the note
497record’s ``desc`` field. All fields are consecutive bytes. Note records with
498variable size strings have a corresponding ``*_size`` field that specifies the
499number of bytes, including the terminating null character, in the string. The
500string(s) come immediately after the preceding fields.
501
502Additional note records can be present.
503
504 .. table:: AMDGPU ELF Note Records
505 :name: amdgpu-elf-note-records-table
506
507 ===== ========================== ==========================================
508 Name Type Description
509 ===== ========================== ==========================================
510 "AMD" ``NT_AMD_AMDGPU_METADATA`` <metadata null terminated string>
511 "AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
512 ===== ========================== ==========================================
513
514..
515
516 .. table:: AMDGPU ELF Note Record Enumeration Values
517 :name: amdgpu-elf-note-record-enumeration-values-table
518
519 ============================= =====
520 Name Value
521 ============================= =====
522 *reserved* 0-9
523 ``NT_AMD_AMDGPU_METADATA`` 10
524 ``NT_AMD_AMDGPU_ISA`` 11
525 ============================= =====
526
527``NT_AMD_AMDGPU_ISA``
528 Specifies the instruction set architecture used by the machine code contained
529 in the code object.
530
531 This note record is required for code objects containing machine code for
532 processors matching the ``amdgcn`` architecture in table
533 :ref:`amdgpu-processors`.
534
535 The null terminated string has the following syntax:
536
537 *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
538
539 where:
540
541 *architecture*
542 The architecture from table :ref:`amdgpu-target-triples-table`.
543
544 This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
545 :ref:`amdgpu-target-triples`).
546
547 *vendor*
548 The vendor from table :ref:`amdgpu-target-triples-table`.
549
550 For the AMDGPU backend this is always ``amd``.
551
552 *os*
553 The OS from table :ref:`amdgpu-target-triples-table`.
554
555 *environment*
556 An environment from table :ref:`amdgpu-target-triples-table`, or blank if
557 the environment has no affect on the execution of the code object.
558
559 For the AMDGPU backend this is currently always blank.
560 *processor*
561 The processor from table :ref:`amdgpu-processors-table`.
562
563 For example:
564
565 ``amdgcn-amd-amdhsa--gfx901``
566
567``NT_AMD_AMDGPU_METADATA``
568 Specifies extensible metadata associated with the code object. See
569 :ref:`amdgpu-code-object-metadata` for the syntax of the code object metadata
570 string.
571
572 This note record is required and must contain the minimum information
573 necessary to support the ROCM kernel queries. For example, the segment sizes
574 needed in a dispatch packet. In addition, a high level language runtime may
575 require other information to be included. For example, the AMD OpenCL runtime
576 records kernel argument information.
577
578 .. TODO
579 Is the string null terminated? It probably should not if YAML allows it to
580 contain null characters, otherwise it should be.
581
582.. _amdgpu-code-object-metadata:
583
584Code Object Metadata
585--------------------
586
587The code object metadata is specified by the ``NT_AMD_AMDHSA_METADATA`` note
588record (see :ref:`amdgpu-note-records`).
589
590The metadata is specified as a YAML formated string (see [YAML]_ and
591:doc:`YamlIO`).
592
593The metadata is represented as a single YAML document comprised of the mapping
594defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
595referenced tables.
596
597For boolean values, the string values of ``false`` and ``true`` are used for
598false and true respectively.
599
600Additional information can be added to the mappings. To avoid conflicts, any
601non-AMD key names should be prefixed by "*vendor-name*.".
602
603 .. table:: AMDHSA Code Object Metadata Mapping
604 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
605
606 ========== ============== ========= =======================================
607 String Key Value Type Required? Description
608 ========== ============== ========= =======================================
609 "Version" sequence of Required - The first integer is the major
610 2 integers version. Currently 1.
611 - The second integer is the minor
612 version. Currently 0.
613 "Printf" sequence of Each string is encoded information
614 strings about a printf function call. The
615 encoded information is organized as
616 fields separated by colon (':'):
617
618 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
619
620 where:
621
622 ``ID``
623 A 32 bit integer as a unique id for
624 each printf function call
625
626 ``N``
627 A 32 bit integer equal to the number
628 of arguments of printf function call
629 minus 1
630
631 ``S[i]`` (where i = 0, 1, ... , N-1)
632 32 bit integers for the size in bytes
633 of the i-th FormatString argument of
634 the printf function call
635
636 FormatString
637 The format string passed to the
638 printf function call.
639 "Kernels" sequence of Required Sequence of the mappings for each
640 mapping kernel in the code object. See
641 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
642 for the definition of the mapping.
643 ========== ============== ========= =======================================
644
645..
646
647 .. table:: AMDHSA Code Object Kernel Metadata Mapping
648 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
649
650 ================= ============== ========= ================================
651 String Key Value Type Required? Description
652 ================= ============== ========= ================================
653 "Name" string Required Source name of the kernel.
654 "SymbolName" string Required Name of the kernel
655 descriptor ELF symbol.
656 "Language" string Source language of the kernel.
657 Values include:
658
659 - "OpenCL C"
660 - "OpenCL C++"
661 - "HCC"
662 - "OpenMP"
663
664 "LanguageVersion" sequence of - The first integer is the major
665 2 integers version.
666 - The second integer is the
667 minor version.
668 "Attrs" mapping Mapping of kernel attributes.
669 See
670 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
671 for the mapping definition.
672 "Arguments" sequence of Sequence of mappings of the
673 mapping kernel arguments. See
674 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
675 for the definition of the mapping.
676 "CodeProps" mapping Mapping of properties related to
677 the kernel code. See
678 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
679 for the mapping definition.
680 "DebugProps" mapping Mapping of properties related to
681 the kernel debugging. See
682 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
683 for the mapping definition.
684 ================= ============== ========= ================================
685
686..
687
688 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
689 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
690
691 =================== ============== ========= ==============================
692 String Key Value Type Required? Description
693 =================== ============== ========= ==============================
694 "ReqdWorkGroupSize" sequence of The dispatch work-group size
695 3 integers X, Y, Z must correspond to the
696 specified values.
697
698 Corresponds to the OpenCL
699 ``reqd_work_group_size``
700 attribute.
701 "WorkGroupSizeHint" sequence of The dispatch work-group size
702 3 integers X, Y, Z is likely to be the
703 specified values.
704
705 Corresponds to the OpenCL
706 ``work_group_size_hint``
707 attribute.
708 "VecTypeHint" string The name of a scalar or vector
709 type.
710
711 Corresponds to the OpenCL
712 ``vec_type_hint`` attribute.
713 =================== ============== ========= ==============================
714
715..
716
717 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
718 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
719
720 ================= ============== ========= ================================
721 String Key Value Type Required? Description
722 ================= ============== ========= ================================
723 "Name" string Kernel argument name.
724 "TypeName" string Kernel argument type name.
725 "Size" integer Required Kernel argument size in bytes.
726 "Align" integer Required Kernel argument alignment in
727 bytes. Must be a power of two.
728 "ValueKind" string Required Kernel argument kind that
729 specifies how to set up the
730 corresponding argument.
731 Values include:
732
733 "ByValue"
734 The argument is copied
735 directly into the kernarg.
736
737 "GlobalBuffer"
738 A global address space pointer
739 to the buffer data is passed
740 in the kernarg.
741
742 "DynamicSharedPointer"
743 A group address space pointer
744 to dynamically allocated LDS
745 is passed in the kernarg.
746
747 "Sampler"
748 A global address space
749 pointer to a S# is passed in
750 the kernarg.
751
752 "Image"
753 A global address space
754 pointer to a T# is passed in
755 the kernarg.
756
757 "Pipe"
758 A global address space pointer
759 to an OpenCL pipe is passed in
760 the kernarg.
761
762 "Queue"
763 A global address space pointer
764 to an OpenCL device enqueue
765 queue is passed in the
766 kernarg.
767
768 "HiddenGlobalOffsetX"
769 The OpenCL grid dispatch
770 global offset for the X
771 dimension is passed in the
772 kernarg.
773
774 "HiddenGlobalOffsetY"
775 The OpenCL grid dispatch
776 global offset for the Y
777 dimension is passed in the
778 kernarg.
779
780 "HiddenGlobalOffsetZ"
781 The OpenCL grid dispatch
782 global offset for the Z
783 dimension is passed in the
784 kernarg.
785
786 "HiddenNone"
787 An argument that is not used
788 by the kernel. Space needs to
789 be left for it, but it does
790 not need to be set up.
791
792 "HiddenPrintfBuffer"
793 A global address space pointer
794 to the runtime printf buffer
795 is passed in kernarg.
796
797 "HiddenDefaultQueue"
798 A global address space pointer
799 to the OpenCL device enqueue
800 queue that should be used by
801 the kernel by default is
802 passed in the kernarg.
803
804 "HiddenCompletionAction"
805 *TBD*
806
807 .. TODO
808 Add description.
809
810 "ValueType" string Required Kernel argument value type. Only
811 present if "ValueKind" is
812 "ByValue". For vector data
813 types, the value is for the
814 element type. Values include:
815
816 - "Struct"
817 - "I8"
818 - "U8"
819 - "I16"
820 - "U16"
821 - "F16"
822 - "I32"
823 - "U32"
824 - "F32"
825 - "I64"
826 - "U64"
827 - "F64"
828
829 .. TODO
830 How can it be determined if a
831 vector type, and what size
832 vector?
833 "PointeeAlign" integer Alignment in bytes of pointee
834 type for pointer type kernel
835 argument. Must be a power
836 of 2. Only present if
837 "ValueKind" is
838 "DynamicSharedPointer".
839 "AddrSpaceQual" string Kernel argument address space
840 qualifier. Only present if
841 "ValueKind" is "GlobalBuffer" or
842 "DynamicSharedPointer". Values
843 are:
844
845 - "Private"
846 - "Global"
847 - "Constant"
848 - "Local"
849 - "Generic"
850 - "Region"
851
852 .. TODO
853 Is GlobalBuffer only Global
854 or Constant? Is
855 DynamicSharedPointer always
856 Local? Can HCC allow Generic?
857 How can Private or Region
858 ever happen?
859 "AccQual" string Kernel argument access
860 qualifier. Only present if
861 "ValueKind" is "Image" or
862 "Pipe". Values
863 are:
864
865 - "ReadOnly"
866 - "WriteOnly"
867 - "ReadWrite"
868
869 .. TODO
870 Does this apply to
871 GlobalBuffer?
872 "ActualAcc" string The actual memory accesses
873 performed by the kernel on the
874 kernel argument. Only present if
875 "ValueKind" is "GlobalBuffer",
876 "Image", or "Pipe". This may be
877 more restrictive than indicated
878 by "AccQual" to reflect what the
879 kernel actual does. If not
880 present then the runtime must
881 assume what is implied by
882 "AccQual" and "IsConst". Values
883 are:
884
885 - "ReadOnly"
886 - "WriteOnly"
887 - "ReadWrite"
888
889 "IsConst" boolean Indicates if the kernel argument
890 is const qualified. Only present
891 if "ValueKind" is
892 "GlobalBuffer".
893
894 "IsRestrict" boolean Indicates if the kernel argument
895 is restrict qualified. Only
896 present if "ValueKind" is
897 "GlobalBuffer".
898
899 "IsVolatile" boolean Indicates if the kernel argument
900 is volatile qualified. Only
901 present if "ValueKind" is
902 "GlobalBuffer".
903
904 "IsPipe" boolean Indicates if the kernel argument
905 is pipe qualified. Only present
906 if "ValueKind" is "Pipe".
907
908 .. TODO
909 Can GlobalBuffer be pipe
910 qualified?
911 ================= ============== ========= ================================
912
913..
914
915 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
916 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
917
918 ============================ ============== ========= =====================
919 String Key Value Type Required? Description
920 ============================ ============== ========= =====================
921 "KernargSegmentSize" integer Required The size in bytes of
922 the kernarg segment
923 that holds the values
924 of the arguments to
925 the kernel.
926 "GroupSegmentFixedSize" integer Required The amount of group
927 segment memory
928 required by a
929 work-group in
930 bytes. This does not
931 include any
932 dynamically allocated
933 group segment memory
934 that may be added
935 when the kernel is
936 dispatched.
937 "PrivateSegmentFixedSize" integer Required The amount of fixed
938 private address space
939 memory required for a
940 work-item in
941 bytes. If
942 IsDynamicCallstack
943 is 1 then additional
944 space must be added
945 to this value for the
946 call stack.
947 "KernargSegmentAlign" integer Required The maximum byte
948 alignment of
949 arguments in the
950 kernarg segment. Must
951 be a power of 2.
952 "WavefrontSize" integer Required Wavefront size. Must
953 be a power of 2.
954 "NumSGPRs" integer Number of scalar
955 registers used by a
956 wavefront for
957 GFX6-GFX9. This
958 includes the special
959 SGPRs for VCC, Flat
960 Scratch (GFX7-GFX9)
961 and XNACK (for
962 GFX8-GFX9). It does
963 not include the 16
964 SGPR added if a trap
965 handler is
966 enabled. It is not
967 rounded up to the
968 allocation
969 granularity.
970 "NumVGPRs" integer Number of vector
971 registers used by
972 each work-item for
973 GFX6-GFX9
974 "MaxFlatWorkgroupSize" integer Maximum flat
975 work-group size
976 supported by the
977 kernel in work-items.
978 "IsDynamicCallStack" boolean Indicates if the
979 generated machine
980 code is using a
981 dynamically sized
982 call stack.
983 "IsXNACKEnabled" boolean Indicates if the
984 generated machine
985 code is capable of
986 supporting XNACK.
987 ============================ ============== ========= =====================
988
989..
990
991 .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
992 :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
993
994 =================================== ============== ========= ==============
995 String Key Value Type Required? Description
996 =================================== ============== ========= ==============
997 "DebuggerABIVersion" string
998 "ReservedNumVGPRs" integer
999 "ReservedFirstVGPR" integer
1000 "PrivateSegmentBufferSGPR" integer
1001 "WavefrontPrivateSegmentOffsetSGPR" integer
1002 =================================== ============== ========= ==============
1003
1004.. TODO
1005 Plan to remove the debug properties metadata.
1006
1007.. _amdgpu-symbols:
1008
1009Symbols
1010-------
1011
1012Symbols include the following:
1013
1014 .. table:: AMDGPU ELF Symbols
1015 :name: amdgpu-elf-symbols-table
1016
1017 ===================== ============== ============= ==================
1018 Name Type Section Description
1019 ===================== ============== ============= ==================
1020 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
1021 - ``.rodata``
1022 - ``.bss``
1023 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
1024 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
1025 ===================== ============== ============= ==================
1026
1027Global variable
1028 Global variables both used and defined by the compilation unit.
1029
1030 If the symbol is defined in the compilation unit then it is allocated in the
1031 appropriate section according to if it has initialized data or is readonly.
1032
1033 If the symbol is external then its section is ``STN_UNDEF`` and the loader
1034 will resolve relocations using the defintion provided by another code object
1035 or explicitly defined by the runtime.
1036
1037 All global symbols, whether defined in the compilation unit or external, are
1038 accessed by the machine code indirectly throught a GOT table entry. This
1039 allows them to be preemptable. The GOT table is only supported when the target
1040 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
1041
1042 .. TODO
1043 Add description of linked shared object symbols. Seems undefined symbols
1044 are marked as STT_NOTYPE.
1045
1046Kernel descriptor
1047 Every HSA kernel has an associated kernel descriptor. It is the address of the
1048 kernel descriptor that is used in the AQL dispatch packet used to invoke the
1049 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1050 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1051
1052Kernel entry point
1053 Every HSA kernel also has a symbol for its machine code entry point.
1054
1055.. _amdgpu-relocation-records:
1056
1057Relocation Records
1058------------------
1059
1060AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1061relocatable fields are:
1062
1063``word32``
1064 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1065 alignment. These values use the same byte order as other word values in the
1066 AMD GPU architecture.
1067
1068``word64``
1069 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1070 alignment. These values use the same byte order as other word values in the
1071 AMD GPU architecture.
1072
1073Following notations are used for specifying relocation calculations:
1074
1075**A**
1076 Represents the addend used to compute the value of the relocatable field.
1077
1078**G**
1079 Represents the offset into the global offset table at which the relocation
1080 entry’s symbol will reside during execution.
1081
1082**GOT**
1083 Represents the address of the global offset table.
1084
1085**P**
1086 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1087 of the storage unit being relocated (computed using ``r_offset``).
1088
1089**S**
1090 Represents the value of the symbol whose index resides in the relocation
1091 entry.
1092
1093The following relocation types are supported:
1094
1095 .. table:: AMDGPU ELF Relocation Records
1096 :name: amdgpu-elf-relocation-records-table
1097
1098 ========================== ===== ========== ==============================
1099 Relocation Type Value Field Calculation
1100 ========================== ===== ========== ==============================
1101 ``R_AMDGPU_NONE`` 0 *none* *none*
1102 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
1103 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
1104 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
1105 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
1106 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
1107 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
1108 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
1109 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
1110 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
1111 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
1112 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
1113 ========================== ===== ========== ==============================
1114
1115.. _amdgpu-dwarf:
1116
1117DWARF
1118-----
1119
1120Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
1121information that maps the code object executable code and data to the source
1122language constructs. It can be used by tools such as debuggers and profilers.
1123
1124Address Space Mapping
1125~~~~~~~~~~~~~~~~~~~~~
1126
1127The following address space mapping is used:
1128
1129 .. table:: AMDGPU DWARF Address Space Mapping
1130 :name: amdgpu-dwarf-address-space-mapping-table
1131
1132 =================== =================
1133 DWARF Address Space Memory Space
1134 =================== =================
1135 1 Private (Scratch)
1136 2 Local (group/LDS)
1137 *omitted* Global
1138 *omitted* Constant
1139 *omitted* Generic (Flat)
1140 *not supported* Region (GDS)
1141 =================== =================
1142
1143See :ref:`amdgpu-address-spaces` for infomration on the memory space terminology
1144used in the table.
1145
1146An ``address_class`` attribute is generated on pointer type DIEs to specify the
1147DWARF address space of the value of the pointer when it is in the *private* or
1148*local* address space. Otherwise the attribute is omitted.
1149
1150An ``XDEREF`` operation is generated in location list expressions for variables
1151that are allocated in the *private* and *local* address space. Otherwise no
1152``XDREF`` is omitted.
1153
1154Register Mapping
1155~~~~~~~~~~~~~~~~
1156
1157*This section is WIP.*
1158
1159.. TODO
1160 Define DWARF register enumeration.
1161
1162 If want to present a wavefront state then should expose vector registers as
1163 64 wide (rather than per work-item view that LLVM uses). Either as seperate
1164 registers, or a 64x4 byte single register. In either case use a new LANE op
1165 (akin to XDREF) to select the current lane usage in a location
1166 expression. This would also allow scalar register spilling to vector register
1167 lanes to be expressed (currently no debug information is being generated for
1168 spilling). If choose a wide single register approach then use LANE in
1169 conjunction with PIECE operation to select the dword part of the register for
1170 the current lane. If the separate register approach then use LANE to select
1171 the register.
1172
1173Source Text
1174~~~~~~~~~~~
1175
1176*This section is WIP.*
1177
1178.. TODO
1179 DWARF extension to include runtime generated source text.
1180
1181.. _amdgpu-code-conventions:
1182
1183Code Conventions
1184================
1185
1186AMDHSA
1187------
1188
1189This section provides code conventions used when the target triple OS is
1190``amdhsa`` (see :ref:`amdgpu-target-triples`).
1191
1192Kernel Dispatch
1193~~~~~~~~~~~~~~~
1194
1195The HSA architected queuing language (AQL) defines a user space memory interface
1196that can be used to control the dispatch of kernels, in an agent independent
1197way. An agent can have zero or more AQL queues created for it using the ROCm
1198runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1199*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1200mechanics and packet layouts.
1201
1202The packet processor of a kernel agent is responsible for detecting and
1203dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1204packet processor is implemented by the hardware command processor (CP),
1205asynchronous dispatch controller (ADC) and shader processor input controller
1206(SPI).
1207
1208The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1209mode driver to initialize and register the AQL queue with CP.
1210
1211To dispatch a kernel the following actions are performed. This can occur in the
1212CPU host program, or from an HSA kernel executing on a GPU.
1213
12141. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1215 executed is obtained.
12162. A pointer to the kernel descriptor (see
1217 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1218 obtained. It must be for a kernel that is contained in a code object that that
1219 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1220 associated.
12213. Space is allocated for the kernel arguments using the ROCm runtime allocator
1222 for a memory region with the kernarg property for the kernel agent that will
1223 execute the kernel. It must be at least 16 byte aligned.
12244. Kernel argument values are assigned to the kernel argument memory
1225 allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1226 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1227 memory in the same way constant memory is accessed. (Note that the HSA
1228 specification allows an implementation to copy the kernel argument contents to
1229 another location that is accessed by the kernel.)
12305. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1231 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1232 packet. The packet must be set up, and the final write must use an atomic
1233 store release to set the packet kind to ensure the packet contents are
1234 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1235 notify the kernel agent that the AQL queue has been updated. These rules, and
1236 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1237 System Architecture Specification* [HSA]_.
12386. A kernel dispatch packet includes information about the actual dispatch,
1239 such as grid and work-group size, together with information from the code
1240 object about the kernel, such as segment sizes. The ROCm runtime queries on
1241 the kernel symbol can be used to obtain the code object values which are
1242 recorded in the :ref:`amdgpu-code-object-metadata`.
12437. CP executes micro-code and is responsible for detecting and setting up the
1244 GPU to execute the wavefronts of a kernel dispatch.
12458. CP ensures that when the a wavefront starts executing the kernel machine
1246 code, the scalar general purpose registers (SGPR) and vector general purpose
1247 registers (VGPR) are set up as required by the machine code. The required
1248 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1249 register state is defined in
1250 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
12519. The prolog of the kernel machine code (see
1252 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1253 before continuing executing the machine code that corresponds to the kernel.
125410. When the kernel dispatch has completed execution, CP signals the completion
1255 signal specified in the kernel dispatch packet if not 0.
1256
1257.. _amdgpu-amdhsa-memory-spaces:
1258
1259Memory Spaces
1260~~~~~~~~~~~~~
1261
1262The memory space properties are:
1263
1264 .. table:: AMDHSA Memory Spaces
1265 :name: amdgpu-amdhsa-memory-spaces-table
1266
1267 ================= =========== ======== ======= ==================
1268 Memory Space Name HSA Segment Hardware Address NULL Value
1269 Name Name Size
1270 ================= =========== ======== ======= ==================
1271 Private private scratch 32 0x00000000
1272 Local group LDS 32 0xFFFFFFFF
1273 Global global global 64 0x0000000000000000
1274 Constant constant *same as 64 0x0000000000000000
1275 global*
1276 Generic flat flat 64 0x0000000000000000
1277 Region N/A GDS 32 *not implemented
1278 for AMDHSA*
1279 ================= =========== ======== ======= ==================
1280
1281The global and constant memory spaces both use global virtual addresses, which
1282are the same virtual address space used by the CPU. However, some virtual
1283addresses may only be accessible to the CPU, some only accessible by the GPU,
1284and some by both.
1285
1286Using the constant memory space indicates that the data will not change during
1287the execution of the kernel. This allows scalar read instructions to be
1288used. The vector and scalar L1 caches are invalidated of volatile data before
1289each kernel dispatch execution to allow constant memory to change values between
1290kernel dispatches.
1291
1292The local memory space uses the hardware Local Data Store (LDS) which is
1293automatically allocated when the hardware creates work-groups of wavefronts, and
1294freed when all the wavefronts of a work-group have terminated. The data store
1295(DS) instructions can be used to access it.
1296
1297The private memory space uses the hardware scratch memory support. If the kernel
1298uses scratch, then the hardware allocates memory that is accessed using
1299wavefront lane dword (4 byte) interleaving. The mapping used from private
1300address to physical address is:
1301
1302 ``wavefront-scratch-base +
1303 (private-address * wavefront-size * 4) +
1304 (wavefront-lane-id * 4)``
1305
1306There are different ways that the wavefront scratch base address is determined
1307by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1308memory can be accessed in an interleaved manner using buffer instruction with
1309the scratch buffer descriptor and per wave scratch offset, by the scratch
1310instructions, or by flat instructions. If each lane of a wavefront accesses the
1311same private address, the interleaving results in adjacent dwords being accessed
1312and hence requires fewer cache lines to be fetched. Multi-dword access is not
1313supported except by flat and scratch instructions in GFX9.
1314
1315The generic address space uses the hardware flat address support available in
1316GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1317local appertures), that are outside the range of addressible global memory, to
1318map from a flat address to a private or local address.
1319
1320FLAT instructions can take a flat address and access global, private (scratch)
1321and group (LDS) memory depending in if the address is within one of the
1322apperture ranges. Flat access to scratch requires hardware aperture setup and
1323setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1324access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1325(see :ref:`amdgpu-amdhsa-m0`).
1326
1327To convert between a segment address and a flat address the base address of the
1328appertures address can be used. For GFX7-GFX8 these are available in the
1329:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1330Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1331GFX9 the appature base addresses are directly available as inline constant
1332registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1333address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1334which makes it easier to convert from flat to segment or segment to flat.
1335
1336HSA Image and Samplers
1337~~~~~~~~~~~~~~~~~~~~~~
1338
1339Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1340hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1341HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1342enumeration values for the queries that are not trivially deducible from the S#
1343representation.
1344
1345HSA Signals
1346~~~~~~~~~~~
1347
1348Signal handles created by the ROCm runtime are 64 bit addresses of a structure
1349allocated in memory accessible from both the CPU and GPU. The structure is
1350defined by the ROCm runtime and subject to change between releases (see
1351[AMD-ROCm-github]_).
1352
1353.. _amdgpu-amdhsa-hsa-aql-queue:
1354
1355HSA AQL Queue
1356~~~~~~~~~~~~~
1357
1358The AQL queue structure is defined by the ROCm runtime and subject to change
1359between releases (see [AMD-ROCm-github]_). For some processors it contains
1360fields needed to implement certain language features such as the flat address
1361aperture bases. It also contains fields used by CP such as managing the
1362allocation of scratch memory.
1363
1364.. _amdgpu-amdhsa-kernel-descriptor:
1365
1366Kernel Descriptor
1367~~~~~~~~~~~~~~~~~
1368
1369A kernel descriptor consists of the information needed by CP to initiate the
1370execution of a kernel, including the entry point address of the machine code
1371that implements the kernel.
1372
1373Kernel Descriptor for GFX6-GFX9
1374+++++++++++++++++++++++++++++++
1375
1376CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1377
1378 .. table:: Kernel Descriptor for GFX6-GFX9
1379 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1380
1381 ======= ======= =============================== ===========================
1382 Bits Size Field Name Description
1383 ======= ======= =============================== ===========================
1384 31:0 4 bytes group_segment_fixed_size The amount of fixed local
1385 address space memory
1386 required for a work-group
1387 in bytes. This does not
1388 include any dynamically
1389 allocated local address
1390 space memory that may be
1391 added when the kernel is
1392 dispatched.
1393 63:32 4 bytes private_segment_fixed_size The amount of fixed
1394 private address space
1395 memory required for a
1396 work-item in bytes. If
1397 is_dynamic_callstack is 1
1398 then additional space must
1399 be added to this value for
1400 the call stack.
1401 95:64 4 bytes max_flat_workgroup_size Maximum flat work-group
1402 size supported by the
1403 kernel in work-items.
1404 96 1 bit is_dynamic_call_stack Indicates if the generated
1405 machine code is using a
1406 dynamically sized call
1407 stack.
1408 97 1 bit is_xnack_enabled Indicates if the generated
1409 machine code is capable of
1410 suppoting XNACK.
1411 127:98 30 bits Reserved. Must be 0.
1412 191:128 8 bytes kernel_code_entry_byte_offset Byte offset (possibly
1413 negative) from base
1414 address of kernel
1415 descriptor to kernel's
1416 entry point instruction
1417 which must be 256 byte
1418 aligned.
1419 383:192 24 Reserved. Must be 0.
1420 bytes
1421 415:384 4 bytes compute_pgm_rsrc1 Compute Shader (CS)
1422 program settings used by
1423 CP to set up
1424 ``COMPUTE_PGM_RSRC1``
1425 configuration
1426 register. See
1427 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
1428 447:416 4 bytes compute_pgm_rsrc2 Compute Shader (CS)
1429 program settings used by
1430 CP to set up
1431 ``COMPUTE_PGM_RSRC2``
1432 configuration
1433 register. See
1434 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1435 448 1 bit enable_sgpr_private_segment Enable the setup of the
1436 _buffer SGPR user data registers
1437 (see
1438 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1439
1440 The total number of SGPR
1441 user data registers
1442 requested must not exceed
1443 16 and match value in
1444 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1445 Any requests beyond 16
1446 will be ignored.
1447 449 1 bit enable_sgpr_dispatch_ptr *see above*
1448 450 1 bit enable_sgpr_queue_ptr *see above*
1449 451 1 bit enable_sgpr_kernarg_segment_ptr *see above*
1450 452 1 bit enable_sgpr_dispatch_id *see above*
1451 453 1 bit enable_sgpr_flat_scratch_init *see above*
1452 454 1 bit enable_sgpr_private_segment *see above*
1453 _size
1454 455 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1455 _count_X should always be 0.
1456 456 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1457 _count_Y should always be 0.
1458 457 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1459 _count_Z should always be 0.
1460 463:458 6 bits Reserved. Must be 0.
1461 511:464 4 Reserved. Must be 0.
1462 bytes
1463 512 **Total size 64 bytes.**
1464 ======= ===================================================================
1465
1466..
1467
1468 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1469 :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1470
1471 ======= ======= =============================== ===========================
1472 Bits Size Field Name Description
1473 ======= ======= =============================== ===========================
1474 5:0 6 bits granulated_workitem_vgpr_count Number of vector registers
1475 used by each work-item,
1476 granularity is device
1477 specific:
1478
1479 GFX6-9
1480 roundup((max-vgpg + 1)
1481 / 4) - 1
1482
1483 Used by CP to set up
1484 ``COMPUTE_PGM_RSRC1.VGPRS``.
1485 9:6 4 bits granulated_wavefront_sgpr_count Number of scalar registers
1486 used by a wavefront,
1487 granularity is device
1488 specific:
1489
1490 GFX6-8
1491 roundup((max-sgpg + 1)
1492 / 8) - 1
1493 GFX9
1494 roundup((max-sgpg + 1)
1495 / 16) - 1
1496
1497 Includes the special SGPRs
1498 for VCC, Flat Scratch (for
1499 GFX7 onwards) and XNACK
1500 (for GFX8 onwards). It does
1501 not include the 16 SGPR
1502 added if a trap handler is
1503 enabled.
1504
1505 Used by CP to set up
1506 ``COMPUTE_PGM_RSRC1.SGPRS``.
1507 11:10 2 bits priority Must be 0.
1508
1509 Start executing wavefront
1510 at the specified priority.
1511
1512 CP is responsible for
1513 filling in
1514 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1515 13:12 2 bits float_mode_round_32 Wavefront starts execution
1516 with specified rounding
1517 mode for single (32
1518 bit) floating point
1519 precision floating point
1520 operations.
1521
1522 Floating point rounding
1523 mode values are defined in
1524 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1525
1526 Used by CP to set up
1527 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1528 15:14 2 bits float_mode_round_16_64 Wavefront starts execution
1529 with specified rounding
1530 denorm mode for half/double (16
1531 and 64 bit) floating point
1532 precision floating point
1533 operations.
1534
1535 Floating point rounding
1536 mode values are defined in
1537 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1538
1539 Used by CP to set up
1540 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1541 17:16 2 bits float_mode_denorm_32 Wavefront starts execution
1542 with specified denorm mode
1543 for single (32
1544 bit) floating point
1545 precision floating point
1546 operations.
1547
1548 Floating point denorm mode
1549 values are defined in
1550 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1551
1552 Used by CP to set up
1553 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1554 19:18 2 bits float_mode_denorm_16_64 Wavefront starts execution
1555 with specified denorm mode
1556 for half/double (16
1557 and 64 bit) floating point
1558 precision floating point
1559 operations.
1560
1561 Floating point denorm mode
1562 values are defined in
1563 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1564
1565 Used by CP to set up
1566 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1567 20 1 bit priv Must be 0.
1568
1569 Start executing wavefront
1570 in privilege trap handler
1571 mode.
1572
1573 CP is responsible for
1574 filling in
1575 ``COMPUTE_PGM_RSRC1.PRIV``.
1576 21 1 bit enable_dx10_clamp Wavefront starts execution
1577 with DX10 clamp mode
1578 enabled. Used by the vector
1579 ALU to force DX-10 style
1580 treatment of NaN's (when
1581 set, clamp NaN to zero,
1582 otherwise pass NaN
1583 through).
1584
1585 Used by CP to set up
1586 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1587 22 1 bit debug_mode Must be 0.
1588
1589 Start executing wavefront
1590 in single step mode.
1591
1592 CP is responsible for
1593 filling in
1594 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1595 23 1 bit enable_ieee_mode Wavefront starts execution
1596 with IEEE mode
1597 enabled. Floating point
1598 opcodes that support
1599 exception flag gathering
1600 will quiet and propagate
1601 signaling-NaN inputs per
1602 IEEE 754-2008. Min_dx10 and
1603 max_dx10 become IEEE
1604 754-2008 compliant due to
1605 signaling-NaN propagation
1606 and quieting.
1607
1608 Used by CP to set up
1609 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1610 24 1 bit bulky Must be 0.
1611
1612 Only one work-group allowed
1613 to execute on a compute
1614 unit.
1615
1616 CP is responsible for
1617 filling in
1618 ``COMPUTE_PGM_RSRC1.BULKY``.
1619 25 1 bit cdbg_user Must be 0.
1620
1621 Flag that can be used to
1622 control debugging code.
1623
1624 CP is responsible for
1625 filling in
1626 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1627 31:26 6 bits Reserved. Must be 0.
1628 32 **Total size 4 bytes**
1629 ======= ===================================================================
1630
1631..
1632
1633 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1634 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1635
1636 ======= ======= =============================== ===========================
1637 Bits Size Field Name Description
1638 ======= ======= =============================== ===========================
1639 0 1 bit enable_sgpr_private_segment Enable the setup of the
1640 _wave_offset SGPR wave scratch offset
1641 system register (see
1642 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1643
1644 Used by CP to set up
1645 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1646 5:1 5 bits user_sgpr_count The total number of SGPR
1647 user data registers
1648 requested. This number must
1649 match the number of user
1650 data registers enabled.
1651
1652 Used by CP to set up
1653 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1654 6 1 bit enable_trap_handler Set to 1 if code contains a
1655 TRAP instruction which
1656 requires a trap hander to
1657 be enabled.
1658
1659 CP sets
1660 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1661 if the runtime has
1662 installed a trap handler
1663 regardless of the setting
1664 of this field.
1665 7 1 bit enable_sgpr_workgroup_id_x Enable the setup of the
1666 system SGPR register for
1667 the work-group id in the X
1668 dimension (see
1669 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1670
1671 Used by CP to set up
1672 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1673 8 1 bit enable_sgpr_workgroup_id_y Enable the setup of the
1674 system SGPR register for
1675 the work-group id in the Y
1676 dimension (see
1677 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1678
1679 Used by CP to set up
1680 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1681 9 1 bit enable_sgpr_workgroup_id_z Enable the setup of the
1682 system SGPR register for
1683 the work-group id in the Z
1684 dimension (see
1685 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1686
1687 Used by CP to set up
1688 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1689 10 1 bit enable_sgpr_workgroup_info Enable the setup of the
1690 system SGPR register for
1691 work-group information (see
1692 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1693
1694 Used by CP to set up
1695 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1696 12:11 2 bits enable_vgpr_workitem_id Enable the setup of the
1697 VGPR system registers used
1698 for the work-item ID.
1699 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1700 defines the values.
1701
1702 Used by CP to set up
1703 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1704 13 1 bit enable_exception_address_watch Must be 0.
1705
1706 Wavefront starts execution
1707 with address watch
1708 exceptions enabled which
1709 are generated when L1 has
1710 witnessed a thread access
1711 an *address of
1712 interest*.
1713
1714 CP is responsible for
1715 filling in the address
1716 watch bit in
1717 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1718 according to what the
1719 runtime requests.
1720 14 1 bit enable_exception_memory Must be 0.
1721
1722 Wavefront starts execution
1723 with memory violation
1724 exceptions exceptions
1725 enabled which are generated
1726 when a memory violation has
1727 occurred for this wave from
1728 L1 or LDS
1729 (write-to-read-only-memory,
1730 mis-aligned atomic, LDS
1731 address out of range,
1732 illegal address, etc.).
1733
1734 CP sets the memory
1735 violation bit in
1736 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1737 according to what the
1738 runtime requests.
1739 23:15 9 bits granulated_lds_size Must be 0.
1740
1741 CP uses the rounded value
1742 from the dispatch packet,
1743 not this value, as the
1744 dispatch may contain
1745 dynamically allocated group
1746 segment memory. CP writes
1747 directly to
1748 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1749
1750 Amount of group segment
1751 (LDS) to allocate for each
1752 work-group. Granularity is
1753 device specific:
1754
1755 GFX6:
1756 roundup(lds-size / (64 * 4))
1757 GFX7-GFX9:
1758 roundup(lds-size / (128 * 4))
1759
1760 24 1 bit enable_exception_ieee_754_fp Wavefront starts execution
1761 _invalid_operation with specified exceptions
1762 enabled.
1763
1764 Used by CP to set up
1765 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1766 (set from bits 0..6).
1767
1768 IEEE 754 FP Invalid
1769 Operation
1770 25 1 bit enable_exception_fp_denormal FP Denormal one or more
1771 _source input operands is a
1772 denormal number
1773 26 1 bit enable_exception_ieee_754_fp IEEE 754 FP Division by
1774 _division_by_zero Zero
1775 27 1 bit enable_exception_ieee_754_fp IEEE 754 FP FP Overflow
1776 _overflow
1777 28 1 bit enable_exception_ieee_754_fp IEEE 754 FP Underflow
1778 _underflow
1779 29 1 bit enable_exception_ieee_754_fp IEEE 754 FP Inexact
1780 _inexact
1781 30 1 bit enable_exception_int_divide_by Integer Division by Zero
1782 _zero (rcp_iflag_f32 instruction
1783 only)
1784 31 1 bit Reserved. Must be 0.
1785 32 **Total size 4 bytes.**
1786 ======= ===================================================================
1787
1788..
1789
1790 .. table:: Floating Point Rounding Mode Enumeration Values
1791 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1792
1793 ===================================== ===== ===============================
1794 Enumeration Name Value Description
1795 ===================================== ===== ===============================
1796 AMD_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1797 AMD_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1798 AMD_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1799 AMD_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1800 ===================================== ===== ===============================
1801
1802..
1803
1804 .. table:: Floating Point Denorm Mode Enumeration Values
1805 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1806
1807 ===================================== ===== ===============================
1808 Enumeration Name Value Description
1809 ===================================== ===== ===============================
1810 AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1811 Denorms
1812 AMD_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1813 AMD_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1814 AMD_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1815 ===================================== ===== ===============================
1816
1817..
1818
1819 .. table:: System VGPR Work-Item ID Enumeration Values
1820 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1821
1822 ===================================== ===== ===============================
1823 Enumeration Name Value Description
1824 ===================================== ===== ===============================
1825 AMD_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension ID.
1826 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1827 dimensions ID.
1828 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1829 dimensions ID.
1830 AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1831 ===================================== ===== ===============================
1832
1833.. _amdgpu-amdhsa-initial-kernel-execution-state:
1834
1835Initial Kernel Execution State
1836~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1837
1838This section defines the register state that will be set up by the packet
1839processor prior to the start of execution of every wavefront. This is limited by
1840the constraints of the hardware controllers of CP/ADC/SPI.
1841
1842The order of the SGPR registers is defined, but the compiler can specify which
1843ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1844fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1845for enabled registers are dense starting at SGPR0: the first enabled register is
1846SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1847an SGPR number.
1848
1849The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1850all waves of the grid. It is possible to specify more than 16 User SGPRs using
1851the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1852initialized. These are then immediately followed by the System SGPRs that are
1853set up by ADC/SPI and can have different values for each wave of the grid
1854dispatch.
1855
1856SGPR register initial state is defined in
1857:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1858
1859 .. table:: SGPR Register Set Up Order
1860 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1861
1862 ========== ========================== ====== ==============================
1863 SGPR Order Name Number Description
1864 (kernel descriptor enable of
1865 field) SGPRs
1866 ========== ========================== ====== ==============================
1867 First Private Segment Buffer 4 V# that can be used, together
1868 (enable_sgpr_private with Scratch Wave Offset as an
1869 _segment_buffer) offset, to access the private
1870 memory space using a segment
1871 address.
1872
1873 CP uses the value provided by
1874 the runtime.
1875 then Dispatch Ptr 2 64 bit address of AQL dispatch
1876 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
1877 actually executing.
1878 then Queue Ptr 2 64 bit address of amd_queue_t
1879 (enable_sgpr_queue_ptr) object for AQL queue on which
1880 the dispatch packet was
1881 queued.
1882 then Kernarg Segment Ptr 2 64 bit address of Kernarg
1883 (enable_sgpr_kernarg segment. This is directly
1884 _segment_ptr) copied from the
1885 kernarg_address in the kernel
1886 dispatch packet.
1887
1888 Having CP load it once avoids
1889 loading it at the beginning of
1890 every wavefront.
1891 then Dispatch Id 2 64 bit Dispatch ID of the
1892 (enable_sgpr_dispatch_id) dispatch packet being
1893 executed.
1894 then Flat Scratch Init 2 This is 2 SGPRs:
1895 (enable_sgpr_flat_scratch
1896 _init) GFX6
1897 Not supported.
1898 GFX7-GFX8
1899 The first SGPR is a 32 bit
1900 byte offset from
1901 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1902 to per SPI base of memory
1903 for scratch for the queue
1904 executing the kernel
1905 dispatch. CP obtains this
1906 from the runtime.
1907
1908 This is the same offset used
1909 in computing the Scratch
1910 Segment Buffer base
1911 address. The value of
1912 Scratch Wave Offset must be
1913 added by the kernel machine
1914 code and moved to SGPRn-4
1915 for use as the FLAT SCRATCH
1916 BASE in flat memory
1917 instructions.
1918
1919 The second SGPR is 32 bit
1920 byte size of a single
1921 work-item’s scratch memory
1922 usage. This is directly
1923 loaded from the kernel
1924 dispatch packet Private
1925 Segment Byte Size and
1926 rounded up to a multiple of
1927 DWORD.
1928
1929 The kernel code must move to
1930 SGPRn-3 for use as the FLAT
1931 SCRATCH SIZE in flat memory
1932 instructions. Having CP load
1933 it once avoids loading it at
1934 the beginning of every
1935 wavefront.
1936 GFX9
1937 This is the 64 bit base
1938 address of the per SPI
1939 scratch backing memory
1940 managed by SPI for the queue
1941 executing the kernel
1942 dispatch. CP obtains this
1943 from the runtime (and
1944 divides it if there are
1945 multiple Shader Arrays each
1946 with its own SPI). The value
1947 of Scratch Wave Offset must
1948 be added by the kernel
1949 machine code and moved to
1950 SGPRn-4 and SGPRn-3 for use
1951 as the FLAT SCRATCH BASE in
1952 flat memory instructions.
1953 then Private Segment Size 1 The 32 bit byte size of a
1954 (enable_sgpr_private single work-item’s scratch
1955 _segment_size) memory allocation. This is the
1956 value from the kernel dispatch
1957 packet Private Segment Byte
1958 Size rounded up by CP to a
1959 multiple of DWORD.
1960
1961 Having CP load it once avoids
1962 loading it at the beginning of
1963 every wavefront.
1964
1965 This is not used for
1966 GFX7-GFX8 since it is the same
1967 value as the second SGPR of
1968 Flat Scratch Init. However, it
1969 may be needed for GFX9 which
1970 changes the meaning of the
1971 Flat Scratch Init value.
1972 then Grid Work-Group Count X 1 32 bit count of the number of
1973 (enable_sgpr_grid work-groups in the X dimension
1974 _workgroup_count_X) for the grid being
1975 executed. Computed from the
1976 fields in the kernel dispatch
1977 packet as ((grid_size.x +
1978 workgroup_size.x - 1) /
1979 workgroup_size.x).
1980 then Grid Work-Group Count Y 1 32 bit count of the number of
1981 (enable_sgpr_grid work-groups in the Y dimension
1982 _workgroup_count_Y && for the grid being
1983 less than 16 previous executed. Computed from the
1984 SGPRs) fields in the kernel dispatch
1985 packet as ((grid_size.y +
1986 workgroup_size.y - 1) /
1987 workgroupSize.y).
1988
1989 Only initialized if <16
1990 previous SGPRs initialized.
1991 then Grid Work-Group Count Z 1 32 bit count of the number of
1992 (enable_sgpr_grid work-groups in the Z dimension
1993 _workgroup_count_Z && for the grid being
1994 less than 16 previous executed. Computed from the
1995 SGPRs) fields in the kernel dispatch
1996 packet as ((grid_size.z +
1997 workgroup_size.z - 1) /
1998 workgroupSize.z).
1999
2000 Only initialized if <16
2001 previous SGPRs initialized.
2002 then Work-Group Id X 1 32 bit work-group id in X
2003 (enable_sgpr_workgroup_id dimension of grid for
2004 _X) wavefront.
2005 then Work-Group Id Y 1 32 bit work-group id in Y
2006 (enable_sgpr_workgroup_id dimension of grid for
2007 _Y) wavefront.
2008 then Work-Group Id Z 1 32 bit work-group id in Z
2009 (enable_sgpr_workgroup_id dimension of grid for
2010 _Z) wavefront.
2011 then Work-Group Info 1 {first_wave, 14’b0000,
2012 (enable_sgpr_workgroup ordered_append_term[10:0],
2013 _info) threadgroup_size_in_waves[5:0]}
2014 then Scratch Wave Offset 1 32 bit byte offset from base
2015 (enable_sgpr_private of scratch base of queue
2016 _segment_wave_offset) executing the kernel
2017 dispatch. Must be used as an
2018 offset with Private
2019 segment address when using
2020 Scratch Segment Buffer. It
2021 must be used to set up FLAT
2022 SCRATCH for flat addressing
2023 (see
2024 :ref:`amdgpu-amdhsa-flat-scratch`).
2025 ========== ========================== ====== ==============================
2026
2027The order of the VGPR registers is defined, but the compiler can specify which
2028ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2029fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2030for enabled registers are dense starting at VGPR0: the first enabled register is
2031VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2032VGPR number.
2033
2034VGPR register initial state is defined in
2035:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2036
2037 .. table:: VGPR Register Set Up Order
2038 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2039
2040 ========== ========================== ====== ==============================
2041 VGPR Order Name Number Description
2042 (kernel descriptor enable of
2043 field) VGPRs
2044 ========== ========================== ====== ==============================
2045 First Work-Item Id X 1 32 bit work item id in X
2046 (Always initialized) dimension of work-group for
2047 wavefront lane.
2048 then Work-Item Id Y 1 32 bit work item id in Y
2049 (enable_vgpr_workitem_id dimension of work-group for
2050 > 0) wavefront lane.
2051 then Work-Item Id Z 1 32 bit work item id in Z
2052 (enable_vgpr_workitem_id dimension of work-group for
2053 > 1) wavefront lane.
2054 ========== ========================== ====== ==============================
2055
2056The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2057
20581. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2059 registers.
20602. Work-group Id registers X, Y, Z are set by ADC which supports any
2061 combination including none.
20623. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2063 cannot included with the flat scratch init value which is per queue.
20644. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2065 or (X, Y, Z).
2066
2067Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2068value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2069
2070The global segment can be accessed either using buffer instructions (GFX6 which
2071has V# 64 bit address support), flat instructions (GFX7-9), or global
2072instructions (GFX9).
2073
2074If buffer operations are used then the compiler can generate a V# with the
2075following properties:
2076
2077* base address of 0
2078* no swizzle
2079* ATC: 1 if IOMMU present (such as APU)
2080* ptr64: 1
2081* MTYPE set to support memory coherence that matches the runtime (such as CC for
2082 APU and NC for dGPU).
2083
2084.. _amdgpu-amdhsa-kernel-prolog:
2085
2086Kernel Prolog
2087~~~~~~~~~~~~~
2088
2089.. _amdgpu-amdhsa-m0:
2090
2091M0
2092++
2093
2094GFX6-GFX8
2095 The M0 register must be initialized with a value at least the total LDS size
2096 if the kernel may access LDS via DS or flat operations. Total LDS size is
2097 available in dispatch packet. For M0, it is also possible to use maximum
2098 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2099 GFX7-GFX8).
2100GFX9
2101 The M0 register is not used for range checking LDS accesses and so does not
2102 need to be initialized in the prolog.
2103
2104.. _amdgpu-amdhsa-flat-scratch:
2105
2106Flat Scratch
2107++++++++++++
2108
2109If the kernel may use flat operations to access scratch memory, the prolog code
2110must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2111are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2112Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2113
2114GFX6
2115 Flat scratch is not supported.
2116
2117GFX7-8
2118 1. The low word of Flat Scratch Init is 32 bit byte offset from
2119 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2120 being managed by SPI for the queue executing the kernel dispatch. This is
2121 the same value used in the Scratch Segment Buffer V# base address. The
2122 prolog must add the value of Scratch Wave Offset to get the wave's byte
2123 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2124 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2125 by 8 before moving into FLAT_SCRATCH_LO.
2126 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2127 work-items scratch memory usage. This is directly loaded from the kernel
2128 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2129 DWORD. Having CP load it once avoids loading it at the beginning of every
2130 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2131 SIZE.
2132GFX9
2133 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2134 memory being managed by SPI for the queue executing the kernel dispatch. The
2135 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2136 pair for use as the flat scratch base in flat memory instructions.
2137
2138.. _amdgpu-amdhsa-memory-model:
2139
2140Memory Model
2141~~~~~~~~~~~~
2142
2143This section describes the mapping of LLVM memory model onto AMDGPU machine code
2144(see :ref:`memmodel`). *The implementation is WIP.*
2145
2146.. TODO
2147 Update when implementation complete.
2148
2149 Support more relaxed OpenCL memory model to be controled by environment
2150 component of target triple.
2151
2152The AMDGPU backend supports the memory synchronization scopes specified in
2153:ref:`amdgpu-memory-scopes`.
2154
2155The code sequences used to implement the memory model are defined in table
2156:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2157
2158The sequences specify the order of instructions that a single thread must
2159execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2160to other memory instructions executed by the same thread. This allows them to be
2161moved earlier or later which can allow them to be combined with other instances
2162of the same instruction, or hoisted/sunk out of loops to improve
2163performance. Only the instructions related to the memory model are given;
2164additional ``s_waitcnt`` instructions are required to ensure registers are
2165defined before being used. These may be able to be combined with the memory
2166model ``s_waitcnt`` instructions as described above.
2167
2168The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2169OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2170relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2171memory model which has separate happens-before relations for the global and
2172local address spaces, and only a fence specifying both global and local address
2173space joins the relationships. Since the LLVM ``memfence`` instruction does not
2174allow an address space to be specified the OpenCL fence has to convervatively
2175assume both local and global address space was specified. However, optimizations
2176can often be done to eliminate the additional ``s_waitcnt``instructions when
2177there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2178instructions. The code sequences in the table indicate what can be omitted for
2179the OpenCL memory. The target triple environment is used to determine if the
2180source language is OpenCL (see :ref:`amdgpu-opencl`).
2181
2182``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2183operations.
2184
2185``buffer/global/flat_load/store/atomic`` instructions to global memory are
2186termed vector memory operations.
2187
2188For GFX6-GFX9:
2189
2190* Each agent has multiple compute units (CU).
2191* Each CU has multiple SIMDs that execute wavefronts.
2192* The wavefronts for a single work-group are executed in the same CU but may be
2193 executed by different SIMDs.
2194* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2195 executing on it.
2196* All LDS operations of a CU are performed as wavefront wide operations in a
2197 global order and involve no caching. Completion is reported to a wavefront in
2198 execution order.
2199* The LDS memory has multiple request queues shared by the SIMDs of a
2200 CU. Therefore, the LDS operations performed by different waves of a work-group
2201 can be reordered relative to each other, which can result in reordering the
2202 visibility of vector memory operations with respect to LDS operations of other
2203 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2204 ensure synchonization between LDS operations and vector memory operations
2205 between waves of a work-group, but not between operations performed by the
2206 same wavefront.
2207* The vector memory operations are performed as wavefront wide operations and
2208 completion is reported to a wavefront in execution order. The exception is
2209 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2210 vector memory order if they access LDS memory, and out of LDS operation order
2211 if they access global memory.
2212* The vector memory operations access a vector L1 cache shared by all wavefronts
2213 on a CU. Therefore, no special action is required for coherence between
2214 wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2215 coherence between waves executing in different work-groups as they may be
2216 executing on different CUs.
2217* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2218 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2219 scalar operations are used in a restricted way so do not impact the memory
2220 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2221* The vector and scalar memory operations use an L2 cache shared by all CUs on
2222 the same agent.
2223* The L2 cache has independent channels to service disjoint ranges of virtual
2224 addresses.
2225* Each CU has a separate request queue per channel. Therefore, the vector and
2226 scalar memory operations performed by waves executing in different work-groups
2227 (which may be executing on different CUs) of an agent can be reordered
2228 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2229 synchonization between vector memory operations of different CUs. It ensures a
2230 previous vector memory operation has completed before executing a subsequent
2231 vector memory or LDS operation and so can be used to meet the requirements of
2232 acquire and release.
2233* The L2 cache can be kept coherent with other agents on some targets, or ranges
2234 of virtual addresses can be set up to bypass it to ensure system coherence.
2235
2236Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2237or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2238memory, atomic memory orderings are not meaningful and all accesses are treated
2239as non-atomic.
2240
2241Constant address space uses ``buffer/global_load`` instructions (or equivalent
2242scalar memory instructions). Since the constant address space contents do not
2243change during the execution of a kernel dispatch it is not legal to perform
2244stores, and atomic memory orderings are not meaningful and all access are
2245treated as non-atomic.
2246
2247A memory synchronization scope wider than work-group is not meaningful for the
2248group (LDS) address space and is treated as work-group.
2249
2250The memory model does not support the region address space which is treated as
2251non-atomic.
2252
2253Acquire memory ordering is not meaningful on store atomic instructions and is
2254treated as non-atomic.
2255
2256Release memory ordering is not meaningful on load atomic instructions and is
2257treated a non-atomic.
2258
2259Acquire-release memory ordering is not meaningful on load or store atomic
2260instructions and is treated as acquire and release respectively.
2261
2262AMDGPU backend only uses scalar memory operations to access memory that is
2263proven to not change during the execution of the kernel dispatch. This includes
2264constant address space and global address space for program scope const
2265variables. Therefore the kernel machine code does not have to maintain the
2266scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2267and vector L1 caches are invalidated between kernel dispatches by CP since
2268constant address space data may change between kernel dispatch executions. See
2269:ref:`amdgpu-amdhsa-memory-spaces`.
2270
2271The one exeception is if scalar writes are used to spill SGPR registers. In this
2272case the AMDGPU backend ensures the memory location used to spill is never
2273accessed by vector memory operations at the same time. If scalar writes are used
2274then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2275return since the locations may be used for vector memory instructions by a
2276future wave that uses the same scratch area, or a function call that creates a
2277frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2278as all scalar writes are write-before-read in the same thread.
2279
2280Scratch backing memory (which is used for the private address space) is accessed
2281with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2282is only accessed by a single thread, and is always write-before-read,
2283there is never a need to invalidate these entries from the L1 cache. Hence all
2284cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2285lines.
2286
2287On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2288to invalidate the L2 cache. This also causes it to be treated as non-volatile
2289and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2290coherent) and so the L2 cache will coherent with the CPU and other agents.
2291
2292 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2293 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2294
2295 ============ ============ ============== ========== =======================
2296 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2297 Ordering Sync Scope Address
2298 Space
2299 ============ ============ ============== ========== =======================
2300 **Non-Atomic**
2301 ---------------------------------------------------------------------------
2302 load *none* *none* - global non-volatile
2303 - generic 1. buffer/global/flat_load
2304 volatile
2305 1. buffer/global/flat_load
2306 glc=1
2307 load *none* *none* - local 1. ds_load
2308 store *none* *none* - global 1. buffer/global/flat_store
2309 - generic
2310 store *none* *none* - local 1. ds_store
2311 **Unordered Atomic**
2312 ---------------------------------------------------------------------------
2313 load atomic unordered *any* *any* *Same as non-atomic*.
2314 store atomic unordered *any* *any* *Same as non-atomic*.
2315 atomicrmw unordered *any* *any* *Same as monotonic
2316 atomic*.
2317 **Monotonic Atomic**
2318 ---------------------------------------------------------------------------
2319 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2320 - wavefront - generic
2321 - workgroup
2322 load atomic monotonic - singlethread - local 1. ds_load
2323 - wavefront
2324 - workgroup
2325 load atomic monotonic - agent - global 1. buffer/global/flat_load
2326 - system - generic glc=1
2327 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2328 - wavefront - generic
2329 - workgroup
2330 - agent
2331 - system
2332 store atomic monotonic - singlethread - local 1. ds_store
2333 - wavefront
2334 - workgroup
2335 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2336 - wavefront - generic
2337 - workgroup
2338 - agent
2339 - system
2340 atomicrmw monotonic - singlethread - local 1. ds_atomic
2341 - wavefront
2342 - workgroup
2343 **Acquire Atomic**
2344 ---------------------------------------------------------------------------
2345 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2346 - wavefront - local
2347 - generic
2348 load atomic acquire - workgroup - global 1. buffer/global_load
2349 load atomic acquire - workgroup - local 1. ds/flat_load
2350 - generic 2. s_waitcnt lgkmcnt(0)
2351
2352 - If OpenCL, omit
2353 waitcnt.
2354 - Must happen before
2355 any following
2356 global/generic
2357 load/load
2358 atomic/store/store
2359 atomic/atomicrmw.
2360 - Ensures any
2361 following global
2362 data read is no
2363 older than the load
2364 atomic value being
2365 acquired.
2366
2367 load atomic acquire - agent - global 1. buffer/global_load
2368 - system glc=1
2369 2. s_waitcnt vmcnt(0)
2370
2371 - Must happen before
2372 following
2373 buffer_wbinvl1_vol.
2374 - Ensures the load
2375 has completed
2376 before invalidating
2377 the cache.
2378
2379 3. buffer_wbinvl1_vol
2380
2381 - Must happen before
2382 any following
2383 global/generic
2384 load/load
2385 atomic/atomicrmw.
2386 - Ensures that
2387 following
2388 loads will not see
2389 stale global data.
2390
2391 load atomic acquire - agent - generic 1. flat_load glc=1
2392 - system 2. s_waitcnt vmcnt(0) &
2393 lgkmcnt(0)
2394
2395 - If OpenCL omit
2396 lgkmcnt(0).
2397 - Must happen before
2398 following
2399 buffer_wbinvl1_vol.
2400 - Ensures the flat_load
2401 has completed
2402 before invalidating
2403 the cache.
2404
2405 3. buffer_wbinvl1_vol
2406
2407 - Must happen before
2408 any following
2409 global/generic
2410 load/load
2411 atomic/atomicrmw.
2412 - Ensures that
2413 following loads
2414 will not see stale
2415 global data.
2416
2417 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2418 - wavefront - local
2419 - generic
2420 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
2421 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
2422 - generic 2. waitcnt lgkmcnt(0)
2423
2424 - If OpenCL, omit
2425 waitcnt.
2426 - Must happen before
2427 any following
2428 global/generic
2429 load/load
2430 atomic/store/store
2431 atomic/atomicrmw.
2432 - Ensures any
2433 following global
2434 data read is no
2435 older than the
2436 atomicrmw value
2437 being acquired.
2438
2439 atomicrmw acquire - agent - global 1. buffer/global_atomic
2440 - system 2. s_waitcnt vmcnt(0)
2441
2442 - Must happen before
2443 following
2444 buffer_wbinvl1_vol.
2445 - Ensures the
2446 atomicrmw has
2447 completed before
2448 invalidating the
2449 cache.
2450
2451 3. buffer_wbinvl1_vol
2452
2453 - Must happen before
2454 any following
2455 global/generic
2456 load/load
2457 atomic/atomicrmw.
2458 - Ensures that
2459 following loads
2460 will not see stale
2461 global data.
2462
2463 atomicrmw acquire - agent - generic 1. flat_atomic
2464 - system 2. s_waitcnt vmcnt(0) &
2465 lgkmcnt(0)
2466
2467 - If OpenCL, omit
2468 lgkmcnt(0).
2469 - Must happen before
2470 following
2471 buffer_wbinvl1_vol.
2472 - Ensures the
2473 atomicrmw has
2474 completed before
2475 invalidating the
2476 cache.
2477
2478 3. buffer_wbinvl1_vol
2479
2480 - Must happen before
2481 any following
2482 global/generic
2483 load/load
2484 atomic/atomicrmw.
2485 - Ensures that
2486 following loads
2487 will not see stale
2488 global data.
2489
2490 fence acquire - singlethread *none* *none*
2491 - wavefront
2492 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2493
2494 - If OpenCL and
2495 address space is
2496 not generic, omit
2497 waitcnt. However,
2498 since LLVM
2499 currently has no
2500 address space on
2501 the fence need to
2502 conservatively
2503 always generate. If
2504 fence had an
2505 address space then
2506 set to address
2507 space of OpenCL
2508 fence flag, or to
2509 generic if both
2510 local and global
2511 flags are
2512 specified.
2513 - Must happen after
2514 any preceding
2515 local/generic load
2516 atomic/atomicrmw
2517 with an equal or
2518 wider sync scope
2519 and memory ordering
2520 stronger than
2521 unordered (this is
2522 termed the
2523 fence-paired-atomic).
2524 - Must happen before
2525 any following
2526 global/generic
2527 load/load
2528 atomic/store/store
2529 atomic/atomicrmw.
2530 - Ensures any
2531 following global
2532 data read is no
2533 older than the
2534 value read by the
2535 fence-paired-atomic.
2536
2537 fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
2538 - system lgkmcnt(0)
2539
2540 - If OpenCL and
2541 address space is
2542 not generic, omit
2543 lgkmcnt(0).
2544 However, since LLVM
2545 currently has no
2546 address space on
2547 the fence need to
2548 conservatively
2549 always generate
2550 (see comment for
2551 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002552 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002553 separate s_waitcnt
2554 vmcnt(0) and
2555 s_waitcnt
2556 lgkmcnt(0) to allow
2557 them to be
2558 independently moved
2559 according to the
2560 following rules.
2561 - s_waitcnt vmcnt(0)
2562 must happen after
2563 any preceding
2564 global/generic load
2565 atomic/atomicrmw
2566 with an equal or
2567 wider sync scope
2568 and memory ordering
2569 stronger than
2570 unordered (this is
2571 termed the
2572 fence-paired-atomic).
2573 - s_waitcnt lgkmcnt(0)
2574 must happen after
2575 any preceding
2576 group/generic load
2577 atomic/atomicrmw
2578 with an equal or
2579 wider sync scope
2580 and memory ordering
2581 stronger than
2582 unordered (this is
2583 termed the
2584 fence-paired-atomic).
2585 - Must happen before
2586 the following
2587 buffer_wbinvl1_vol.
2588 - Ensures that the
2589 fence-paired atomic
2590 has completed
2591 before invalidating
2592 the
2593 cache. Therefore
2594 any following
2595 locations read must
2596 be no older than
2597 the value read by
2598 the
2599 fence-paired-atomic.
2600
2601 2. buffer_wbinvl1_vol
2602
2603 - Must happen before
2604 any following global/generic
2605 load/load
2606 atomic/store/store
2607 atomic/atomicrmw.
2608 - Ensures that
2609 following loads
2610 will not see stale
2611 global data.
2612
2613 **Release Atomic**
2614 ---------------------------------------------------------------------------
2615 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2616 - wavefront - local
2617 - generic
2618 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2619 - generic
2620 - If OpenCL, omit
2621 waitcnt.
2622 - Must happen after
2623 any preceding
2624 local/generic
2625 load/store/load
2626 atomic/store
2627 atomic/atomicrmw.
2628 - Must happen before
2629 the following
2630 store.
2631 - Ensures that all
2632 memory operations
2633 to local have
2634 completed before
2635 performing the
2636 store that is being
2637 released.
2638
2639 2. buffer/global/flat_store
2640 store atomic release - workgroup - local 1. ds_store
2641 store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
2642 - system - generic lgkmcnt(0)
2643
2644 - If OpenCL, omit
2645 lgkmcnt(0).
2646 - Could be split into
2647 separate s_waitcnt
2648 vmcnt(0) and
2649 s_waitcnt
2650 lgkmcnt(0) to allow
2651 them to be
2652 independently moved
2653 according to the
2654 following rules.
2655 - s_waitcnt vmcnt(0)
2656 must happen after
2657 any preceding
2658 global/generic
2659 load/store/load
2660 atomic/store
2661 atomic/atomicrmw.
2662 - s_waitcnt lgkmcnt(0)
2663 must happen after
2664 any preceding
2665 local/generic
2666 load/store/load
2667 atomic/store
2668 atomic/atomicrmw.
2669 - Must happen before
2670 the following
2671 store.
2672 - Ensures that all
2673 memory operations
2674 to global have
2675 completed before
2676 performing the
2677 store that is being
2678 released.
2679
2680 2. buffer/global/ds/flat_store
2681 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2682 - wavefront - local
2683 - generic
2684 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2685 - generic
2686 - If OpenCL, omit
2687 waitcnt.
2688 - Must happen after
2689 any preceding
2690 local/generic
2691 load/store/load
2692 atomic/store
2693 atomic/atomicrmw.
2694 - Must happen before
2695 the following
2696 atomicrmw.
2697 - Ensures that all
2698 memory operations
2699 to local have
2700 completed before
2701 performing the
2702 atomicrmw that is
2703 being released.
2704
2705 2. buffer/global/flat_atomic
2706 atomicrmw release - workgroup - local 1. ds_atomic
2707 atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
2708 - system - generic lgkmcnt(0)
2709
2710 - If OpenCL, omit
2711 lgkmcnt(0).
2712 - Could be split into
2713 separate s_waitcnt
2714 vmcnt(0) and
2715 s_waitcnt
2716 lgkmcnt(0) to allow
2717 them to be
2718 independently moved
2719 according to the
2720 following rules.
2721 - s_waitcnt vmcnt(0)
2722 must happen after
2723 any preceding
2724 global/generic
2725 load/store/load
2726 atomic/store
2727 atomic/atomicrmw.
2728 - s_waitcnt lgkmcnt(0)
2729 must happen after
2730 any preceding
2731 local/generic
2732 load/store/load
2733 atomic/store
2734 atomic/atomicrmw.
2735 - Must happen before
2736 the following
2737 atomicrmw.
2738 - Ensures that all
2739 memory operations
2740 to global and local
2741 have completed
2742 before performing
2743 the atomicrmw that
2744 is being released.
2745
2746 2. buffer/global/ds/flat_atomic*
2747 fence release - singlethread *none* *none*
2748 - wavefront
2749 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2750
2751 - If OpenCL and
2752 address space is
2753 not generic, omit
2754 waitcnt. However,
2755 since LLVM
2756 currently has no
2757 address space on
2758 the fence need to
2759 conservatively
2760 always generate
2761 (see comment for
2762 previous fence).
2763 - Must happen after
2764 any preceding
2765 local/generic
2766 load/load
2767 atomic/store/store
2768 atomic/atomicrmw.
2769 - Must happen before
2770 any following store
2771 atomic/atomicrmw
2772 with an equal or
2773 wider sync scope
2774 and memory ordering
2775 stronger than
2776 unordered (this is
2777 termed the
2778 fence-paired-atomic).
2779 - Ensures that all
2780 memory operations
2781 to local have
2782 completed before
2783 performing the
2784 following
2785 fence-paired-atomic.
2786
2787 fence release - agent *none* 1. s_waitcnt vmcnt(0) &
2788 - system lgkmcnt(0)
2789
2790 - If OpenCL and
2791 address space is
2792 not generic, omit
2793 lgkmcnt(0).
2794 However, since LLVM
2795 currently has no
2796 address space on
2797 the fence need to
2798 conservatively
2799 always generate
2800 (see comment for
2801 previous fence).
2802 - Could be split into
2803 separate s_waitcnt
2804 vmcnt(0) and
2805 s_waitcnt
2806 lgkmcnt(0) to allow
2807 them to be
2808 independently moved
2809 according to the
2810 following rules.
2811 - s_waitcnt vmcnt(0)
2812 must happen after
2813 any preceding
2814 global/generic
2815 load/store/load
2816 atomic/store
2817 atomic/atomicrmw.
2818 - s_waitcnt lgkmcnt(0)
2819 must happen after
2820 any preceding
2821 local/generic
2822 load/store/load
2823 atomic/store
2824 atomic/atomicrmw.
2825 - Must happen before
2826 any following store
2827 atomic/atomicrmw
2828 with an equal or
2829 wider sync scope
2830 and memory ordering
2831 stronger than
2832 unordered (this is
2833 termed the
2834 fence-paired-atomic).
2835 - Ensures that all
2836 memory operations
2837 to global have
2838 completed before
2839 performing the
2840 following
2841 fence-paired-atomic.
2842
2843 **Acquire-Release Atomic**
2844 ---------------------------------------------------------------------------
2845 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
2846 - wavefront - local
2847 - generic
2848 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
2849
2850 - If OpenCL, omit
2851 waitcnt.
2852 - Must happen after
2853 any preceding
2854 local/generic
2855 load/store/load
2856 atomic/store
2857 atomic/atomicrmw.
2858 - Must happen before
2859 the following
2860 atomicrmw.
2861 - Ensures that all
2862 memory operations
2863 to local have
2864 completed before
2865 performing the
2866 atomicrmw that is
2867 being released.
2868
2869 2. buffer/global_atomic
2870 atomicrmw acq_rel - workgroup - local 1. ds_atomic
2871 2. s_waitcnt lgkmcnt(0)
2872
2873 - If OpenCL, omit
2874 waitcnt.
2875 - Must happen before
2876 any following
2877 global/generic
2878 load/load
2879 atomic/store/store
2880 atomic/atomicrmw.
2881 - Ensures any
2882 following global
2883 data read is no
2884 older than the load
2885 atomic value being
2886 acquired.
2887
2888 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2889
2890 - If OpenCL, omit
2891 waitcnt.
2892 - Must happen after
2893 any preceding
2894 local/generic
2895 load/store/load
2896 atomic/store
2897 atomic/atomicrmw.
2898 - Must happen before
2899 the following
2900 atomicrmw.
2901 - Ensures that all
2902 memory operations
2903 to local have
2904 completed before
2905 performing the
2906 atomicrmw that is
2907 being released.
2908
2909 2. flat_atomic
2910 3. s_waitcnt lgkmcnt(0)
2911
2912 - If OpenCL, omit
2913 waitcnt.
2914 - Must happen before
2915 any following
2916 global/generic
2917 load/load
2918 atomic/store/store
2919 atomic/atomicrmw.
2920 - Ensures any
2921 following global
2922 data read is no
2923 older than the load
2924 atomic value being
2925 acquired.
2926 atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
2927 - system lgkmcnt(0)
2928
2929 - If OpenCL, omit
2930 lgkmcnt(0).
2931 - Could be split into
2932 separate s_waitcnt
2933 vmcnt(0) and
2934 s_waitcnt
2935 lgkmcnt(0) to allow
2936 them to be
2937 independently moved
2938 according to the
2939 following rules.
2940 - s_waitcnt vmcnt(0)
2941 must happen after
2942 any preceding
2943 global/generic
2944 load/store/load
2945 atomic/store
2946 atomic/atomicrmw.
2947 - s_waitcnt lgkmcnt(0)
2948 must happen after
2949 any preceding
2950 local/generic
2951 load/store/load
2952 atomic/store
2953 atomic/atomicrmw.
2954 - Must happen before
2955 the following
2956 atomicrmw.
2957 - Ensures that all
2958 memory operations
2959 to global have
2960 completed before
2961 performing the
2962 atomicrmw that is
2963 being released.
2964
2965 2. buffer/global_atomic
2966 3. s_waitcnt vmcnt(0)
2967
2968 - Must happen before
2969 following
2970 buffer_wbinvl1_vol.
2971 - Ensures the
2972 atomicrmw has
2973 completed before
2974 invalidating the
2975 cache.
2976
2977 4. buffer_wbinvl1_vol
2978
2979 - Must happen before
2980 any following
2981 global/generic
2982 load/load
2983 atomic/atomicrmw.
2984 - Ensures that
2985 following loads
2986 will not see stale
2987 global data.
2988
2989 atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
2990 - system lgkmcnt(0)
2991
2992 - If OpenCL, omit
2993 lgkmcnt(0).
2994 - Could be split into
2995 separate s_waitcnt
2996 vmcnt(0) and
2997 s_waitcnt
2998 lgkmcnt(0) to allow
2999 them to be
3000 independently moved
3001 according to the
3002 following rules.
3003 - s_waitcnt vmcnt(0)
3004 must happen after
3005 any preceding
3006 global/generic
3007 load/store/load
3008 atomic/store
3009 atomic/atomicrmw.
3010 - s_waitcnt lgkmcnt(0)
3011 must happen after
3012 any preceding
3013 local/generic
3014 load/store/load
3015 atomic/store
3016 atomic/atomicrmw.
3017 - Must happen before
3018 the following
3019 atomicrmw.
3020 - Ensures that all
3021 memory operations
3022 to global have
3023 completed before
3024 performing the
3025 atomicrmw that is
3026 being released.
3027
3028 2. flat_atomic
3029 3. s_waitcnt vmcnt(0) &
3030 lgkmcnt(0)
3031
3032 - If OpenCL, omit
3033 lgkmcnt(0).
3034 - Must happen before
3035 following
3036 buffer_wbinvl1_vol.
3037 - Ensures the
3038 atomicrmw has
3039 completed before
3040 invalidating the
3041 cache.
3042
3043 4. buffer_wbinvl1_vol
3044
3045 - Must happen before
3046 any following
3047 global/generic
3048 load/load
3049 atomic/atomicrmw.
3050 - Ensures that
3051 following loads
3052 will not see stale
3053 global data.
3054
3055 fence acq_rel - singlethread *none* *none*
3056 - wavefront
3057 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3058
3059 - If OpenCL and
3060 address space is
3061 not generic, omit
3062 waitcnt. However,
3063 since LLVM
3064 currently has no
3065 address space on
3066 the fence need to
3067 conservatively
3068 always generate
3069 (see comment for
3070 previous fence).
3071 - Must happen after
3072 any preceding
3073 local/generic
3074 load/load
3075 atomic/store/store
3076 atomic/atomicrmw.
3077 - Must happen before
3078 any following
3079 global/generic
3080 load/load
3081 atomic/store/store
3082 atomic/atomicrmw.
3083 - Ensures that all
3084 memory operations
3085 to local have
3086 completed before
3087 performing any
3088 following global
3089 memory operations.
3090 - Ensures that the
3091 preceding
3092 local/generic load
3093 atomic/atomicrmw
3094 with an equal or
3095 wider sync scope
3096 and memory ordering
3097 stronger than
3098 unordered (this is
3099 termed the
3100 fence-paired-atomic)
3101 has completed
3102 before following
3103 global memory
3104 operations. This
3105 satisfies the
3106 requirements of
3107 acquire.
3108 - Ensures that all
3109 previous memory
3110 operations have
3111 completed before a
3112 following
3113 local/generic store
3114 atomic/atomicrmw
3115 with an equal or
3116 wider sync scope
3117 and memory ordering
3118 stronger than
3119 unordered (this is
3120 termed the
3121 fence-paired-atomic).
3122 This satisfies the
3123 requirements of
3124 release.
3125
3126 fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
3127 - system lgkmcnt(0)
3128
3129 - If OpenCL and
3130 address space is
3131 not generic, omit
3132 lgkmcnt(0).
3133 However, since LLVM
3134 currently has no
3135 address space on
3136 the fence need to
3137 conservatively
3138 always generate
3139 (see comment for
3140 previous fence).
3141 - Could be split into
3142 separate s_waitcnt
3143 vmcnt(0) and
3144 s_waitcnt
3145 lgkmcnt(0) to allow
3146 them to be
3147 independently moved
3148 according to the
3149 following rules.
3150 - s_waitcnt vmcnt(0)
3151 must happen after
3152 any preceding
3153 global/generic
3154 load/store/load
3155 atomic/store
3156 atomic/atomicrmw.
3157 - s_waitcnt lgkmcnt(0)
3158 must happen after
3159 any preceding
3160 local/generic
3161 load/store/load
3162 atomic/store
3163 atomic/atomicrmw.
3164 - Must happen before
3165 the following
3166 buffer_wbinvl1_vol.
3167 - Ensures that the
3168 preceding
3169 global/local/generic
3170 load
3171 atomic/atomicrmw
3172 with an equal or
3173 wider sync scope
3174 and memory ordering
3175 stronger than
3176 unordered (this is
3177 termed the
3178 fence-paired-atomic)
3179 has completed
3180 before invalidating
3181 the cache. This
3182 satisfies the
3183 requirements of
3184 acquire.
3185 - Ensures that all
3186 previous memory
3187 operations have
3188 completed before a
3189 following
3190 global/local/generic
3191 store
3192 atomic/atomicrmw
3193 with an equal or
3194 wider sync scope
3195 and memory ordering
3196 stronger than
3197 unordered (this is
3198 termed the
3199 fence-paired-atomic).
3200 This satisfies the
3201 requirements of
3202 release.
3203
3204 2. buffer_wbinvl1_vol
3205
3206 - Must happen before
3207 any following
3208 global/generic
3209 load/load
3210 atomic/store/store
3211 atomic/atomicrmw.
3212 - Ensures that
3213 following loads
3214 will not see stale
3215 global data. This
3216 satisfies the
3217 requirements of
3218 acquire.
3219
3220 **Sequential Consistent Atomic**
3221 ---------------------------------------------------------------------------
3222 load atomic seq_cst - singlethread - global *Same as corresponding
3223 - wavefront - local load atomic acquire*.
3224 - workgroup - generic
3225 load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
3226 - system - local
3227 - generic - Must happen after
3228 preceding
3229 global/generic load
3230 atomic/store
3231 atomic/atomicrmw
3232 with memory
3233 ordering of seq_cst
3234 and with equal or
3235 wider sync scope.
3236 (Note that seq_cst
3237 fences have their
3238 own s_waitcnt
3239 vmcnt(0) and so do
3240 not need to be
3241 considered.)
3242 - Ensures any
3243 preceding
3244 sequential
3245 consistent global
3246 memory instructions
3247 have completed
3248 before executing
3249 this sequentially
3250 consistent
3251 instruction. This
3252 prevents reordering
3253 a seq_cst store
3254 followed by a
3255 seq_cst load (Note
3256 that seq_cst is
3257 stronger than
3258 acquire/release as
3259 the reordering of
3260 load acquire
3261 followed by a store
3262 release is
3263 prevented by the
3264 waitcnt vmcnt(0) of
3265 the release, but
3266 there is nothing
3267 preventing a store
3268 release followed by
3269 load acquire from
3270 competing out of
3271 order.)
3272
3273 2. *Following
3274 instructions same as
3275 corresponding load
3276 atomic acquire*.
3277
3278 store atomic seq_cst - singlethread - global *Same as corresponding
3279 - wavefront - local store atomic release*.
3280 - workgroup - generic
3281 store atomic seq_cst - agent - global *Same as corresponding
3282 - system - generic store atomic release*.
3283 atomicrmw seq_cst - singlethread - global *Same as corresponding
3284 - wavefront - local atomicrmw acq_rel*.
3285 - workgroup - generic
3286 atomicrmw seq_cst - agent - global *Same as corresponding
3287 - system - generic atomicrmw acq_rel*.
3288 fence seq_cst - singlethread *none* *Same as corresponding
3289 - wavefront fence acq_rel*.
3290 - workgroup
3291 - agent
3292 - system
3293 ============ ============ ============== ========== =======================
3294
3295The memory order also adds the single thread optimization constrains defined in
3296table
3297:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3298
3299 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3300 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3301
3302 ============ ==============================================================
3303 LLVM Memory Optimization Constraints
3304 Ordering
3305 ============ ==============================================================
3306 unordered *none*
3307 monotonic *none*
3308 acquire - If a load atomic/atomicrmw then no following load/load
3309 atomic/store/ store atomic/atomicrmw/fence instruction can
3310 be moved before the acquire.
3311 - If a fence then same as load atomic, plus no preceding
3312 associated fence-paired-atomic can be moved after the fence.
3313 release - If a store atomic/atomicrmw then no preceeding load/load
3314 atomic/store/ store atomic/atomicrmw/fence instruction can
3315 be moved after the release.
3316 - If a fence then same as store atomic, plus no following
3317 associated fence-paired-atomic can be moved before the
3318 fence.
3319 acq_rel Same constraints as both acquire and release.
3320 seq_cst - If a load atomic then same constraints as acquire, plus no
3321 preceding sequentially consistent load atomic/store
3322 atomic/atomicrmw/fence instruction can be moved after the
3323 seq_cst.
3324 - If a store atomic then the same constraints as release, plus
3325 no following sequentially consistent load atomic/store
3326 atomic/atomicrmw/fence instruction can be moved before the
3327 seq_cst.
3328 - If an atomicrmw/fence then same constraints as acq_rel.
3329 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003330
Wei Ding16289cf2017-02-21 18:48:01 +00003331Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003332~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003333
Tony Tyef16a45e2017-06-06 20:31:59 +00003334For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3335(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3336the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003337
Tony Tyef16a45e2017-06-06 20:31:59 +00003338 .. table:: AMDGPU Trap Handler for AMDHSA OS
3339 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003340
Tony Tyef16a45e2017-06-06 20:31:59 +00003341 =================== =============== =============== =======================
3342 Usage Code Sequence Trap Handler Description
3343 Inputs
3344 =================== =============== =============== =======================
3345 reserved ``s_trap 0x00`` Reserved by hardware.
3346 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3347 ``queue_ptr`` ``debugtrap``
3348 ``VGPR0``: intrinsic (not
3349 ``arg`` implemented).
3350 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3351 ``queue_ptr`` terminated and its
3352 associated queue put
3353 into the error state.
3354 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3355 ``queue_ptr`` installed handled
3356 same as ``llvm.trap``.
3357 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3358 breakpoints.
3359 debugger ``s_trap 0x08`` Reserved for debugger.
3360 debugger ``s_trap 0xfe`` Reserved for debugger.
3361 debugger ``s_trap 0xff`` Reserved for debugger.
3362 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003363
Tony Tyef16a45e2017-06-06 20:31:59 +00003364Non-AMDHSA
3365----------
3366
3367Trap Handler ABI
3368~~~~~~~~~~~~~~~~
3369
3370For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3371not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3372instructions are handled as follows:
3373
3374 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3375 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3376
3377 =============== =============== ===========================================
3378 Usage Code Sequence Description
3379 =============== =============== ===========================================
3380 llvm.trap s_endpgm Causes wavefront to be terminated.
3381 llvm.debugtrap *none* Compiler warning given that there is no
3382 trap handler installed.
3383 =============== =============== ===========================================
3384
3385Source Languages
3386================
3387
3388.. _amdgpu-opencl:
3389
3390OpenCL
3391------
3392
3393When generating code for the OpenCL language the target triple environment
3394should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3395
3396When the language is OpenCL the following differences occur:
3397
33981. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
33992. The AMDGPU backend adds additional arguments to the kernel.
34003. Additional metadata is generated (:ref:`amdgpu-code-object-metadata`).
3401
3402.. TODO
3403 Specify what affect this has. Hidden arguments added. Additional metadata
3404 generated.
3405
3406.. _amdgpu-hcc:
3407
3408HCC
3409---
3410
3411When generating code for the OpenCL language the target triple environment
3412should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3413
3414When the language is OpenCL the following differences occur:
3415
34161. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3417
3418.. TODO
3419 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003420
Tom Stellard45bb48e2015-06-13 03:28:10 +00003421Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003422---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003423
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003424AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef16a45e2017-06-06 20:31:59 +00003425It supports AMDGCN GFX6-GFX8.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003426
Tony Tyef16a45e2017-06-06 20:31:59 +00003427This section describes general syntax for instructions and operands. For more
3428information about instructions, their semantics and supported combinations of
3429operands, refer to one of instruction set architecture manuals
3430[AMD-Souther-Islands]_ [AMD-Sea-Islands]_ [AMD-Volcanic-Islands]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003431
Tony Tyef16a45e2017-06-06 20:31:59 +00003432An instruction has the following syntax (register operands are normally
3433comma-separated while extra operands are space-separated):
Tom Stellard45bb48e2015-06-13 03:28:10 +00003434
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003435*<opcode> <register_operand0>, ... <extra_operand0> ...*
Tom Stellard45bb48e2015-06-13 03:28:10 +00003436
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003437Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003438~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003439
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003440The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003441
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003442* SGPR registers: s0, ... or s[0], ...
3443* VGPR registers: v0, ... or v[0], ...
3444* TTMP registers: ttmp0, ... or ttmp[0], ...
3445* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3446* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3447* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3448* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3449* Register index expressions: v[2*2], s[1-1:2-1]
3450* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003451
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003452The following extra operands are supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003453
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003454* offset, offset0, offset1
3455* idxen, offen bits
3456* glc, slc, tfe bits
3457* waitcnt: integer or combination of counter values
3458* VOP3 modifiers:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003459
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003460 - abs (\| \|), neg (\-)
Tom Stellard45bb48e2015-06-13 03:28:10 +00003461
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003462* DPP modifiers:
3463
3464 - row_shl, row_shr, row_ror, row_rol
3465 - row_mirror, row_half_mirror, row_bcast
3466 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3467 - row_mask, bank_mask, bound_ctrl
3468
3469* SDWA modifiers:
3470
3471 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3472 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3473 - abs, neg, sext
3474
Tony Tyef16a45e2017-06-06 20:31:59 +00003475Instruction Examples
3476~~~~~~~~~~~~~~~~~~~~
3477
3478DS
3479~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003480
3481.. code-block:: nasm
3482
3483 ds_add_u32 v2, v4 offset:16
3484 ds_write_src2_b64 v2 offset0:4 offset1:8
3485 ds_cmpst_f32 v2, v4, v6
3486 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3487
3488
3489For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3490
Tony Tyef16a45e2017-06-06 20:31:59 +00003491FLAT
3492++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003493
3494.. code-block:: nasm
3495
3496 flat_load_dword v1, v[3:4]
3497 flat_store_dwordx3 v[3:4], v[5:7]
3498 flat_atomic_swap v1, v[3:4], v5 glc
3499 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3500 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3501
3502For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3503
Tony Tyef16a45e2017-06-06 20:31:59 +00003504MUBUF
3505+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003506
3507.. code-block:: nasm
3508
3509 buffer_load_dword v1, off, s[4:7], s1
3510 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3511 buffer_store_format_xy v[1:2], off, s[4:7], s1
3512 buffer_wbinvl1
3513 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3514
3515For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3516
Tony Tyef16a45e2017-06-06 20:31:59 +00003517SMRD/SMEM
3518+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003519
3520.. code-block:: nasm
3521
3522 s_load_dword s1, s[2:3], 0xfc
3523 s_load_dwordx8 s[8:15], s[2:3], s4
3524 s_load_dwordx16 s[88:103], s[2:3], s4
3525 s_dcache_inv_vol
3526 s_memtime s[4:5]
3527
3528For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3529
Tony Tyef16a45e2017-06-06 20:31:59 +00003530SOP1
3531++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003532
3533.. code-block:: nasm
3534
3535 s_mov_b32 s1, s2
3536 s_mov_b64 s[0:1], 0x80000000
3537 s_cmov_b32 s1, 200
3538 s_wqm_b64 s[2:3], s[4:5]
3539 s_bcnt0_i32_b64 s1, s[2:3]
3540 s_swappc_b64 s[2:3], s[4:5]
3541 s_cbranch_join s[4:5]
3542
3543For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3544
Tony Tyef16a45e2017-06-06 20:31:59 +00003545SOP2
3546++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003547
3548.. code-block:: nasm
3549
3550 s_add_u32 s1, s2, s3
3551 s_and_b64 s[2:3], s[4:5], s[6:7]
3552 s_cselect_b32 s1, s2, s3
3553 s_andn2_b32 s2, s4, s6
3554 s_lshr_b64 s[2:3], s[4:5], s6
3555 s_ashr_i32 s2, s4, s6
3556 s_bfm_b64 s[2:3], s4, s6
3557 s_bfe_i64 s[2:3], s[4:5], s6
3558 s_cbranch_g_fork s[4:5], s[6:7]
3559
3560For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3561
Tony Tyef16a45e2017-06-06 20:31:59 +00003562SOPC
3563++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003564
3565.. code-block:: nasm
3566
3567 s_cmp_eq_i32 s1, s2
3568 s_bitcmp1_b32 s1, s2
3569 s_bitcmp0_b64 s[2:3], s4
3570 s_setvskip s3, s5
3571
3572For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3573
Tony Tyef16a45e2017-06-06 20:31:59 +00003574SOPP
3575++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003576
3577.. code-block:: nasm
3578
3579 s_barrier
3580 s_nop 2
3581 s_endpgm
3582 s_waitcnt 0 ; Wait for all counters to be 0
3583 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3584 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3585 s_sethalt 9
3586 s_sleep 10
3587 s_sendmsg 0x1
3588 s_sendmsg sendmsg(MSG_INTERRUPT)
3589 s_trap 1
3590
3591For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3592
3593Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003594of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003595range or acceptable values.
3596
Tony Tyef16a45e2017-06-06 20:31:59 +00003597VALU
3598++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003599
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003600For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3601the assembler will automatically use optimal encoding based on its operands.
3602To force specific encoding, one can add a suffix to the opcode of the instruction:
3603
3604* _e32 for 32-bit VOP1/VOP2/VOPC
3605* _e64 for 64-bit VOP3
3606* _dpp for VOP_DPP
3607* _sdwa for VOP_SDWA
3608
3609VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003610
3611.. code-block:: nasm
3612
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003613 v_mov_b32 v1, v2
3614 v_mov_b32_e32 v1, v2
3615 v_nop
3616 v_cvt_f64_i32_e32 v[1:2], v2
3617 v_floor_f32_e32 v1, v2
3618 v_bfrev_b32_e32 v1, v2
3619 v_add_f32_e32 v1, v2, v3
3620 v_mul_i32_i24_e64 v1, v2, 3
3621 v_mul_i32_i24_e32 v1, -3, v3
3622 v_mul_i32_i24_e32 v1, -100, v3
3623 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3624 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00003625
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003626VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003627
3628.. code-block:: nasm
3629
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003630 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3631 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3632 v_mov_b32 v0, v0 wave_shl:1
3633 v_mov_b32 v0, v0 row_mirror
3634 v_mov_b32 v0, v0 row_bcast:31
3635 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3636 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3637 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00003638
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003639VOP_SDWA examples:
3640
3641.. code-block:: nasm
3642
3643 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3644 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3645 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3646 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3647 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3648
3649For full list of supported instructions, refer to "Vector ALU instructions".
3650
3651HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00003652~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003653
3654AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3655one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00003656
3657.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00003658+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00003659
3660*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003661object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00003662
3663.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00003664+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3665
Tom Stellard347ac792015-06-26 21:15:07 +00003666
3667*major*, *minor*, and *stepping* are all integers that describe the instruction
3668set architecture (ISA) version of the assembly program.
3669
3670*vendor* and *arch* are quoted strings. *vendor* should always be equal to
3671"AMD" and *arch* should always be equal to "AMDGPU".
3672
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003673By default, the assembler will derive the ISA version, *vendor*, and *arch*
3674from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00003675
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003676.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00003677+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003678
3679This directives specifies that the symbol with given name is a kernel entry point
3680(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00003681
3682.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00003683++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00003684
3685This directive marks the beginning of a list of key / value pairs that are used
3686to specify the amd_kernel_code_t object that will be emitted by the assembler.
3687The list must be terminated by the *.end_amd_kernel_code_t* directive. For
3688any amd_kernel_code_t values that are unspecified a default value will be
3689used. The default value for all keys is 0, with the following exceptions:
3690
3691- *kernel_code_version_major* defaults to 1.
3692- *machine_kind* defaults to 1.
3693- *machine_version_major*, *machine_version_minor*, and
3694 *machine_version_stepping* are derived from the value of the -mcpu option
3695 that is passed to the assembler.
3696- *kernel_code_entry_byte_offset* defaults to 256.
3697- *wavefront_size* defaults to 6.
3698- *kernarg_segment_alignment*, *group_segment_alignment*, and
3699 *private_segment_alignment* default to 4. Note that alignments are specified
3700 as a power of two, so a value of **n** means an alignment of 2^ **n**.
3701
3702The *.amd_kernel_code_t* directive must be placed immediately after the
3703function label and before any instructions.
3704
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003705For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
3706comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00003707
3708Here is an example of a minimal amd_kernel_code_t specification:
3709
Aaron Ballman887ad0e2016-07-19 17:46:55 +00003710.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00003711
3712 .hsa_code_object_version 1,0
3713 .hsa_code_object_isa
3714
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003715 .hsatext
3716 .globl hello_world
3717 .p2align 8
3718 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00003719
3720 hello_world:
3721
3722 .amd_kernel_code_t
3723 enable_sgpr_kernarg_segment_ptr = 1
3724 is_ptr64 = 1
3725 compute_pgm_rsrc1_vgprs = 0
3726 compute_pgm_rsrc1_sgprs = 0
3727 compute_pgm_rsrc2_user_sgpr = 2
3728 kernarg_segment_byte_size = 8
3729 wavefront_sgpr_count = 2
3730 workitem_vgpr_count = 3
3731 .end_amd_kernel_code_t
3732
3733 s_load_dwordx2 s[0:1], s[0:1] 0x0
3734 v_mov_b32 v0, 3.14159
3735 s_waitcnt lgkmcnt(0)
3736 v_mov_b32 v1, s0
3737 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003738 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00003739 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00003740 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00003741 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00003742
3743Additional Documentation
3744========================
3745
3746.. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
3747.. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
3748.. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
3749.. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
3750.. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
3751.. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
3752.. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
3753.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
3754.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
3755.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
3756.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
3757.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
3758.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
3759.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
3760.. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
3761.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
3762.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
3763.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__