blob: 7062d75d92ec6e0750d65ad8221f6881764d3092 [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
26 .. table:: AMDGPU Target Triples
27 :name: amdgpu-target-triples-table
28
29 ============ ======== ========= ===========
30 Architecture Vendor OS Environment
31 ============ ======== ========= ===========
32 r600 amd <empty> <empty>
33 amdgcn amd <empty> <empty>
34 amdgcn amd amdhsa <empty>
35 amdgcn amd amdhsa opencl
36 amdgcn amd amdhsa amdgizcl
37 amdgcn amd amdhsa amdgiz
38 amdgcn amd amdhsa hcc
39 ============ ======== ========= ===========
40
41``r600-amd--``
42 Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
43 the MESA runtime.
44
45``amdgcn-amd--``
Tony Tye46d35762017-08-15 20:47:41 +000046 Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
Tony Tyef16a45e2017-06-06 20:31:59 +000047 the MESA runtime.
48
49``amdgcn-amd-amdhsa-``
50 Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
51 compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
52
53``amdgcn-amd-amdhsa-opencl``
54 Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
55 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
56 :ref:`amdgpu-opencl`.
57
58``amdgcn-amd-amdhsa-amdgizcl``
59 Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
60 is used (see :ref:`amdgpu-address-spaces`).
61
62``amdgcn-amd-amdhsa-amdgiz``
63 Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
64 used (see :ref:`amdgpu-address-spaces`).
65
66``amdgcn-amd-amdhsa-hcc``
67 Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
68 executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
69 :ref:`amdgpu-hcc`.
70
71.. _amdgpu-processors:
72
73Processors
74----------
75
76Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
77names from both the *Processor* and *Alternative Processor* can be used.
78
79 .. table:: AMDGPU Processors
80 :name: amdgpu-processors-table
81
82 ========== =========== ============ ===== ======= ==================
83 Processor Alternative Target dGPU/ Runtime Example
84 Processor Triple APU Support Products
85 Architecture
86 ========== =========== ============ ===== ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000087 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tyef16a45e2017-06-06 20:31:59 +000088 --------------------------------------------------------------------
89 r600 r600 dGPU
90 r630 r600 dGPU
91 rs880 r600 dGPU
92 rv670 r600 dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000093 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tyef16a45e2017-06-06 20:31:59 +000094 --------------------------------------------------------------------
95 rv710 r600 dGPU
96 rv730 r600 dGPU
97 rv770 r600 dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000098 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tyef16a45e2017-06-06 20:31:59 +000099 --------------------------------------------------------------------
100 cedar r600 dGPU
101 redwood r600 dGPU
102 sumo r600 dGPU
103 juniper r600 dGPU
104 cypress r600 dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000106 --------------------------------------------------------------------
107 barts r600 dGPU
108 turks r600 dGPU
109 caicos r600 dGPU
110 cayman r600 dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000112 --------------------------------------------------------------------
Konstantin Zhuravlyov6cbcb272017-08-08 04:28:31 +0000113 gfx600 - tahiti amdgcn dGPU
Tony Tyef16a45e2017-06-06 20:31:59 +0000114 gfx601 - pitcairn amdgcn dGPU
115 - verde
116 - oland
117 - hainan
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000119 --------------------------------------------------------------------
120 gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
121 - Radeon HD 8770
122 - R7 260
123 - R7 260X
124 \ - kaveri amdgcn APU - A6-7000
125 - A6 Pro-7050B
126 - A8-7100
127 - A8 Pro-7150B
128 - A10-7300
129 - A10 Pro-7350B
130 - FX-7500
131 - A8-7200P
132 - A10-7400P
133 - FX-7600P
134 gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
135 - FirePro W9100
136 - FirePro S9150
137 - FirePro S9170
Tony Tye1fd77b22017-06-08 01:47:25 +0000138 gfx702 dGPU ROCm - Radeon R9 290
Tony Tyef16a45e2017-06-06 20:31:59 +0000139 - Radeon R9 290x
140 - Radeon R390
Tony Tye1fd77b22017-06-08 01:47:25 +0000141 - Radeon R390x
142 gfx703 - kabini amdgcn APU - E1-2100
Tony Tyef16a45e2017-06-06 20:31:59 +0000143 - mullins - E1-2200
144 - E1-2500
145 - E2-3000
146 - E2-3800
147 - A4-5000
148 - A4-5100
149 - A6-5200
150 - A4 Pro-3340B
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000152 --------------------------------------------------------------------
153 gfx800 - iceland amdgcn dGPU - FirePro S7150
154 - FirePro S7100
155 - FirePro W7100
156 - Radeon R285
157 - Radeon R9 380
158 - Radeon R9 385
159 - Mobile FirePro
160 M7170
161 gfx801 - carrizo amdgcn APU - A6-8500P
162 - Pro A6-8500B
163 - A8-8600P
164 - Pro A8-8600B
165 - FX-8800P
166 - Pro A12-8800B
167 \ amdgcn APU ROCm - A10-8700P
168 - Pro A10-8700B
169 - A10-8780P
170 \ amdgcn APU - A10-9600P
171 - A10-9630P
172 - A12-9700P
173 - A12-9730P
174 - FX-9800P
175 - FX-9830P
176 \ amdgcn APU - E2-9010
177 - A6-9210
178 - A9-9410
179 gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
180 gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
181 - Radeon R9 Fury
182 - Radeon R9 FuryX
183 - Radeon Pro Duo
184 - FirePro S9300x2
Tony Tye46d35762017-08-15 20:47:41 +0000185 - Radeon Instinct MI8
Tony Tyef16a45e2017-06-06 20:31:59 +0000186 \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
187 - Radeon RX 480
Tony Tye46d35762017-08-15 20:47:41 +0000188 - Radeon Instinct MI6
Tony Tyef16a45e2017-06-06 20:31:59 +0000189 \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
190 gfx804 amdgcn dGPU Same as gfx803
191 gfx810 - stoney amdgcn APU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000192 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tyef16a45e2017-06-06 20:31:59 +0000193 --------------------------------------------------------------------
Tony Tye46d35762017-08-15 20:47:41 +0000194 gfx900 amdgcn dGPU - Radeon Vega
195 Frontier Edition
196 - Radeon RX Vega 56
197 - Radeon RX Vega 64
198 - Radeon RX Vega 64
199 Liquid
200 - Radeon Instinct MI25
Tony Tyef16a45e2017-06-06 20:31:59 +0000201 gfx901 amdgcn dGPU ROCm Same as gfx900
202 except XNACK is
203 enabled
204 gfx902 amdgcn APU *TBA*
205
206 .. TODO
207 Add product
208 names.
209 gfx903 amdgcn APU Same as gfx902
210 except XNACK is
211 enabled
212 ========== =========== ============ ===== ======= ==================
213
214.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000215
216Address Spaces
217--------------
218
Tony Tyef16a45e2017-06-06 20:31:59 +0000219The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000220
Tony Tyef16a45e2017-06-06 20:31:59 +0000221The memory space names used in the table, aside from the region memory space, is
222from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000223
Tony Tyef16a45e2017-06-06 20:31:59 +0000224LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000225
Tony Tyef16a45e2017-06-06 20:31:59 +0000226 .. table:: Address Space Mapping
227 :name: amdgpu-address-space-mapping-table
228
229 ================== ================= ================= ================= =================
230 LLVM Address Space Memory Space
231 ------------------ -----------------------------------------------------------------------
232 \ Current Default amdgiz/amdgizcl hcc Future Default
233 ================== ================= ================= ================= =================
234 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
235 1 Global Global Global Global
236 2 Constant Constant Constant Region (GDS)
237 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
238 4 Generic (Flat) Region (GDS) Region (GDS) Constant
239 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
240 ================== ================= ================= ================= =================
241
242Current Default
243 This is the current default address space mapping used for all languages
244 except hcc. This will shortly be deprecated.
245
246amdgiz/amdgizcl
247 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
248 is specified as the target triple environment value.
249
250hcc
251 This is the current address space mapping used when ``hcc`` is specified as
252 the target triple environment value.This will shortly be deprecated.
253
254Future Default
255 This will shortly be the only address space mapping for all languages using
256 AMDGPU backend.
257
258.. _amdgpu-memory-scopes:
259
260Memory Scopes
261-------------
262
263This section provides LLVM memory synchronization scopes supported by the AMDGPU
264backend memory model when the target triple OS is ``amdhsa`` (see
265:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
266
267The memory model supported is based on the HSA memory model [HSA]_ which is
268based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
269relation is transitive over the synchonizes-with relation independent of scope,
270and synchonizes-with allows the memory scope instances to be inclusive (see
271table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
272
273This is different to the OpenCL [OpenCL]_ memory model which does not have scope
274inclusion and requires the memory scopes to exactly match. However, this
275is conservatively correct for OpenCL.
276
277 .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
278 :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
279
280 ================ ==========================================================
281 LLVM Sync Scope Description
282 ================ ==========================================================
283 *none* The default: ``system``.
284
285 Synchronizes with, and participates in modification and
286 seq_cst total orderings with, other operations (except
287 image operations) for all address spaces (except private,
288 or generic that accesses private) provided the other
289 operation's sync scope is:
290
291 - ``system``.
292 - ``agent`` and executed by a thread on the same agent.
293 - ``workgroup`` and executed by a thread in the same
294 workgroup.
295 - ``wavefront`` and executed by a thread in the same
296 wavefront.
297
298 ``agent`` Synchronizes with, and participates in modification and
299 seq_cst total orderings with, other operations (except
300 image operations) for all address spaces (except private,
301 or generic that accesses private) provided the other
302 operation's sync scope is:
303
304 - ``system`` or ``agent`` and executed by a thread on the
305 same agent.
306 - ``workgroup`` and executed by a thread in the same
307 workgroup.
308 - ``wavefront`` and executed by a thread in the same
309 wavefront.
310
311 ``workgroup`` Synchronizes with, and participates in modification and
312 seq_cst total orderings with, other operations (except
313 image operations) for all address spaces (except private,
314 or generic that accesses private) provided the other
315 operation's sync scope is:
316
317 - ``system``, ``agent`` or ``workgroup`` and executed by a
318 thread in the same workgroup.
319 - ``wavefront`` and executed by a thread in the same
320 wavefront.
321
322 ``wavefront`` Synchronizes with, and participates in modification and
323 seq_cst total orderings with, other operations (except
324 image operations) for all address spaces (except private,
325 or generic that accesses private) provided the other
326 operation's sync scope is:
327
328 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
329 and executed by a thread in the same wavefront.
330
331 ``singlethread`` Only synchronizes with, and participates in modification
332 and seq_cst total orderings with, other operations (except
333 image operations) running in the same thread for all
334 address spaces (for example, in signal handlers).
335 ================ ==========================================================
336
337AMDGPU Intrinsics
338-----------------
339
340The AMDGPU backend implements the following intrinsics.
341
342*This section is WIP.*
343
344.. TODO
345 List AMDGPU intrinsics
346
347Code Object
348===========
349
350The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
351can be linked by ``lld`` to produce a standard ELF shared code object which can
352be loaded and executed on an AMDGPU target.
353
354Header
355------
356
357The AMDGPU backend uses the following ELF header:
358
359 .. table:: AMDGPU ELF Header
360 :name: amdgpu-elf-header-table
361
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000362 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000363 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000364 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000365 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
366 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Konstantin Zhuravlyov36963522017-10-03 21:18:03 +0000367 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``,
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000368 ``ELFOSABI_AMDGPU_PAL`` or
369 ``ELFOSABI_AMDGPU_MESA3D``
Konstantin Zhuravlyov36963522017-10-03 21:18:03 +0000370 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``,
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000371 ``ELFABIVERSION_AMDGPU_PAL`` or
372 ``ELFABIVERSION_AMDGPU_MESA3D``
Tony Tyef16a45e2017-06-06 20:31:59 +0000373 ``e_type`` ``ET_REL`` or ``ET_DYN``
374 ``e_machine`` ``EM_AMDGPU``
375 ``e_entry`` 0
376 ``e_flags`` 0
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000377 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000378
379..
380
381 .. table:: AMDGPU ELF Header Enumeration Values
382 :name: amdgpu-elf-header-enumeration-values-table
383
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000384 =============================== =====
385 Name Value
386 =============================== =====
387 ``EM_AMDGPU`` 224
388 ``ELFOSABI_AMDGPU_HSA`` 64
389 ``ELFOSABI_AMDGPU_PAL`` 65
390 ``ELFOSABI_AMDGPU_MESA3D`` 66
391 ``ELFABIVERSION_AMDGPU_HSA`` 1
392 ``ELFABIVERSION_AMDGPU_PAL`` 0
393 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
394 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000395
396``e_ident[EI_CLASS]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000397 The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
398 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000399
400``e_ident[EI_DATA]``
401 All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
402
403``e_ident[EI_OSABI]``
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000404 One of the following AMD GPU architecture specific OS ABIs:
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000405
406 * ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
407 the AMD HSA runtime ABI [HSA]_.
408
409 * ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
410 the AMD PAL runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000411
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000412 * ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
413 to the AMD MESA runtime ABI.
414
Tony Tyef16a45e2017-06-06 20:31:59 +0000415``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000416 The ABI version of the AMD GPU architecture specific OS ABI to which the code
417 object conforms:
418
419 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
420 runtime ABI.
421
422 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
423 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000424
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000425 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
426 runtime ABI.
427
Tony Tyef16a45e2017-06-06 20:31:59 +0000428``e_type``
429 Can be one of the following values:
430
431
432 ``ET_REL``
433 The type produced by the AMD GPU backend compiler as it is relocatable code
434 object.
435
436 ``ET_DYN``
437 The type produced by the linker as it is a shared code object.
438
439 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
440
441``e_machine``
442 The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
443 architecture family. The specific member is specified in the
444 ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
445 :ref:`amdgpu-note-records`).
446
447``e_entry``
448 The entry point is 0 as the entry points for individual kernels must be
449 selected in order to invoke them through AQL packets.
450
451``e_flags``
452 The value is 0 as no flags are used.
453
454Sections
455--------
456
457An AMDGPU target ELF code object has the standard ELF sections which include:
458
459 .. table:: AMDGPU ELF Sections
460 :name: amdgpu-elf-sections-table
461
462 ================== ================ =================================
463 Name Type Attributes
464 ================== ================ =================================
465 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
466 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
467 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
468 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
469 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
470 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
471 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
472 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
473 ``.note`` ``SHT_NOTE`` *none*
474 ``.rela``\ *name* ``SHT_RELA`` *none*
475 ``.rela.dyn`` ``SHT_RELA`` *none*
476 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
477 ``.shstrtab`` ``SHT_STRTAB`` *none*
478 ``.strtab`` ``SHT_STRTAB`` *none*
479 ``.symtab`` ``SHT_SYMTAB`` *none*
480 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
481 ================== ================ =================================
482
483These sections have their standard meanings (see [ELF]_) and are only generated
484if needed.
485
486``.debug``\ *\**
487 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
488 DWARF produced by the AMDGPU backend.
489
Tony Tye46d35762017-08-15 20:47:41 +0000490``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000491 The standard sections used by a dynamic loader.
492
493``.note``
494 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
495 backend.
496
497``.rela``\ *name*, ``.rela.dyn``
498 For relocatable code objects, *name* is the name of the section that the
499 relocation records apply. For example, ``.rela.text`` is the section name for
500 relocation records associated with the ``.text`` section.
501
502 For linked shared code objects, ``.rela.dyn`` contains all the relocation
503 records from each of the relocatable code object's ``.rela``\ *name* sections.
504
505 See :ref:`amdgpu-relocation-records` for the relocation records supported by
506 the AMDGPU backend.
507
508``.text``
509 The executable machine code for the kernels and functions they call. Generated
510 as position independent code. See :ref:`amdgpu-code-conventions` for
511 information on conventions used in the isa generation.
512
513.. _amdgpu-note-records:
514
515Note Records
516------------
517
518As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
519the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
520minimal zero byte padding must be generated to ensure the ``desc`` field size is
521a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
522be at least 4 to indicate at least 8 byte alignment.
523
524The AMDGPU backend code object uses the following ELF note records in the
525``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000526record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000527variable size strings have a corresponding ``*_size`` field that specifies the
528number of bytes, including the terminating null character, in the string. The
529string(s) come immediately after the preceding fields.
530
531Additional note records can be present.
532
533 .. table:: AMDGPU ELF Note Records
534 :name: amdgpu-elf-note-records-table
535
Tony Tye46d35762017-08-15 20:47:41 +0000536 ===== ============================== ======================================
537 Name Type Description
538 ===== ============================== ======================================
539 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
540 "AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
541 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000542
543..
544
545 .. table:: AMDGPU ELF Note Record Enumeration Values
546 :name: amdgpu-elf-note-record-enumeration-values-table
547
Tony Tye46d35762017-08-15 20:47:41 +0000548 ============================== =====
549 Name Value
550 ============================== =====
551 *reserved* 0-9
552 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
553 ``NT_AMD_AMDGPU_ISA`` 11
554 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000555
556``NT_AMD_AMDGPU_ISA``
557 Specifies the instruction set architecture used by the machine code contained
558 in the code object.
559
560 This note record is required for code objects containing machine code for
561 processors matching the ``amdgcn`` architecture in table
562 :ref:`amdgpu-processors`.
563
564 The null terminated string has the following syntax:
565
566 *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
567
568 where:
569
570 *architecture*
571 The architecture from table :ref:`amdgpu-target-triples-table`.
572
573 This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
574 :ref:`amdgpu-target-triples`).
575
576 *vendor*
577 The vendor from table :ref:`amdgpu-target-triples-table`.
578
579 For the AMDGPU backend this is always ``amd``.
580
581 *os*
582 The OS from table :ref:`amdgpu-target-triples-table`.
583
584 *environment*
585 An environment from table :ref:`amdgpu-target-triples-table`, or blank if
586 the environment has no affect on the execution of the code object.
587
588 For the AMDGPU backend this is currently always blank.
589 *processor*
590 The processor from table :ref:`amdgpu-processors-table`.
591
592 For example:
593
594 ``amdgcn-amd-amdhsa--gfx901``
595
Tony Tye46d35762017-08-15 20:47:41 +0000596``NT_AMD_AMDGPU_HSA_METADATA``
597 Specifies extensible metadata associated with the code objects executed on HSA
598 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
599 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
600 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
601 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000602
Tony Tye46d35762017-08-15 20:47:41 +0000603.. _amdgpu-symbols:
604
605Symbols
606-------
607
608Symbols include the following:
609
610 .. table:: AMDGPU ELF Symbols
611 :name: amdgpu-elf-symbols-table
612
613 ===================== ============== ============= ==================
614 Name Type Section Description
615 ===================== ============== ============= ==================
616 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
617 - ``.rodata``
618 - ``.bss``
619 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
620 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
621 ===================== ============== ============= ==================
622
623Global variable
624 Global variables both used and defined by the compilation unit.
625
626 If the symbol is defined in the compilation unit then it is allocated in the
627 appropriate section according to if it has initialized data or is readonly.
628
629 If the symbol is external then its section is ``STN_UNDEF`` and the loader
630 will resolve relocations using the definition provided by another code object
631 or explicitly defined by the runtime.
632
633 All global symbols, whether defined in the compilation unit or external, are
634 accessed by the machine code indirectly through a GOT table entry. This
635 allows them to be preemptable. The GOT table is only supported when the target
636 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000637
638 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000639 Add description of linked shared object symbols. Seems undefined symbols
640 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000641
Tony Tye46d35762017-08-15 20:47:41 +0000642Kernel descriptor
643 Every HSA kernel has an associated kernel descriptor. It is the address of the
644 kernel descriptor that is used in the AQL dispatch packet used to invoke the
645 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
646 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
647
648Kernel entry point
649 Every HSA kernel also has a symbol for its machine code entry point.
650
651.. _amdgpu-relocation-records:
652
653Relocation Records
654------------------
655
656AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
657relocatable fields are:
658
659``word32``
660 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
661 alignment. These values use the same byte order as other word values in the
662 AMD GPU architecture.
663
664``word64``
665 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
666 alignment. These values use the same byte order as other word values in the
667 AMD GPU architecture.
668
669Following notations are used for specifying relocation calculations:
670
671**A**
672 Represents the addend used to compute the value of the relocatable field.
673
674**G**
675 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000676 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000677
678**GOT**
679 Represents the address of the global offset table.
680
681**P**
682 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
683 of the storage unit being relocated (computed using ``r_offset``).
684
685**S**
686 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000687 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
688
689**B**
690 Represents the base address of a loaded executable or shared object which is
691 the difference between the ELF address and the actual load address. Relocations
692 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000693
694The following relocation types are supported:
695
696 .. table:: AMDGPU ELF Relocation Records
697 :name: amdgpu-elf-relocation-records-table
698
699 ========================== ===== ========== ==============================
700 Relocation Type Value Field Calculation
701 ========================== ===== ========== ==============================
702 ``R_AMDGPU_NONE`` 0 *none* *none*
703 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
704 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
705 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
706 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
707 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
708 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
709 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
710 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
711 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
712 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
713 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tyed2884302017-10-16 20:44:29 +0000714 *reserved* 12
715 ``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye46d35762017-08-15 20:47:41 +0000716 ========================== ===== ========== ==============================
717
718.. _amdgpu-dwarf:
719
720DWARF
721-----
722
723Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
724information that maps the code object executable code and data to the source
725language constructs. It can be used by tools such as debuggers and profilers.
726
727Address Space Mapping
728~~~~~~~~~~~~~~~~~~~~~
729
730The following address space mapping is used:
731
732 .. table:: AMDGPU DWARF Address Space Mapping
733 :name: amdgpu-dwarf-address-space-mapping-table
734
735 =================== =================
736 DWARF Address Space Memory Space
737 =================== =================
738 1 Private (Scratch)
739 2 Local (group/LDS)
740 *omitted* Global
741 *omitted* Constant
742 *omitted* Generic (Flat)
743 *not supported* Region (GDS)
744 =================== =================
745
746See :ref:`amdgpu-address-spaces` for information on the memory space terminology
747used in the table.
748
749An ``address_class`` attribute is generated on pointer type DIEs to specify the
750DWARF address space of the value of the pointer when it is in the *private* or
751*local* address space. Otherwise the attribute is omitted.
752
753An ``XDEREF`` operation is generated in location list expressions for variables
754that are allocated in the *private* and *local* address space. Otherwise no
755``XDREF`` is omitted.
756
757Register Mapping
758~~~~~~~~~~~~~~~~
759
760*This section is WIP.*
761
762.. TODO
763 Define DWARF register enumeration.
764
765 If want to present a wavefront state then should expose vector registers as
766 64 wide (rather than per work-item view that LLVM uses). Either as separate
767 registers, or a 64x4 byte single register. In either case use a new LANE op
768 (akin to XDREF) to select the current lane usage in a location
769 expression. This would also allow scalar register spilling to vector register
770 lanes to be expressed (currently no debug information is being generated for
771 spilling). If choose a wide single register approach then use LANE in
772 conjunction with PIECE operation to select the dword part of the register for
773 the current lane. If the separate register approach then use LANE to select
774 the register.
775
776Source Text
777~~~~~~~~~~~
778
779*This section is WIP.*
780
781.. TODO
782 DWARF extension to include runtime generated source text.
783
784.. _amdgpu-code-conventions:
785
786Code Conventions
787================
788
789This section provides code conventions used for each supported target triple OS
790(see :ref:`amdgpu-target-triples`).
791
792AMDHSA
793------
794
795This section provides code conventions used when the target triple OS is
796``amdhsa`` (see :ref:`amdgpu-target-triples`).
797
798.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000799
800Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000801~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000802
Tony Tye46d35762017-08-15 20:47:41 +0000803The code object metadata specifies extensible metadata associated with the code
804objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
805[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
806(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
807``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
808information necessary to support the ROCM kernel queries. For example, the
809segment sizes needed in a dispatch packet. In addition, a high level language
810runtime may require other information to be included. For example, the AMD
811OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000812
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000813The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000814:doc:`YamlIO`).
815
Tony Tye46d35762017-08-15 20:47:41 +0000816.. TODO
817 Is the string null terminated? It probably should not if YAML allows it to
818 contain null characters, otherwise it should be.
819
Tony Tyef16a45e2017-06-06 20:31:59 +0000820The metadata is represented as a single YAML document comprised of the mapping
821defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
822referenced tables.
823
824For boolean values, the string values of ``false`` and ``true`` are used for
825false and true respectively.
826
827Additional information can be added to the mappings. To avoid conflicts, any
828non-AMD key names should be prefixed by "*vendor-name*.".
829
830 .. table:: AMDHSA Code Object Metadata Mapping
831 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
832
833 ========== ============== ========= =======================================
834 String Key Value Type Required? Description
835 ========== ============== ========= =======================================
836 "Version" sequence of Required - The first integer is the major
837 2 integers version. Currently 1.
838 - The second integer is the minor
839 version. Currently 0.
840 "Printf" sequence of Each string is encoded information
841 strings about a printf function call. The
842 encoded information is organized as
843 fields separated by colon (':'):
844
845 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
846
847 where:
848
849 ``ID``
850 A 32 bit integer as a unique id for
851 each printf function call
852
853 ``N``
854 A 32 bit integer equal to the number
855 of arguments of printf function call
856 minus 1
857
858 ``S[i]`` (where i = 0, 1, ... , N-1)
859 32 bit integers for the size in bytes
860 of the i-th FormatString argument of
861 the printf function call
862
863 FormatString
864 The format string passed to the
865 printf function call.
866 "Kernels" sequence of Required Sequence of the mappings for each
867 mapping kernel in the code object. See
868 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
869 for the definition of the mapping.
870 ========== ============== ========= =======================================
871
872..
873
874 .. table:: AMDHSA Code Object Kernel Metadata Mapping
875 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
876
877 ================= ============== ========= ================================
878 String Key Value Type Required? Description
879 ================= ============== ========= ================================
880 "Name" string Required Source name of the kernel.
881 "SymbolName" string Required Name of the kernel
882 descriptor ELF symbol.
883 "Language" string Source language of the kernel.
884 Values include:
885
886 - "OpenCL C"
887 - "OpenCL C++"
888 - "HCC"
889 - "OpenMP"
890
891 "LanguageVersion" sequence of - The first integer is the major
892 2 integers version.
893 - The second integer is the
894 minor version.
895 "Attrs" mapping Mapping of kernel attributes.
896 See
897 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
898 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +0000899 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +0000900 mapping kernel arguments. See
901 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
902 for the definition of the mapping.
903 "CodeProps" mapping Mapping of properties related to
904 the kernel code. See
905 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
906 for the mapping definition.
907 "DebugProps" mapping Mapping of properties related to
908 the kernel debugging. See
909 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
910 for the mapping definition.
911 ================= ============== ========= ================================
912
913..
914
915 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
916 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
917
918 =================== ============== ========= ==============================
919 String Key Value Type Required? Description
920 =================== ============== ========= ==============================
921 "ReqdWorkGroupSize" sequence of The dispatch work-group size
922 3 integers X, Y, Z must correspond to the
923 specified values.
924
925 Corresponds to the OpenCL
926 ``reqd_work_group_size``
927 attribute.
928 "WorkGroupSizeHint" sequence of The dispatch work-group size
929 3 integers X, Y, Z is likely to be the
930 specified values.
931
932 Corresponds to the OpenCL
933 ``work_group_size_hint``
934 attribute.
935 "VecTypeHint" string The name of a scalar or vector
936 type.
937
938 Corresponds to the OpenCL
939 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +0000940
941 "RuntimeHandle" string The external symbol name
942 associated with a kernel.
943 OpenCL runtime allocates a
944 global buffer for the symbol
945 and saves the kernel's address
946 to it, which is used for
947 device side enqueueing. Only
948 available for device side
949 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +0000950 =================== ============== ========= ==============================
951
952..
953
954 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
955 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
956
957 ================= ============== ========= ================================
958 String Key Value Type Required? Description
959 ================= ============== ========= ================================
960 "Name" string Kernel argument name.
961 "TypeName" string Kernel argument type name.
962 "Size" integer Required Kernel argument size in bytes.
963 "Align" integer Required Kernel argument alignment in
964 bytes. Must be a power of two.
965 "ValueKind" string Required Kernel argument kind that
966 specifies how to set up the
967 corresponding argument.
968 Values include:
969
970 "ByValue"
971 The argument is copied
972 directly into the kernarg.
973
974 "GlobalBuffer"
975 A global address space pointer
976 to the buffer data is passed
977 in the kernarg.
978
979 "DynamicSharedPointer"
980 A group address space pointer
981 to dynamically allocated LDS
982 is passed in the kernarg.
983
984 "Sampler"
985 A global address space
986 pointer to a S# is passed in
987 the kernarg.
988
989 "Image"
990 A global address space
991 pointer to a T# is passed in
992 the kernarg.
993
994 "Pipe"
995 A global address space pointer
996 to an OpenCL pipe is passed in
997 the kernarg.
998
999 "Queue"
1000 A global address space pointer
1001 to an OpenCL device enqueue
1002 queue is passed in the
1003 kernarg.
1004
1005 "HiddenGlobalOffsetX"
1006 The OpenCL grid dispatch
1007 global offset for the X
1008 dimension is passed in the
1009 kernarg.
1010
1011 "HiddenGlobalOffsetY"
1012 The OpenCL grid dispatch
1013 global offset for the Y
1014 dimension is passed in the
1015 kernarg.
1016
1017 "HiddenGlobalOffsetZ"
1018 The OpenCL grid dispatch
1019 global offset for the Z
1020 dimension is passed in the
1021 kernarg.
1022
1023 "HiddenNone"
1024 An argument that is not used
1025 by the kernel. Space needs to
1026 be left for it, but it does
1027 not need to be set up.
1028
1029 "HiddenPrintfBuffer"
1030 A global address space pointer
1031 to the runtime printf buffer
1032 is passed in kernarg.
1033
1034 "HiddenDefaultQueue"
1035 A global address space pointer
1036 to the OpenCL device enqueue
1037 queue that should be used by
1038 the kernel by default is
1039 passed in the kernarg.
1040
1041 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001042 A global address space pointer
1043 to help link enqueued kernels into
1044 the ancestor tree for determining
1045 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001046
1047 "ValueType" string Required Kernel argument value type. Only
1048 present if "ValueKind" is
1049 "ByValue". For vector data
1050 types, the value is for the
1051 element type. Values include:
1052
1053 - "Struct"
1054 - "I8"
1055 - "U8"
1056 - "I16"
1057 - "U16"
1058 - "F16"
1059 - "I32"
1060 - "U32"
1061 - "F32"
1062 - "I64"
1063 - "U64"
1064 - "F64"
1065
1066 .. TODO
1067 How can it be determined if a
1068 vector type, and what size
1069 vector?
1070 "PointeeAlign" integer Alignment in bytes of pointee
1071 type for pointer type kernel
1072 argument. Must be a power
1073 of 2. Only present if
1074 "ValueKind" is
1075 "DynamicSharedPointer".
1076 "AddrSpaceQual" string Kernel argument address space
1077 qualifier. Only present if
1078 "ValueKind" is "GlobalBuffer" or
1079 "DynamicSharedPointer". Values
1080 are:
1081
1082 - "Private"
1083 - "Global"
1084 - "Constant"
1085 - "Local"
1086 - "Generic"
1087 - "Region"
1088
1089 .. TODO
1090 Is GlobalBuffer only Global
1091 or Constant? Is
1092 DynamicSharedPointer always
1093 Local? Can HCC allow Generic?
1094 How can Private or Region
1095 ever happen?
1096 "AccQual" string Kernel argument access
1097 qualifier. Only present if
1098 "ValueKind" is "Image" or
1099 "Pipe". Values
1100 are:
1101
1102 - "ReadOnly"
1103 - "WriteOnly"
1104 - "ReadWrite"
1105
1106 .. TODO
1107 Does this apply to
1108 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001109 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001110 performed by the kernel on the
1111 kernel argument. Only present if
1112 "ValueKind" is "GlobalBuffer",
1113 "Image", or "Pipe". This may be
1114 more restrictive than indicated
1115 by "AccQual" to reflect what the
1116 kernel actual does. If not
1117 present then the runtime must
1118 assume what is implied by
1119 "AccQual" and "IsConst". Values
1120 are:
1121
1122 - "ReadOnly"
1123 - "WriteOnly"
1124 - "ReadWrite"
1125
1126 "IsConst" boolean Indicates if the kernel argument
1127 is const qualified. Only present
1128 if "ValueKind" is
1129 "GlobalBuffer".
1130
1131 "IsRestrict" boolean Indicates if the kernel argument
1132 is restrict qualified. Only
1133 present if "ValueKind" is
1134 "GlobalBuffer".
1135
1136 "IsVolatile" boolean Indicates if the kernel argument
1137 is volatile qualified. Only
1138 present if "ValueKind" is
1139 "GlobalBuffer".
1140
1141 "IsPipe" boolean Indicates if the kernel argument
1142 is pipe qualified. Only present
1143 if "ValueKind" is "Pipe".
1144
1145 .. TODO
1146 Can GlobalBuffer be pipe
1147 qualified?
1148 ================= ============== ========= ================================
1149
1150..
1151
1152 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1153 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1154
1155 ============================ ============== ========= =====================
1156 String Key Value Type Required? Description
1157 ============================ ============== ========= =====================
1158 "KernargSegmentSize" integer Required The size in bytes of
1159 the kernarg segment
1160 that holds the values
1161 of the arguments to
1162 the kernel.
1163 "GroupSegmentFixedSize" integer Required The amount of group
1164 segment memory
1165 required by a
1166 work-group in
1167 bytes. This does not
1168 include any
1169 dynamically allocated
1170 group segment memory
1171 that may be added
1172 when the kernel is
1173 dispatched.
1174 "PrivateSegmentFixedSize" integer Required The amount of fixed
1175 private address space
1176 memory required for a
1177 work-item in
1178 bytes. If
1179 IsDynamicCallstack
1180 is 1 then additional
1181 space must be added
1182 to this value for the
1183 call stack.
1184 "KernargSegmentAlign" integer Required The maximum byte
1185 alignment of
1186 arguments in the
1187 kernarg segment. Must
1188 be a power of 2.
1189 "WavefrontSize" integer Required Wavefront size. Must
1190 be a power of 2.
1191 "NumSGPRs" integer Number of scalar
1192 registers used by a
1193 wavefront for
1194 GFX6-GFX9. This
1195 includes the special
1196 SGPRs for VCC, Flat
1197 Scratch (GFX7-GFX9)
1198 and XNACK (for
1199 GFX8-GFX9). It does
1200 not include the 16
1201 SGPR added if a trap
1202 handler is
1203 enabled. It is not
1204 rounded up to the
1205 allocation
1206 granularity.
1207 "NumVGPRs" integer Number of vector
1208 registers used by
1209 each work-item for
1210 GFX6-GFX9
Konstantin Zhuravlyov8d5e9e12017-10-18 17:31:09 +00001211 "MaxFlatWorkGroupSize" integer Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001212 work-group size
1213 supported by the
1214 kernel in work-items.
1215 "IsDynamicCallStack" boolean Indicates if the
1216 generated machine
1217 code is using a
1218 dynamically sized
1219 call stack.
1220 "IsXNACKEnabled" boolean Indicates if the
1221 generated machine
1222 code is capable of
1223 supporting XNACK.
1224 ============================ ============== ========= =====================
1225
1226..
1227
1228 .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
1229 :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
1230
1231 =================================== ============== ========= ==============
1232 String Key Value Type Required? Description
1233 =================================== ============== ========= ==============
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001234 "DebuggerABIVersion" sequence of
1235 2 integers
Tony Tyef16a45e2017-06-06 20:31:59 +00001236 "ReservedNumVGPRs" integer
1237 "ReservedFirstVGPR" integer
1238 "PrivateSegmentBufferSGPR" integer
1239 "WavefrontPrivateSegmentOffsetSGPR" integer
1240 =================================== ============== ========= ==============
1241
1242.. TODO
Tony Tye6baa6d22017-10-18 22:16:55 +00001243 Plan to remove the debug properties metadata.
Tony Tyef16a45e2017-06-06 20:31:59 +00001244
Tony Tyef16a45e2017-06-06 20:31:59 +00001245Kernel Dispatch
1246~~~~~~~~~~~~~~~
1247
1248The HSA architected queuing language (AQL) defines a user space memory interface
1249that can be used to control the dispatch of kernels, in an agent independent
1250way. An agent can have zero or more AQL queues created for it using the ROCm
1251runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1252*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1253mechanics and packet layouts.
1254
1255The packet processor of a kernel agent is responsible for detecting and
1256dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1257packet processor is implemented by the hardware command processor (CP),
1258asynchronous dispatch controller (ADC) and shader processor input controller
1259(SPI).
1260
1261The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1262mode driver to initialize and register the AQL queue with CP.
1263
1264To dispatch a kernel the following actions are performed. This can occur in the
1265CPU host program, or from an HSA kernel executing on a GPU.
1266
12671. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1268 executed is obtained.
12692. A pointer to the kernel descriptor (see
1270 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1271 obtained. It must be for a kernel that is contained in a code object that that
1272 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1273 associated.
12743. Space is allocated for the kernel arguments using the ROCm runtime allocator
1275 for a memory region with the kernarg property for the kernel agent that will
1276 execute the kernel. It must be at least 16 byte aligned.
12774. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001278 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001279 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1280 memory in the same way constant memory is accessed. (Note that the HSA
1281 specification allows an implementation to copy the kernel argument contents to
1282 another location that is accessed by the kernel.)
12835. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1284 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1285 packet. The packet must be set up, and the final write must use an atomic
1286 store release to set the packet kind to ensure the packet contents are
1287 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1288 notify the kernel agent that the AQL queue has been updated. These rules, and
1289 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1290 System Architecture Specification* [HSA]_.
12916. A kernel dispatch packet includes information about the actual dispatch,
1292 such as grid and work-group size, together with information from the code
1293 object about the kernel, such as segment sizes. The ROCm runtime queries on
1294 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001295 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000012967. CP executes micro-code and is responsible for detecting and setting up the
1297 GPU to execute the wavefronts of a kernel dispatch.
12988. CP ensures that when the a wavefront starts executing the kernel machine
1299 code, the scalar general purpose registers (SGPR) and vector general purpose
1300 registers (VGPR) are set up as required by the machine code. The required
1301 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1302 register state is defined in
1303 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13049. The prolog of the kernel machine code (see
1305 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1306 before continuing executing the machine code that corresponds to the kernel.
130710. When the kernel dispatch has completed execution, CP signals the completion
1308 signal specified in the kernel dispatch packet if not 0.
1309
1310.. _amdgpu-amdhsa-memory-spaces:
1311
1312Memory Spaces
1313~~~~~~~~~~~~~
1314
1315The memory space properties are:
1316
1317 .. table:: AMDHSA Memory Spaces
1318 :name: amdgpu-amdhsa-memory-spaces-table
1319
1320 ================= =========== ======== ======= ==================
1321 Memory Space Name HSA Segment Hardware Address NULL Value
1322 Name Name Size
1323 ================= =========== ======== ======= ==================
1324 Private private scratch 32 0x00000000
1325 Local group LDS 32 0xFFFFFFFF
1326 Global global global 64 0x0000000000000000
1327 Constant constant *same as 64 0x0000000000000000
1328 global*
1329 Generic flat flat 64 0x0000000000000000
1330 Region N/A GDS 32 *not implemented
1331 for AMDHSA*
1332 ================= =========== ======== ======= ==================
1333
1334The global and constant memory spaces both use global virtual addresses, which
1335are the same virtual address space used by the CPU. However, some virtual
1336addresses may only be accessible to the CPU, some only accessible by the GPU,
1337and some by both.
1338
1339Using the constant memory space indicates that the data will not change during
1340the execution of the kernel. This allows scalar read instructions to be
1341used. The vector and scalar L1 caches are invalidated of volatile data before
1342each kernel dispatch execution to allow constant memory to change values between
1343kernel dispatches.
1344
1345The local memory space uses the hardware Local Data Store (LDS) which is
1346automatically allocated when the hardware creates work-groups of wavefronts, and
1347freed when all the wavefronts of a work-group have terminated. The data store
1348(DS) instructions can be used to access it.
1349
1350The private memory space uses the hardware scratch memory support. If the kernel
1351uses scratch, then the hardware allocates memory that is accessed using
1352wavefront lane dword (4 byte) interleaving. The mapping used from private
1353address to physical address is:
1354
1355 ``wavefront-scratch-base +
1356 (private-address * wavefront-size * 4) +
1357 (wavefront-lane-id * 4)``
1358
1359There are different ways that the wavefront scratch base address is determined
1360by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1361memory can be accessed in an interleaved manner using buffer instruction with
1362the scratch buffer descriptor and per wave scratch offset, by the scratch
1363instructions, or by flat instructions. If each lane of a wavefront accesses the
1364same private address, the interleaving results in adjacent dwords being accessed
1365and hence requires fewer cache lines to be fetched. Multi-dword access is not
1366supported except by flat and scratch instructions in GFX9.
1367
1368The generic address space uses the hardware flat address support available in
1369GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1370local appertures), that are outside the range of addressible global memory, to
1371map from a flat address to a private or local address.
1372
1373FLAT instructions can take a flat address and access global, private (scratch)
1374and group (LDS) memory depending in if the address is within one of the
1375apperture ranges. Flat access to scratch requires hardware aperture setup and
1376setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1377access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1378(see :ref:`amdgpu-amdhsa-m0`).
1379
1380To convert between a segment address and a flat address the base address of the
1381appertures address can be used. For GFX7-GFX8 these are available in the
1382:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1383Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1384GFX9 the appature base addresses are directly available as inline constant
1385registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1386address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1387which makes it easier to convert from flat to segment or segment to flat.
1388
Tony Tye46d35762017-08-15 20:47:41 +00001389Image and Samplers
1390~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001391
1392Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1393hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1394HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1395enumeration values for the queries that are not trivially deducible from the S#
1396representation.
1397
1398HSA Signals
1399~~~~~~~~~~~
1400
Tony Tye46d35762017-08-15 20:47:41 +00001401HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1402structure allocated in memory accessible from both the CPU and GPU. The
1403structure is defined by the ROCm runtime and subject to change between releases
1404(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001405
1406.. _amdgpu-amdhsa-hsa-aql-queue:
1407
1408HSA AQL Queue
1409~~~~~~~~~~~~~
1410
Tony Tye46d35762017-08-15 20:47:41 +00001411The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001412between releases (see [AMD-ROCm-github]_). For some processors it contains
1413fields needed to implement certain language features such as the flat address
1414aperture bases. It also contains fields used by CP such as managing the
1415allocation of scratch memory.
1416
1417.. _amdgpu-amdhsa-kernel-descriptor:
1418
1419Kernel Descriptor
1420~~~~~~~~~~~~~~~~~
1421
1422A kernel descriptor consists of the information needed by CP to initiate the
1423execution of a kernel, including the entry point address of the machine code
1424that implements the kernel.
1425
1426Kernel Descriptor for GFX6-GFX9
1427+++++++++++++++++++++++++++++++
1428
1429CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1430
1431 .. table:: Kernel Descriptor for GFX6-GFX9
1432 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1433
Tony Tye6baa6d22017-10-18 22:16:55 +00001434 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001435 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001436 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001437 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001438 address space memory
1439 required for a work-group
1440 in bytes. This does not
1441 include any dynamically
1442 allocated local address
1443 space memory that may be
1444 added when the kernel is
1445 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001446 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001447 private address space
1448 memory required for a
1449 work-item in bytes. If
1450 is_dynamic_callstack is 1
1451 then additional space must
1452 be added to this value for
1453 the call stack.
Konstantin Zhuravlyov8d5e9e12017-10-18 17:31:09 +00001454 95:64 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
Tony Tyef16a45e2017-06-06 20:31:59 +00001455 size supported by the
1456 kernel in work-items.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001457 96 1 bit IsDynamicCallStack Indicates if the generated
Tony Tyef16a45e2017-06-06 20:31:59 +00001458 machine code is using a
1459 dynamically sized call
1460 stack.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001461 97 1 bit IsXNACKEnabled Indicates if the generated
Tony Tyef16a45e2017-06-06 20:31:59 +00001462 machine code is capable of
1463 suppoting XNACK.
Tony Tye6baa6d22017-10-18 22:16:55 +00001464 127:98 30 bits Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001465 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001466 negative) from base
1467 address of kernel
1468 descriptor to kernel's
1469 entry point instruction
1470 which must be 256 byte
1471 aligned.
Tony Tye6baa6d22017-10-18 22:16:55 +00001472 383:192 24 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001473 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001474 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001475 program settings used by
1476 CP to set up
1477 ``COMPUTE_PGM_RSRC1``
1478 configuration
1479 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001480 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001481 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001482 program settings used by
1483 CP to set up
1484 ``COMPUTE_PGM_RSRC2``
1485 configuration
1486 register. See
1487 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001488 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1489 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001490 (see
1491 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1492
1493 The total number of SGPR
1494 user data registers
1495 requested must not exceed
1496 16 and match value in
1497 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1498 Any requests beyond 16
1499 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001500 449 1 bit EnableSGPRDispatchPtr *see above*
1501 450 1 bit EnableSGPRQueuePtr *see above*
1502 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1503 452 1 bit EnableSGPRDispatchID *see above*
1504 453 1 bit EnableSGPRFlatScratchInit *see above*
1505 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1506 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1507 should always be 0.
1508 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1509 should always be 0.
1510 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1511 should always be 0.
Tony Tye6baa6d22017-10-18 22:16:55 +00001512 463:458 6 bits Reserved, must be 0.
1513 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001514 bytes
1515 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001516 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001517
1518..
1519
1520 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001521 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001522
Tony Tye3b340612017-06-07 00:46:08 +00001523 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001524 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001525 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001526 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001527 used by each work-item,
1528 granularity is device
1529 specific:
1530
1531 GFX6-9
Tony Tye6baa6d22017-10-18 22:16:55 +00001532 - max_vgpr 1..256
1533 - roundup((max_vgpg + 1)
1534 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001535
1536 Used by CP to set up
1537 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001538 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001539 used by a wavefront,
1540 granularity is device
1541 specific:
1542
1543 GFX6-8
Tony Tye6baa6d22017-10-18 22:16:55 +00001544 - max_sgpr 1..112
1545 - roundup((max_sgpg + 1)
1546 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001547 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001548 - max_sgpr 1..112
1549 - roundup((max_sgpg + 1)
1550 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001551
1552 Includes the special SGPRs
1553 for VCC, Flat Scratch (for
1554 GFX7 onwards) and XNACK
1555 (for GFX8 onwards). It does
1556 not include the 16 SGPR
1557 added if a trap handler is
1558 enabled.
1559
1560 Used by CP to set up
1561 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001562 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001563
1564 Start executing wavefront
1565 at the specified priority.
1566
1567 CP is responsible for
1568 filling in
1569 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001570 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001571 with specified rounding
1572 mode for single (32
1573 bit) floating point
1574 precision floating point
1575 operations.
1576
1577 Floating point rounding
1578 mode values are defined in
1579 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1580
1581 Used by CP to set up
1582 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001583 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001584 with specified rounding
1585 denorm mode for half/double (16
1586 and 64 bit) floating point
1587 precision floating point
1588 operations.
1589
1590 Floating point rounding
1591 mode values are defined in
1592 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1593
1594 Used by CP to set up
1595 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001596 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001597 with specified denorm mode
1598 for single (32
1599 bit) floating point
1600 precision floating point
1601 operations.
1602
1603 Floating point denorm mode
1604 values are defined in
1605 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1606
1607 Used by CP to set up
1608 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001609 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001610 with specified denorm mode
1611 for half/double (16
1612 and 64 bit) floating point
1613 precision floating point
1614 operations.
1615
1616 Floating point denorm mode
1617 values are defined in
1618 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1619
1620 Used by CP to set up
1621 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001622 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001623
1624 Start executing wavefront
1625 in privilege trap handler
1626 mode.
1627
1628 CP is responsible for
1629 filling in
1630 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001631 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001632 with DX10 clamp mode
1633 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001634 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001635 treatment of NaN's (when
1636 set, clamp NaN to zero,
1637 otherwise pass NaN
1638 through).
1639
1640 Used by CP to set up
1641 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001642 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001643
1644 Start executing wavefront
1645 in single step mode.
1646
1647 CP is responsible for
1648 filling in
1649 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001650 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001651 with IEEE mode
1652 enabled. Floating point
1653 opcodes that support
1654 exception flag gathering
1655 will quiet and propagate
1656 signaling-NaN inputs per
1657 IEEE 754-2008. Min_dx10 and
1658 max_dx10 become IEEE
1659 754-2008 compliant due to
1660 signaling-NaN propagation
1661 and quieting.
1662
1663 Used by CP to set up
1664 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001665 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001666
1667 Only one work-group allowed
1668 to execute on a compute
1669 unit.
1670
1671 CP is responsible for
1672 filling in
1673 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001674 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001675
1676 Flag that can be used to
1677 control debugging code.
1678
1679 CP is responsible for
1680 filling in
1681 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001682 26 1 bit FP16_OVFL GFX6-8
1683 Reserved, must be 0.
1684 GFX9
1685 Wavefront starts execution
1686 with specified fp16 overflow
1687 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001688
Tony Tye6baa6d22017-10-18 22:16:55 +00001689 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001690 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001691 - If 1, fp16 overflow that is the
1692 result of an +/-INF input value
1693 or divide by 0 produces a +/-INF,
1694 otherwise clamps computed
1695 overflow to +/-MAX_FP16 as
1696 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001697
1698 Used by CP to set up
1699 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001700 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001701 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001702 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001703
1704..
1705
1706 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1707 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1708
Tony Tye3b340612017-06-07 00:46:08 +00001709 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001710 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001711 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001712 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1713 _WAVE_OFFSET SGPR wave scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001714 system register (see
1715 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1716
1717 Used by CP to set up
1718 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001719 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001720 user data registers
1721 requested. This number must
1722 match the number of user
1723 data registers enabled.
1724
1725 Used by CP to set up
1726 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001727 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001728 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001729 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001730 be enabled.
1731
1732 CP sets
1733 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1734 if the runtime has
1735 installed a trap handler
1736 regardless of the setting
1737 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001738 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001739 system SGPR register for
1740 the work-group id in the X
1741 dimension (see
1742 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1743
1744 Used by CP to set up
1745 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001746 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001747 system SGPR register for
1748 the work-group id in the Y
1749 dimension (see
1750 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1751
1752 Used by CP to set up
1753 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001754 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001755 system SGPR register for
1756 the work-group id in the Z
1757 dimension (see
1758 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1759
1760 Used by CP to set up
1761 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001762 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001763 system SGPR register for
1764 work-group information (see
1765 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1766
1767 Used by CP to set up
1768 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001769 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001770 VGPR system registers used
1771 for the work-item ID.
1772 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1773 defines the values.
1774
1775 Used by CP to set up
1776 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001777 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001778
1779 Wavefront starts execution
1780 with address watch
1781 exceptions enabled which
1782 are generated when L1 has
1783 witnessed a thread access
1784 an *address of
1785 interest*.
1786
1787 CP is responsible for
1788 filling in the address
1789 watch bit in
1790 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1791 according to what the
1792 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001793 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001794
1795 Wavefront starts execution
1796 with memory violation
1797 exceptions exceptions
1798 enabled which are generated
1799 when a memory violation has
1800 occurred for this wave from
1801 L1 or LDS
1802 (write-to-read-only-memory,
1803 mis-aligned atomic, LDS
1804 address out of range,
1805 illegal address, etc.).
1806
1807 CP sets the memory
1808 violation bit in
1809 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1810 according to what the
1811 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001812 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001813
1814 CP uses the rounded value
1815 from the dispatch packet,
1816 not this value, as the
1817 dispatch may contain
1818 dynamically allocated group
1819 segment memory. CP writes
1820 directly to
1821 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1822
1823 Amount of group segment
1824 (LDS) to allocate for each
1825 work-group. Granularity is
1826 device specific:
1827
1828 GFX6:
1829 roundup(lds-size / (64 * 4))
1830 GFX7-GFX9:
1831 roundup(lds-size / (128 * 4))
1832
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001833 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1834 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001835 enabled.
1836
1837 Used by CP to set up
1838 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1839 (set from bits 0..6).
1840
1841 IEEE 754 FP Invalid
1842 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001843 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1844 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001845 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001846 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1847 _DIVISION_BY_ZERO Zero
1848 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1849 _OVERFLOW
1850 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1851 _UNDERFLOW
1852 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1853 _INEXACT
1854 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1855 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001856 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001857 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001858 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001859 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001860
1861..
1862
1863 .. table:: Floating Point Rounding Mode Enumeration Values
1864 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1865
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001866 ====================================== ===== ==============================
1867 Enumeration Name Value Description
1868 ====================================== ===== ==============================
1869 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1870 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1871 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1872 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1873 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001874
1875..
1876
1877 .. table:: Floating Point Denorm Mode Enumeration Values
1878 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1879
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001880 ====================================== ===== ==============================
1881 Enumeration Name Value Description
1882 ====================================== ===== ==============================
1883 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1884 Denorms
1885 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1886 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1887 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1888 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001889
1890..
1891
1892 .. table:: System VGPR Work-Item ID Enumeration Values
1893 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1894
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001895 ======================================== ===== ============================
1896 Enumeration Name Value Description
1897 ======================================== ===== ============================
1898 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1899 ID.
1900 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1901 dimensions ID.
1902 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1903 dimensions ID.
1904 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1905 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001906
1907.. _amdgpu-amdhsa-initial-kernel-execution-state:
1908
1909Initial Kernel Execution State
1910~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1911
1912This section defines the register state that will be set up by the packet
1913processor prior to the start of execution of every wavefront. This is limited by
1914the constraints of the hardware controllers of CP/ADC/SPI.
1915
1916The order of the SGPR registers is defined, but the compiler can specify which
1917ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1918fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1919for enabled registers are dense starting at SGPR0: the first enabled register is
1920SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1921an SGPR number.
1922
1923The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1924all waves of the grid. It is possible to specify more than 16 User SGPRs using
1925the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1926initialized. These are then immediately followed by the System SGPRs that are
1927set up by ADC/SPI and can have different values for each wave of the grid
1928dispatch.
1929
1930SGPR register initial state is defined in
1931:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1932
1933 .. table:: SGPR Register Set Up Order
1934 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1935
1936 ========== ========================== ====== ==============================
1937 SGPR Order Name Number Description
1938 (kernel descriptor enable of
1939 field) SGPRs
1940 ========== ========================== ====== ==============================
1941 First Private Segment Buffer 4 V# that can be used, together
1942 (enable_sgpr_private with Scratch Wave Offset as an
1943 _segment_buffer) offset, to access the private
1944 memory space using a segment
1945 address.
1946
1947 CP uses the value provided by
1948 the runtime.
1949 then Dispatch Ptr 2 64 bit address of AQL dispatch
1950 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
1951 actually executing.
1952 then Queue Ptr 2 64 bit address of amd_queue_t
1953 (enable_sgpr_queue_ptr) object for AQL queue on which
1954 the dispatch packet was
1955 queued.
1956 then Kernarg Segment Ptr 2 64 bit address of Kernarg
1957 (enable_sgpr_kernarg segment. This is directly
1958 _segment_ptr) copied from the
1959 kernarg_address in the kernel
1960 dispatch packet.
1961
1962 Having CP load it once avoids
1963 loading it at the beginning of
1964 every wavefront.
1965 then Dispatch Id 2 64 bit Dispatch ID of the
1966 (enable_sgpr_dispatch_id) dispatch packet being
1967 executed.
1968 then Flat Scratch Init 2 This is 2 SGPRs:
1969 (enable_sgpr_flat_scratch
1970 _init) GFX6
1971 Not supported.
1972 GFX7-GFX8
1973 The first SGPR is a 32 bit
1974 byte offset from
1975 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1976 to per SPI base of memory
1977 for scratch for the queue
1978 executing the kernel
1979 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00001980 from the runtime. (The
1981 Scratch Segment Buffer base
1982 address is
1983 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1984 plus this offset.) The value
1985 of Scratch Wave Offset must
1986 be added to this offset by
1987 the kernel machine code,
1988 right shifted by 8, and
1989 moved to the FLAT_SCRATCH_HI
1990 SGPR register.
1991 FLAT_SCRATCH_HI corresponds
1992 to SGPRn-4 on GFX7, and
1993 SGPRn-6 on GFX8 (where SGPRn
1994 is the highest numbered SGPR
1995 allocated to the wave).
1996 FLAT_SCRATCH_HI is
1997 multiplied by 256 (as it is
1998 in units of 256 bytes) and
1999 added to
2000 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2001 to calculate the per wave
2002 FLAT SCRATCH BASE in flat
2003 memory instructions that
2004 access the scratch
2005 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002006
2007 The second SGPR is 32 bit
2008 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002009 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002010 usage. CP obtains this from
2011 the runtime, and it is
2012 always a multiple of DWORD.
2013 CP checks that the value in
2014 the kernel dispatch packet
2015 Private Segment Byte Size is
2016 not larger, and requests the
2017 runtime to increase the
2018 queue's scratch size if
2019 necessary. The kernel code
2020 must move it to
2021 FLAT_SCRATCH_LO which is
2022 SGPRn-3 on GFX7 and SGPRn-5
2023 on GFX8. FLAT_SCRATCH_LO is
2024 used as the FLAT SCRATCH
2025 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002026 instructions. Having CP load
2027 it once avoids loading it at
2028 the beginning of every
Tony Tye46d35762017-08-15 20:47:41 +00002029 wavefront. GFX9 This is the
2030 64 bit base address of the
2031 per SPI scratch backing
2032 memory managed by SPI for
2033 the queue executing the
2034 kernel dispatch. CP obtains
2035 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002036 divides it if there are
2037 multiple Shader Arrays each
2038 with its own SPI). The value
2039 of Scratch Wave Offset must
2040 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002041 machine code and the result
2042 moved to the FLAT_SCRATCH
2043 SGPR which is SGPRn-6 and
2044 SGPRn-5. It is used as the
2045 FLAT SCRATCH BASE in flat
2046 memory instructions. then
2047 Private Segment Size 1 The
2048 32 bit byte size of a
2049 (enable_sgpr_private single
2050 work-item's
2051 scratch_segment_size) memory
2052 allocation. This is the
2053 value from the kernel
2054 dispatch packet Private
2055 Segment Byte Size rounded up
2056 by CP to a multiple of
2057 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002058
2059 Having CP load it once avoids
2060 loading it at the beginning of
2061 every wavefront.
2062
2063 This is not used for
2064 GFX7-GFX8 since it is the same
2065 value as the second SGPR of
2066 Flat Scratch Init. However, it
2067 may be needed for GFX9 which
2068 changes the meaning of the
2069 Flat Scratch Init value.
2070 then Grid Work-Group Count X 1 32 bit count of the number of
2071 (enable_sgpr_grid work-groups in the X dimension
2072 _workgroup_count_X) for the grid being
2073 executed. Computed from the
2074 fields in the kernel dispatch
2075 packet as ((grid_size.x +
2076 workgroup_size.x - 1) /
2077 workgroup_size.x).
2078 then Grid Work-Group Count Y 1 32 bit count of the number of
2079 (enable_sgpr_grid work-groups in the Y dimension
2080 _workgroup_count_Y && for the grid being
2081 less than 16 previous executed. Computed from the
2082 SGPRs) fields in the kernel dispatch
2083 packet as ((grid_size.y +
2084 workgroup_size.y - 1) /
2085 workgroupSize.y).
2086
2087 Only initialized if <16
2088 previous SGPRs initialized.
2089 then Grid Work-Group Count Z 1 32 bit count of the number of
2090 (enable_sgpr_grid work-groups in the Z dimension
2091 _workgroup_count_Z && for the grid being
2092 less than 16 previous executed. Computed from the
2093 SGPRs) fields in the kernel dispatch
2094 packet as ((grid_size.z +
2095 workgroup_size.z - 1) /
2096 workgroupSize.z).
2097
2098 Only initialized if <16
2099 previous SGPRs initialized.
2100 then Work-Group Id X 1 32 bit work-group id in X
2101 (enable_sgpr_workgroup_id dimension of grid for
2102 _X) wavefront.
2103 then Work-Group Id Y 1 32 bit work-group id in Y
2104 (enable_sgpr_workgroup_id dimension of grid for
2105 _Y) wavefront.
2106 then Work-Group Id Z 1 32 bit work-group id in Z
2107 (enable_sgpr_workgroup_id dimension of grid for
2108 _Z) wavefront.
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002109 then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002110 (enable_sgpr_workgroup ordered_append_term[10:0],
2111 _info) threadgroup_size_in_waves[5:0]}
2112 then Scratch Wave Offset 1 32 bit byte offset from base
2113 (enable_sgpr_private of scratch base of queue
2114 _segment_wave_offset) executing the kernel
2115 dispatch. Must be used as an
2116 offset with Private
2117 segment address when using
2118 Scratch Segment Buffer. It
2119 must be used to set up FLAT
2120 SCRATCH for flat addressing
2121 (see
2122 :ref:`amdgpu-amdhsa-flat-scratch`).
2123 ========== ========================== ====== ==============================
2124
2125The order of the VGPR registers is defined, but the compiler can specify which
2126ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2127fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2128for enabled registers are dense starting at VGPR0: the first enabled register is
2129VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2130VGPR number.
2131
2132VGPR register initial state is defined in
2133:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2134
2135 .. table:: VGPR Register Set Up Order
2136 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2137
2138 ========== ========================== ====== ==============================
2139 VGPR Order Name Number Description
2140 (kernel descriptor enable of
2141 field) VGPRs
2142 ========== ========================== ====== ==============================
2143 First Work-Item Id X 1 32 bit work item id in X
2144 (Always initialized) dimension of work-group for
2145 wavefront lane.
2146 then Work-Item Id Y 1 32 bit work item id in Y
2147 (enable_vgpr_workitem_id dimension of work-group for
2148 > 0) wavefront lane.
2149 then Work-Item Id Z 1 32 bit work item id in Z
2150 (enable_vgpr_workitem_id dimension of work-group for
2151 > 1) wavefront lane.
2152 ========== ========================== ====== ==============================
2153
2154The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2155
21561. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2157 registers.
21582. Work-group Id registers X, Y, Z are set by ADC which supports any
2159 combination including none.
21603. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2161 cannot included with the flat scratch init value which is per queue.
21624. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2163 or (X, Y, Z).
2164
2165Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2166value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2167
2168The global segment can be accessed either using buffer instructions (GFX6 which
2169has V# 64 bit address support), flat instructions (GFX7-9), or global
2170instructions (GFX9).
2171
2172If buffer operations are used then the compiler can generate a V# with the
2173following properties:
2174
2175* base address of 0
2176* no swizzle
2177* ATC: 1 if IOMMU present (such as APU)
2178* ptr64: 1
2179* MTYPE set to support memory coherence that matches the runtime (such as CC for
2180 APU and NC for dGPU).
2181
2182.. _amdgpu-amdhsa-kernel-prolog:
2183
2184Kernel Prolog
2185~~~~~~~~~~~~~
2186
2187.. _amdgpu-amdhsa-m0:
2188
2189M0
2190++
2191
2192GFX6-GFX8
2193 The M0 register must be initialized with a value at least the total LDS size
2194 if the kernel may access LDS via DS or flat operations. Total LDS size is
2195 available in dispatch packet. For M0, it is also possible to use maximum
2196 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2197 GFX7-GFX8).
2198GFX9
2199 The M0 register is not used for range checking LDS accesses and so does not
2200 need to be initialized in the prolog.
2201
2202.. _amdgpu-amdhsa-flat-scratch:
2203
2204Flat Scratch
2205++++++++++++
2206
2207If the kernel may use flat operations to access scratch memory, the prolog code
2208must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2209are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2210Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2211
2212GFX6
2213 Flat scratch is not supported.
2214
2215GFX7-8
2216 1. The low word of Flat Scratch Init is 32 bit byte offset from
2217 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2218 being managed by SPI for the queue executing the kernel dispatch. This is
2219 the same value used in the Scratch Segment Buffer V# base address. The
2220 prolog must add the value of Scratch Wave Offset to get the wave's byte
2221 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2222 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2223 by 8 before moving into FLAT_SCRATCH_LO.
2224 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2225 work-items scratch memory usage. This is directly loaded from the kernel
2226 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2227 DWORD. Having CP load it once avoids loading it at the beginning of every
2228 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2229 SIZE.
2230GFX9
2231 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2232 memory being managed by SPI for the queue executing the kernel dispatch. The
2233 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2234 pair for use as the flat scratch base in flat memory instructions.
2235
2236.. _amdgpu-amdhsa-memory-model:
2237
2238Memory Model
2239~~~~~~~~~~~~
2240
2241This section describes the mapping of LLVM memory model onto AMDGPU machine code
2242(see :ref:`memmodel`). *The implementation is WIP.*
2243
2244.. TODO
2245 Update when implementation complete.
2246
Tony Tyef16a45e2017-06-06 20:31:59 +00002247The AMDGPU backend supports the memory synchronization scopes specified in
2248:ref:`amdgpu-memory-scopes`.
2249
2250The code sequences used to implement the memory model are defined in table
2251:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2252
2253The sequences specify the order of instructions that a single thread must
2254execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2255to other memory instructions executed by the same thread. This allows them to be
2256moved earlier or later which can allow them to be combined with other instances
2257of the same instruction, or hoisted/sunk out of loops to improve
2258performance. Only the instructions related to the memory model are given;
2259additional ``s_waitcnt`` instructions are required to ensure registers are
2260defined before being used. These may be able to be combined with the memory
2261model ``s_waitcnt`` instructions as described above.
2262
Tony Tye6baa6d22017-10-18 22:16:55 +00002263The AMDGPU backend supports the following memory models:
2264
2265 HSA Memory Model [HSA]_
2266 The HSA memory model uses a single happens-before relation for all address
2267 spaces (see :ref:`amdgpu-address-spaces`).
2268 OpenCL Memory Model [OpenCL]_
2269 The OpenCL memory model which has separate happens-before relations for the
2270 global and local address spaces. Only a fence specifying both global and
2271 local address space, and seq_cst instructions join the relationships. Since
2272 the LLVM ``memfence`` instruction does not allow an address space to be
2273 specified the OpenCL fence has to convervatively assume both local and
2274 global address space was specified. However, optimizations can often be
2275 done to eliminate the additional ``s_waitcnt`` instructions when there are
2276 no intervening memory instructions which access the corresponding address
2277 space. The code sequences in the table indicate what can be omitted for the
2278 OpenCL memory. The target triple environment is used to determine if the
2279 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002280
2281``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2282operations.
2283
2284``buffer/global/flat_load/store/atomic`` instructions to global memory are
2285termed vector memory operations.
2286
2287For GFX6-GFX9:
2288
2289* Each agent has multiple compute units (CU).
2290* Each CU has multiple SIMDs that execute wavefronts.
2291* The wavefronts for a single work-group are executed in the same CU but may be
2292 executed by different SIMDs.
2293* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2294 executing on it.
2295* All LDS operations of a CU are performed as wavefront wide operations in a
2296 global order and involve no caching. Completion is reported to a wavefront in
2297 execution order.
2298* The LDS memory has multiple request queues shared by the SIMDs of a
2299 CU. Therefore, the LDS operations performed by different waves of a work-group
2300 can be reordered relative to each other, which can result in reordering the
2301 visibility of vector memory operations with respect to LDS operations of other
2302 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002303 ensure synchronization between LDS operations and vector memory operations
Tony Tyef16a45e2017-06-06 20:31:59 +00002304 between waves of a work-group, but not between operations performed by the
2305 same wavefront.
2306* The vector memory operations are performed as wavefront wide operations and
2307 completion is reported to a wavefront in execution order. The exception is
2308 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2309 vector memory order if they access LDS memory, and out of LDS operation order
2310 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002311* The vector memory operations access a single vector L1 cache shared by all
2312 SIMDs a CU. Therefore, no special action is required for coherence between the
2313 lanes of a single wavefront, or for coherence between wavefronts in the same
2314 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2315 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002316* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2317 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2318 scalar operations are used in a restricted way so do not impact the memory
2319 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2320* The vector and scalar memory operations use an L2 cache shared by all CUs on
2321 the same agent.
2322* The L2 cache has independent channels to service disjoint ranges of virtual
2323 addresses.
2324* Each CU has a separate request queue per channel. Therefore, the vector and
2325 scalar memory operations performed by waves executing in different work-groups
2326 (which may be executing on different CUs) of an agent can be reordered
2327 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002328 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002329 previous vector memory operation has completed before executing a subsequent
2330 vector memory or LDS operation and so can be used to meet the requirements of
2331 acquire and release.
2332* The L2 cache can be kept coherent with other agents on some targets, or ranges
2333 of virtual addresses can be set up to bypass it to ensure system coherence.
2334
2335Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2336or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2337memory, atomic memory orderings are not meaningful and all accesses are treated
2338as non-atomic.
2339
2340Constant address space uses ``buffer/global_load`` instructions (or equivalent
2341scalar memory instructions). Since the constant address space contents do not
2342change during the execution of a kernel dispatch it is not legal to perform
2343stores, and atomic memory orderings are not meaningful and all access are
2344treated as non-atomic.
2345
2346A memory synchronization scope wider than work-group is not meaningful for the
2347group (LDS) address space and is treated as work-group.
2348
2349The memory model does not support the region address space which is treated as
2350non-atomic.
2351
2352Acquire memory ordering is not meaningful on store atomic instructions and is
2353treated as non-atomic.
2354
2355Release memory ordering is not meaningful on load atomic instructions and is
2356treated a non-atomic.
2357
2358Acquire-release memory ordering is not meaningful on load or store atomic
2359instructions and is treated as acquire and release respectively.
2360
2361AMDGPU backend only uses scalar memory operations to access memory that is
2362proven to not change during the execution of the kernel dispatch. This includes
2363constant address space and global address space for program scope const
2364variables. Therefore the kernel machine code does not have to maintain the
2365scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2366and vector L1 caches are invalidated between kernel dispatches by CP since
2367constant address space data may change between kernel dispatch executions. See
2368:ref:`amdgpu-amdhsa-memory-spaces`.
2369
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002370The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002371case the AMDGPU backend ensures the memory location used to spill is never
2372accessed by vector memory operations at the same time. If scalar writes are used
2373then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2374return since the locations may be used for vector memory instructions by a
2375future wave that uses the same scratch area, or a function call that creates a
2376frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2377as all scalar writes are write-before-read in the same thread.
2378
Tony Tye6baa6d22017-10-18 22:16:55 +00002379Scratch backing memory (which is used for the private address space)
2380is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2381address space is only accessed by a single thread, and is always
2382write-before-read, there is never a need to invalidate these entries from the L1
2383cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2384volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002385
2386On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002387to invalidate the L2 cache. This also causes it to be treated as
2388non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2389(cache coherent) and so the L2 cache will coherent with the CPU and other
2390agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002391
2392 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2393 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2394
Tony Tye6baa6d22017-10-18 22:16:55 +00002395 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002396 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2397 Ordering Sync Scope Address
2398 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002399 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002400 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002401 -----------------------------------------------------------------------------------
2402 load *none* *none* - global - !volatile & !nontemporal
2403 - generic
2404 - private 1. buffer/global/flat_load
2405 - constant
2406 - volatile & !nontemporal
2407
Tony Tyef16a45e2017-06-06 20:31:59 +00002408 1. buffer/global/flat_load
2409 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002410
2411 - nontemporal
2412
2413 1. buffer/global/flat_load
2414 glc=1 slc=1
2415
Tony Tyef16a45e2017-06-06 20:31:59 +00002416 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002417 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002418 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002419 - private 1. buffer/global/flat_store
2420 - constant
2421 - nontemporal
2422
2423 1. buffer/global/flat_stote
2424 glc=1 slc=1
2425
Tony Tyef16a45e2017-06-06 20:31:59 +00002426 store *none* *none* - local 1. ds_store
2427 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002428 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002429 load atomic unordered *any* *any* *Same as non-atomic*.
2430 store atomic unordered *any* *any* *Same as non-atomic*.
2431 atomicrmw unordered *any* *any* *Same as monotonic
2432 atomic*.
2433 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002434 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002435 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2436 - wavefront - generic
2437 - workgroup
2438 load atomic monotonic - singlethread - local 1. ds_load
2439 - wavefront
2440 - workgroup
2441 load atomic monotonic - agent - global 1. buffer/global/flat_load
2442 - system - generic glc=1
2443 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2444 - wavefront - generic
2445 - workgroup
2446 - agent
2447 - system
2448 store atomic monotonic - singlethread - local 1. ds_store
2449 - wavefront
2450 - workgroup
2451 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2452 - wavefront - generic
2453 - workgroup
2454 - agent
2455 - system
2456 atomicrmw monotonic - singlethread - local 1. ds_atomic
2457 - wavefront
2458 - workgroup
2459 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002460 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002461 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2462 - wavefront - local
2463 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002464 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2465 load atomic acquire - workgroup - local 1. ds_load
2466 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002467
Tony Tye6baa6d22017-10-18 22:16:55 +00002468 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002469 - Must happen before
2470 any following
2471 global/generic
2472 load/load
2473 atomic/store/store
2474 atomic/atomicrmw.
2475 - Ensures any
2476 following global
2477 data read is no
2478 older than the load
2479 atomic value being
2480 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002481 load atomic acquire - workgroup - generic 1. flat_load
2482 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002483
Tony Tye6baa6d22017-10-18 22:16:55 +00002484 - If OpenCL, omit.
2485 - Must happen before
2486 any following
2487 global/generic
2488 load/load
2489 atomic/store/store
2490 atomic/atomicrmw.
2491 - Ensures any
2492 following global
2493 data read is no
2494 older than the load
2495 atomic value being
2496 acquired.
2497 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002498 - system glc=1
2499 2. s_waitcnt vmcnt(0)
2500
2501 - Must happen before
2502 following
2503 buffer_wbinvl1_vol.
2504 - Ensures the load
2505 has completed
2506 before invalidating
2507 the cache.
2508
2509 3. buffer_wbinvl1_vol
2510
2511 - Must happen before
2512 any following
2513 global/generic
2514 load/load
2515 atomic/atomicrmw.
2516 - Ensures that
2517 following
2518 loads will not see
2519 stale global data.
2520
2521 load atomic acquire - agent - generic 1. flat_load glc=1
2522 - system 2. s_waitcnt vmcnt(0) &
2523 lgkmcnt(0)
2524
2525 - If OpenCL omit
2526 lgkmcnt(0).
2527 - Must happen before
2528 following
2529 buffer_wbinvl1_vol.
2530 - Ensures the flat_load
2531 has completed
2532 before invalidating
2533 the cache.
2534
2535 3. buffer_wbinvl1_vol
2536
2537 - Must happen before
2538 any following
2539 global/generic
2540 load/load
2541 atomic/atomicrmw.
2542 - Ensures that
2543 following loads
2544 will not see stale
2545 global data.
2546
2547 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2548 - wavefront - local
2549 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002550 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2551 atomicrmw acquire - workgroup - local 1. ds_atomic
2552 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002553
Tony Tye6baa6d22017-10-18 22:16:55 +00002554 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002555 - Must happen before
2556 any following
2557 global/generic
2558 load/load
2559 atomic/store/store
2560 atomic/atomicrmw.
2561 - Ensures any
2562 following global
2563 data read is no
2564 older than the
2565 atomicrmw value
2566 being acquired.
2567
Tony Tye6baa6d22017-10-18 22:16:55 +00002568 atomicrmw acquire - workgroup - generic 1. flat_atomic
2569 2. waitcnt lgkmcnt(0)
2570
2571 - If OpenCL, omit.
2572 - Must happen before
2573 any following
2574 global/generic
2575 load/load
2576 atomic/store/store
2577 atomic/atomicrmw.
2578 - Ensures any
2579 following global
2580 data read is no
2581 older than the
2582 atomicrmw value
2583 being acquired.
2584
2585 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002586 - system 2. s_waitcnt vmcnt(0)
2587
2588 - Must happen before
2589 following
2590 buffer_wbinvl1_vol.
2591 - Ensures the
2592 atomicrmw has
2593 completed before
2594 invalidating the
2595 cache.
2596
2597 3. buffer_wbinvl1_vol
2598
2599 - Must happen before
2600 any following
2601 global/generic
2602 load/load
2603 atomic/atomicrmw.
2604 - Ensures that
2605 following loads
2606 will not see stale
2607 global data.
2608
2609 atomicrmw acquire - agent - generic 1. flat_atomic
2610 - system 2. s_waitcnt vmcnt(0) &
2611 lgkmcnt(0)
2612
2613 - If OpenCL, omit
2614 lgkmcnt(0).
2615 - Must happen before
2616 following
2617 buffer_wbinvl1_vol.
2618 - Ensures the
2619 atomicrmw has
2620 completed before
2621 invalidating the
2622 cache.
2623
2624 3. buffer_wbinvl1_vol
2625
2626 - Must happen before
2627 any following
2628 global/generic
2629 load/load
2630 atomic/atomicrmw.
2631 - Ensures that
2632 following loads
2633 will not see stale
2634 global data.
2635
2636 fence acquire - singlethread *none* *none*
2637 - wavefront
2638 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2639
2640 - If OpenCL and
2641 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002642 not generic, omit.
2643 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002644 currently has no
2645 address space on
2646 the fence need to
2647 conservatively
2648 always generate. If
2649 fence had an
2650 address space then
2651 set to address
2652 space of OpenCL
2653 fence flag, or to
2654 generic if both
2655 local and global
2656 flags are
2657 specified.
2658 - Must happen after
2659 any preceding
2660 local/generic load
2661 atomic/atomicrmw
2662 with an equal or
2663 wider sync scope
2664 and memory ordering
2665 stronger than
2666 unordered (this is
2667 termed the
2668 fence-paired-atomic).
2669 - Must happen before
2670 any following
2671 global/generic
2672 load/load
2673 atomic/store/store
2674 atomic/atomicrmw.
2675 - Ensures any
2676 following global
2677 data read is no
2678 older than the
2679 value read by the
2680 fence-paired-atomic.
2681
Tony Tye6baa6d22017-10-18 22:16:55 +00002682 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2683 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002684
2685 - If OpenCL and
2686 address space is
2687 not generic, omit
2688 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002689 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002690 currently has no
2691 address space on
2692 the fence need to
2693 conservatively
2694 always generate
2695 (see comment for
2696 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002697 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002698 separate s_waitcnt
2699 vmcnt(0) and
2700 s_waitcnt
2701 lgkmcnt(0) to allow
2702 them to be
2703 independently moved
2704 according to the
2705 following rules.
2706 - s_waitcnt vmcnt(0)
2707 must happen after
2708 any preceding
2709 global/generic load
2710 atomic/atomicrmw
2711 with an equal or
2712 wider sync scope
2713 and memory ordering
2714 stronger than
2715 unordered (this is
2716 termed the
2717 fence-paired-atomic).
2718 - s_waitcnt lgkmcnt(0)
2719 must happen after
2720 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002721 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002722 atomic/atomicrmw
2723 with an equal or
2724 wider sync scope
2725 and memory ordering
2726 stronger than
2727 unordered (this is
2728 termed the
2729 fence-paired-atomic).
2730 - Must happen before
2731 the following
2732 buffer_wbinvl1_vol.
2733 - Ensures that the
2734 fence-paired atomic
2735 has completed
2736 before invalidating
2737 the
2738 cache. Therefore
2739 any following
2740 locations read must
2741 be no older than
2742 the value read by
2743 the
2744 fence-paired-atomic.
2745
2746 2. buffer_wbinvl1_vol
2747
Tony Tye6baa6d22017-10-18 22:16:55 +00002748 - Must happen before any
2749 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002750 load/load
2751 atomic/store/store
2752 atomic/atomicrmw.
2753 - Ensures that
2754 following loads
2755 will not see stale
2756 global data.
2757
2758 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002759 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002760 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2761 - wavefront - local
2762 - generic
2763 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002764
2765 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002766 - Must happen after
2767 any preceding
2768 local/generic
2769 load/store/load
2770 atomic/store
2771 atomic/atomicrmw.
2772 - Must happen before
2773 the following
2774 store.
2775 - Ensures that all
2776 memory operations
2777 to local have
2778 completed before
2779 performing the
2780 store that is being
2781 released.
2782
2783 2. buffer/global/flat_store
2784 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002785 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2786
2787 - If OpenCL, omit.
2788 - Must happen after
2789 any preceding
2790 local/generic
2791 load/store/load
2792 atomic/store
2793 atomic/atomicrmw.
2794 - Must happen before
2795 the following
2796 store.
2797 - Ensures that all
2798 memory operations
2799 to local have
2800 completed before
2801 performing the
2802 store that is being
2803 released.
2804
2805 2. flat_store
2806 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2807 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002808
2809 - If OpenCL, omit
2810 lgkmcnt(0).
2811 - Could be split into
2812 separate s_waitcnt
2813 vmcnt(0) and
2814 s_waitcnt
2815 lgkmcnt(0) to allow
2816 them to be
2817 independently moved
2818 according to the
2819 following rules.
2820 - s_waitcnt vmcnt(0)
2821 must happen after
2822 any preceding
2823 global/generic
2824 load/store/load
2825 atomic/store
2826 atomic/atomicrmw.
2827 - s_waitcnt lgkmcnt(0)
2828 must happen after
2829 any preceding
2830 local/generic
2831 load/store/load
2832 atomic/store
2833 atomic/atomicrmw.
2834 - Must happen before
2835 the following
2836 store.
2837 - Ensures that all
2838 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002839 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002840 completed before
2841 performing the
2842 store that is being
2843 released.
2844
2845 2. buffer/global/ds/flat_store
2846 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2847 - wavefront - local
2848 - generic
2849 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002850
2851 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002852 - Must happen after
2853 any preceding
2854 local/generic
2855 load/store/load
2856 atomic/store
2857 atomic/atomicrmw.
2858 - Must happen before
2859 the following
2860 atomicrmw.
2861 - Ensures that all
2862 memory operations
2863 to local have
2864 completed before
2865 performing the
2866 atomicrmw that is
2867 being released.
2868
2869 2. buffer/global/flat_atomic
2870 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00002871 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2872
2873 - If OpenCL, omit.
2874 - Must happen after
2875 any preceding
2876 local/generic
2877 load/store/load
2878 atomic/store
2879 atomic/atomicrmw.
2880 - Must happen before
2881 the following
2882 atomicrmw.
2883 - Ensures that all
2884 memory operations
2885 to local have
2886 completed before
2887 performing the
2888 atomicrmw that is
2889 being released.
2890
2891 2. flat_atomic
2892 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2893 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002894
2895 - If OpenCL, omit
2896 lgkmcnt(0).
2897 - Could be split into
2898 separate s_waitcnt
2899 vmcnt(0) and
2900 s_waitcnt
2901 lgkmcnt(0) to allow
2902 them to be
2903 independently moved
2904 according to the
2905 following rules.
2906 - s_waitcnt vmcnt(0)
2907 must happen after
2908 any preceding
2909 global/generic
2910 load/store/load
2911 atomic/store
2912 atomic/atomicrmw.
2913 - s_waitcnt lgkmcnt(0)
2914 must happen after
2915 any preceding
2916 local/generic
2917 load/store/load
2918 atomic/store
2919 atomic/atomicrmw.
2920 - Must happen before
2921 the following
2922 atomicrmw.
2923 - Ensures that all
2924 memory operations
2925 to global and local
2926 have completed
2927 before performing
2928 the atomicrmw that
2929 is being released.
2930
Tony Tye6baa6d22017-10-18 22:16:55 +00002931 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002932 fence release - singlethread *none* *none*
2933 - wavefront
2934 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2935
2936 - If OpenCL and
2937 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002938 not generic, omit.
2939 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002940 currently has no
2941 address space on
2942 the fence need to
2943 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00002944 always generate. If
2945 fence had an
2946 address space then
2947 set to address
2948 space of OpenCL
2949 fence flag, or to
2950 generic if both
2951 local and global
2952 flags are
2953 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00002954 - Must happen after
2955 any preceding
2956 local/generic
2957 load/load
2958 atomic/store/store
2959 atomic/atomicrmw.
2960 - Must happen before
2961 any following store
2962 atomic/atomicrmw
2963 with an equal or
2964 wider sync scope
2965 and memory ordering
2966 stronger than
2967 unordered (this is
2968 termed the
2969 fence-paired-atomic).
2970 - Ensures that all
2971 memory operations
2972 to local have
2973 completed before
2974 performing the
2975 following
2976 fence-paired-atomic.
2977
Tony Tye6baa6d22017-10-18 22:16:55 +00002978 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
2979 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002980
2981 - If OpenCL and
2982 address space is
2983 not generic, omit
2984 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002985 - If OpenCL and
2986 address space is
2987 local, omit
2988 vmcnt(0).
2989 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002990 currently has no
2991 address space on
2992 the fence need to
2993 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00002994 always generate. If
2995 fence had an
2996 address space then
2997 set to address
2998 space of OpenCL
2999 fence flag, or to
3000 generic if both
3001 local and global
3002 flags are
3003 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003004 - Could be split into
3005 separate s_waitcnt
3006 vmcnt(0) and
3007 s_waitcnt
3008 lgkmcnt(0) to allow
3009 them to be
3010 independently moved
3011 according to the
3012 following rules.
3013 - s_waitcnt vmcnt(0)
3014 must happen after
3015 any preceding
3016 global/generic
3017 load/store/load
3018 atomic/store
3019 atomic/atomicrmw.
3020 - s_waitcnt lgkmcnt(0)
3021 must happen after
3022 any preceding
3023 local/generic
3024 load/store/load
3025 atomic/store
3026 atomic/atomicrmw.
3027 - Must happen before
3028 any following store
3029 atomic/atomicrmw
3030 with an equal or
3031 wider sync scope
3032 and memory ordering
3033 stronger than
3034 unordered (this is
3035 termed the
3036 fence-paired-atomic).
3037 - Ensures that all
3038 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003039 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003040 completed before
3041 performing the
3042 following
3043 fence-paired-atomic.
3044
3045 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003046 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003047 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3048 - wavefront - local
3049 - generic
3050 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3051
Tony Tye6baa6d22017-10-18 22:16:55 +00003052 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003053 - Must happen after
3054 any preceding
3055 local/generic
3056 load/store/load
3057 atomic/store
3058 atomic/atomicrmw.
3059 - Must happen before
3060 the following
3061 atomicrmw.
3062 - Ensures that all
3063 memory operations
3064 to local have
3065 completed before
3066 performing the
3067 atomicrmw that is
3068 being released.
3069
Tony Tye6baa6d22017-10-18 22:16:55 +00003070 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003071 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3072 2. s_waitcnt lgkmcnt(0)
3073
Tony Tye6baa6d22017-10-18 22:16:55 +00003074 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003075 - Must happen before
3076 any following
3077 global/generic
3078 load/load
3079 atomic/store/store
3080 atomic/atomicrmw.
3081 - Ensures any
3082 following global
3083 data read is no
3084 older than the load
3085 atomic value being
3086 acquired.
3087
3088 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3089
Tony Tye6baa6d22017-10-18 22:16:55 +00003090 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003091 - Must happen after
3092 any preceding
3093 local/generic
3094 load/store/load
3095 atomic/store
3096 atomic/atomicrmw.
3097 - Must happen before
3098 the following
3099 atomicrmw.
3100 - Ensures that all
3101 memory operations
3102 to local have
3103 completed before
3104 performing the
3105 atomicrmw that is
3106 being released.
3107
3108 2. flat_atomic
3109 3. s_waitcnt lgkmcnt(0)
3110
Tony Tye6baa6d22017-10-18 22:16:55 +00003111 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003112 - Must happen before
3113 any following
3114 global/generic
3115 load/load
3116 atomic/store/store
3117 atomic/atomicrmw.
3118 - Ensures any
3119 following global
3120 data read is no
3121 older than the load
3122 atomic value being
3123 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003124
3125 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3126 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003127
3128 - If OpenCL, omit
3129 lgkmcnt(0).
3130 - Could be split into
3131 separate s_waitcnt
3132 vmcnt(0) and
3133 s_waitcnt
3134 lgkmcnt(0) to allow
3135 them to be
3136 independently moved
3137 according to the
3138 following rules.
3139 - s_waitcnt vmcnt(0)
3140 must happen after
3141 any preceding
3142 global/generic
3143 load/store/load
3144 atomic/store
3145 atomic/atomicrmw.
3146 - s_waitcnt lgkmcnt(0)
3147 must happen after
3148 any preceding
3149 local/generic
3150 load/store/load
3151 atomic/store
3152 atomic/atomicrmw.
3153 - Must happen before
3154 the following
3155 atomicrmw.
3156 - Ensures that all
3157 memory operations
3158 to global have
3159 completed before
3160 performing the
3161 atomicrmw that is
3162 being released.
3163
Tony Tye6baa6d22017-10-18 22:16:55 +00003164 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003165 3. s_waitcnt vmcnt(0)
3166
3167 - Must happen before
3168 following
3169 buffer_wbinvl1_vol.
3170 - Ensures the
3171 atomicrmw has
3172 completed before
3173 invalidating the
3174 cache.
3175
3176 4. buffer_wbinvl1_vol
3177
3178 - Must happen before
3179 any following
3180 global/generic
3181 load/load
3182 atomic/atomicrmw.
3183 - Ensures that
3184 following loads
3185 will not see stale
3186 global data.
3187
Tony Tye6baa6d22017-10-18 22:16:55 +00003188 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3189 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003190
3191 - If OpenCL, omit
3192 lgkmcnt(0).
3193 - Could be split into
3194 separate s_waitcnt
3195 vmcnt(0) and
3196 s_waitcnt
3197 lgkmcnt(0) to allow
3198 them to be
3199 independently moved
3200 according to the
3201 following rules.
3202 - s_waitcnt vmcnt(0)
3203 must happen after
3204 any preceding
3205 global/generic
3206 load/store/load
3207 atomic/store
3208 atomic/atomicrmw.
3209 - s_waitcnt lgkmcnt(0)
3210 must happen after
3211 any preceding
3212 local/generic
3213 load/store/load
3214 atomic/store
3215 atomic/atomicrmw.
3216 - Must happen before
3217 the following
3218 atomicrmw.
3219 - Ensures that all
3220 memory operations
3221 to global have
3222 completed before
3223 performing the
3224 atomicrmw that is
3225 being released.
3226
3227 2. flat_atomic
3228 3. s_waitcnt vmcnt(0) &
3229 lgkmcnt(0)
3230
3231 - If OpenCL, omit
3232 lgkmcnt(0).
3233 - Must happen before
3234 following
3235 buffer_wbinvl1_vol.
3236 - Ensures the
3237 atomicrmw has
3238 completed before
3239 invalidating the
3240 cache.
3241
3242 4. buffer_wbinvl1_vol
3243
3244 - Must happen before
3245 any following
3246 global/generic
3247 load/load
3248 atomic/atomicrmw.
3249 - Ensures that
3250 following loads
3251 will not see stale
3252 global data.
3253
3254 fence acq_rel - singlethread *none* *none*
3255 - wavefront
3256 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3257
3258 - If OpenCL and
3259 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003260 not generic, omit.
3261 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003262 since LLVM
3263 currently has no
3264 address space on
3265 the fence need to
3266 conservatively
3267 always generate
3268 (see comment for
3269 previous fence).
3270 - Must happen after
3271 any preceding
3272 local/generic
3273 load/load
3274 atomic/store/store
3275 atomic/atomicrmw.
3276 - Must happen before
3277 any following
3278 global/generic
3279 load/load
3280 atomic/store/store
3281 atomic/atomicrmw.
3282 - Ensures that all
3283 memory operations
3284 to local have
3285 completed before
3286 performing any
3287 following global
3288 memory operations.
3289 - Ensures that the
3290 preceding
3291 local/generic load
3292 atomic/atomicrmw
3293 with an equal or
3294 wider sync scope
3295 and memory ordering
3296 stronger than
3297 unordered (this is
3298 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003299 acquire-fence-paired-atomic
3300 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003301 before following
3302 global memory
3303 operations. This
3304 satisfies the
3305 requirements of
3306 acquire.
3307 - Ensures that all
3308 previous memory
3309 operations have
3310 completed before a
3311 following
3312 local/generic store
3313 atomic/atomicrmw
3314 with an equal or
3315 wider sync scope
3316 and memory ordering
3317 stronger than
3318 unordered (this is
3319 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003320 release-fence-paired-atomic
3321 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003322 requirements of
3323 release.
3324
Tony Tye6baa6d22017-10-18 22:16:55 +00003325 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3326 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003327
3328 - If OpenCL and
3329 address space is
3330 not generic, omit
3331 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003332 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003333 currently has no
3334 address space on
3335 the fence need to
3336 conservatively
3337 always generate
3338 (see comment for
3339 previous fence).
3340 - Could be split into
3341 separate s_waitcnt
3342 vmcnt(0) and
3343 s_waitcnt
3344 lgkmcnt(0) to allow
3345 them to be
3346 independently moved
3347 according to the
3348 following rules.
3349 - s_waitcnt vmcnt(0)
3350 must happen after
3351 any preceding
3352 global/generic
3353 load/store/load
3354 atomic/store
3355 atomic/atomicrmw.
3356 - s_waitcnt lgkmcnt(0)
3357 must happen after
3358 any preceding
3359 local/generic
3360 load/store/load
3361 atomic/store
3362 atomic/atomicrmw.
3363 - Must happen before
3364 the following
3365 buffer_wbinvl1_vol.
3366 - Ensures that the
3367 preceding
3368 global/local/generic
3369 load
3370 atomic/atomicrmw
3371 with an equal or
3372 wider sync scope
3373 and memory ordering
3374 stronger than
3375 unordered (this is
3376 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003377 acquire-fence-paired-atomic
3378 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003379 before invalidating
3380 the cache. This
3381 satisfies the
3382 requirements of
3383 acquire.
3384 - Ensures that all
3385 previous memory
3386 operations have
3387 completed before a
3388 following
3389 global/local/generic
3390 store
3391 atomic/atomicrmw
3392 with an equal or
3393 wider sync scope
3394 and memory ordering
3395 stronger than
3396 unordered (this is
3397 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003398 release-fence-paired-atomic
3399 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003400 requirements of
3401 release.
3402
3403 2. buffer_wbinvl1_vol
3404
3405 - Must happen before
3406 any following
3407 global/generic
3408 load/load
3409 atomic/store/store
3410 atomic/atomicrmw.
3411 - Ensures that
3412 following loads
3413 will not see stale
3414 global data. This
3415 satisfies the
3416 requirements of
3417 acquire.
3418
3419 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003420 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003421 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003422 - wavefront - local load atomic acquire,
3423 - generic except must generated
3424 all instructions even
3425 for OpenCL.*
3426 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3427 - generic
3428 - Must
3429 happen after
3430 preceding
3431 global/generic load
3432 atomic/store
3433 atomic/atomicrmw
3434 with memory
3435 ordering of seq_cst
3436 and with equal or
3437 wider sync scope.
3438 (Note that seq_cst
3439 fences have their
3440 own s_waitcnt
3441 lgkmcnt(0) and so do
3442 not need to be
3443 considered.)
3444 - Ensures any
3445 preceding
3446 sequential
3447 consistent local
3448 memory instructions
3449 have completed
3450 before executing
3451 this sequentially
3452 consistent
3453 instruction. This
3454 prevents reordering
3455 a seq_cst store
3456 followed by a
3457 seq_cst load. (Note
3458 that seq_cst is
3459 stronger than
3460 acquire/release as
3461 the reordering of
3462 load acquire
3463 followed by a store
3464 release is
3465 prevented by the
3466 waitcnt of
3467 the release, but
3468 there is nothing
3469 preventing a store
3470 release followed by
3471 load acquire from
3472 competing out of
3473 order.)
3474
3475 2. *Following
3476 instructions same as
3477 corresponding load
3478 atomic acquire,
3479 except must generated
3480 all instructions even
3481 for OpenCL.*
3482 load atomic seq_cst - workgroup - local *Same as corresponding
3483 load atomic acquire,
3484 except must generated
3485 all instructions even
3486 for OpenCL.*
3487 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3488 - system - generic vmcnt(0)
3489
3490 - Could be split into
3491 separate s_waitcnt
3492 vmcnt(0)
3493 and s_waitcnt
3494 lgkmcnt(0) to allow
3495 them to be
3496 independently moved
3497 according to the
3498 following rules.
3499 - waitcnt lgkmcnt(0)
3500 must happen after
3501 preceding
3502 global/generic load
3503 atomic/store
3504 atomic/atomicrmw
3505 with memory
3506 ordering of seq_cst
3507 and with equal or
3508 wider sync scope.
3509 (Note that seq_cst
3510 fences have their
3511 own s_waitcnt
3512 lgkmcnt(0) and so do
3513 not need to be
3514 considered.)
3515 - waitcnt vmcnt(0)
3516 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003517 preceding
3518 global/generic load
3519 atomic/store
3520 atomic/atomicrmw
3521 with memory
3522 ordering of seq_cst
3523 and with equal or
3524 wider sync scope.
3525 (Note that seq_cst
3526 fences have their
3527 own s_waitcnt
3528 vmcnt(0) and so do
3529 not need to be
3530 considered.)
3531 - Ensures any
3532 preceding
3533 sequential
3534 consistent global
3535 memory instructions
3536 have completed
3537 before executing
3538 this sequentially
3539 consistent
3540 instruction. This
3541 prevents reordering
3542 a seq_cst store
3543 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003544 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003545 that seq_cst is
3546 stronger than
3547 acquire/release as
3548 the reordering of
3549 load acquire
3550 followed by a store
3551 release is
3552 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003553 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003554 the release, but
3555 there is nothing
3556 preventing a store
3557 release followed by
3558 load acquire from
3559 competing out of
3560 order.)
3561
3562 2. *Following
3563 instructions same as
3564 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003565 atomic acquire,
3566 except must generated
3567 all instructions even
3568 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003569 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003570 - wavefront - local store atomic release,
3571 - workgroup - generic except must generated
3572 all instructions even
3573 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003574 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003575 - system - generic store atomic release,
3576 except must generated
3577 all instructions even
3578 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003579 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003580 - wavefront - local atomicrmw acq_rel,
3581 - workgroup - generic except must generated
3582 all instructions even
3583 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003584 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003585 - system - generic atomicrmw acq_rel,
3586 except must generated
3587 all instructions even
3588 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003589 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003590 - wavefront fence acq_rel,
3591 - workgroup except must generated
3592 - agent all instructions even
3593 - system for OpenCL.*
3594 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003595
3596The memory order also adds the single thread optimization constrains defined in
3597table
3598:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3599
3600 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3601 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3602
3603 ============ ==============================================================
3604 LLVM Memory Optimization Constraints
3605 Ordering
3606 ============ ==============================================================
3607 unordered *none*
3608 monotonic *none*
3609 acquire - If a load atomic/atomicrmw then no following load/load
3610 atomic/store/ store atomic/atomicrmw/fence instruction can
3611 be moved before the acquire.
3612 - If a fence then same as load atomic, plus no preceding
3613 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003614 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003615 atomic/store/ store atomic/atomicrmw/fence instruction can
3616 be moved after the release.
3617 - If a fence then same as store atomic, plus no following
3618 associated fence-paired-atomic can be moved before the
3619 fence.
3620 acq_rel Same constraints as both acquire and release.
3621 seq_cst - If a load atomic then same constraints as acquire, plus no
3622 preceding sequentially consistent load atomic/store
3623 atomic/atomicrmw/fence instruction can be moved after the
3624 seq_cst.
3625 - If a store atomic then the same constraints as release, plus
3626 no following sequentially consistent load atomic/store
3627 atomic/atomicrmw/fence instruction can be moved before the
3628 seq_cst.
3629 - If an atomicrmw/fence then same constraints as acq_rel.
3630 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003631
Wei Ding16289cf2017-02-21 18:48:01 +00003632Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003633~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003634
Tony Tyef16a45e2017-06-06 20:31:59 +00003635For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3636(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3637the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003638
Tony Tyef16a45e2017-06-06 20:31:59 +00003639 .. table:: AMDGPU Trap Handler for AMDHSA OS
3640 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003641
Tony Tyef16a45e2017-06-06 20:31:59 +00003642 =================== =============== =============== =======================
3643 Usage Code Sequence Trap Handler Description
3644 Inputs
3645 =================== =============== =============== =======================
3646 reserved ``s_trap 0x00`` Reserved by hardware.
3647 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3648 ``queue_ptr`` ``debugtrap``
3649 ``VGPR0``: intrinsic (not
3650 ``arg`` implemented).
3651 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3652 ``queue_ptr`` terminated and its
3653 associated queue put
3654 into the error state.
3655 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3656 ``queue_ptr`` installed handled
3657 same as ``llvm.trap``.
3658 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3659 breakpoints.
3660 debugger ``s_trap 0x08`` Reserved for debugger.
3661 debugger ``s_trap 0xfe`` Reserved for debugger.
3662 debugger ``s_trap 0xff`` Reserved for debugger.
3663 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003664
Tony Tye46d35762017-08-15 20:47:41 +00003665Unspecified OS
3666--------------
3667
3668This section provides code conventions used when the target triple OS is
3669empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003670
3671Trap Handler ABI
3672~~~~~~~~~~~~~~~~
3673
3674For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3675not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3676instructions are handled as follows:
3677
3678 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3679 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3680
3681 =============== =============== ===========================================
3682 Usage Code Sequence Description
3683 =============== =============== ===========================================
3684 llvm.trap s_endpgm Causes wavefront to be terminated.
3685 llvm.debugtrap *none* Compiler warning given that there is no
3686 trap handler installed.
3687 =============== =============== ===========================================
3688
3689Source Languages
3690================
3691
3692.. _amdgpu-opencl:
3693
3694OpenCL
3695------
3696
3697When generating code for the OpenCL language the target triple environment
3698should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3699
3700When the language is OpenCL the following differences occur:
3701
37021. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
37032. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye46d35762017-08-15 20:47:41 +000037043. Additional metadata is generated
3705 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003706
3707.. TODO
3708 Specify what affect this has. Hidden arguments added. Additional metadata
3709 generated.
3710
3711.. _amdgpu-hcc:
3712
3713HCC
3714---
3715
3716When generating code for the OpenCL language the target triple environment
3717should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3718
3719When the language is OpenCL the following differences occur:
3720
37211. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3722
3723.. TODO
3724 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003725
Tom Stellard45bb48e2015-06-13 03:28:10 +00003726Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003727---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003728
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003729AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef16a45e2017-06-06 20:31:59 +00003730It supports AMDGCN GFX6-GFX8.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003731
Tony Tyef16a45e2017-06-06 20:31:59 +00003732This section describes general syntax for instructions and operands. For more
3733information about instructions, their semantics and supported combinations of
3734operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00003735[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003736
Tony Tyef16a45e2017-06-06 20:31:59 +00003737An instruction has the following syntax (register operands are normally
3738comma-separated while extra operands are space-separated):
Tom Stellard45bb48e2015-06-13 03:28:10 +00003739
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003740*<opcode> <register_operand0>, ... <extra_operand0> ...*
Tom Stellard45bb48e2015-06-13 03:28:10 +00003741
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003742Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003743~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003744
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003745The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003746
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003747* SGPR registers: s0, ... or s[0], ...
3748* VGPR registers: v0, ... or v[0], ...
3749* TTMP registers: ttmp0, ... or ttmp[0], ...
3750* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3751* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3752* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3753* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3754* Register index expressions: v[2*2], s[1-1:2-1]
3755* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003756
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003757The following extra operands are supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003758
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003759* offset, offset0, offset1
3760* idxen, offen bits
3761* glc, slc, tfe bits
3762* waitcnt: integer or combination of counter values
3763* VOP3 modifiers:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003764
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003765 - abs (\| \|), neg (\-)
Tom Stellard45bb48e2015-06-13 03:28:10 +00003766
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003767* DPP modifiers:
3768
3769 - row_shl, row_shr, row_ror, row_rol
3770 - row_mirror, row_half_mirror, row_bcast
3771 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3772 - row_mask, bank_mask, bound_ctrl
3773
3774* SDWA modifiers:
3775
3776 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3777 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3778 - abs, neg, sext
3779
Tony Tyef16a45e2017-06-06 20:31:59 +00003780Instruction Examples
3781~~~~~~~~~~~~~~~~~~~~
3782
3783DS
3784~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003785
3786.. code-block:: nasm
3787
3788 ds_add_u32 v2, v4 offset:16
3789 ds_write_src2_b64 v2 offset0:4 offset1:8
3790 ds_cmpst_f32 v2, v4, v6
3791 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3792
3793
3794For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3795
Tony Tyef16a45e2017-06-06 20:31:59 +00003796FLAT
3797++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003798
3799.. code-block:: nasm
3800
3801 flat_load_dword v1, v[3:4]
3802 flat_store_dwordx3 v[3:4], v[5:7]
3803 flat_atomic_swap v1, v[3:4], v5 glc
3804 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3805 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3806
3807For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3808
Tony Tyef16a45e2017-06-06 20:31:59 +00003809MUBUF
3810+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003811
3812.. code-block:: nasm
3813
3814 buffer_load_dword v1, off, s[4:7], s1
3815 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3816 buffer_store_format_xy v[1:2], off, s[4:7], s1
3817 buffer_wbinvl1
3818 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3819
3820For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3821
Tony Tyef16a45e2017-06-06 20:31:59 +00003822SMRD/SMEM
3823+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003824
3825.. code-block:: nasm
3826
3827 s_load_dword s1, s[2:3], 0xfc
3828 s_load_dwordx8 s[8:15], s[2:3], s4
3829 s_load_dwordx16 s[88:103], s[2:3], s4
3830 s_dcache_inv_vol
3831 s_memtime s[4:5]
3832
3833For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3834
Tony Tyef16a45e2017-06-06 20:31:59 +00003835SOP1
3836++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003837
3838.. code-block:: nasm
3839
3840 s_mov_b32 s1, s2
3841 s_mov_b64 s[0:1], 0x80000000
3842 s_cmov_b32 s1, 200
3843 s_wqm_b64 s[2:3], s[4:5]
3844 s_bcnt0_i32_b64 s1, s[2:3]
3845 s_swappc_b64 s[2:3], s[4:5]
3846 s_cbranch_join s[4:5]
3847
3848For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3849
Tony Tyef16a45e2017-06-06 20:31:59 +00003850SOP2
3851++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003852
3853.. code-block:: nasm
3854
3855 s_add_u32 s1, s2, s3
3856 s_and_b64 s[2:3], s[4:5], s[6:7]
3857 s_cselect_b32 s1, s2, s3
3858 s_andn2_b32 s2, s4, s6
3859 s_lshr_b64 s[2:3], s[4:5], s6
3860 s_ashr_i32 s2, s4, s6
3861 s_bfm_b64 s[2:3], s4, s6
3862 s_bfe_i64 s[2:3], s[4:5], s6
3863 s_cbranch_g_fork s[4:5], s[6:7]
3864
3865For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3866
Tony Tyef16a45e2017-06-06 20:31:59 +00003867SOPC
3868++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003869
3870.. code-block:: nasm
3871
3872 s_cmp_eq_i32 s1, s2
3873 s_bitcmp1_b32 s1, s2
3874 s_bitcmp0_b64 s[2:3], s4
3875 s_setvskip s3, s5
3876
3877For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3878
Tony Tyef16a45e2017-06-06 20:31:59 +00003879SOPP
3880++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003881
3882.. code-block:: nasm
3883
3884 s_barrier
3885 s_nop 2
3886 s_endpgm
3887 s_waitcnt 0 ; Wait for all counters to be 0
3888 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3889 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3890 s_sethalt 9
3891 s_sleep 10
3892 s_sendmsg 0x1
3893 s_sendmsg sendmsg(MSG_INTERRUPT)
3894 s_trap 1
3895
3896For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3897
3898Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003899of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003900range or acceptable values.
3901
Tony Tyef16a45e2017-06-06 20:31:59 +00003902VALU
3903++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003904
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003905For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3906the assembler will automatically use optimal encoding based on its operands.
3907To force specific encoding, one can add a suffix to the opcode of the instruction:
3908
3909* _e32 for 32-bit VOP1/VOP2/VOPC
3910* _e64 for 64-bit VOP3
3911* _dpp for VOP_DPP
3912* _sdwa for VOP_SDWA
3913
3914VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003915
3916.. code-block:: nasm
3917
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003918 v_mov_b32 v1, v2
3919 v_mov_b32_e32 v1, v2
3920 v_nop
3921 v_cvt_f64_i32_e32 v[1:2], v2
3922 v_floor_f32_e32 v1, v2
3923 v_bfrev_b32_e32 v1, v2
3924 v_add_f32_e32 v1, v2, v3
3925 v_mul_i32_i24_e64 v1, v2, 3
3926 v_mul_i32_i24_e32 v1, -3, v3
3927 v_mul_i32_i24_e32 v1, -100, v3
3928 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3929 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00003930
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003931VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003932
3933.. code-block:: nasm
3934
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003935 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3936 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3937 v_mov_b32 v0, v0 wave_shl:1
3938 v_mov_b32 v0, v0 row_mirror
3939 v_mov_b32 v0, v0 row_bcast:31
3940 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3941 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3942 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00003943
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003944VOP_SDWA examples:
3945
3946.. code-block:: nasm
3947
3948 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3949 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3950 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3951 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3952 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3953
3954For full list of supported instructions, refer to "Vector ALU instructions".
3955
3956HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00003957~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003958
3959AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3960one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00003961
3962.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00003963+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00003964
3965*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003966object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00003967
3968.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00003969+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3970
Tom Stellard347ac792015-06-26 21:15:07 +00003971
3972*major*, *minor*, and *stepping* are all integers that describe the instruction
3973set architecture (ISA) version of the assembly program.
3974
3975*vendor* and *arch* are quoted strings. *vendor* should always be equal to
3976"AMD" and *arch* should always be equal to "AMDGPU".
3977
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003978By default, the assembler will derive the ISA version, *vendor*, and *arch*
3979from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00003980
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003981.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00003982+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003983
3984This directives specifies that the symbol with given name is a kernel entry point
3985(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00003986
3987.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00003988++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00003989
3990This directive marks the beginning of a list of key / value pairs that are used
3991to specify the amd_kernel_code_t object that will be emitted by the assembler.
3992The list must be terminated by the *.end_amd_kernel_code_t* directive. For
3993any amd_kernel_code_t values that are unspecified a default value will be
3994used. The default value for all keys is 0, with the following exceptions:
3995
3996- *kernel_code_version_major* defaults to 1.
3997- *machine_kind* defaults to 1.
3998- *machine_version_major*, *machine_version_minor*, and
3999 *machine_version_stepping* are derived from the value of the -mcpu option
4000 that is passed to the assembler.
4001- *kernel_code_entry_byte_offset* defaults to 256.
4002- *wavefront_size* defaults to 6.
4003- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004004 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004005 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4006
4007The *.amd_kernel_code_t* directive must be placed immediately after the
4008function label and before any instructions.
4009
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004010For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4011comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004012
4013Here is an example of a minimal amd_kernel_code_t specification:
4014
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004015.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004016
4017 .hsa_code_object_version 1,0
4018 .hsa_code_object_isa
4019
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004020 .hsatext
4021 .globl hello_world
4022 .p2align 8
4023 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004024
4025 hello_world:
4026
4027 .amd_kernel_code_t
4028 enable_sgpr_kernarg_segment_ptr = 1
4029 is_ptr64 = 1
4030 compute_pgm_rsrc1_vgprs = 0
4031 compute_pgm_rsrc1_sgprs = 0
4032 compute_pgm_rsrc2_user_sgpr = 2
4033 kernarg_segment_byte_size = 8
4034 wavefront_sgpr_count = 2
4035 workitem_vgpr_count = 3
4036 .end_amd_kernel_code_t
4037
4038 s_load_dwordx2 s[0:1], s[0:1] 0x0
4039 v_mov_b32 v0, 3.14159
4040 s_waitcnt lgkmcnt(0)
4041 v_mov_b32 v1, s0
4042 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004043 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004044 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004045 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004046 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004047
4048Additional Documentation
4049========================
4050
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004051.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4052.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4053.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4054.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4055.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4056.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4057.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4058.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004059.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
4060.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
4061.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4062.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4063.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4064.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4065.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004066.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004067.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4068.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
4069.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__