blob: ecb0c11dbcbb872e8af428e4a48a7b15e65620fd [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
67 *<empty>* Defaults to ``opencl``.
68 ``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
69 ``amdgizcl`` Same as ``opencl`` except a different address space mapping is
70 used (see :ref:`amdgpu-address-spaces`).
71 ``amdgiz`` Same as ``opencl`` except a different address space mapping is
72 used (see :ref:`amdgpu-address-spaces`).
73 ``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
74 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000075
76.. _amdgpu-processors:
77
78Processors
79----------
80
81Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
82names from both the *Processor* and *Alternative Processor* can be used.
83
84 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000085 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000086
Tony Tye07d9f102017-11-10 01:00:54 +000087 =========== =============== ============ ===== ======= ==================
88 Processor Alternative Target dGPU/ ROCm Example
89 Processor Triple APU Support Products
90 Architecture
91 =========== =============== ============ ===== ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000092 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye07d9f102017-11-10 01:00:54 +000093 -------------------------------------------------------------------------
94 ``r600`` ``r600`` dGPU
95 ``r630`` ``r600`` dGPU
96 ``rs880`` ``r600`` dGPU
97 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000098 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye07d9f102017-11-10 01:00:54 +000099 -------------------------------------------------------------------------
100 ``rv710`` ``r600`` dGPU
101 ``rv730`` ``r600`` dGPU
102 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000103 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye07d9f102017-11-10 01:00:54 +0000104 -------------------------------------------------------------------------
105 ``cedar`` ``r600`` dGPU
106 ``redwood`` ``r600`` dGPU
107 ``sumo`` ``r600`` dGPU
108 ``juniper`` ``r600`` dGPU
109 ``cypress`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000110 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye07d9f102017-11-10 01:00:54 +0000111 -------------------------------------------------------------------------
112 ``barts`` ``r600`` dGPU
113 ``turks`` ``r600`` dGPU
114 ``caicos`` ``r600`` dGPU
115 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000116 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye07d9f102017-11-10 01:00:54 +0000117 -------------------------------------------------------------------------
118 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
119 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
120 - ``verde``
121 - ``oland``
122 - ``hainan``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000123 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye07d9f102017-11-10 01:00:54 +0000124 -------------------------------------------------------------------------
125 ``gfx700`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
126 - Radeon HD 8770
127 - R7 260
128 - R7 260X
129 \ - ``kaveri`` ``amdgcn`` APU - A6-7000
130 - A6 Pro-7050B
131 - A8-7100
132 - A8 Pro-7150B
133 - A10-7300
134 - A10 Pro-7350B
135 - FX-7500
136 - A8-7200P
137 - A10-7400P
138 - FX-7600P
139 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
140 - FirePro W9100
141 - FirePro S9150
142 - FirePro S9170
143 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
144 - Radeon R9 290x
145 - Radeon R390
146 - Radeon R390x
147 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
148 - ``mullins`` - E1-2200
149 - E1-2500
150 - E2-3000
151 - E2-3800
152 - A4-5000
153 - A4-5100
154 - A6-5200
155 - A4 Pro-3340B
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000156 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye07d9f102017-11-10 01:00:54 +0000157 -------------------------------------------------------------------------
158 ``gfx800`` - ``iceland`` ``amdgcn`` dGPU - FirePro S7150
159 - FirePro S7100
160 - FirePro W7100
161 - Radeon R285
162 - Radeon R9 380
163 - Radeon R9 385
164 - Mobile FirePro
165 M7170
166 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - A6-8500P
167 - Pro A6-8500B
168 - A8-8600P
169 - Pro A8-8600B
170 - FX-8800P
171 - Pro A12-8800B
172 \ ``amdgcn`` APU ROCm - A10-8700P
173 - Pro A10-8700B
174 - A10-8780P
175 \ ``amdgcn`` APU - A10-9600P
176 - A10-9630P
177 - A12-9700P
178 - A12-9730P
179 - FX-9800P
180 - FX-9830P
181 \ ``amdgcn`` APU - E2-9010
182 - A6-9210
183 - A9-9410
184 ``gfx802`` - ``tonga`` ``amdgcn`` dGPU ROCm Same as gfx800
185 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU ROCm - Radeon R9 Nano
186 - Radeon R9 Fury
187 - Radeon R9 FuryX
188 - Radeon Pro Duo
189 - FirePro S9300x2
190 - Radeon Instinct MI8
191 \ - ``polaris10`` ``amdgcn`` dGPU ROCm - Radeon RX 470
192 - Radeon RX 480
193 - Radeon Instinct MI6
194 \ - ``polaris11`` ``amdgcn`` dGPU ROCm - Radeon RX 460
195 ``gfx810`` - ``stoney`` ``amdgcn`` APU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000196 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye07d9f102017-11-10 01:00:54 +0000197 -------------------------------------------------------------------------
198 ``gfx900`` ``amdgcn`` dGPU ROCm - Radeon Vega
199 Frontier Edition
200 - Radeon RX Vega 56
201 - Radeon RX Vega 64
202 - Radeon RX Vega 64
203 Liquid
204 - Radeon Instinct MI25
205 ``gfx902`` ``amdgcn`` APU *TBA*
Tony Tyef16a45e2017-06-06 20:31:59 +0000206
Tony Tye07d9f102017-11-10 01:00:54 +0000207 .. TODO
208 Add product
209 names.
210 =========== =============== ============ ===== ======= ==================
211
212.. _amdgpu-target-features:
213
214Target Features
215---------------
216
217Target features control how code is generated to support certain
218features. Not all target features are supported by all processors. The
219runtime must ensure that the features supported by the device used to
220execute the code match the features enabled when generating the
221code. A mismatch of features may result in incorrect execution, or a
222reduction in performance.
223
224Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
225target features.
226
227For example:
228
229``-mxnack``
230 Enable the *XNACK* feature.
231``-mno-xnack``
232 Disable the *XNACK* feature.
233
234 .. table:: AMDGPU Target Features
235 :name: amdgpu-target-feature-table
236
237 ============== ======== ==================================================
238 Target Feature Default Description
239 ============== ======== ==================================================
240 -m[no-]xnack disabled Enable/disable generating code that has
241 memory clauses that are compatible with
242 having XNACK replay enabled.
243
244 This is used for demand paging and page
245 migration. If XNACK replay is enabled in
246 the device, then if a page fault occurs
247 the code may execute incorrectly if the
248 XNACK feature is not enabled. Executing
249 code that has the feature enabled on a
250 device that does not have XNACK replay
251 enabled will execute correctly, but may
252 be less performant than code with the
253 feature disabled.
254
255 This feature is supported by the
Tony Tye35077502017-11-11 00:50:32 +0000256 ``amdgcn`` architecture for GFX8-GFX9.
Tony Tye07d9f102017-11-10 01:00:54 +0000257 ============== ======== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000258
259.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000260
261Address Spaces
262--------------
263
Tony Tyef16a45e2017-06-06 20:31:59 +0000264The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000265
Tony Tyef16a45e2017-06-06 20:31:59 +0000266The memory space names used in the table, aside from the region memory space, is
267from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000268
Tony Tyef16a45e2017-06-06 20:31:59 +0000269LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000270
Tony Tyef16a45e2017-06-06 20:31:59 +0000271 .. table:: Address Space Mapping
272 :name: amdgpu-address-space-mapping-table
273
274 ================== ================= ================= ================= =================
275 LLVM Address Space Memory Space
276 ------------------ -----------------------------------------------------------------------
277 \ Current Default amdgiz/amdgizcl hcc Future Default
278 ================== ================= ================= ================= =================
279 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
280 1 Global Global Global Global
281 2 Constant Constant Constant Region (GDS)
282 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
283 4 Generic (Flat) Region (GDS) Region (GDS) Constant
284 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
285 ================== ================= ================= ================= =================
286
287Current Default
288 This is the current default address space mapping used for all languages
289 except hcc. This will shortly be deprecated.
290
291amdgiz/amdgizcl
292 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
293 is specified as the target triple environment value.
294
295hcc
296 This is the current address space mapping used when ``hcc`` is specified as
297 the target triple environment value.This will shortly be deprecated.
298
299Future Default
300 This will shortly be the only address space mapping for all languages using
301 AMDGPU backend.
302
303.. _amdgpu-memory-scopes:
304
305Memory Scopes
306-------------
307
308This section provides LLVM memory synchronization scopes supported by the AMDGPU
309backend memory model when the target triple OS is ``amdhsa`` (see
310:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
311
312The memory model supported is based on the HSA memory model [HSA]_ which is
313based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
314relation is transitive over the synchonizes-with relation independent of scope,
315and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000316table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000317
318This is different to the OpenCL [OpenCL]_ memory model which does not have scope
319inclusion and requires the memory scopes to exactly match. However, this
320is conservatively correct for OpenCL.
321
Tony Tye07d9f102017-11-10 01:00:54 +0000322 .. table:: AMDHSA LLVM Sync Scopes
323 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000324
325 ================ ==========================================================
326 LLVM Sync Scope Description
327 ================ ==========================================================
328 *none* The default: ``system``.
329
330 Synchronizes with, and participates in modification and
331 seq_cst total orderings with, other operations (except
332 image operations) for all address spaces (except private,
333 or generic that accesses private) provided the other
334 operation's sync scope is:
335
336 - ``system``.
337 - ``agent`` and executed by a thread on the same agent.
338 - ``workgroup`` and executed by a thread in the same
339 workgroup.
340 - ``wavefront`` and executed by a thread in the same
341 wavefront.
342
343 ``agent`` Synchronizes with, and participates in modification and
344 seq_cst total orderings with, other operations (except
345 image operations) for all address spaces (except private,
346 or generic that accesses private) provided the other
347 operation's sync scope is:
348
349 - ``system`` or ``agent`` and executed by a thread on the
350 same agent.
351 - ``workgroup`` and executed by a thread in the same
352 workgroup.
353 - ``wavefront`` and executed by a thread in the same
354 wavefront.
355
356 ``workgroup`` Synchronizes with, and participates in modification and
357 seq_cst total orderings with, other operations (except
358 image operations) for all address spaces (except private,
359 or generic that accesses private) provided the other
360 operation's sync scope is:
361
362 - ``system``, ``agent`` or ``workgroup`` and executed by a
363 thread in the same workgroup.
364 - ``wavefront`` and executed by a thread in the same
365 wavefront.
366
367 ``wavefront`` Synchronizes with, and participates in modification and
368 seq_cst total orderings with, other operations (except
369 image operations) for all address spaces (except private,
370 or generic that accesses private) provided the other
371 operation's sync scope is:
372
373 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
374 and executed by a thread in the same wavefront.
375
376 ``singlethread`` Only synchronizes with, and participates in modification
377 and seq_cst total orderings with, other operations (except
378 image operations) running in the same thread for all
379 address spaces (for example, in signal handlers).
380 ================ ==========================================================
381
382AMDGPU Intrinsics
383-----------------
384
385The AMDGPU backend implements the following intrinsics.
386
387*This section is WIP.*
388
389.. TODO
390 List AMDGPU intrinsics
391
392Code Object
393===========
394
395The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
396can be linked by ``lld`` to produce a standard ELF shared code object which can
397be loaded and executed on an AMDGPU target.
398
399Header
400------
401
402The AMDGPU backend uses the following ELF header:
403
404 .. table:: AMDGPU ELF Header
405 :name: amdgpu-elf-header-table
406
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000407 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000408 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000409 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000410 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
411 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000412 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
413 - ``ELFOSABI_AMDGPU_HSA``
414 - ``ELFOSABI_AMDGPU_PAL``
415 - ``ELFOSABI_AMDGPU_MESA3D``
416 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
417 - ``ELFABIVERSION_AMDGPU_PAL``
418 - ``ELFABIVERSION_AMDGPU_MESA3D``
419 ``e_type`` - ``ET_REL``
420 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000421 ``e_machine`` ``EM_AMDGPU``
422 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000423 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000424 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000425
426..
427
428 .. table:: AMDGPU ELF Header Enumeration Values
429 :name: amdgpu-elf-header-enumeration-values-table
430
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000431 =============================== =====
432 Name Value
433 =============================== =====
434 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000435 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000436 ``ELFOSABI_AMDGPU_HSA`` 64
437 ``ELFOSABI_AMDGPU_PAL`` 65
438 ``ELFOSABI_AMDGPU_MESA3D`` 66
439 ``ELFABIVERSION_AMDGPU_HSA`` 1
440 ``ELFABIVERSION_AMDGPU_PAL`` 0
441 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
442 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000443
444``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000445 The ELF class is:
446
447 * ``ELFCLASS32`` for ``r600`` architecture.
448
449 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
450 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000451
452``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000453 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000454
455``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000456 One of the following AMD GPU architecture specific OS ABIs
457 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000458
Tony Tye07d9f102017-11-10 01:00:54 +0000459 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000460
Tony Tye07d9f102017-11-10 01:00:54 +0000461 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000462
Tony Tye07d9f102017-11-10 01:00:54 +0000463 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
464
465 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000466
Tony Tyef16a45e2017-06-06 20:31:59 +0000467``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000468 The ABI version of the AMD GPU architecture specific OS ABI to which the code
469 object conforms:
470
471 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
472 runtime ABI.
473
474 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
475 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000476
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000477 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000478 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000479
Tony Tyef16a45e2017-06-06 20:31:59 +0000480``e_type``
481 Can be one of the following values:
482
483
484 ``ET_REL``
485 The type produced by the AMD GPU backend compiler as it is relocatable code
486 object.
487
488 ``ET_DYN``
489 The type produced by the linker as it is a shared code object.
490
491 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
492
493``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000494 The value ``EM_AMDGPU`` is used for the machine for all processors supported
495 by the ``r600`` and ``amdgcn`` architectures (see
496 :ref:`amdgpu-processor-table`). The specific processor is specified in the
497 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
498 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000499
500``e_entry``
501 The entry point is 0 as the entry points for individual kernels must be
502 selected in order to invoke them through AQL packets.
503
504``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000505 The AMDGPU backend uses the following ELF header flags:
506
507 .. table:: AMDGPU ELF Header ``e_flags``
508 :name: amdgpu-elf-header-e_flags-table
509
510 ================================= ========== =============================
511 Name Value Description
512 ================================= ========== =============================
513 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
514 -------------------------------------------- -----------------------------
515 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
516 mask for
517 ``EF_AMDGPU_MACH_xxx`` values
518 defined in
519 :ref:`amdgpu-ef-amdgpu-mach-table`.
520 ================================= ========== =============================
521
522 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
523 :name: amdgpu-ef-amdgpu-mach-table
524
525 ================================= ========== =============================
526 Name Value Description (see
527 :ref:`amdgpu-processor-table`)
528 ================================= ========== =============================
529 ``EF_AMDGPU_MACH_NONE`` 0 *not specified*
530 ``EF_AMDGPU_MACH_R600_R600`` 1 ``r600``
531 ``EF_AMDGPU_MACH_R600_R630`` 2 ``r630``
532 ``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880``
533 ``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670``
534 ``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710``
535 ``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730``
536 ``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770``
537 ``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar``
538 ``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood``
539 ``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo``
540 ``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper``
541 ``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress``
542 ``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts``
543 ``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks``
544 ``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos``
545 ``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman``
546 *reserved* 17-31 Reserved for ``r600``
547 architecture processors.
548 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600``
549 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601``
550 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700``
551 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701``
552 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702``
553 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703``
554 ``EF_AMDGPU_MACH_AMDGCN_GFX800`` 38 ``gfx800``
555 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 39 ``gfx801``
556 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 40 ``gfx802``
557 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 41 ``gfx803``
558 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 42 ``gfx810``
559 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 43 ``gfx900``
560 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 44 ``gfx902``
561 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000562
563Sections
564--------
565
566An AMDGPU target ELF code object has the standard ELF sections which include:
567
568 .. table:: AMDGPU ELF Sections
569 :name: amdgpu-elf-sections-table
570
571 ================== ================ =================================
572 Name Type Attributes
573 ================== ================ =================================
574 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
575 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
576 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
577 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
578 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
579 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
580 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
581 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
582 ``.note`` ``SHT_NOTE`` *none*
583 ``.rela``\ *name* ``SHT_RELA`` *none*
584 ``.rela.dyn`` ``SHT_RELA`` *none*
585 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
586 ``.shstrtab`` ``SHT_STRTAB`` *none*
587 ``.strtab`` ``SHT_STRTAB`` *none*
588 ``.symtab`` ``SHT_SYMTAB`` *none*
589 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
590 ================== ================ =================================
591
592These sections have their standard meanings (see [ELF]_) and are only generated
593if needed.
594
595``.debug``\ *\**
596 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
597 DWARF produced by the AMDGPU backend.
598
Tony Tye46d35762017-08-15 20:47:41 +0000599``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000600 The standard sections used by a dynamic loader.
601
602``.note``
603 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
604 backend.
605
606``.rela``\ *name*, ``.rela.dyn``
607 For relocatable code objects, *name* is the name of the section that the
608 relocation records apply. For example, ``.rela.text`` is the section name for
609 relocation records associated with the ``.text`` section.
610
611 For linked shared code objects, ``.rela.dyn`` contains all the relocation
612 records from each of the relocatable code object's ``.rela``\ *name* sections.
613
614 See :ref:`amdgpu-relocation-records` for the relocation records supported by
615 the AMDGPU backend.
616
617``.text``
618 The executable machine code for the kernels and functions they call. Generated
619 as position independent code. See :ref:`amdgpu-code-conventions` for
620 information on conventions used in the isa generation.
621
622.. _amdgpu-note-records:
623
624Note Records
625------------
626
Tony Tye07d9f102017-11-10 01:00:54 +0000627As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
628be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
629aligned. In addition, minimal zero byte padding must be generated to ensure the
630``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
631``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000632
633The AMDGPU backend code object uses the following ELF note records in the
634``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000635record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000636variable size strings have a corresponding ``*_size`` field that specifies the
637number of bytes, including the terminating null character, in the string. The
638string(s) come immediately after the preceding fields.
639
640Additional note records can be present.
641
642 .. table:: AMDGPU ELF Note Records
643 :name: amdgpu-elf-note-records-table
644
Tony Tye46d35762017-08-15 20:47:41 +0000645 ===== ============================== ======================================
646 Name Type Description
647 ===== ============================== ======================================
648 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000649 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000650
651..
652
653 .. table:: AMDGPU ELF Note Record Enumeration Values
654 :name: amdgpu-elf-note-record-enumeration-values-table
655
Tony Tye46d35762017-08-15 20:47:41 +0000656 ============================== =====
657 Name Value
658 ============================== =====
659 *reserved* 0-9
660 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000661 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000662 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000663
Tony Tye46d35762017-08-15 20:47:41 +0000664``NT_AMD_AMDGPU_HSA_METADATA``
665 Specifies extensible metadata associated with the code objects executed on HSA
666 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
667 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
668 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
669 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000670
Tony Tye46d35762017-08-15 20:47:41 +0000671.. _amdgpu-symbols:
672
673Symbols
674-------
675
676Symbols include the following:
677
678 .. table:: AMDGPU ELF Symbols
679 :name: amdgpu-elf-symbols-table
680
681 ===================== ============== ============= ==================
682 Name Type Section Description
683 ===================== ============== ============= ==================
684 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
685 - ``.rodata``
686 - ``.bss``
687 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
688 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
689 ===================== ============== ============= ==================
690
691Global variable
692 Global variables both used and defined by the compilation unit.
693
694 If the symbol is defined in the compilation unit then it is allocated in the
695 appropriate section according to if it has initialized data or is readonly.
696
697 If the symbol is external then its section is ``STN_UNDEF`` and the loader
698 will resolve relocations using the definition provided by another code object
699 or explicitly defined by the runtime.
700
701 All global symbols, whether defined in the compilation unit or external, are
702 accessed by the machine code indirectly through a GOT table entry. This
703 allows them to be preemptable. The GOT table is only supported when the target
704 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000705
706 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000707 Add description of linked shared object symbols. Seems undefined symbols
708 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000709
Tony Tye46d35762017-08-15 20:47:41 +0000710Kernel descriptor
711 Every HSA kernel has an associated kernel descriptor. It is the address of the
712 kernel descriptor that is used in the AQL dispatch packet used to invoke the
713 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
714 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
715
716Kernel entry point
717 Every HSA kernel also has a symbol for its machine code entry point.
718
719.. _amdgpu-relocation-records:
720
721Relocation Records
722------------------
723
724AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
725relocatable fields are:
726
727``word32``
728 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
729 alignment. These values use the same byte order as other word values in the
730 AMD GPU architecture.
731
732``word64``
733 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
734 alignment. These values use the same byte order as other word values in the
735 AMD GPU architecture.
736
737Following notations are used for specifying relocation calculations:
738
739**A**
740 Represents the addend used to compute the value of the relocatable field.
741
742**G**
743 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000744 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000745
746**GOT**
747 Represents the address of the global offset table.
748
749**P**
750 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
751 of the storage unit being relocated (computed using ``r_offset``).
752
753**S**
754 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000755 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
756
757**B**
758 Represents the base address of a loaded executable or shared object which is
759 the difference between the ELF address and the actual load address. Relocations
760 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000761
762The following relocation types are supported:
763
764 .. table:: AMDGPU ELF Relocation Records
765 :name: amdgpu-elf-relocation-records-table
766
767 ========================== ===== ========== ==============================
768 Relocation Type Value Field Calculation
769 ========================== ===== ========== ==============================
770 ``R_AMDGPU_NONE`` 0 *none* *none*
771 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
772 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
773 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
774 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
775 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
776 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
777 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
778 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
779 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
780 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
781 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tyed2884302017-10-16 20:44:29 +0000782 *reserved* 12
783 ``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye46d35762017-08-15 20:47:41 +0000784 ========================== ===== ========== ==============================
785
786.. _amdgpu-dwarf:
787
788DWARF
789-----
790
791Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
792information that maps the code object executable code and data to the source
793language constructs. It can be used by tools such as debuggers and profilers.
794
795Address Space Mapping
796~~~~~~~~~~~~~~~~~~~~~
797
798The following address space mapping is used:
799
800 .. table:: AMDGPU DWARF Address Space Mapping
801 :name: amdgpu-dwarf-address-space-mapping-table
802
803 =================== =================
804 DWARF Address Space Memory Space
805 =================== =================
806 1 Private (Scratch)
807 2 Local (group/LDS)
808 *omitted* Global
809 *omitted* Constant
810 *omitted* Generic (Flat)
811 *not supported* Region (GDS)
812 =================== =================
813
814See :ref:`amdgpu-address-spaces` for information on the memory space terminology
815used in the table.
816
817An ``address_class`` attribute is generated on pointer type DIEs to specify the
818DWARF address space of the value of the pointer when it is in the *private* or
819*local* address space. Otherwise the attribute is omitted.
820
821An ``XDEREF`` operation is generated in location list expressions for variables
822that are allocated in the *private* and *local* address space. Otherwise no
823``XDREF`` is omitted.
824
825Register Mapping
826~~~~~~~~~~~~~~~~
827
828*This section is WIP.*
829
830.. TODO
831 Define DWARF register enumeration.
832
833 If want to present a wavefront state then should expose vector registers as
834 64 wide (rather than per work-item view that LLVM uses). Either as separate
835 registers, or a 64x4 byte single register. In either case use a new LANE op
836 (akin to XDREF) to select the current lane usage in a location
837 expression. This would also allow scalar register spilling to vector register
838 lanes to be expressed (currently no debug information is being generated for
839 spilling). If choose a wide single register approach then use LANE in
840 conjunction with PIECE operation to select the dword part of the register for
841 the current lane. If the separate register approach then use LANE to select
842 the register.
843
844Source Text
845~~~~~~~~~~~
846
847*This section is WIP.*
848
849.. TODO
850 DWARF extension to include runtime generated source text.
851
852.. _amdgpu-code-conventions:
853
854Code Conventions
855================
856
857This section provides code conventions used for each supported target triple OS
858(see :ref:`amdgpu-target-triples`).
859
860AMDHSA
861------
862
863This section provides code conventions used when the target triple OS is
864``amdhsa`` (see :ref:`amdgpu-target-triples`).
865
866.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000867
868Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000869~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000870
Tony Tye46d35762017-08-15 20:47:41 +0000871The code object metadata specifies extensible metadata associated with the code
872objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
873[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
874(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
875``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
876information necessary to support the ROCM kernel queries. For example, the
877segment sizes needed in a dispatch packet. In addition, a high level language
878runtime may require other information to be included. For example, the AMD
879OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000880
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000881The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000882:doc:`YamlIO`).
883
Tony Tye46d35762017-08-15 20:47:41 +0000884.. TODO
885 Is the string null terminated? It probably should not if YAML allows it to
886 contain null characters, otherwise it should be.
887
Tony Tyef16a45e2017-06-06 20:31:59 +0000888The metadata is represented as a single YAML document comprised of the mapping
889defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
890referenced tables.
891
892For boolean values, the string values of ``false`` and ``true`` are used for
893false and true respectively.
894
895Additional information can be added to the mappings. To avoid conflicts, any
896non-AMD key names should be prefixed by "*vendor-name*.".
897
898 .. table:: AMDHSA Code Object Metadata Mapping
899 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
900
901 ========== ============== ========= =======================================
902 String Key Value Type Required? Description
903 ========== ============== ========= =======================================
904 "Version" sequence of Required - The first integer is the major
905 2 integers version. Currently 1.
906 - The second integer is the minor
907 version. Currently 0.
908 "Printf" sequence of Each string is encoded information
909 strings about a printf function call. The
910 encoded information is organized as
911 fields separated by colon (':'):
912
913 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
914
915 where:
916
917 ``ID``
918 A 32 bit integer as a unique id for
919 each printf function call
920
921 ``N``
922 A 32 bit integer equal to the number
923 of arguments of printf function call
924 minus 1
925
926 ``S[i]`` (where i = 0, 1, ... , N-1)
927 32 bit integers for the size in bytes
928 of the i-th FormatString argument of
929 the printf function call
930
931 FormatString
932 The format string passed to the
933 printf function call.
934 "Kernels" sequence of Required Sequence of the mappings for each
935 mapping kernel in the code object. See
936 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
937 for the definition of the mapping.
938 ========== ============== ========= =======================================
939
940..
941
942 .. table:: AMDHSA Code Object Kernel Metadata Mapping
943 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
944
945 ================= ============== ========= ================================
946 String Key Value Type Required? Description
947 ================= ============== ========= ================================
948 "Name" string Required Source name of the kernel.
949 "SymbolName" string Required Name of the kernel
950 descriptor ELF symbol.
951 "Language" string Source language of the kernel.
952 Values include:
953
954 - "OpenCL C"
955 - "OpenCL C++"
956 - "HCC"
957 - "OpenMP"
958
959 "LanguageVersion" sequence of - The first integer is the major
960 2 integers version.
961 - The second integer is the
962 minor version.
963 "Attrs" mapping Mapping of kernel attributes.
964 See
965 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
966 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +0000967 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +0000968 mapping kernel arguments. See
969 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
970 for the definition of the mapping.
971 "CodeProps" mapping Mapping of properties related to
972 the kernel code. See
973 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
974 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +0000975 ================= ============== ========= ================================
976
977..
978
979 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
980 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
981
982 =================== ============== ========= ==============================
983 String Key Value Type Required? Description
984 =================== ============== ========= ==============================
985 "ReqdWorkGroupSize" sequence of The dispatch work-group size
986 3 integers X, Y, Z must correspond to the
987 specified values.
988
989 Corresponds to the OpenCL
990 ``reqd_work_group_size``
991 attribute.
992 "WorkGroupSizeHint" sequence of The dispatch work-group size
993 3 integers X, Y, Z is likely to be the
994 specified values.
995
996 Corresponds to the OpenCL
997 ``work_group_size_hint``
998 attribute.
999 "VecTypeHint" string The name of a scalar or vector
1000 type.
1001
1002 Corresponds to the OpenCL
1003 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001004
1005 "RuntimeHandle" string The external symbol name
1006 associated with a kernel.
1007 OpenCL runtime allocates a
1008 global buffer for the symbol
1009 and saves the kernel's address
1010 to it, which is used for
1011 device side enqueueing. Only
1012 available for device side
1013 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001014 =================== ============== ========= ==============================
1015
1016..
1017
1018 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1019 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1020
1021 ================= ============== ========= ================================
1022 String Key Value Type Required? Description
1023 ================= ============== ========= ================================
1024 "Name" string Kernel argument name.
1025 "TypeName" string Kernel argument type name.
1026 "Size" integer Required Kernel argument size in bytes.
1027 "Align" integer Required Kernel argument alignment in
1028 bytes. Must be a power of two.
1029 "ValueKind" string Required Kernel argument kind that
1030 specifies how to set up the
1031 corresponding argument.
1032 Values include:
1033
1034 "ByValue"
1035 The argument is copied
1036 directly into the kernarg.
1037
1038 "GlobalBuffer"
1039 A global address space pointer
1040 to the buffer data is passed
1041 in the kernarg.
1042
1043 "DynamicSharedPointer"
1044 A group address space pointer
1045 to dynamically allocated LDS
1046 is passed in the kernarg.
1047
1048 "Sampler"
1049 A global address space
1050 pointer to a S# is passed in
1051 the kernarg.
1052
1053 "Image"
1054 A global address space
1055 pointer to a T# is passed in
1056 the kernarg.
1057
1058 "Pipe"
1059 A global address space pointer
1060 to an OpenCL pipe is passed in
1061 the kernarg.
1062
1063 "Queue"
1064 A global address space pointer
1065 to an OpenCL device enqueue
1066 queue is passed in the
1067 kernarg.
1068
1069 "HiddenGlobalOffsetX"
1070 The OpenCL grid dispatch
1071 global offset for the X
1072 dimension is passed in the
1073 kernarg.
1074
1075 "HiddenGlobalOffsetY"
1076 The OpenCL grid dispatch
1077 global offset for the Y
1078 dimension is passed in the
1079 kernarg.
1080
1081 "HiddenGlobalOffsetZ"
1082 The OpenCL grid dispatch
1083 global offset for the Z
1084 dimension is passed in the
1085 kernarg.
1086
1087 "HiddenNone"
1088 An argument that is not used
1089 by the kernel. Space needs to
1090 be left for it, but it does
1091 not need to be set up.
1092
1093 "HiddenPrintfBuffer"
1094 A global address space pointer
1095 to the runtime printf buffer
1096 is passed in kernarg.
1097
1098 "HiddenDefaultQueue"
1099 A global address space pointer
1100 to the OpenCL device enqueue
1101 queue that should be used by
1102 the kernel by default is
1103 passed in the kernarg.
1104
1105 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001106 A global address space pointer
1107 to help link enqueued kernels into
1108 the ancestor tree for determining
1109 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001110
1111 "ValueType" string Required Kernel argument value type. Only
1112 present if "ValueKind" is
1113 "ByValue". For vector data
1114 types, the value is for the
1115 element type. Values include:
1116
1117 - "Struct"
1118 - "I8"
1119 - "U8"
1120 - "I16"
1121 - "U16"
1122 - "F16"
1123 - "I32"
1124 - "U32"
1125 - "F32"
1126 - "I64"
1127 - "U64"
1128 - "F64"
1129
1130 .. TODO
1131 How can it be determined if a
1132 vector type, and what size
1133 vector?
1134 "PointeeAlign" integer Alignment in bytes of pointee
1135 type for pointer type kernel
1136 argument. Must be a power
1137 of 2. Only present if
1138 "ValueKind" is
1139 "DynamicSharedPointer".
1140 "AddrSpaceQual" string Kernel argument address space
1141 qualifier. Only present if
1142 "ValueKind" is "GlobalBuffer" or
1143 "DynamicSharedPointer". Values
1144 are:
1145
1146 - "Private"
1147 - "Global"
1148 - "Constant"
1149 - "Local"
1150 - "Generic"
1151 - "Region"
1152
1153 .. TODO
1154 Is GlobalBuffer only Global
1155 or Constant? Is
1156 DynamicSharedPointer always
1157 Local? Can HCC allow Generic?
1158 How can Private or Region
1159 ever happen?
1160 "AccQual" string Kernel argument access
1161 qualifier. Only present if
1162 "ValueKind" is "Image" or
1163 "Pipe". Values
1164 are:
1165
1166 - "ReadOnly"
1167 - "WriteOnly"
1168 - "ReadWrite"
1169
1170 .. TODO
1171 Does this apply to
1172 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001173 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001174 performed by the kernel on the
1175 kernel argument. Only present if
1176 "ValueKind" is "GlobalBuffer",
1177 "Image", or "Pipe". This may be
1178 more restrictive than indicated
1179 by "AccQual" to reflect what the
1180 kernel actual does. If not
1181 present then the runtime must
1182 assume what is implied by
1183 "AccQual" and "IsConst". Values
1184 are:
1185
1186 - "ReadOnly"
1187 - "WriteOnly"
1188 - "ReadWrite"
1189
1190 "IsConst" boolean Indicates if the kernel argument
1191 is const qualified. Only present
1192 if "ValueKind" is
1193 "GlobalBuffer".
1194
1195 "IsRestrict" boolean Indicates if the kernel argument
1196 is restrict qualified. Only
1197 present if "ValueKind" is
1198 "GlobalBuffer".
1199
1200 "IsVolatile" boolean Indicates if the kernel argument
1201 is volatile qualified. Only
1202 present if "ValueKind" is
1203 "GlobalBuffer".
1204
1205 "IsPipe" boolean Indicates if the kernel argument
1206 is pipe qualified. Only present
1207 if "ValueKind" is "Pipe".
1208
1209 .. TODO
1210 Can GlobalBuffer be pipe
1211 qualified?
1212 ================= ============== ========= ================================
1213
1214..
1215
1216 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1217 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1218
1219 ============================ ============== ========= =====================
1220 String Key Value Type Required? Description
1221 ============================ ============== ========= =====================
1222 "KernargSegmentSize" integer Required The size in bytes of
1223 the kernarg segment
1224 that holds the values
1225 of the arguments to
1226 the kernel.
1227 "GroupSegmentFixedSize" integer Required The amount of group
1228 segment memory
1229 required by a
1230 work-group in
1231 bytes. This does not
1232 include any
1233 dynamically allocated
1234 group segment memory
1235 that may be added
1236 when the kernel is
1237 dispatched.
1238 "PrivateSegmentFixedSize" integer Required The amount of fixed
1239 private address space
1240 memory required for a
1241 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001242 bytes. If the kernel
1243 uses a dynamic call
1244 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001245 space must be added
1246 to this value for the
1247 call stack.
1248 "KernargSegmentAlign" integer Required The maximum byte
1249 alignment of
1250 arguments in the
1251 kernarg segment. Must
1252 be a power of 2.
1253 "WavefrontSize" integer Required Wavefront size. Must
1254 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001255 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001256 registers used by a
1257 wavefront for
1258 GFX6-GFX9. This
1259 includes the special
1260 SGPRs for VCC, Flat
1261 Scratch (GFX7-GFX9)
1262 and XNACK (for
1263 GFX8-GFX9). It does
1264 not include the 16
1265 SGPR added if a trap
1266 handler is
1267 enabled. It is not
1268 rounded up to the
1269 allocation
1270 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001271 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001272 registers used by
1273 each work-item for
1274 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001275 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001276 work-group size
1277 supported by the
1278 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001279 Must be >=1 and
1280 consistent with any
1281 non-0 values in
1282 FixedWorkGroupSize.
1283 "FixedWorkGroupSize" sequence of Corresponds to the
1284 3 integers dispatch work-group
1285 size X, Y, Z. If
1286 omitted, defaults to
1287 0, 0, 0. If an
1288 element is non-0 then
1289 the kernel must only
1290 be launched with a
1291 matching corresponding
1292 work-group size.
Tony Tyef16a45e2017-06-06 20:31:59 +00001293 "IsXNACKEnabled" boolean Indicates if the
1294 generated machine
1295 code is capable of
Tony Tye07d9f102017-11-10 01:00:54 +00001296 supporting XNACK. See
1297 :ref:`amdgpu-target-features`.
Konstantin Zhuravlyov06ae4ec2017-11-28 17:51:08 +00001298 "NumSpilledSGPRs" integer Number of stores from
1299 a scalar register to
1300 a register allocator
1301 created spill
1302 location.
1303 "NumSpilledVGPRs" integer Number of stores from
1304 a vector register to
1305 a register allocator
1306 created spill
1307 location.
Tony Tyef16a45e2017-06-06 20:31:59 +00001308 ============================ ============== ========= =====================
1309
1310..
1311
Tony Tyef16a45e2017-06-06 20:31:59 +00001312Kernel Dispatch
1313~~~~~~~~~~~~~~~
1314
1315The HSA architected queuing language (AQL) defines a user space memory interface
1316that can be used to control the dispatch of kernels, in an agent independent
1317way. An agent can have zero or more AQL queues created for it using the ROCm
1318runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1319*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1320mechanics and packet layouts.
1321
1322The packet processor of a kernel agent is responsible for detecting and
1323dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1324packet processor is implemented by the hardware command processor (CP),
1325asynchronous dispatch controller (ADC) and shader processor input controller
1326(SPI).
1327
1328The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1329mode driver to initialize and register the AQL queue with CP.
1330
1331To dispatch a kernel the following actions are performed. This can occur in the
1332CPU host program, or from an HSA kernel executing on a GPU.
1333
13341. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1335 executed is obtained.
13362. A pointer to the kernel descriptor (see
1337 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1338 obtained. It must be for a kernel that is contained in a code object that that
1339 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1340 associated.
13413. Space is allocated for the kernel arguments using the ROCm runtime allocator
1342 for a memory region with the kernarg property for the kernel agent that will
1343 execute the kernel. It must be at least 16 byte aligned.
13444. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001345 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001346 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1347 memory in the same way constant memory is accessed. (Note that the HSA
1348 specification allows an implementation to copy the kernel argument contents to
1349 another location that is accessed by the kernel.)
13505. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1351 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1352 packet. The packet must be set up, and the final write must use an atomic
1353 store release to set the packet kind to ensure the packet contents are
1354 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1355 notify the kernel agent that the AQL queue has been updated. These rules, and
1356 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1357 System Architecture Specification* [HSA]_.
13586. A kernel dispatch packet includes information about the actual dispatch,
1359 such as grid and work-group size, together with information from the code
1360 object about the kernel, such as segment sizes. The ROCm runtime queries on
1361 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001362 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000013637. CP executes micro-code and is responsible for detecting and setting up the
1364 GPU to execute the wavefronts of a kernel dispatch.
13658. CP ensures that when the a wavefront starts executing the kernel machine
1366 code, the scalar general purpose registers (SGPR) and vector general purpose
1367 registers (VGPR) are set up as required by the machine code. The required
1368 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1369 register state is defined in
1370 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13719. The prolog of the kernel machine code (see
1372 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1373 before continuing executing the machine code that corresponds to the kernel.
137410. When the kernel dispatch has completed execution, CP signals the completion
1375 signal specified in the kernel dispatch packet if not 0.
1376
1377.. _amdgpu-amdhsa-memory-spaces:
1378
1379Memory Spaces
1380~~~~~~~~~~~~~
1381
1382The memory space properties are:
1383
1384 .. table:: AMDHSA Memory Spaces
1385 :name: amdgpu-amdhsa-memory-spaces-table
1386
1387 ================= =========== ======== ======= ==================
1388 Memory Space Name HSA Segment Hardware Address NULL Value
1389 Name Name Size
1390 ================= =========== ======== ======= ==================
1391 Private private scratch 32 0x00000000
1392 Local group LDS 32 0xFFFFFFFF
1393 Global global global 64 0x0000000000000000
1394 Constant constant *same as 64 0x0000000000000000
1395 global*
1396 Generic flat flat 64 0x0000000000000000
1397 Region N/A GDS 32 *not implemented
1398 for AMDHSA*
1399 ================= =========== ======== ======= ==================
1400
1401The global and constant memory spaces both use global virtual addresses, which
1402are the same virtual address space used by the CPU. However, some virtual
1403addresses may only be accessible to the CPU, some only accessible by the GPU,
1404and some by both.
1405
1406Using the constant memory space indicates that the data will not change during
1407the execution of the kernel. This allows scalar read instructions to be
1408used. The vector and scalar L1 caches are invalidated of volatile data before
1409each kernel dispatch execution to allow constant memory to change values between
1410kernel dispatches.
1411
1412The local memory space uses the hardware Local Data Store (LDS) which is
1413automatically allocated when the hardware creates work-groups of wavefronts, and
1414freed when all the wavefronts of a work-group have terminated. The data store
1415(DS) instructions can be used to access it.
1416
1417The private memory space uses the hardware scratch memory support. If the kernel
1418uses scratch, then the hardware allocates memory that is accessed using
1419wavefront lane dword (4 byte) interleaving. The mapping used from private
1420address to physical address is:
1421
1422 ``wavefront-scratch-base +
1423 (private-address * wavefront-size * 4) +
1424 (wavefront-lane-id * 4)``
1425
1426There are different ways that the wavefront scratch base address is determined
1427by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1428memory can be accessed in an interleaved manner using buffer instruction with
1429the scratch buffer descriptor and per wave scratch offset, by the scratch
1430instructions, or by flat instructions. If each lane of a wavefront accesses the
1431same private address, the interleaving results in adjacent dwords being accessed
1432and hence requires fewer cache lines to be fetched. Multi-dword access is not
1433supported except by flat and scratch instructions in GFX9.
1434
1435The generic address space uses the hardware flat address support available in
1436GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1437local appertures), that are outside the range of addressible global memory, to
1438map from a flat address to a private or local address.
1439
1440FLAT instructions can take a flat address and access global, private (scratch)
1441and group (LDS) memory depending in if the address is within one of the
1442apperture ranges. Flat access to scratch requires hardware aperture setup and
1443setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1444access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1445(see :ref:`amdgpu-amdhsa-m0`).
1446
1447To convert between a segment address and a flat address the base address of the
1448appertures address can be used. For GFX7-GFX8 these are available in the
1449:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1450Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1451GFX9 the appature base addresses are directly available as inline constant
1452registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1453address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1454which makes it easier to convert from flat to segment or segment to flat.
1455
Tony Tye46d35762017-08-15 20:47:41 +00001456Image and Samplers
1457~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001458
1459Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1460hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1461HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1462enumeration values for the queries that are not trivially deducible from the S#
1463representation.
1464
1465HSA Signals
1466~~~~~~~~~~~
1467
Tony Tye46d35762017-08-15 20:47:41 +00001468HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1469structure allocated in memory accessible from both the CPU and GPU. The
1470structure is defined by the ROCm runtime and subject to change between releases
1471(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001472
1473.. _amdgpu-amdhsa-hsa-aql-queue:
1474
1475HSA AQL Queue
1476~~~~~~~~~~~~~
1477
Tony Tye46d35762017-08-15 20:47:41 +00001478The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001479between releases (see [AMD-ROCm-github]_). For some processors it contains
1480fields needed to implement certain language features such as the flat address
1481aperture bases. It also contains fields used by CP such as managing the
1482allocation of scratch memory.
1483
1484.. _amdgpu-amdhsa-kernel-descriptor:
1485
1486Kernel Descriptor
1487~~~~~~~~~~~~~~~~~
1488
1489A kernel descriptor consists of the information needed by CP to initiate the
1490execution of a kernel, including the entry point address of the machine code
1491that implements the kernel.
1492
1493Kernel Descriptor for GFX6-GFX9
1494+++++++++++++++++++++++++++++++
1495
1496CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1497
1498 .. table:: Kernel Descriptor for GFX6-GFX9
1499 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1500
Tony Tye6baa6d22017-10-18 22:16:55 +00001501 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001502 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001503 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001504 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001505 address space memory
1506 required for a work-group
1507 in bytes. This does not
1508 include any dynamically
1509 allocated local address
1510 space memory that may be
1511 added when the kernel is
1512 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001513 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001514 private address space
1515 memory required for a
1516 work-item in bytes. If
1517 is_dynamic_callstack is 1
1518 then additional space must
1519 be added to this value for
1520 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001521 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001522 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001523 negative) from base
1524 address of kernel
1525 descriptor to kernel's
1526 entry point instruction
1527 which must be 256 byte
1528 aligned.
Tony Tye07d9f102017-11-10 01:00:54 +00001529 223:192 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
1530 size supported by the
1531 kernel in work-items. If
1532 an exact work-group size
1533 is required then must be
1534 omitted or 0 and
1535 ReqdWorkGroupSize* must
1536 be set to non-0.
1537 239:224 2 bytes ReqdWorkGroupSizeX If present and non-0 then
1538 the kernel
1539 must be executed with the
1540 specified work-group size
1541 for X.
1542 255:240 2 bytes ReqdWorkGroupSizeY If present and non-0 then
1543 the kernel
1544 must be executed with the
1545 specified work-group size
1546 for Y.
1547 271:256 2 bytes ReqdWorkGroupSizeZ If present and non-0 then
1548 the kernel
1549 must be executed with the
1550 specified work-group size
1551 for Z.
1552 383:271 14 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001553 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001554 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001555 program settings used by
1556 CP to set up
1557 ``COMPUTE_PGM_RSRC1``
1558 configuration
1559 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001560 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001561 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001562 program settings used by
1563 CP to set up
1564 ``COMPUTE_PGM_RSRC2``
1565 configuration
1566 register. See
1567 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001568 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1569 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001570 (see
1571 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1572
1573 The total number of SGPR
1574 user data registers
1575 requested must not exceed
1576 16 and match value in
1577 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1578 Any requests beyond 16
1579 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001580 449 1 bit EnableSGPRDispatchPtr *see above*
1581 450 1 bit EnableSGPRQueuePtr *see above*
1582 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1583 452 1 bit EnableSGPRDispatchID *see above*
1584 453 1 bit EnableSGPRFlatScratchInit *see above*
1585 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1586 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1587 should always be 0.
1588 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1589 should always be 0.
1590 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1591 should always be 0.
Tony Tye07d9f102017-11-10 01:00:54 +00001592 462:458 5 bits Reserved, must be 0.
1593 463 1 bit IsXNACKEnabled Indicates if the generated
1594 machine code is capable of
1595 supporting XNACK.
Tony Tye6baa6d22017-10-18 22:16:55 +00001596 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001597 bytes
1598 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001599 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001600
1601..
1602
1603 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001604 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001605
Tony Tye3b340612017-06-07 00:46:08 +00001606 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001607 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001608 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001609 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001610 used by each work-item,
1611 granularity is device
1612 specific:
1613
Tony Tye07d9f102017-11-10 01:00:54 +00001614 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001615 - max_vgpr 1..256
1616 - roundup((max_vgpg + 1)
1617 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001618
1619 Used by CP to set up
1620 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001621 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001622 used by a wavefront,
1623 granularity is device
1624 specific:
1625
Tony Tye07d9f102017-11-10 01:00:54 +00001626 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001627 - max_sgpr 1..112
1628 - roundup((max_sgpg + 1)
1629 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001630 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001631 - max_sgpr 1..112
1632 - roundup((max_sgpg + 1)
1633 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001634
1635 Includes the special SGPRs
1636 for VCC, Flat Scratch (for
1637 GFX7 onwards) and XNACK
1638 (for GFX8 onwards). It does
1639 not include the 16 SGPR
1640 added if a trap handler is
1641 enabled.
1642
1643 Used by CP to set up
1644 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001645 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001646
1647 Start executing wavefront
1648 at the specified priority.
1649
1650 CP is responsible for
1651 filling in
1652 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001653 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001654 with specified rounding
1655 mode for single (32
1656 bit) floating point
1657 precision floating point
1658 operations.
1659
1660 Floating point rounding
1661 mode values are defined in
1662 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1663
1664 Used by CP to set up
1665 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001666 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001667 with specified rounding
1668 denorm mode for half/double (16
1669 and 64 bit) floating point
1670 precision floating point
1671 operations.
1672
1673 Floating point rounding
1674 mode values are defined in
1675 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1676
1677 Used by CP to set up
1678 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001679 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001680 with specified denorm mode
1681 for single (32
1682 bit) floating point
1683 precision floating point
1684 operations.
1685
1686 Floating point denorm mode
1687 values are defined in
1688 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1689
1690 Used by CP to set up
1691 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001692 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001693 with specified denorm mode
1694 for half/double (16
1695 and 64 bit) floating point
1696 precision floating point
1697 operations.
1698
1699 Floating point denorm mode
1700 values are defined in
1701 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1702
1703 Used by CP to set up
1704 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001705 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001706
1707 Start executing wavefront
1708 in privilege trap handler
1709 mode.
1710
1711 CP is responsible for
1712 filling in
1713 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001714 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001715 with DX10 clamp mode
1716 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001717 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001718 treatment of NaN's (when
1719 set, clamp NaN to zero,
1720 otherwise pass NaN
1721 through).
1722
1723 Used by CP to set up
1724 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001725 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001726
1727 Start executing wavefront
1728 in single step mode.
1729
1730 CP is responsible for
1731 filling in
1732 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001733 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001734 with IEEE mode
1735 enabled. Floating point
1736 opcodes that support
1737 exception flag gathering
1738 will quiet and propagate
1739 signaling-NaN inputs per
1740 IEEE 754-2008. Min_dx10 and
1741 max_dx10 become IEEE
1742 754-2008 compliant due to
1743 signaling-NaN propagation
1744 and quieting.
1745
1746 Used by CP to set up
1747 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001748 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001749
1750 Only one work-group allowed
1751 to execute on a compute
1752 unit.
1753
1754 CP is responsible for
1755 filling in
1756 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001757 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001758
1759 Flag that can be used to
1760 control debugging code.
1761
1762 CP is responsible for
1763 filling in
1764 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001765 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001766 Reserved, must be 0.
1767 GFX9
1768 Wavefront starts execution
1769 with specified fp16 overflow
1770 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001771
Tony Tye6baa6d22017-10-18 22:16:55 +00001772 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001773 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001774 - If 1, fp16 overflow that is the
1775 result of an +/-INF input value
1776 or divide by 0 produces a +/-INF,
1777 otherwise clamps computed
1778 overflow to +/-MAX_FP16 as
1779 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001780
1781 Used by CP to set up
1782 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001783 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001784 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001785 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001786
1787..
1788
1789 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1790 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1791
Tony Tye3b340612017-06-07 00:46:08 +00001792 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001793 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001794 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001795 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1796 _WAVE_OFFSET SGPR wave scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001797 system register (see
1798 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1799
1800 Used by CP to set up
1801 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001802 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001803 user data registers
1804 requested. This number must
1805 match the number of user
1806 data registers enabled.
1807
1808 Used by CP to set up
1809 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001810 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001811 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001812 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001813 be enabled.
1814
1815 CP sets
1816 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1817 if the runtime has
1818 installed a trap handler
1819 regardless of the setting
1820 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001821 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001822 system SGPR register for
1823 the work-group id in the X
1824 dimension (see
1825 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1826
1827 Used by CP to set up
1828 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001829 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001830 system SGPR register for
1831 the work-group id in the Y
1832 dimension (see
1833 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1834
1835 Used by CP to set up
1836 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001837 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001838 system SGPR register for
1839 the work-group id in the Z
1840 dimension (see
1841 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1842
1843 Used by CP to set up
1844 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001845 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001846 system SGPR register for
1847 work-group information (see
1848 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1849
1850 Used by CP to set up
1851 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001852 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001853 VGPR system registers used
1854 for the work-item ID.
1855 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1856 defines the values.
1857
1858 Used by CP to set up
1859 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001860 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001861
1862 Wavefront starts execution
1863 with address watch
1864 exceptions enabled which
1865 are generated when L1 has
1866 witnessed a thread access
1867 an *address of
1868 interest*.
1869
1870 CP is responsible for
1871 filling in the address
1872 watch bit in
1873 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1874 according to what the
1875 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001876 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001877
1878 Wavefront starts execution
1879 with memory violation
1880 exceptions exceptions
1881 enabled which are generated
1882 when a memory violation has
1883 occurred for this wave from
1884 L1 or LDS
1885 (write-to-read-only-memory,
1886 mis-aligned atomic, LDS
1887 address out of range,
1888 illegal address, etc.).
1889
1890 CP sets the memory
1891 violation bit in
1892 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1893 according to what the
1894 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001895 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001896
1897 CP uses the rounded value
1898 from the dispatch packet,
1899 not this value, as the
1900 dispatch may contain
1901 dynamically allocated group
1902 segment memory. CP writes
1903 directly to
1904 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1905
1906 Amount of group segment
1907 (LDS) to allocate for each
1908 work-group. Granularity is
1909 device specific:
1910
1911 GFX6:
1912 roundup(lds-size / (64 * 4))
1913 GFX7-GFX9:
1914 roundup(lds-size / (128 * 4))
1915
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001916 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1917 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001918 enabled.
1919
1920 Used by CP to set up
1921 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1922 (set from bits 0..6).
1923
1924 IEEE 754 FP Invalid
1925 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001926 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1927 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001928 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001929 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1930 _DIVISION_BY_ZERO Zero
1931 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1932 _OVERFLOW
1933 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1934 _UNDERFLOW
1935 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1936 _INEXACT
1937 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1938 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001939 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001940 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001941 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001942 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001943
1944..
1945
1946 .. table:: Floating Point Rounding Mode Enumeration Values
1947 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1948
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001949 ====================================== ===== ==============================
1950 Enumeration Name Value Description
1951 ====================================== ===== ==============================
1952 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1953 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1954 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1955 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1956 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001957
1958..
1959
1960 .. table:: Floating Point Denorm Mode Enumeration Values
1961 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1962
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001963 ====================================== ===== ==============================
1964 Enumeration Name Value Description
1965 ====================================== ===== ==============================
1966 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1967 Denorms
1968 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1969 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1970 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1971 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001972
1973..
1974
1975 .. table:: System VGPR Work-Item ID Enumeration Values
1976 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1977
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001978 ======================================== ===== ============================
1979 Enumeration Name Value Description
1980 ======================================== ===== ============================
1981 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1982 ID.
1983 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1984 dimensions ID.
1985 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1986 dimensions ID.
1987 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1988 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001989
1990.. _amdgpu-amdhsa-initial-kernel-execution-state:
1991
1992Initial Kernel Execution State
1993~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1994
1995This section defines the register state that will be set up by the packet
1996processor prior to the start of execution of every wavefront. This is limited by
1997the constraints of the hardware controllers of CP/ADC/SPI.
1998
1999The order of the SGPR registers is defined, but the compiler can specify which
2000ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2001fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2002for enabled registers are dense starting at SGPR0: the first enabled register is
2003SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2004an SGPR number.
2005
2006The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2007all waves of the grid. It is possible to specify more than 16 User SGPRs using
2008the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2009initialized. These are then immediately followed by the System SGPRs that are
2010set up by ADC/SPI and can have different values for each wave of the grid
2011dispatch.
2012
2013SGPR register initial state is defined in
2014:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2015
2016 .. table:: SGPR Register Set Up Order
2017 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2018
2019 ========== ========================== ====== ==============================
2020 SGPR Order Name Number Description
2021 (kernel descriptor enable of
2022 field) SGPRs
2023 ========== ========================== ====== ==============================
2024 First Private Segment Buffer 4 V# that can be used, together
2025 (enable_sgpr_private with Scratch Wave Offset as an
2026 _segment_buffer) offset, to access the private
2027 memory space using a segment
2028 address.
2029
2030 CP uses the value provided by
2031 the runtime.
2032 then Dispatch Ptr 2 64 bit address of AQL dispatch
2033 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2034 actually executing.
2035 then Queue Ptr 2 64 bit address of amd_queue_t
2036 (enable_sgpr_queue_ptr) object for AQL queue on which
2037 the dispatch packet was
2038 queued.
2039 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2040 (enable_sgpr_kernarg segment. This is directly
2041 _segment_ptr) copied from the
2042 kernarg_address in the kernel
2043 dispatch packet.
2044
2045 Having CP load it once avoids
2046 loading it at the beginning of
2047 every wavefront.
2048 then Dispatch Id 2 64 bit Dispatch ID of the
2049 (enable_sgpr_dispatch_id) dispatch packet being
2050 executed.
2051 then Flat Scratch Init 2 This is 2 SGPRs:
2052 (enable_sgpr_flat_scratch
2053 _init) GFX6
2054 Not supported.
2055 GFX7-GFX8
2056 The first SGPR is a 32 bit
2057 byte offset from
2058 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2059 to per SPI base of memory
2060 for scratch for the queue
2061 executing the kernel
2062 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002063 from the runtime. (The
2064 Scratch Segment Buffer base
2065 address is
2066 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2067 plus this offset.) The value
2068 of Scratch Wave Offset must
2069 be added to this offset by
2070 the kernel machine code,
2071 right shifted by 8, and
2072 moved to the FLAT_SCRATCH_HI
2073 SGPR register.
2074 FLAT_SCRATCH_HI corresponds
2075 to SGPRn-4 on GFX7, and
2076 SGPRn-6 on GFX8 (where SGPRn
2077 is the highest numbered SGPR
2078 allocated to the wave).
2079 FLAT_SCRATCH_HI is
2080 multiplied by 256 (as it is
2081 in units of 256 bytes) and
2082 added to
2083 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2084 to calculate the per wave
2085 FLAT SCRATCH BASE in flat
2086 memory instructions that
2087 access the scratch
2088 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002089
2090 The second SGPR is 32 bit
2091 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002092 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002093 usage. CP obtains this from
2094 the runtime, and it is
2095 always a multiple of DWORD.
2096 CP checks that the value in
2097 the kernel dispatch packet
2098 Private Segment Byte Size is
2099 not larger, and requests the
2100 runtime to increase the
2101 queue's scratch size if
2102 necessary. The kernel code
2103 must move it to
2104 FLAT_SCRATCH_LO which is
2105 SGPRn-3 on GFX7 and SGPRn-5
2106 on GFX8. FLAT_SCRATCH_LO is
2107 used as the FLAT SCRATCH
2108 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002109 instructions. Having CP load
2110 it once avoids loading it at
2111 the beginning of every
Tony Tyef59d0712017-11-10 20:51:43 +00002112 wavefront.
2113 GFX9
2114 This is the
Tony Tye46d35762017-08-15 20:47:41 +00002115 64 bit base address of the
2116 per SPI scratch backing
2117 memory managed by SPI for
2118 the queue executing the
2119 kernel dispatch. CP obtains
2120 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002121 divides it if there are
2122 multiple Shader Arrays each
2123 with its own SPI). The value
2124 of Scratch Wave Offset must
2125 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002126 machine code and the result
2127 moved to the FLAT_SCRATCH
2128 SGPR which is SGPRn-6 and
2129 SGPRn-5. It is used as the
2130 FLAT SCRATCH BASE in flat
Tony Tyef59d0712017-11-10 20:51:43 +00002131 memory instructions.
2132 then Private Segment Size 1 The 32 bit byte size of a
2133 (enable_sgpr_private single
2134 work-item's
2135 scratch_segment_size) memory
2136 allocation. This is the
2137 value from the kernel
2138 dispatch packet Private
2139 Segment Byte Size rounded up
2140 by CP to a multiple of
2141 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002142
2143 Having CP load it once avoids
2144 loading it at the beginning of
2145 every wavefront.
2146
2147 This is not used for
2148 GFX7-GFX8 since it is the same
2149 value as the second SGPR of
2150 Flat Scratch Init. However, it
2151 may be needed for GFX9 which
2152 changes the meaning of the
2153 Flat Scratch Init value.
2154 then Grid Work-Group Count X 1 32 bit count of the number of
2155 (enable_sgpr_grid work-groups in the X dimension
2156 _workgroup_count_X) for the grid being
2157 executed. Computed from the
2158 fields in the kernel dispatch
2159 packet as ((grid_size.x +
2160 workgroup_size.x - 1) /
2161 workgroup_size.x).
2162 then Grid Work-Group Count Y 1 32 bit count of the number of
2163 (enable_sgpr_grid work-groups in the Y dimension
2164 _workgroup_count_Y && for the grid being
2165 less than 16 previous executed. Computed from the
2166 SGPRs) fields in the kernel dispatch
2167 packet as ((grid_size.y +
2168 workgroup_size.y - 1) /
2169 workgroupSize.y).
2170
2171 Only initialized if <16
2172 previous SGPRs initialized.
2173 then Grid Work-Group Count Z 1 32 bit count of the number of
2174 (enable_sgpr_grid work-groups in the Z dimension
2175 _workgroup_count_Z && for the grid being
2176 less than 16 previous executed. Computed from the
2177 SGPRs) fields in the kernel dispatch
2178 packet as ((grid_size.z +
2179 workgroup_size.z - 1) /
2180 workgroupSize.z).
2181
2182 Only initialized if <16
2183 previous SGPRs initialized.
2184 then Work-Group Id X 1 32 bit work-group id in X
2185 (enable_sgpr_workgroup_id dimension of grid for
2186 _X) wavefront.
2187 then Work-Group Id Y 1 32 bit work-group id in Y
2188 (enable_sgpr_workgroup_id dimension of grid for
2189 _Y) wavefront.
2190 then Work-Group Id Z 1 32 bit work-group id in Z
2191 (enable_sgpr_workgroup_id dimension of grid for
2192 _Z) wavefront.
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002193 then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002194 (enable_sgpr_workgroup ordered_append_term[10:0],
2195 _info) threadgroup_size_in_waves[5:0]}
2196 then Scratch Wave Offset 1 32 bit byte offset from base
2197 (enable_sgpr_private of scratch base of queue
2198 _segment_wave_offset) executing the kernel
2199 dispatch. Must be used as an
2200 offset with Private
2201 segment address when using
2202 Scratch Segment Buffer. It
2203 must be used to set up FLAT
2204 SCRATCH for flat addressing
2205 (see
2206 :ref:`amdgpu-amdhsa-flat-scratch`).
2207 ========== ========================== ====== ==============================
2208
2209The order of the VGPR registers is defined, but the compiler can specify which
2210ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2211fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2212for enabled registers are dense starting at VGPR0: the first enabled register is
2213VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2214VGPR number.
2215
2216VGPR register initial state is defined in
2217:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2218
2219 .. table:: VGPR Register Set Up Order
2220 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2221
2222 ========== ========================== ====== ==============================
2223 VGPR Order Name Number Description
2224 (kernel descriptor enable of
2225 field) VGPRs
2226 ========== ========================== ====== ==============================
2227 First Work-Item Id X 1 32 bit work item id in X
2228 (Always initialized) dimension of work-group for
2229 wavefront lane.
2230 then Work-Item Id Y 1 32 bit work item id in Y
2231 (enable_vgpr_workitem_id dimension of work-group for
2232 > 0) wavefront lane.
2233 then Work-Item Id Z 1 32 bit work item id in Z
2234 (enable_vgpr_workitem_id dimension of work-group for
2235 > 1) wavefront lane.
2236 ========== ========================== ====== ==============================
2237
2238The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2239
22401. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2241 registers.
22422. Work-group Id registers X, Y, Z are set by ADC which supports any
2243 combination including none.
22443. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2245 cannot included with the flat scratch init value which is per queue.
22464. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2247 or (X, Y, Z).
2248
2249Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2250value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2251
2252The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002253has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002254instructions (GFX9).
2255
2256If buffer operations are used then the compiler can generate a V# with the
2257following properties:
2258
2259* base address of 0
2260* no swizzle
2261* ATC: 1 if IOMMU present (such as APU)
2262* ptr64: 1
2263* MTYPE set to support memory coherence that matches the runtime (such as CC for
2264 APU and NC for dGPU).
2265
2266.. _amdgpu-amdhsa-kernel-prolog:
2267
2268Kernel Prolog
2269~~~~~~~~~~~~~
2270
2271.. _amdgpu-amdhsa-m0:
2272
2273M0
2274++
2275
2276GFX6-GFX8
2277 The M0 register must be initialized with a value at least the total LDS size
2278 if the kernel may access LDS via DS or flat operations. Total LDS size is
2279 available in dispatch packet. For M0, it is also possible to use maximum
2280 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2281 GFX7-GFX8).
2282GFX9
2283 The M0 register is not used for range checking LDS accesses and so does not
2284 need to be initialized in the prolog.
2285
2286.. _amdgpu-amdhsa-flat-scratch:
2287
2288Flat Scratch
2289++++++++++++
2290
2291If the kernel may use flat operations to access scratch memory, the prolog code
2292must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2293are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2294Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2295
2296GFX6
2297 Flat scratch is not supported.
2298
Tony Tye07d9f102017-11-10 01:00:54 +00002299GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002300 1. The low word of Flat Scratch Init is 32 bit byte offset from
2301 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2302 being managed by SPI for the queue executing the kernel dispatch. This is
2303 the same value used in the Scratch Segment Buffer V# base address. The
2304 prolog must add the value of Scratch Wave Offset to get the wave's byte
2305 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2306 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2307 by 8 before moving into FLAT_SCRATCH_LO.
2308 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2309 work-items scratch memory usage. This is directly loaded from the kernel
2310 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2311 DWORD. Having CP load it once avoids loading it at the beginning of every
2312 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2313 SIZE.
Tony Tyef59d0712017-11-10 20:51:43 +00002314
Tony Tyef16a45e2017-06-06 20:31:59 +00002315GFX9
2316 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2317 memory being managed by SPI for the queue executing the kernel dispatch. The
2318 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2319 pair for use as the flat scratch base in flat memory instructions.
2320
2321.. _amdgpu-amdhsa-memory-model:
2322
2323Memory Model
2324~~~~~~~~~~~~
2325
2326This section describes the mapping of LLVM memory model onto AMDGPU machine code
2327(see :ref:`memmodel`). *The implementation is WIP.*
2328
2329.. TODO
2330 Update when implementation complete.
2331
Tony Tyef16a45e2017-06-06 20:31:59 +00002332The AMDGPU backend supports the memory synchronization scopes specified in
2333:ref:`amdgpu-memory-scopes`.
2334
2335The code sequences used to implement the memory model are defined in table
2336:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2337
2338The sequences specify the order of instructions that a single thread must
2339execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2340to other memory instructions executed by the same thread. This allows them to be
2341moved earlier or later which can allow them to be combined with other instances
2342of the same instruction, or hoisted/sunk out of loops to improve
2343performance. Only the instructions related to the memory model are given;
2344additional ``s_waitcnt`` instructions are required to ensure registers are
2345defined before being used. These may be able to be combined with the memory
2346model ``s_waitcnt`` instructions as described above.
2347
Tony Tye6baa6d22017-10-18 22:16:55 +00002348The AMDGPU backend supports the following memory models:
2349
2350 HSA Memory Model [HSA]_
2351 The HSA memory model uses a single happens-before relation for all address
2352 spaces (see :ref:`amdgpu-address-spaces`).
2353 OpenCL Memory Model [OpenCL]_
2354 The OpenCL memory model which has separate happens-before relations for the
2355 global and local address spaces. Only a fence specifying both global and
2356 local address space, and seq_cst instructions join the relationships. Since
2357 the LLVM ``memfence`` instruction does not allow an address space to be
2358 specified the OpenCL fence has to convervatively assume both local and
2359 global address space was specified. However, optimizations can often be
2360 done to eliminate the additional ``s_waitcnt`` instructions when there are
2361 no intervening memory instructions which access the corresponding address
2362 space. The code sequences in the table indicate what can be omitted for the
2363 OpenCL memory. The target triple environment is used to determine if the
2364 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002365
2366``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2367operations.
2368
2369``buffer/global/flat_load/store/atomic`` instructions to global memory are
2370termed vector memory operations.
2371
2372For GFX6-GFX9:
2373
2374* Each agent has multiple compute units (CU).
2375* Each CU has multiple SIMDs that execute wavefronts.
2376* The wavefronts for a single work-group are executed in the same CU but may be
2377 executed by different SIMDs.
2378* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2379 executing on it.
2380* All LDS operations of a CU are performed as wavefront wide operations in a
2381 global order and involve no caching. Completion is reported to a wavefront in
2382 execution order.
2383* The LDS memory has multiple request queues shared by the SIMDs of a
2384 CU. Therefore, the LDS operations performed by different waves of a work-group
2385 can be reordered relative to each other, which can result in reordering the
2386 visibility of vector memory operations with respect to LDS operations of other
2387 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002388 ensure synchronization between LDS operations and vector memory operations
Tony Tyef16a45e2017-06-06 20:31:59 +00002389 between waves of a work-group, but not between operations performed by the
2390 same wavefront.
2391* The vector memory operations are performed as wavefront wide operations and
2392 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002393 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002394 vector memory order if they access LDS memory, and out of LDS operation order
2395 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002396* The vector memory operations access a single vector L1 cache shared by all
2397 SIMDs a CU. Therefore, no special action is required for coherence between the
2398 lanes of a single wavefront, or for coherence between wavefronts in the same
2399 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2400 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002401* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2402 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2403 scalar operations are used in a restricted way so do not impact the memory
2404 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2405* The vector and scalar memory operations use an L2 cache shared by all CUs on
2406 the same agent.
2407* The L2 cache has independent channels to service disjoint ranges of virtual
2408 addresses.
2409* Each CU has a separate request queue per channel. Therefore, the vector and
2410 scalar memory operations performed by waves executing in different work-groups
2411 (which may be executing on different CUs) of an agent can be reordered
2412 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002413 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002414 previous vector memory operation has completed before executing a subsequent
2415 vector memory or LDS operation and so can be used to meet the requirements of
2416 acquire and release.
2417* The L2 cache can be kept coherent with other agents on some targets, or ranges
2418 of virtual addresses can be set up to bypass it to ensure system coherence.
2419
Tony Tye07d9f102017-11-10 01:00:54 +00002420Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002421or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2422memory, atomic memory orderings are not meaningful and all accesses are treated
2423as non-atomic.
2424
2425Constant address space uses ``buffer/global_load`` instructions (or equivalent
2426scalar memory instructions). Since the constant address space contents do not
2427change during the execution of a kernel dispatch it is not legal to perform
2428stores, and atomic memory orderings are not meaningful and all access are
2429treated as non-atomic.
2430
2431A memory synchronization scope wider than work-group is not meaningful for the
2432group (LDS) address space and is treated as work-group.
2433
2434The memory model does not support the region address space which is treated as
2435non-atomic.
2436
2437Acquire memory ordering is not meaningful on store atomic instructions and is
2438treated as non-atomic.
2439
2440Release memory ordering is not meaningful on load atomic instructions and is
2441treated a non-atomic.
2442
2443Acquire-release memory ordering is not meaningful on load or store atomic
2444instructions and is treated as acquire and release respectively.
2445
2446AMDGPU backend only uses scalar memory operations to access memory that is
2447proven to not change during the execution of the kernel dispatch. This includes
2448constant address space and global address space for program scope const
2449variables. Therefore the kernel machine code does not have to maintain the
2450scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2451and vector L1 caches are invalidated between kernel dispatches by CP since
2452constant address space data may change between kernel dispatch executions. See
2453:ref:`amdgpu-amdhsa-memory-spaces`.
2454
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002455The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002456case the AMDGPU backend ensures the memory location used to spill is never
2457accessed by vector memory operations at the same time. If scalar writes are used
2458then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2459return since the locations may be used for vector memory instructions by a
2460future wave that uses the same scratch area, or a function call that creates a
2461frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2462as all scalar writes are write-before-read in the same thread.
2463
Tony Tye6baa6d22017-10-18 22:16:55 +00002464Scratch backing memory (which is used for the private address space)
2465is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2466address space is only accessed by a single thread, and is always
2467write-before-read, there is never a need to invalidate these entries from the L1
2468cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2469volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002470
2471On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002472to invalidate the L2 cache. This also causes it to be treated as
2473non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2474(cache coherent) and so the L2 cache will coherent with the CPU and other
2475agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002476
2477 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2478 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2479
Tony Tye6baa6d22017-10-18 22:16:55 +00002480 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002481 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2482 Ordering Sync Scope Address
2483 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002484 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002485 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002486 -----------------------------------------------------------------------------------
2487 load *none* *none* - global - !volatile & !nontemporal
2488 - generic
2489 - private 1. buffer/global/flat_load
2490 - constant
2491 - volatile & !nontemporal
2492
Tony Tyef16a45e2017-06-06 20:31:59 +00002493 1. buffer/global/flat_load
2494 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002495
2496 - nontemporal
2497
2498 1. buffer/global/flat_load
2499 glc=1 slc=1
2500
Tony Tyef16a45e2017-06-06 20:31:59 +00002501 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002502 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002503 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002504 - private 1. buffer/global/flat_store
2505 - constant
2506 - nontemporal
2507
2508 1. buffer/global/flat_stote
2509 glc=1 slc=1
2510
Tony Tyef16a45e2017-06-06 20:31:59 +00002511 store *none* *none* - local 1. ds_store
2512 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002513 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002514 load atomic unordered *any* *any* *Same as non-atomic*.
2515 store atomic unordered *any* *any* *Same as non-atomic*.
2516 atomicrmw unordered *any* *any* *Same as monotonic
2517 atomic*.
2518 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002519 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002520 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2521 - wavefront - generic
2522 - workgroup
2523 load atomic monotonic - singlethread - local 1. ds_load
2524 - wavefront
2525 - workgroup
2526 load atomic monotonic - agent - global 1. buffer/global/flat_load
2527 - system - generic glc=1
2528 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2529 - wavefront - generic
2530 - workgroup
2531 - agent
2532 - system
2533 store atomic monotonic - singlethread - local 1. ds_store
2534 - wavefront
2535 - workgroup
2536 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2537 - wavefront - generic
2538 - workgroup
2539 - agent
2540 - system
2541 atomicrmw monotonic - singlethread - local 1. ds_atomic
2542 - wavefront
2543 - workgroup
2544 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002545 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002546 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2547 - wavefront - local
2548 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002549 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2550 load atomic acquire - workgroup - local 1. ds_load
2551 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002552
Tony Tye6baa6d22017-10-18 22:16:55 +00002553 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002554 - Must happen before
2555 any following
2556 global/generic
2557 load/load
2558 atomic/store/store
2559 atomic/atomicrmw.
2560 - Ensures any
2561 following global
2562 data read is no
2563 older than the load
2564 atomic value being
2565 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002566 load atomic acquire - workgroup - generic 1. flat_load
2567 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002568
Tony Tye6baa6d22017-10-18 22:16:55 +00002569 - If OpenCL, omit.
2570 - Must happen before
2571 any following
2572 global/generic
2573 load/load
2574 atomic/store/store
2575 atomic/atomicrmw.
2576 - Ensures any
2577 following global
2578 data read is no
2579 older than the load
2580 atomic value being
2581 acquired.
2582 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002583 - system glc=1
2584 2. s_waitcnt vmcnt(0)
2585
2586 - Must happen before
2587 following
2588 buffer_wbinvl1_vol.
2589 - Ensures the load
2590 has completed
2591 before invalidating
2592 the cache.
2593
2594 3. buffer_wbinvl1_vol
2595
2596 - Must happen before
2597 any following
2598 global/generic
2599 load/load
2600 atomic/atomicrmw.
2601 - Ensures that
2602 following
2603 loads will not see
2604 stale global data.
2605
2606 load atomic acquire - agent - generic 1. flat_load glc=1
2607 - system 2. s_waitcnt vmcnt(0) &
2608 lgkmcnt(0)
2609
2610 - If OpenCL omit
2611 lgkmcnt(0).
2612 - Must happen before
2613 following
2614 buffer_wbinvl1_vol.
2615 - Ensures the flat_load
2616 has completed
2617 before invalidating
2618 the cache.
2619
2620 3. buffer_wbinvl1_vol
2621
2622 - Must happen before
2623 any following
2624 global/generic
2625 load/load
2626 atomic/atomicrmw.
2627 - Ensures that
2628 following loads
2629 will not see stale
2630 global data.
2631
2632 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2633 - wavefront - local
2634 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002635 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2636 atomicrmw acquire - workgroup - local 1. ds_atomic
2637 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002638
Tony Tye6baa6d22017-10-18 22:16:55 +00002639 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002640 - Must happen before
2641 any following
2642 global/generic
2643 load/load
2644 atomic/store/store
2645 atomic/atomicrmw.
2646 - Ensures any
2647 following global
2648 data read is no
2649 older than the
2650 atomicrmw value
2651 being acquired.
2652
Tony Tye6baa6d22017-10-18 22:16:55 +00002653 atomicrmw acquire - workgroup - generic 1. flat_atomic
2654 2. waitcnt lgkmcnt(0)
2655
2656 - If OpenCL, omit.
2657 - Must happen before
2658 any following
2659 global/generic
2660 load/load
2661 atomic/store/store
2662 atomic/atomicrmw.
2663 - Ensures any
2664 following global
2665 data read is no
2666 older than the
2667 atomicrmw value
2668 being acquired.
2669
2670 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002671 - system 2. s_waitcnt vmcnt(0)
2672
2673 - Must happen before
2674 following
2675 buffer_wbinvl1_vol.
2676 - Ensures the
2677 atomicrmw has
2678 completed before
2679 invalidating the
2680 cache.
2681
2682 3. buffer_wbinvl1_vol
2683
2684 - Must happen before
2685 any following
2686 global/generic
2687 load/load
2688 atomic/atomicrmw.
2689 - Ensures that
2690 following loads
2691 will not see stale
2692 global data.
2693
2694 atomicrmw acquire - agent - generic 1. flat_atomic
2695 - system 2. s_waitcnt vmcnt(0) &
2696 lgkmcnt(0)
2697
2698 - If OpenCL, omit
2699 lgkmcnt(0).
2700 - Must happen before
2701 following
2702 buffer_wbinvl1_vol.
2703 - Ensures the
2704 atomicrmw has
2705 completed before
2706 invalidating the
2707 cache.
2708
2709 3. buffer_wbinvl1_vol
2710
2711 - Must happen before
2712 any following
2713 global/generic
2714 load/load
2715 atomic/atomicrmw.
2716 - Ensures that
2717 following loads
2718 will not see stale
2719 global data.
2720
2721 fence acquire - singlethread *none* *none*
2722 - wavefront
2723 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2724
2725 - If OpenCL and
2726 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002727 not generic, omit.
2728 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002729 currently has no
2730 address space on
2731 the fence need to
2732 conservatively
2733 always generate. If
2734 fence had an
2735 address space then
2736 set to address
2737 space of OpenCL
2738 fence flag, or to
2739 generic if both
2740 local and global
2741 flags are
2742 specified.
2743 - Must happen after
2744 any preceding
2745 local/generic load
2746 atomic/atomicrmw
2747 with an equal or
2748 wider sync scope
2749 and memory ordering
2750 stronger than
2751 unordered (this is
2752 termed the
2753 fence-paired-atomic).
2754 - Must happen before
2755 any following
2756 global/generic
2757 load/load
2758 atomic/store/store
2759 atomic/atomicrmw.
2760 - Ensures any
2761 following global
2762 data read is no
2763 older than the
2764 value read by the
2765 fence-paired-atomic.
2766
Tony Tye6baa6d22017-10-18 22:16:55 +00002767 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2768 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002769
2770 - If OpenCL and
2771 address space is
2772 not generic, omit
2773 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002774 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002775 currently has no
2776 address space on
2777 the fence need to
2778 conservatively
2779 always generate
2780 (see comment for
2781 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002782 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002783 separate s_waitcnt
2784 vmcnt(0) and
2785 s_waitcnt
2786 lgkmcnt(0) to allow
2787 them to be
2788 independently moved
2789 according to the
2790 following rules.
2791 - s_waitcnt vmcnt(0)
2792 must happen after
2793 any preceding
2794 global/generic load
2795 atomic/atomicrmw
2796 with an equal or
2797 wider sync scope
2798 and memory ordering
2799 stronger than
2800 unordered (this is
2801 termed the
2802 fence-paired-atomic).
2803 - s_waitcnt lgkmcnt(0)
2804 must happen after
2805 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002806 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002807 atomic/atomicrmw
2808 with an equal or
2809 wider sync scope
2810 and memory ordering
2811 stronger than
2812 unordered (this is
2813 termed the
2814 fence-paired-atomic).
2815 - Must happen before
2816 the following
2817 buffer_wbinvl1_vol.
2818 - Ensures that the
2819 fence-paired atomic
2820 has completed
2821 before invalidating
2822 the
2823 cache. Therefore
2824 any following
2825 locations read must
2826 be no older than
2827 the value read by
2828 the
2829 fence-paired-atomic.
2830
2831 2. buffer_wbinvl1_vol
2832
Tony Tye6baa6d22017-10-18 22:16:55 +00002833 - Must happen before any
2834 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002835 load/load
2836 atomic/store/store
2837 atomic/atomicrmw.
2838 - Ensures that
2839 following loads
2840 will not see stale
2841 global data.
2842
2843 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002844 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002845 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2846 - wavefront - local
2847 - generic
2848 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002849
2850 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002851 - Must happen after
2852 any preceding
2853 local/generic
2854 load/store/load
2855 atomic/store
2856 atomic/atomicrmw.
2857 - Must happen before
2858 the following
2859 store.
2860 - Ensures that all
2861 memory operations
2862 to local have
2863 completed before
2864 performing the
2865 store that is being
2866 released.
2867
2868 2. buffer/global/flat_store
2869 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002870 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2871
2872 - If OpenCL, omit.
2873 - Must happen after
2874 any preceding
2875 local/generic
2876 load/store/load
2877 atomic/store
2878 atomic/atomicrmw.
2879 - Must happen before
2880 the following
2881 store.
2882 - Ensures that all
2883 memory operations
2884 to local have
2885 completed before
2886 performing the
2887 store that is being
2888 released.
2889
2890 2. flat_store
2891 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2892 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002893
2894 - If OpenCL, omit
2895 lgkmcnt(0).
2896 - Could be split into
2897 separate s_waitcnt
2898 vmcnt(0) and
2899 s_waitcnt
2900 lgkmcnt(0) to allow
2901 them to be
2902 independently moved
2903 according to the
2904 following rules.
2905 - s_waitcnt vmcnt(0)
2906 must happen after
2907 any preceding
2908 global/generic
2909 load/store/load
2910 atomic/store
2911 atomic/atomicrmw.
2912 - s_waitcnt lgkmcnt(0)
2913 must happen after
2914 any preceding
2915 local/generic
2916 load/store/load
2917 atomic/store
2918 atomic/atomicrmw.
2919 - Must happen before
2920 the following
2921 store.
2922 - Ensures that all
2923 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002924 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002925 completed before
2926 performing the
2927 store that is being
2928 released.
2929
2930 2. buffer/global/ds/flat_store
2931 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2932 - wavefront - local
2933 - generic
2934 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002935
2936 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002937 - Must happen after
2938 any preceding
2939 local/generic
2940 load/store/load
2941 atomic/store
2942 atomic/atomicrmw.
2943 - Must happen before
2944 the following
2945 atomicrmw.
2946 - Ensures that all
2947 memory operations
2948 to local have
2949 completed before
2950 performing the
2951 atomicrmw that is
2952 being released.
2953
2954 2. buffer/global/flat_atomic
2955 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00002956 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2957
2958 - If OpenCL, omit.
2959 - Must happen after
2960 any preceding
2961 local/generic
2962 load/store/load
2963 atomic/store
2964 atomic/atomicrmw.
2965 - Must happen before
2966 the following
2967 atomicrmw.
2968 - Ensures that all
2969 memory operations
2970 to local have
2971 completed before
2972 performing the
2973 atomicrmw that is
2974 being released.
2975
2976 2. flat_atomic
2977 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2978 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002979
2980 - If OpenCL, omit
2981 lgkmcnt(0).
2982 - Could be split into
2983 separate s_waitcnt
2984 vmcnt(0) and
2985 s_waitcnt
2986 lgkmcnt(0) to allow
2987 them to be
2988 independently moved
2989 according to the
2990 following rules.
2991 - s_waitcnt vmcnt(0)
2992 must happen after
2993 any preceding
2994 global/generic
2995 load/store/load
2996 atomic/store
2997 atomic/atomicrmw.
2998 - s_waitcnt lgkmcnt(0)
2999 must happen after
3000 any preceding
3001 local/generic
3002 load/store/load
3003 atomic/store
3004 atomic/atomicrmw.
3005 - Must happen before
3006 the following
3007 atomicrmw.
3008 - Ensures that all
3009 memory operations
3010 to global and local
3011 have completed
3012 before performing
3013 the atomicrmw that
3014 is being released.
3015
Tony Tye6baa6d22017-10-18 22:16:55 +00003016 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003017 fence release - singlethread *none* *none*
3018 - wavefront
3019 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3020
3021 - If OpenCL and
3022 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003023 not generic, omit.
3024 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003025 currently has no
3026 address space on
3027 the fence need to
3028 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003029 always generate. If
3030 fence had an
3031 address space then
3032 set to address
3033 space of OpenCL
3034 fence flag, or to
3035 generic if both
3036 local and global
3037 flags are
3038 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003039 - Must happen after
3040 any preceding
3041 local/generic
3042 load/load
3043 atomic/store/store
3044 atomic/atomicrmw.
3045 - Must happen before
3046 any following store
3047 atomic/atomicrmw
3048 with an equal or
3049 wider sync scope
3050 and memory ordering
3051 stronger than
3052 unordered (this is
3053 termed the
3054 fence-paired-atomic).
3055 - Ensures that all
3056 memory operations
3057 to local have
3058 completed before
3059 performing the
3060 following
3061 fence-paired-atomic.
3062
Tony Tye6baa6d22017-10-18 22:16:55 +00003063 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3064 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003065
3066 - If OpenCL and
3067 address space is
3068 not generic, omit
3069 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003070 - If OpenCL and
3071 address space is
3072 local, omit
3073 vmcnt(0).
3074 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003075 currently has no
3076 address space on
3077 the fence need to
3078 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003079 always generate. If
3080 fence had an
3081 address space then
3082 set to address
3083 space of OpenCL
3084 fence flag, or to
3085 generic if both
3086 local and global
3087 flags are
3088 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003089 - Could be split into
3090 separate s_waitcnt
3091 vmcnt(0) and
3092 s_waitcnt
3093 lgkmcnt(0) to allow
3094 them to be
3095 independently moved
3096 according to the
3097 following rules.
3098 - s_waitcnt vmcnt(0)
3099 must happen after
3100 any preceding
3101 global/generic
3102 load/store/load
3103 atomic/store
3104 atomic/atomicrmw.
3105 - s_waitcnt lgkmcnt(0)
3106 must happen after
3107 any preceding
3108 local/generic
3109 load/store/load
3110 atomic/store
3111 atomic/atomicrmw.
3112 - Must happen before
3113 any following store
3114 atomic/atomicrmw
3115 with an equal or
3116 wider sync scope
3117 and memory ordering
3118 stronger than
3119 unordered (this is
3120 termed the
3121 fence-paired-atomic).
3122 - Ensures that all
3123 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003124 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003125 completed before
3126 performing the
3127 following
3128 fence-paired-atomic.
3129
3130 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003131 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003132 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3133 - wavefront - local
3134 - generic
3135 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3136
Tony Tye6baa6d22017-10-18 22:16:55 +00003137 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003138 - Must happen after
3139 any preceding
3140 local/generic
3141 load/store/load
3142 atomic/store
3143 atomic/atomicrmw.
3144 - Must happen before
3145 the following
3146 atomicrmw.
3147 - Ensures that all
3148 memory operations
3149 to local have
3150 completed before
3151 performing the
3152 atomicrmw that is
3153 being released.
3154
Tony Tye6baa6d22017-10-18 22:16:55 +00003155 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003156 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3157 2. s_waitcnt lgkmcnt(0)
3158
Tony Tye6baa6d22017-10-18 22:16:55 +00003159 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003160 - Must happen before
3161 any following
3162 global/generic
3163 load/load
3164 atomic/store/store
3165 atomic/atomicrmw.
3166 - Ensures any
3167 following global
3168 data read is no
3169 older than the load
3170 atomic value being
3171 acquired.
3172
3173 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3174
Tony Tye6baa6d22017-10-18 22:16:55 +00003175 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003176 - Must happen after
3177 any preceding
3178 local/generic
3179 load/store/load
3180 atomic/store
3181 atomic/atomicrmw.
3182 - Must happen before
3183 the following
3184 atomicrmw.
3185 - Ensures that all
3186 memory operations
3187 to local have
3188 completed before
3189 performing the
3190 atomicrmw that is
3191 being released.
3192
3193 2. flat_atomic
3194 3. s_waitcnt lgkmcnt(0)
3195
Tony Tye6baa6d22017-10-18 22:16:55 +00003196 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003197 - Must happen before
3198 any following
3199 global/generic
3200 load/load
3201 atomic/store/store
3202 atomic/atomicrmw.
3203 - Ensures any
3204 following global
3205 data read is no
3206 older than the load
3207 atomic value being
3208 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003209
3210 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3211 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003212
3213 - If OpenCL, omit
3214 lgkmcnt(0).
3215 - Could be split into
3216 separate s_waitcnt
3217 vmcnt(0) and
3218 s_waitcnt
3219 lgkmcnt(0) to allow
3220 them to be
3221 independently moved
3222 according to the
3223 following rules.
3224 - s_waitcnt vmcnt(0)
3225 must happen after
3226 any preceding
3227 global/generic
3228 load/store/load
3229 atomic/store
3230 atomic/atomicrmw.
3231 - s_waitcnt lgkmcnt(0)
3232 must happen after
3233 any preceding
3234 local/generic
3235 load/store/load
3236 atomic/store
3237 atomic/atomicrmw.
3238 - Must happen before
3239 the following
3240 atomicrmw.
3241 - Ensures that all
3242 memory operations
3243 to global have
3244 completed before
3245 performing the
3246 atomicrmw that is
3247 being released.
3248
Tony Tye6baa6d22017-10-18 22:16:55 +00003249 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003250 3. s_waitcnt vmcnt(0)
3251
3252 - Must happen before
3253 following
3254 buffer_wbinvl1_vol.
3255 - Ensures the
3256 atomicrmw has
3257 completed before
3258 invalidating the
3259 cache.
3260
3261 4. buffer_wbinvl1_vol
3262
3263 - Must happen before
3264 any following
3265 global/generic
3266 load/load
3267 atomic/atomicrmw.
3268 - Ensures that
3269 following loads
3270 will not see stale
3271 global data.
3272
Tony Tye6baa6d22017-10-18 22:16:55 +00003273 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3274 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003275
3276 - If OpenCL, omit
3277 lgkmcnt(0).
3278 - Could be split into
3279 separate s_waitcnt
3280 vmcnt(0) and
3281 s_waitcnt
3282 lgkmcnt(0) to allow
3283 them to be
3284 independently moved
3285 according to the
3286 following rules.
3287 - s_waitcnt vmcnt(0)
3288 must happen after
3289 any preceding
3290 global/generic
3291 load/store/load
3292 atomic/store
3293 atomic/atomicrmw.
3294 - s_waitcnt lgkmcnt(0)
3295 must happen after
3296 any preceding
3297 local/generic
3298 load/store/load
3299 atomic/store
3300 atomic/atomicrmw.
3301 - Must happen before
3302 the following
3303 atomicrmw.
3304 - Ensures that all
3305 memory operations
3306 to global have
3307 completed before
3308 performing the
3309 atomicrmw that is
3310 being released.
3311
3312 2. flat_atomic
3313 3. s_waitcnt vmcnt(0) &
3314 lgkmcnt(0)
3315
3316 - If OpenCL, omit
3317 lgkmcnt(0).
3318 - Must happen before
3319 following
3320 buffer_wbinvl1_vol.
3321 - Ensures the
3322 atomicrmw has
3323 completed before
3324 invalidating the
3325 cache.
3326
3327 4. buffer_wbinvl1_vol
3328
3329 - Must happen before
3330 any following
3331 global/generic
3332 load/load
3333 atomic/atomicrmw.
3334 - Ensures that
3335 following loads
3336 will not see stale
3337 global data.
3338
3339 fence acq_rel - singlethread *none* *none*
3340 - wavefront
3341 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3342
3343 - If OpenCL and
3344 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003345 not generic, omit.
3346 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003347 since LLVM
3348 currently has no
3349 address space on
3350 the fence need to
3351 conservatively
3352 always generate
3353 (see comment for
3354 previous fence).
3355 - Must happen after
3356 any preceding
3357 local/generic
3358 load/load
3359 atomic/store/store
3360 atomic/atomicrmw.
3361 - Must happen before
3362 any following
3363 global/generic
3364 load/load
3365 atomic/store/store
3366 atomic/atomicrmw.
3367 - Ensures that all
3368 memory operations
3369 to local have
3370 completed before
3371 performing any
3372 following global
3373 memory operations.
3374 - Ensures that the
3375 preceding
3376 local/generic load
3377 atomic/atomicrmw
3378 with an equal or
3379 wider sync scope
3380 and memory ordering
3381 stronger than
3382 unordered (this is
3383 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003384 acquire-fence-paired-atomic
3385 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003386 before following
3387 global memory
3388 operations. This
3389 satisfies the
3390 requirements of
3391 acquire.
3392 - Ensures that all
3393 previous memory
3394 operations have
3395 completed before a
3396 following
3397 local/generic store
3398 atomic/atomicrmw
3399 with an equal or
3400 wider sync scope
3401 and memory ordering
3402 stronger than
3403 unordered (this is
3404 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003405 release-fence-paired-atomic
3406 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003407 requirements of
3408 release.
3409
Tony Tye6baa6d22017-10-18 22:16:55 +00003410 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3411 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003412
3413 - If OpenCL and
3414 address space is
3415 not generic, omit
3416 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003417 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003418 currently has no
3419 address space on
3420 the fence need to
3421 conservatively
3422 always generate
3423 (see comment for
3424 previous fence).
3425 - Could be split into
3426 separate s_waitcnt
3427 vmcnt(0) and
3428 s_waitcnt
3429 lgkmcnt(0) to allow
3430 them to be
3431 independently moved
3432 according to the
3433 following rules.
3434 - s_waitcnt vmcnt(0)
3435 must happen after
3436 any preceding
3437 global/generic
3438 load/store/load
3439 atomic/store
3440 atomic/atomicrmw.
3441 - s_waitcnt lgkmcnt(0)
3442 must happen after
3443 any preceding
3444 local/generic
3445 load/store/load
3446 atomic/store
3447 atomic/atomicrmw.
3448 - Must happen before
3449 the following
3450 buffer_wbinvl1_vol.
3451 - Ensures that the
3452 preceding
3453 global/local/generic
3454 load
3455 atomic/atomicrmw
3456 with an equal or
3457 wider sync scope
3458 and memory ordering
3459 stronger than
3460 unordered (this is
3461 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003462 acquire-fence-paired-atomic
3463 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003464 before invalidating
3465 the cache. This
3466 satisfies the
3467 requirements of
3468 acquire.
3469 - Ensures that all
3470 previous memory
3471 operations have
3472 completed before a
3473 following
3474 global/local/generic
3475 store
3476 atomic/atomicrmw
3477 with an equal or
3478 wider sync scope
3479 and memory ordering
3480 stronger than
3481 unordered (this is
3482 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003483 release-fence-paired-atomic
3484 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003485 requirements of
3486 release.
3487
3488 2. buffer_wbinvl1_vol
3489
3490 - Must happen before
3491 any following
3492 global/generic
3493 load/load
3494 atomic/store/store
3495 atomic/atomicrmw.
3496 - Ensures that
3497 following loads
3498 will not see stale
3499 global data. This
3500 satisfies the
3501 requirements of
3502 acquire.
3503
3504 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003505 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003506 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003507 - wavefront - local load atomic acquire,
3508 - generic except must generated
3509 all instructions even
3510 for OpenCL.*
3511 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3512 - generic
3513 - Must
3514 happen after
3515 preceding
3516 global/generic load
3517 atomic/store
3518 atomic/atomicrmw
3519 with memory
3520 ordering of seq_cst
3521 and with equal or
3522 wider sync scope.
3523 (Note that seq_cst
3524 fences have their
3525 own s_waitcnt
3526 lgkmcnt(0) and so do
3527 not need to be
3528 considered.)
3529 - Ensures any
3530 preceding
3531 sequential
3532 consistent local
3533 memory instructions
3534 have completed
3535 before executing
3536 this sequentially
3537 consistent
3538 instruction. This
3539 prevents reordering
3540 a seq_cst store
3541 followed by a
3542 seq_cst load. (Note
3543 that seq_cst is
3544 stronger than
3545 acquire/release as
3546 the reordering of
3547 load acquire
3548 followed by a store
3549 release is
3550 prevented by the
3551 waitcnt of
3552 the release, but
3553 there is nothing
3554 preventing a store
3555 release followed by
3556 load acquire from
3557 competing out of
3558 order.)
3559
3560 2. *Following
3561 instructions same as
3562 corresponding load
3563 atomic acquire,
3564 except must generated
3565 all instructions even
3566 for OpenCL.*
3567 load atomic seq_cst - workgroup - local *Same as corresponding
3568 load atomic acquire,
3569 except must generated
3570 all instructions even
3571 for OpenCL.*
3572 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3573 - system - generic vmcnt(0)
3574
3575 - Could be split into
3576 separate s_waitcnt
3577 vmcnt(0)
3578 and s_waitcnt
3579 lgkmcnt(0) to allow
3580 them to be
3581 independently moved
3582 according to the
3583 following rules.
3584 - waitcnt lgkmcnt(0)
3585 must happen after
3586 preceding
3587 global/generic load
3588 atomic/store
3589 atomic/atomicrmw
3590 with memory
3591 ordering of seq_cst
3592 and with equal or
3593 wider sync scope.
3594 (Note that seq_cst
3595 fences have their
3596 own s_waitcnt
3597 lgkmcnt(0) and so do
3598 not need to be
3599 considered.)
3600 - waitcnt vmcnt(0)
3601 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003602 preceding
3603 global/generic load
3604 atomic/store
3605 atomic/atomicrmw
3606 with memory
3607 ordering of seq_cst
3608 and with equal or
3609 wider sync scope.
3610 (Note that seq_cst
3611 fences have their
3612 own s_waitcnt
3613 vmcnt(0) and so do
3614 not need to be
3615 considered.)
3616 - Ensures any
3617 preceding
3618 sequential
3619 consistent global
3620 memory instructions
3621 have completed
3622 before executing
3623 this sequentially
3624 consistent
3625 instruction. This
3626 prevents reordering
3627 a seq_cst store
3628 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003629 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003630 that seq_cst is
3631 stronger than
3632 acquire/release as
3633 the reordering of
3634 load acquire
3635 followed by a store
3636 release is
3637 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003638 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003639 the release, but
3640 there is nothing
3641 preventing a store
3642 release followed by
3643 load acquire from
3644 competing out of
3645 order.)
3646
3647 2. *Following
3648 instructions same as
3649 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003650 atomic acquire,
3651 except must generated
3652 all instructions even
3653 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003654 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003655 - wavefront - local store atomic release,
3656 - workgroup - generic except must generated
3657 all instructions even
3658 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003659 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003660 - system - generic store atomic release,
3661 except must generated
3662 all instructions even
3663 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003664 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003665 - wavefront - local atomicrmw acq_rel,
3666 - workgroup - generic except must generated
3667 all instructions even
3668 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003669 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003670 - system - generic atomicrmw acq_rel,
3671 except must generated
3672 all instructions even
3673 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003674 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003675 - wavefront fence acq_rel,
3676 - workgroup except must generated
3677 - agent all instructions even
3678 - system for OpenCL.*
3679 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003680
3681The memory order also adds the single thread optimization constrains defined in
3682table
3683:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3684
3685 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3686 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3687
3688 ============ ==============================================================
3689 LLVM Memory Optimization Constraints
3690 Ordering
3691 ============ ==============================================================
3692 unordered *none*
3693 monotonic *none*
3694 acquire - If a load atomic/atomicrmw then no following load/load
3695 atomic/store/ store atomic/atomicrmw/fence instruction can
3696 be moved before the acquire.
3697 - If a fence then same as load atomic, plus no preceding
3698 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003699 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003700 atomic/store/ store atomic/atomicrmw/fence instruction can
3701 be moved after the release.
3702 - If a fence then same as store atomic, plus no following
3703 associated fence-paired-atomic can be moved before the
3704 fence.
3705 acq_rel Same constraints as both acquire and release.
3706 seq_cst - If a load atomic then same constraints as acquire, plus no
3707 preceding sequentially consistent load atomic/store
3708 atomic/atomicrmw/fence instruction can be moved after the
3709 seq_cst.
3710 - If a store atomic then the same constraints as release, plus
3711 no following sequentially consistent load atomic/store
3712 atomic/atomicrmw/fence instruction can be moved before the
3713 seq_cst.
3714 - If an atomicrmw/fence then same constraints as acq_rel.
3715 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003716
Wei Ding16289cf2017-02-21 18:48:01 +00003717Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003718~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003719
Tony Tyef16a45e2017-06-06 20:31:59 +00003720For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3721(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3722the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003723
Tony Tyef16a45e2017-06-06 20:31:59 +00003724 .. table:: AMDGPU Trap Handler for AMDHSA OS
3725 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003726
Tony Tyef16a45e2017-06-06 20:31:59 +00003727 =================== =============== =============== =======================
3728 Usage Code Sequence Trap Handler Description
3729 Inputs
3730 =================== =============== =============== =======================
3731 reserved ``s_trap 0x00`` Reserved by hardware.
3732 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3733 ``queue_ptr`` ``debugtrap``
3734 ``VGPR0``: intrinsic (not
3735 ``arg`` implemented).
3736 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3737 ``queue_ptr`` terminated and its
3738 associated queue put
3739 into the error state.
3740 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3741 ``queue_ptr`` installed handled
3742 same as ``llvm.trap``.
3743 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3744 breakpoints.
3745 debugger ``s_trap 0x08`` Reserved for debugger.
3746 debugger ``s_trap 0xfe`` Reserved for debugger.
3747 debugger ``s_trap 0xff`` Reserved for debugger.
3748 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003749
Tony Tye46d35762017-08-15 20:47:41 +00003750Unspecified OS
3751--------------
3752
3753This section provides code conventions used when the target triple OS is
3754empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003755
3756Trap Handler ABI
3757~~~~~~~~~~~~~~~~
3758
3759For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3760not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3761instructions are handled as follows:
3762
3763 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3764 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3765
3766 =============== =============== ===========================================
3767 Usage Code Sequence Description
3768 =============== =============== ===========================================
3769 llvm.trap s_endpgm Causes wavefront to be terminated.
3770 llvm.debugtrap *none* Compiler warning given that there is no
3771 trap handler installed.
3772 =============== =============== ===========================================
3773
3774Source Languages
3775================
3776
3777.. _amdgpu-opencl:
3778
3779OpenCL
3780------
3781
3782When generating code for the OpenCL language the target triple environment
3783should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3784
3785When the language is OpenCL the following differences occur:
3786
37871. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
37882. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye46d35762017-08-15 20:47:41 +000037893. Additional metadata is generated
3790 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003791
3792.. TODO
3793 Specify what affect this has. Hidden arguments added. Additional metadata
3794 generated.
3795
3796.. _amdgpu-hcc:
3797
3798HCC
3799---
3800
3801When generating code for the OpenCL language the target triple environment
3802should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3803
3804When the language is OpenCL the following differences occur:
3805
38061. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3807
3808.. TODO
3809 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003810
Tom Stellard45bb48e2015-06-13 03:28:10 +00003811Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003812---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003813
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003814AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef59d0712017-11-10 20:51:43 +00003815It supports AMDGCN GFX6-GFX9.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003816
Tony Tyef16a45e2017-06-06 20:31:59 +00003817This section describes general syntax for instructions and operands. For more
3818information about instructions, their semantics and supported combinations of
3819operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00003820[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003821
Tony Tyef16a45e2017-06-06 20:31:59 +00003822An instruction has the following syntax (register operands are normally
3823comma-separated while extra operands are space-separated):
Tom Stellard45bb48e2015-06-13 03:28:10 +00003824
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003825*<opcode> <register_operand0>, ... <extra_operand0> ...*
Tom Stellard45bb48e2015-06-13 03:28:10 +00003826
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003827Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003828~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003829
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003830The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003831
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003832* SGPR registers: s0, ... or s[0], ...
3833* VGPR registers: v0, ... or v[0], ...
3834* TTMP registers: ttmp0, ... or ttmp[0], ...
3835* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3836* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3837* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3838* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3839* Register index expressions: v[2*2], s[1-1:2-1]
3840* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003841
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003842The following extra operands are supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003843
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003844* offset, offset0, offset1
3845* idxen, offen bits
3846* glc, slc, tfe bits
3847* waitcnt: integer or combination of counter values
3848* VOP3 modifiers:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003849
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003850 - abs (\| \|), neg (\-)
Tom Stellard45bb48e2015-06-13 03:28:10 +00003851
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003852* DPP modifiers:
3853
3854 - row_shl, row_shr, row_ror, row_rol
3855 - row_mirror, row_half_mirror, row_bcast
3856 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3857 - row_mask, bank_mask, bound_ctrl
3858
3859* SDWA modifiers:
3860
3861 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3862 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3863 - abs, neg, sext
3864
Tony Tyef16a45e2017-06-06 20:31:59 +00003865Instruction Examples
3866~~~~~~~~~~~~~~~~~~~~
3867
3868DS
3869~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003870
3871.. code-block:: nasm
3872
3873 ds_add_u32 v2, v4 offset:16
3874 ds_write_src2_b64 v2 offset0:4 offset1:8
3875 ds_cmpst_f32 v2, v4, v6
3876 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3877
3878
3879For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3880
Tony Tyef16a45e2017-06-06 20:31:59 +00003881FLAT
3882++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003883
3884.. code-block:: nasm
3885
3886 flat_load_dword v1, v[3:4]
3887 flat_store_dwordx3 v[3:4], v[5:7]
3888 flat_atomic_swap v1, v[3:4], v5 glc
3889 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3890 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3891
3892For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3893
Tony Tyef16a45e2017-06-06 20:31:59 +00003894MUBUF
3895+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003896
3897.. code-block:: nasm
3898
3899 buffer_load_dword v1, off, s[4:7], s1
3900 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3901 buffer_store_format_xy v[1:2], off, s[4:7], s1
3902 buffer_wbinvl1
3903 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3904
3905For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3906
Tony Tyef16a45e2017-06-06 20:31:59 +00003907SMRD/SMEM
3908+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003909
3910.. code-block:: nasm
3911
3912 s_load_dword s1, s[2:3], 0xfc
3913 s_load_dwordx8 s[8:15], s[2:3], s4
3914 s_load_dwordx16 s[88:103], s[2:3], s4
3915 s_dcache_inv_vol
3916 s_memtime s[4:5]
3917
3918For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3919
Tony Tyef16a45e2017-06-06 20:31:59 +00003920SOP1
3921++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003922
3923.. code-block:: nasm
3924
3925 s_mov_b32 s1, s2
3926 s_mov_b64 s[0:1], 0x80000000
3927 s_cmov_b32 s1, 200
3928 s_wqm_b64 s[2:3], s[4:5]
3929 s_bcnt0_i32_b64 s1, s[2:3]
3930 s_swappc_b64 s[2:3], s[4:5]
3931 s_cbranch_join s[4:5]
3932
3933For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3934
Tony Tyef16a45e2017-06-06 20:31:59 +00003935SOP2
3936++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003937
3938.. code-block:: nasm
3939
3940 s_add_u32 s1, s2, s3
3941 s_and_b64 s[2:3], s[4:5], s[6:7]
3942 s_cselect_b32 s1, s2, s3
3943 s_andn2_b32 s2, s4, s6
3944 s_lshr_b64 s[2:3], s[4:5], s6
3945 s_ashr_i32 s2, s4, s6
3946 s_bfm_b64 s[2:3], s4, s6
3947 s_bfe_i64 s[2:3], s[4:5], s6
3948 s_cbranch_g_fork s[4:5], s[6:7]
3949
3950For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3951
Tony Tyef16a45e2017-06-06 20:31:59 +00003952SOPC
3953++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003954
3955.. code-block:: nasm
3956
3957 s_cmp_eq_i32 s1, s2
3958 s_bitcmp1_b32 s1, s2
3959 s_bitcmp0_b64 s[2:3], s4
3960 s_setvskip s3, s5
3961
3962For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3963
Tony Tyef16a45e2017-06-06 20:31:59 +00003964SOPP
3965++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003966
3967.. code-block:: nasm
3968
3969 s_barrier
3970 s_nop 2
3971 s_endpgm
3972 s_waitcnt 0 ; Wait for all counters to be 0
3973 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3974 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3975 s_sethalt 9
3976 s_sleep 10
3977 s_sendmsg 0x1
3978 s_sendmsg sendmsg(MSG_INTERRUPT)
3979 s_trap 1
3980
3981For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3982
3983Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003984of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003985range or acceptable values.
3986
Tony Tyef16a45e2017-06-06 20:31:59 +00003987VALU
3988++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003989
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003990For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3991the assembler will automatically use optimal encoding based on its operands.
3992To force specific encoding, one can add a suffix to the opcode of the instruction:
3993
3994* _e32 for 32-bit VOP1/VOP2/VOPC
3995* _e64 for 64-bit VOP3
3996* _dpp for VOP_DPP
3997* _sdwa for VOP_SDWA
3998
3999VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004000
4001.. code-block:: nasm
4002
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004003 v_mov_b32 v1, v2
4004 v_mov_b32_e32 v1, v2
4005 v_nop
4006 v_cvt_f64_i32_e32 v[1:2], v2
4007 v_floor_f32_e32 v1, v2
4008 v_bfrev_b32_e32 v1, v2
4009 v_add_f32_e32 v1, v2, v3
4010 v_mul_i32_i24_e64 v1, v2, 3
4011 v_mul_i32_i24_e32 v1, -3, v3
4012 v_mul_i32_i24_e32 v1, -100, v3
4013 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4014 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00004015
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004016VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004017
4018.. code-block:: nasm
4019
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004020 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4021 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4022 v_mov_b32 v0, v0 wave_shl:1
4023 v_mov_b32 v0, v0 row_mirror
4024 v_mov_b32 v0, v0 row_bcast:31
4025 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4026 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4027 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00004028
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004029VOP_SDWA examples:
4030
4031.. code-block:: nasm
4032
4033 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4034 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4035 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4036 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4037 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4038
4039For full list of supported instructions, refer to "Vector ALU instructions".
4040
4041HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004042~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004043
4044AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4045one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004046
4047.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004048+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004049
4050*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004051object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004052
4053.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004054+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4055
Tom Stellard347ac792015-06-26 21:15:07 +00004056
4057*major*, *minor*, and *stepping* are all integers that describe the instruction
4058set architecture (ISA) version of the assembly program.
4059
4060*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4061"AMD" and *arch* should always be equal to "AMDGPU".
4062
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004063By default, the assembler will derive the ISA version, *vendor*, and *arch*
4064from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004065
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004066.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004067+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004068
4069This directives specifies that the symbol with given name is a kernel entry point
4070(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004071
4072.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004073++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004074
4075This directive marks the beginning of a list of key / value pairs that are used
4076to specify the amd_kernel_code_t object that will be emitted by the assembler.
4077The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4078any amd_kernel_code_t values that are unspecified a default value will be
4079used. The default value for all keys is 0, with the following exceptions:
4080
4081- *kernel_code_version_major* defaults to 1.
4082- *machine_kind* defaults to 1.
4083- *machine_version_major*, *machine_version_minor*, and
4084 *machine_version_stepping* are derived from the value of the -mcpu option
4085 that is passed to the assembler.
4086- *kernel_code_entry_byte_offset* defaults to 256.
4087- *wavefront_size* defaults to 6.
4088- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004089 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004090 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4091
4092The *.amd_kernel_code_t* directive must be placed immediately after the
4093function label and before any instructions.
4094
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004095For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4096comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004097
4098Here is an example of a minimal amd_kernel_code_t specification:
4099
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004100.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004101
4102 .hsa_code_object_version 1,0
4103 .hsa_code_object_isa
4104
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004105 .hsatext
4106 .globl hello_world
4107 .p2align 8
4108 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004109
4110 hello_world:
4111
4112 .amd_kernel_code_t
4113 enable_sgpr_kernarg_segment_ptr = 1
4114 is_ptr64 = 1
4115 compute_pgm_rsrc1_vgprs = 0
4116 compute_pgm_rsrc1_sgprs = 0
4117 compute_pgm_rsrc2_user_sgpr = 2
4118 kernarg_segment_byte_size = 8
4119 wavefront_sgpr_count = 2
4120 workitem_vgpr_count = 3
4121 .end_amd_kernel_code_t
4122
4123 s_load_dwordx2 s[0:1], s[0:1] 0x0
4124 v_mov_b32 v0, 3.14159
4125 s_waitcnt lgkmcnt(0)
4126 v_mov_b32 v1, s0
4127 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004128 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004129 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004130 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004131 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004132
4133Additional Documentation
4134========================
4135
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004136.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4137.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4138.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4139.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4140.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4141.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4142.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4143.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004144.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
4145.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
4146.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4147.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4148.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4149.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4150.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004151.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004152.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4153.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
4154.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__