blob: 1cf30304dfc841227be768f83a1546b93f156feb [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
67 *<empty>* Defaults to ``opencl``.
68 ``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
69 ``amdgizcl`` Same as ``opencl`` except a different address space mapping is
70 used (see :ref:`amdgpu-address-spaces`).
71 ``amdgiz`` Same as ``opencl`` except a different address space mapping is
72 used (see :ref:`amdgpu-address-spaces`).
73 ``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
74 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000075
76.. _amdgpu-processors:
77
78Processors
79----------
80
81Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
82names from both the *Processor* and *Alternative Processor* can be used.
83
84 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000085 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000086
Tony Tye07d9f102017-11-10 01:00:54 +000087 =========== =============== ============ ===== ======= ==================
88 Processor Alternative Target dGPU/ ROCm Example
89 Processor Triple APU Support Products
90 Architecture
91 =========== =============== ============ ===== ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000092 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye07d9f102017-11-10 01:00:54 +000093 -------------------------------------------------------------------------
94 ``r600`` ``r600`` dGPU
95 ``r630`` ``r600`` dGPU
96 ``rs880`` ``r600`` dGPU
97 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000098 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye07d9f102017-11-10 01:00:54 +000099 -------------------------------------------------------------------------
100 ``rv710`` ``r600`` dGPU
101 ``rv730`` ``r600`` dGPU
102 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000103 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye07d9f102017-11-10 01:00:54 +0000104 -------------------------------------------------------------------------
105 ``cedar`` ``r600`` dGPU
106 ``redwood`` ``r600`` dGPU
107 ``sumo`` ``r600`` dGPU
108 ``juniper`` ``r600`` dGPU
109 ``cypress`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000110 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye07d9f102017-11-10 01:00:54 +0000111 -------------------------------------------------------------------------
112 ``barts`` ``r600`` dGPU
113 ``turks`` ``r600`` dGPU
114 ``caicos`` ``r600`` dGPU
115 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000116 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye07d9f102017-11-10 01:00:54 +0000117 -------------------------------------------------------------------------
118 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
119 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
120 - ``verde``
121 - ``oland``
122 - ``hainan``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000123 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye07d9f102017-11-10 01:00:54 +0000124 -------------------------------------------------------------------------
125 ``gfx700`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
126 - Radeon HD 8770
127 - R7 260
128 - R7 260X
129 \ - ``kaveri`` ``amdgcn`` APU - A6-7000
130 - A6 Pro-7050B
131 - A8-7100
132 - A8 Pro-7150B
133 - A10-7300
134 - A10 Pro-7350B
135 - FX-7500
136 - A8-7200P
137 - A10-7400P
138 - FX-7600P
139 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
140 - FirePro W9100
141 - FirePro S9150
142 - FirePro S9170
143 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
144 - Radeon R9 290x
145 - Radeon R390
146 - Radeon R390x
147 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
148 - ``mullins`` - E1-2200
149 - E1-2500
150 - E2-3000
151 - E2-3800
152 - A4-5000
153 - A4-5100
154 - A6-5200
155 - A4 Pro-3340B
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000156 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye07d9f102017-11-10 01:00:54 +0000157 -------------------------------------------------------------------------
158 ``gfx800`` - ``iceland`` ``amdgcn`` dGPU - FirePro S7150
159 - FirePro S7100
160 - FirePro W7100
161 - Radeon R285
162 - Radeon R9 380
163 - Radeon R9 385
164 - Mobile FirePro
165 M7170
166 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - A6-8500P
167 - Pro A6-8500B
168 - A8-8600P
169 - Pro A8-8600B
170 - FX-8800P
171 - Pro A12-8800B
172 \ ``amdgcn`` APU ROCm - A10-8700P
173 - Pro A10-8700B
174 - A10-8780P
175 \ ``amdgcn`` APU - A10-9600P
176 - A10-9630P
177 - A12-9700P
178 - A12-9730P
179 - FX-9800P
180 - FX-9830P
181 \ ``amdgcn`` APU - E2-9010
182 - A6-9210
183 - A9-9410
184 ``gfx802`` - ``tonga`` ``amdgcn`` dGPU ROCm Same as gfx800
185 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU ROCm - Radeon R9 Nano
186 - Radeon R9 Fury
187 - Radeon R9 FuryX
188 - Radeon Pro Duo
189 - FirePro S9300x2
190 - Radeon Instinct MI8
191 \ - ``polaris10`` ``amdgcn`` dGPU ROCm - Radeon RX 470
192 - Radeon RX 480
193 - Radeon Instinct MI6
194 \ - ``polaris11`` ``amdgcn`` dGPU ROCm - Radeon RX 460
195 ``gfx810`` - ``stoney`` ``amdgcn`` APU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000196 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye07d9f102017-11-10 01:00:54 +0000197 -------------------------------------------------------------------------
198 ``gfx900`` ``amdgcn`` dGPU ROCm - Radeon Vega
199 Frontier Edition
200 - Radeon RX Vega 56
201 - Radeon RX Vega 64
202 - Radeon RX Vega 64
203 Liquid
204 - Radeon Instinct MI25
205 ``gfx902`` ``amdgcn`` APU *TBA*
Tony Tyef16a45e2017-06-06 20:31:59 +0000206
Tony Tye07d9f102017-11-10 01:00:54 +0000207 .. TODO
208 Add product
209 names.
210 =========== =============== ============ ===== ======= ==================
211
212.. _amdgpu-target-features:
213
214Target Features
215---------------
216
217Target features control how code is generated to support certain
218features. Not all target features are supported by all processors. The
219runtime must ensure that the features supported by the device used to
220execute the code match the features enabled when generating the
221code. A mismatch of features may result in incorrect execution, or a
222reduction in performance.
223
224Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
225target features.
226
227For example:
228
229``-mxnack``
230 Enable the *XNACK* feature.
231``-mno-xnack``
232 Disable the *XNACK* feature.
233
234 .. table:: AMDGPU Target Features
235 :name: amdgpu-target-feature-table
236
237 ============== ======== ==================================================
238 Target Feature Default Description
239 ============== ======== ==================================================
240 -m[no-]xnack disabled Enable/disable generating code that has
241 memory clauses that are compatible with
242 having XNACK replay enabled.
243
244 This is used for demand paging and page
245 migration. If XNACK replay is enabled in
246 the device, then if a page fault occurs
247 the code may execute incorrectly if the
248 XNACK feature is not enabled. Executing
249 code that has the feature enabled on a
250 device that does not have XNACK replay
251 enabled will execute correctly, but may
252 be less performant than code with the
253 feature disabled.
254
255 This feature is supported by the
Tony Tye35077502017-11-11 00:50:32 +0000256 ``amdgcn`` architecture for GFX8-GFX9.
Tony Tye07d9f102017-11-10 01:00:54 +0000257 ============== ======== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000258
259.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000260
261Address Spaces
262--------------
263
Tony Tyef16a45e2017-06-06 20:31:59 +0000264The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000265
Tony Tyef16a45e2017-06-06 20:31:59 +0000266The memory space names used in the table, aside from the region memory space, is
267from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000268
Tony Tyef16a45e2017-06-06 20:31:59 +0000269LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000270
Tony Tyef16a45e2017-06-06 20:31:59 +0000271 .. table:: Address Space Mapping
272 :name: amdgpu-address-space-mapping-table
273
274 ================== ================= ================= ================= =================
275 LLVM Address Space Memory Space
276 ------------------ -----------------------------------------------------------------------
277 \ Current Default amdgiz/amdgizcl hcc Future Default
278 ================== ================= ================= ================= =================
279 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
280 1 Global Global Global Global
281 2 Constant Constant Constant Region (GDS)
282 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
283 4 Generic (Flat) Region (GDS) Region (GDS) Constant
284 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
285 ================== ================= ================= ================= =================
286
287Current Default
288 This is the current default address space mapping used for all languages
289 except hcc. This will shortly be deprecated.
290
291amdgiz/amdgizcl
292 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
293 is specified as the target triple environment value.
294
295hcc
296 This is the current address space mapping used when ``hcc`` is specified as
297 the target triple environment value.This will shortly be deprecated.
298
299Future Default
300 This will shortly be the only address space mapping for all languages using
301 AMDGPU backend.
302
303.. _amdgpu-memory-scopes:
304
305Memory Scopes
306-------------
307
308This section provides LLVM memory synchronization scopes supported by the AMDGPU
309backend memory model when the target triple OS is ``amdhsa`` (see
310:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
311
312The memory model supported is based on the HSA memory model [HSA]_ which is
313based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
314relation is transitive over the synchonizes-with relation independent of scope,
315and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000316table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000317
318This is different to the OpenCL [OpenCL]_ memory model which does not have scope
319inclusion and requires the memory scopes to exactly match. However, this
320is conservatively correct for OpenCL.
321
Tony Tye07d9f102017-11-10 01:00:54 +0000322 .. table:: AMDHSA LLVM Sync Scopes
323 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000324
325 ================ ==========================================================
326 LLVM Sync Scope Description
327 ================ ==========================================================
328 *none* The default: ``system``.
329
330 Synchronizes with, and participates in modification and
331 seq_cst total orderings with, other operations (except
332 image operations) for all address spaces (except private,
333 or generic that accesses private) provided the other
334 operation's sync scope is:
335
336 - ``system``.
337 - ``agent`` and executed by a thread on the same agent.
338 - ``workgroup`` and executed by a thread in the same
339 workgroup.
340 - ``wavefront`` and executed by a thread in the same
341 wavefront.
342
343 ``agent`` Synchronizes with, and participates in modification and
344 seq_cst total orderings with, other operations (except
345 image operations) for all address spaces (except private,
346 or generic that accesses private) provided the other
347 operation's sync scope is:
348
349 - ``system`` or ``agent`` and executed by a thread on the
350 same agent.
351 - ``workgroup`` and executed by a thread in the same
352 workgroup.
353 - ``wavefront`` and executed by a thread in the same
354 wavefront.
355
356 ``workgroup`` Synchronizes with, and participates in modification and
357 seq_cst total orderings with, other operations (except
358 image operations) for all address spaces (except private,
359 or generic that accesses private) provided the other
360 operation's sync scope is:
361
362 - ``system``, ``agent`` or ``workgroup`` and executed by a
363 thread in the same workgroup.
364 - ``wavefront`` and executed by a thread in the same
365 wavefront.
366
367 ``wavefront`` Synchronizes with, and participates in modification and
368 seq_cst total orderings with, other operations (except
369 image operations) for all address spaces (except private,
370 or generic that accesses private) provided the other
371 operation's sync scope is:
372
373 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
374 and executed by a thread in the same wavefront.
375
376 ``singlethread`` Only synchronizes with, and participates in modification
377 and seq_cst total orderings with, other operations (except
378 image operations) running in the same thread for all
379 address spaces (for example, in signal handlers).
380 ================ ==========================================================
381
382AMDGPU Intrinsics
383-----------------
384
385The AMDGPU backend implements the following intrinsics.
386
387*This section is WIP.*
388
389.. TODO
390 List AMDGPU intrinsics
391
392Code Object
393===========
394
395The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
396can be linked by ``lld`` to produce a standard ELF shared code object which can
397be loaded and executed on an AMDGPU target.
398
399Header
400------
401
402The AMDGPU backend uses the following ELF header:
403
404 .. table:: AMDGPU ELF Header
405 :name: amdgpu-elf-header-table
406
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000407 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000408 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000409 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000410 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
411 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000412 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
413 - ``ELFOSABI_AMDGPU_HSA``
414 - ``ELFOSABI_AMDGPU_PAL``
415 - ``ELFOSABI_AMDGPU_MESA3D``
416 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
417 - ``ELFABIVERSION_AMDGPU_PAL``
418 - ``ELFABIVERSION_AMDGPU_MESA3D``
419 ``e_type`` - ``ET_REL``
420 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000421 ``e_machine`` ``EM_AMDGPU``
422 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000423 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000424 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000425
426..
427
428 .. table:: AMDGPU ELF Header Enumeration Values
429 :name: amdgpu-elf-header-enumeration-values-table
430
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000431 =============================== =====
432 Name Value
433 =============================== =====
434 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000435 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000436 ``ELFOSABI_AMDGPU_HSA`` 64
437 ``ELFOSABI_AMDGPU_PAL`` 65
438 ``ELFOSABI_AMDGPU_MESA3D`` 66
439 ``ELFABIVERSION_AMDGPU_HSA`` 1
440 ``ELFABIVERSION_AMDGPU_PAL`` 0
441 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
442 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000443
444``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000445 The ELF class is:
446
447 * ``ELFCLASS32`` for ``r600`` architecture.
448
449 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
450 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000451
452``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000453 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000454
455``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000456 One of the following AMD GPU architecture specific OS ABIs
457 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000458
Tony Tye07d9f102017-11-10 01:00:54 +0000459 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000460
Tony Tye07d9f102017-11-10 01:00:54 +0000461 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000462
Tony Tye07d9f102017-11-10 01:00:54 +0000463 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
464
465 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000466
Tony Tyef16a45e2017-06-06 20:31:59 +0000467``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000468 The ABI version of the AMD GPU architecture specific OS ABI to which the code
469 object conforms:
470
471 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
472 runtime ABI.
473
474 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
475 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000476
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000477 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000478 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000479
Tony Tyef16a45e2017-06-06 20:31:59 +0000480``e_type``
481 Can be one of the following values:
482
483
484 ``ET_REL``
485 The type produced by the AMD GPU backend compiler as it is relocatable code
486 object.
487
488 ``ET_DYN``
489 The type produced by the linker as it is a shared code object.
490
491 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
492
493``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000494 The value ``EM_AMDGPU`` is used for the machine for all processors supported
495 by the ``r600`` and ``amdgcn`` architectures (see
496 :ref:`amdgpu-processor-table`). The specific processor is specified in the
497 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
498 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000499
500``e_entry``
501 The entry point is 0 as the entry points for individual kernels must be
502 selected in order to invoke them through AQL packets.
503
504``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000505 The AMDGPU backend uses the following ELF header flags:
506
507 .. table:: AMDGPU ELF Header ``e_flags``
508 :name: amdgpu-elf-header-e_flags-table
509
510 ================================= ========== =============================
511 Name Value Description
512 ================================= ========== =============================
513 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
514 -------------------------------------------- -----------------------------
515 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
516 mask for
517 ``EF_AMDGPU_MACH_xxx`` values
518 defined in
519 :ref:`amdgpu-ef-amdgpu-mach-table`.
520 ================================= ========== =============================
521
522 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
523 :name: amdgpu-ef-amdgpu-mach-table
524
525 ================================= ========== =============================
526 Name Value Description (see
527 :ref:`amdgpu-processor-table`)
528 ================================= ========== =============================
529 ``EF_AMDGPU_MACH_NONE`` 0 *not specified*
530 ``EF_AMDGPU_MACH_R600_R600`` 1 ``r600``
531 ``EF_AMDGPU_MACH_R600_R630`` 2 ``r630``
532 ``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880``
533 ``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670``
534 ``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710``
535 ``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730``
536 ``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770``
537 ``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar``
538 ``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood``
539 ``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo``
540 ``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper``
541 ``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress``
542 ``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts``
543 ``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks``
544 ``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos``
545 ``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman``
546 *reserved* 17-31 Reserved for ``r600``
547 architecture processors.
548 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600``
549 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601``
550 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700``
551 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701``
552 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702``
553 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703``
554 ``EF_AMDGPU_MACH_AMDGCN_GFX800`` 38 ``gfx800``
555 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 39 ``gfx801``
556 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 40 ``gfx802``
557 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 41 ``gfx803``
558 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 42 ``gfx810``
559 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 43 ``gfx900``
560 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 44 ``gfx902``
561 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000562
563Sections
564--------
565
566An AMDGPU target ELF code object has the standard ELF sections which include:
567
568 .. table:: AMDGPU ELF Sections
569 :name: amdgpu-elf-sections-table
570
571 ================== ================ =================================
572 Name Type Attributes
573 ================== ================ =================================
574 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
575 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
576 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
577 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
578 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
579 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
580 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
581 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
582 ``.note`` ``SHT_NOTE`` *none*
583 ``.rela``\ *name* ``SHT_RELA`` *none*
584 ``.rela.dyn`` ``SHT_RELA`` *none*
585 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
586 ``.shstrtab`` ``SHT_STRTAB`` *none*
587 ``.strtab`` ``SHT_STRTAB`` *none*
588 ``.symtab`` ``SHT_SYMTAB`` *none*
589 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
590 ================== ================ =================================
591
592These sections have their standard meanings (see [ELF]_) and are only generated
593if needed.
594
595``.debug``\ *\**
596 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
597 DWARF produced by the AMDGPU backend.
598
Tony Tye46d35762017-08-15 20:47:41 +0000599``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000600 The standard sections used by a dynamic loader.
601
602``.note``
603 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
604 backend.
605
606``.rela``\ *name*, ``.rela.dyn``
607 For relocatable code objects, *name* is the name of the section that the
608 relocation records apply. For example, ``.rela.text`` is the section name for
609 relocation records associated with the ``.text`` section.
610
611 For linked shared code objects, ``.rela.dyn`` contains all the relocation
612 records from each of the relocatable code object's ``.rela``\ *name* sections.
613
614 See :ref:`amdgpu-relocation-records` for the relocation records supported by
615 the AMDGPU backend.
616
617``.text``
618 The executable machine code for the kernels and functions they call. Generated
619 as position independent code. See :ref:`amdgpu-code-conventions` for
620 information on conventions used in the isa generation.
621
622.. _amdgpu-note-records:
623
624Note Records
625------------
626
Tony Tye07d9f102017-11-10 01:00:54 +0000627As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
628be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
629aligned. In addition, minimal zero byte padding must be generated to ensure the
630``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
631``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000632
633The AMDGPU backend code object uses the following ELF note records in the
634``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000635record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000636variable size strings have a corresponding ``*_size`` field that specifies the
637number of bytes, including the terminating null character, in the string. The
638string(s) come immediately after the preceding fields.
639
640Additional note records can be present.
641
642 .. table:: AMDGPU ELF Note Records
643 :name: amdgpu-elf-note-records-table
644
Tony Tye46d35762017-08-15 20:47:41 +0000645 ===== ============================== ======================================
646 Name Type Description
647 ===== ============================== ======================================
648 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000649 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000650
651..
652
653 .. table:: AMDGPU ELF Note Record Enumeration Values
654 :name: amdgpu-elf-note-record-enumeration-values-table
655
Tony Tye46d35762017-08-15 20:47:41 +0000656 ============================== =====
657 Name Value
658 ============================== =====
659 *reserved* 0-9
660 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000661 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000662 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000663
Tony Tye46d35762017-08-15 20:47:41 +0000664``NT_AMD_AMDGPU_HSA_METADATA``
665 Specifies extensible metadata associated with the code objects executed on HSA
666 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
667 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
668 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
669 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000670
Tony Tye46d35762017-08-15 20:47:41 +0000671.. _amdgpu-symbols:
672
673Symbols
674-------
675
676Symbols include the following:
677
678 .. table:: AMDGPU ELF Symbols
679 :name: amdgpu-elf-symbols-table
680
681 ===================== ============== ============= ==================
682 Name Type Section Description
683 ===================== ============== ============= ==================
684 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
685 - ``.rodata``
686 - ``.bss``
687 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
688 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
689 ===================== ============== ============= ==================
690
691Global variable
692 Global variables both used and defined by the compilation unit.
693
694 If the symbol is defined in the compilation unit then it is allocated in the
695 appropriate section according to if it has initialized data or is readonly.
696
697 If the symbol is external then its section is ``STN_UNDEF`` and the loader
698 will resolve relocations using the definition provided by another code object
699 or explicitly defined by the runtime.
700
701 All global symbols, whether defined in the compilation unit or external, are
702 accessed by the machine code indirectly through a GOT table entry. This
703 allows them to be preemptable. The GOT table is only supported when the target
704 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000705
706 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000707 Add description of linked shared object symbols. Seems undefined symbols
708 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000709
Tony Tye46d35762017-08-15 20:47:41 +0000710Kernel descriptor
711 Every HSA kernel has an associated kernel descriptor. It is the address of the
712 kernel descriptor that is used in the AQL dispatch packet used to invoke the
713 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
714 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
715
716Kernel entry point
717 Every HSA kernel also has a symbol for its machine code entry point.
718
719.. _amdgpu-relocation-records:
720
721Relocation Records
722------------------
723
724AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
725relocatable fields are:
726
727``word32``
728 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
729 alignment. These values use the same byte order as other word values in the
730 AMD GPU architecture.
731
732``word64``
733 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
734 alignment. These values use the same byte order as other word values in the
735 AMD GPU architecture.
736
737Following notations are used for specifying relocation calculations:
738
739**A**
740 Represents the addend used to compute the value of the relocatable field.
741
742**G**
743 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000744 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000745
746**GOT**
747 Represents the address of the global offset table.
748
749**P**
750 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
751 of the storage unit being relocated (computed using ``r_offset``).
752
753**S**
754 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000755 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
756
757**B**
758 Represents the base address of a loaded executable or shared object which is
759 the difference between the ELF address and the actual load address. Relocations
760 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000761
762The following relocation types are supported:
763
764 .. table:: AMDGPU ELF Relocation Records
765 :name: amdgpu-elf-relocation-records-table
766
767 ========================== ===== ========== ==============================
768 Relocation Type Value Field Calculation
769 ========================== ===== ========== ==============================
770 ``R_AMDGPU_NONE`` 0 *none* *none*
771 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
772 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
773 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
774 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
775 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
776 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
777 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
778 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
779 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
780 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
781 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tyed2884302017-10-16 20:44:29 +0000782 *reserved* 12
783 ``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye46d35762017-08-15 20:47:41 +0000784 ========================== ===== ========== ==============================
785
786.. _amdgpu-dwarf:
787
788DWARF
789-----
790
791Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
792information that maps the code object executable code and data to the source
793language constructs. It can be used by tools such as debuggers and profilers.
794
795Address Space Mapping
796~~~~~~~~~~~~~~~~~~~~~
797
798The following address space mapping is used:
799
800 .. table:: AMDGPU DWARF Address Space Mapping
801 :name: amdgpu-dwarf-address-space-mapping-table
802
803 =================== =================
804 DWARF Address Space Memory Space
805 =================== =================
806 1 Private (Scratch)
807 2 Local (group/LDS)
808 *omitted* Global
809 *omitted* Constant
810 *omitted* Generic (Flat)
811 *not supported* Region (GDS)
812 =================== =================
813
814See :ref:`amdgpu-address-spaces` for information on the memory space terminology
815used in the table.
816
817An ``address_class`` attribute is generated on pointer type DIEs to specify the
818DWARF address space of the value of the pointer when it is in the *private* or
819*local* address space. Otherwise the attribute is omitted.
820
821An ``XDEREF`` operation is generated in location list expressions for variables
822that are allocated in the *private* and *local* address space. Otherwise no
823``XDREF`` is omitted.
824
825Register Mapping
826~~~~~~~~~~~~~~~~
827
828*This section is WIP.*
829
830.. TODO
831 Define DWARF register enumeration.
832
833 If want to present a wavefront state then should expose vector registers as
834 64 wide (rather than per work-item view that LLVM uses). Either as separate
835 registers, or a 64x4 byte single register. In either case use a new LANE op
836 (akin to XDREF) to select the current lane usage in a location
837 expression. This would also allow scalar register spilling to vector register
838 lanes to be expressed (currently no debug information is being generated for
839 spilling). If choose a wide single register approach then use LANE in
840 conjunction with PIECE operation to select the dword part of the register for
841 the current lane. If the separate register approach then use LANE to select
842 the register.
843
844Source Text
845~~~~~~~~~~~
846
847*This section is WIP.*
848
849.. TODO
850 DWARF extension to include runtime generated source text.
851
852.. _amdgpu-code-conventions:
853
854Code Conventions
855================
856
857This section provides code conventions used for each supported target triple OS
858(see :ref:`amdgpu-target-triples`).
859
860AMDHSA
861------
862
863This section provides code conventions used when the target triple OS is
864``amdhsa`` (see :ref:`amdgpu-target-triples`).
865
866.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000867
868Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000869~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000870
Tony Tye46d35762017-08-15 20:47:41 +0000871The code object metadata specifies extensible metadata associated with the code
872objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
873[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
874(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
875``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
876information necessary to support the ROCM kernel queries. For example, the
877segment sizes needed in a dispatch packet. In addition, a high level language
878runtime may require other information to be included. For example, the AMD
879OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000880
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000881The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000882:doc:`YamlIO`).
883
Tony Tye46d35762017-08-15 20:47:41 +0000884.. TODO
885 Is the string null terminated? It probably should not if YAML allows it to
886 contain null characters, otherwise it should be.
887
Tony Tyef16a45e2017-06-06 20:31:59 +0000888The metadata is represented as a single YAML document comprised of the mapping
889defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
890referenced tables.
891
892For boolean values, the string values of ``false`` and ``true`` are used for
893false and true respectively.
894
895Additional information can be added to the mappings. To avoid conflicts, any
896non-AMD key names should be prefixed by "*vendor-name*.".
897
898 .. table:: AMDHSA Code Object Metadata Mapping
899 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
900
901 ========== ============== ========= =======================================
902 String Key Value Type Required? Description
903 ========== ============== ========= =======================================
904 "Version" sequence of Required - The first integer is the major
905 2 integers version. Currently 1.
906 - The second integer is the minor
907 version. Currently 0.
908 "Printf" sequence of Each string is encoded information
909 strings about a printf function call. The
910 encoded information is organized as
911 fields separated by colon (':'):
912
913 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
914
915 where:
916
917 ``ID``
918 A 32 bit integer as a unique id for
919 each printf function call
920
921 ``N``
922 A 32 bit integer equal to the number
923 of arguments of printf function call
924 minus 1
925
926 ``S[i]`` (where i = 0, 1, ... , N-1)
927 32 bit integers for the size in bytes
928 of the i-th FormatString argument of
929 the printf function call
930
931 FormatString
932 The format string passed to the
933 printf function call.
934 "Kernels" sequence of Required Sequence of the mappings for each
935 mapping kernel in the code object. See
936 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
937 for the definition of the mapping.
938 ========== ============== ========= =======================================
939
940..
941
942 .. table:: AMDHSA Code Object Kernel Metadata Mapping
943 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
944
945 ================= ============== ========= ================================
946 String Key Value Type Required? Description
947 ================= ============== ========= ================================
948 "Name" string Required Source name of the kernel.
949 "SymbolName" string Required Name of the kernel
950 descriptor ELF symbol.
951 "Language" string Source language of the kernel.
952 Values include:
953
954 - "OpenCL C"
955 - "OpenCL C++"
956 - "HCC"
957 - "OpenMP"
958
959 "LanguageVersion" sequence of - The first integer is the major
960 2 integers version.
961 - The second integer is the
962 minor version.
963 "Attrs" mapping Mapping of kernel attributes.
964 See
965 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
966 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +0000967 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +0000968 mapping kernel arguments. See
969 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
970 for the definition of the mapping.
971 "CodeProps" mapping Mapping of properties related to
972 the kernel code. See
973 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
974 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +0000975 ================= ============== ========= ================================
976
977..
978
979 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
980 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
981
982 =================== ============== ========= ==============================
983 String Key Value Type Required? Description
984 =================== ============== ========= ==============================
985 "ReqdWorkGroupSize" sequence of The dispatch work-group size
986 3 integers X, Y, Z must correspond to the
987 specified values.
988
989 Corresponds to the OpenCL
990 ``reqd_work_group_size``
991 attribute.
992 "WorkGroupSizeHint" sequence of The dispatch work-group size
993 3 integers X, Y, Z is likely to be the
994 specified values.
995
996 Corresponds to the OpenCL
997 ``work_group_size_hint``
998 attribute.
999 "VecTypeHint" string The name of a scalar or vector
1000 type.
1001
1002 Corresponds to the OpenCL
1003 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001004
1005 "RuntimeHandle" string The external symbol name
1006 associated with a kernel.
1007 OpenCL runtime allocates a
1008 global buffer for the symbol
1009 and saves the kernel's address
1010 to it, which is used for
1011 device side enqueueing. Only
1012 available for device side
1013 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001014 =================== ============== ========= ==============================
1015
1016..
1017
1018 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1019 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1020
1021 ================= ============== ========= ================================
1022 String Key Value Type Required? Description
1023 ================= ============== ========= ================================
1024 "Name" string Kernel argument name.
1025 "TypeName" string Kernel argument type name.
1026 "Size" integer Required Kernel argument size in bytes.
1027 "Align" integer Required Kernel argument alignment in
1028 bytes. Must be a power of two.
1029 "ValueKind" string Required Kernel argument kind that
1030 specifies how to set up the
1031 corresponding argument.
1032 Values include:
1033
1034 "ByValue"
1035 The argument is copied
1036 directly into the kernarg.
1037
1038 "GlobalBuffer"
1039 A global address space pointer
1040 to the buffer data is passed
1041 in the kernarg.
1042
1043 "DynamicSharedPointer"
1044 A group address space pointer
1045 to dynamically allocated LDS
1046 is passed in the kernarg.
1047
1048 "Sampler"
1049 A global address space
1050 pointer to a S# is passed in
1051 the kernarg.
1052
1053 "Image"
1054 A global address space
1055 pointer to a T# is passed in
1056 the kernarg.
1057
1058 "Pipe"
1059 A global address space pointer
1060 to an OpenCL pipe is passed in
1061 the kernarg.
1062
1063 "Queue"
1064 A global address space pointer
1065 to an OpenCL device enqueue
1066 queue is passed in the
1067 kernarg.
1068
1069 "HiddenGlobalOffsetX"
1070 The OpenCL grid dispatch
1071 global offset for the X
1072 dimension is passed in the
1073 kernarg.
1074
1075 "HiddenGlobalOffsetY"
1076 The OpenCL grid dispatch
1077 global offset for the Y
1078 dimension is passed in the
1079 kernarg.
1080
1081 "HiddenGlobalOffsetZ"
1082 The OpenCL grid dispatch
1083 global offset for the Z
1084 dimension is passed in the
1085 kernarg.
1086
1087 "HiddenNone"
1088 An argument that is not used
1089 by the kernel. Space needs to
1090 be left for it, but it does
1091 not need to be set up.
1092
1093 "HiddenPrintfBuffer"
1094 A global address space pointer
1095 to the runtime printf buffer
1096 is passed in kernarg.
1097
1098 "HiddenDefaultQueue"
1099 A global address space pointer
1100 to the OpenCL device enqueue
1101 queue that should be used by
1102 the kernel by default is
1103 passed in the kernarg.
1104
1105 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001106 A global address space pointer
1107 to help link enqueued kernels into
1108 the ancestor tree for determining
1109 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001110
1111 "ValueType" string Required Kernel argument value type. Only
1112 present if "ValueKind" is
1113 "ByValue". For vector data
1114 types, the value is for the
1115 element type. Values include:
1116
1117 - "Struct"
1118 - "I8"
1119 - "U8"
1120 - "I16"
1121 - "U16"
1122 - "F16"
1123 - "I32"
1124 - "U32"
1125 - "F32"
1126 - "I64"
1127 - "U64"
1128 - "F64"
1129
1130 .. TODO
1131 How can it be determined if a
1132 vector type, and what size
1133 vector?
1134 "PointeeAlign" integer Alignment in bytes of pointee
1135 type for pointer type kernel
1136 argument. Must be a power
1137 of 2. Only present if
1138 "ValueKind" is
1139 "DynamicSharedPointer".
1140 "AddrSpaceQual" string Kernel argument address space
1141 qualifier. Only present if
1142 "ValueKind" is "GlobalBuffer" or
1143 "DynamicSharedPointer". Values
1144 are:
1145
1146 - "Private"
1147 - "Global"
1148 - "Constant"
1149 - "Local"
1150 - "Generic"
1151 - "Region"
1152
1153 .. TODO
1154 Is GlobalBuffer only Global
1155 or Constant? Is
1156 DynamicSharedPointer always
1157 Local? Can HCC allow Generic?
1158 How can Private or Region
1159 ever happen?
1160 "AccQual" string Kernel argument access
1161 qualifier. Only present if
1162 "ValueKind" is "Image" or
1163 "Pipe". Values
1164 are:
1165
1166 - "ReadOnly"
1167 - "WriteOnly"
1168 - "ReadWrite"
1169
1170 .. TODO
1171 Does this apply to
1172 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001173 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001174 performed by the kernel on the
1175 kernel argument. Only present if
1176 "ValueKind" is "GlobalBuffer",
1177 "Image", or "Pipe". This may be
1178 more restrictive than indicated
1179 by "AccQual" to reflect what the
1180 kernel actual does. If not
1181 present then the runtime must
1182 assume what is implied by
1183 "AccQual" and "IsConst". Values
1184 are:
1185
1186 - "ReadOnly"
1187 - "WriteOnly"
1188 - "ReadWrite"
1189
1190 "IsConst" boolean Indicates if the kernel argument
1191 is const qualified. Only present
1192 if "ValueKind" is
1193 "GlobalBuffer".
1194
1195 "IsRestrict" boolean Indicates if the kernel argument
1196 is restrict qualified. Only
1197 present if "ValueKind" is
1198 "GlobalBuffer".
1199
1200 "IsVolatile" boolean Indicates if the kernel argument
1201 is volatile qualified. Only
1202 present if "ValueKind" is
1203 "GlobalBuffer".
1204
1205 "IsPipe" boolean Indicates if the kernel argument
1206 is pipe qualified. Only present
1207 if "ValueKind" is "Pipe".
1208
1209 .. TODO
1210 Can GlobalBuffer be pipe
1211 qualified?
1212 ================= ============== ========= ================================
1213
1214..
1215
1216 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1217 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1218
1219 ============================ ============== ========= =====================
1220 String Key Value Type Required? Description
1221 ============================ ============== ========= =====================
1222 "KernargSegmentSize" integer Required The size in bytes of
1223 the kernarg segment
1224 that holds the values
1225 of the arguments to
1226 the kernel.
1227 "GroupSegmentFixedSize" integer Required The amount of group
1228 segment memory
1229 required by a
1230 work-group in
1231 bytes. This does not
1232 include any
1233 dynamically allocated
1234 group segment memory
1235 that may be added
1236 when the kernel is
1237 dispatched.
1238 "PrivateSegmentFixedSize" integer Required The amount of fixed
1239 private address space
1240 memory required for a
1241 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001242 bytes. If the kernel
1243 uses a dynamic call
1244 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001245 space must be added
1246 to this value for the
1247 call stack.
1248 "KernargSegmentAlign" integer Required The maximum byte
1249 alignment of
1250 arguments in the
1251 kernarg segment. Must
1252 be a power of 2.
1253 "WavefrontSize" integer Required Wavefront size. Must
1254 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001255 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001256 registers used by a
1257 wavefront for
1258 GFX6-GFX9. This
1259 includes the special
1260 SGPRs for VCC, Flat
1261 Scratch (GFX7-GFX9)
1262 and XNACK (for
1263 GFX8-GFX9). It does
1264 not include the 16
1265 SGPR added if a trap
1266 handler is
1267 enabled. It is not
1268 rounded up to the
1269 allocation
1270 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001271 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001272 registers used by
1273 each work-item for
1274 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001275 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001276 work-group size
1277 supported by the
1278 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001279 Must be >=1 and
1280 consistent with any
1281 non-0 values in
1282 FixedWorkGroupSize.
1283 "FixedWorkGroupSize" sequence of Corresponds to the
1284 3 integers dispatch work-group
1285 size X, Y, Z. If
1286 omitted, defaults to
1287 0, 0, 0. If an
1288 element is non-0 then
1289 the kernel must only
1290 be launched with a
1291 matching corresponding
1292 work-group size.
Tony Tyef16a45e2017-06-06 20:31:59 +00001293 "IsXNACKEnabled" boolean Indicates if the
1294 generated machine
1295 code is capable of
Tony Tye07d9f102017-11-10 01:00:54 +00001296 supporting XNACK. See
1297 :ref:`amdgpu-target-features`.
Tony Tyef16a45e2017-06-06 20:31:59 +00001298 ============================ ============== ========= =====================
1299
1300..
1301
Tony Tyef16a45e2017-06-06 20:31:59 +00001302Kernel Dispatch
1303~~~~~~~~~~~~~~~
1304
1305The HSA architected queuing language (AQL) defines a user space memory interface
1306that can be used to control the dispatch of kernels, in an agent independent
1307way. An agent can have zero or more AQL queues created for it using the ROCm
1308runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1309*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1310mechanics and packet layouts.
1311
1312The packet processor of a kernel agent is responsible for detecting and
1313dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1314packet processor is implemented by the hardware command processor (CP),
1315asynchronous dispatch controller (ADC) and shader processor input controller
1316(SPI).
1317
1318The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1319mode driver to initialize and register the AQL queue with CP.
1320
1321To dispatch a kernel the following actions are performed. This can occur in the
1322CPU host program, or from an HSA kernel executing on a GPU.
1323
13241. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1325 executed is obtained.
13262. A pointer to the kernel descriptor (see
1327 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1328 obtained. It must be for a kernel that is contained in a code object that that
1329 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1330 associated.
13313. Space is allocated for the kernel arguments using the ROCm runtime allocator
1332 for a memory region with the kernarg property for the kernel agent that will
1333 execute the kernel. It must be at least 16 byte aligned.
13344. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001335 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001336 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1337 memory in the same way constant memory is accessed. (Note that the HSA
1338 specification allows an implementation to copy the kernel argument contents to
1339 another location that is accessed by the kernel.)
13405. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1341 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1342 packet. The packet must be set up, and the final write must use an atomic
1343 store release to set the packet kind to ensure the packet contents are
1344 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1345 notify the kernel agent that the AQL queue has been updated. These rules, and
1346 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1347 System Architecture Specification* [HSA]_.
13486. A kernel dispatch packet includes information about the actual dispatch,
1349 such as grid and work-group size, together with information from the code
1350 object about the kernel, such as segment sizes. The ROCm runtime queries on
1351 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001352 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000013537. CP executes micro-code and is responsible for detecting and setting up the
1354 GPU to execute the wavefronts of a kernel dispatch.
13558. CP ensures that when the a wavefront starts executing the kernel machine
1356 code, the scalar general purpose registers (SGPR) and vector general purpose
1357 registers (VGPR) are set up as required by the machine code. The required
1358 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1359 register state is defined in
1360 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13619. The prolog of the kernel machine code (see
1362 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1363 before continuing executing the machine code that corresponds to the kernel.
136410. When the kernel dispatch has completed execution, CP signals the completion
1365 signal specified in the kernel dispatch packet if not 0.
1366
1367.. _amdgpu-amdhsa-memory-spaces:
1368
1369Memory Spaces
1370~~~~~~~~~~~~~
1371
1372The memory space properties are:
1373
1374 .. table:: AMDHSA Memory Spaces
1375 :name: amdgpu-amdhsa-memory-spaces-table
1376
1377 ================= =========== ======== ======= ==================
1378 Memory Space Name HSA Segment Hardware Address NULL Value
1379 Name Name Size
1380 ================= =========== ======== ======= ==================
1381 Private private scratch 32 0x00000000
1382 Local group LDS 32 0xFFFFFFFF
1383 Global global global 64 0x0000000000000000
1384 Constant constant *same as 64 0x0000000000000000
1385 global*
1386 Generic flat flat 64 0x0000000000000000
1387 Region N/A GDS 32 *not implemented
1388 for AMDHSA*
1389 ================= =========== ======== ======= ==================
1390
1391The global and constant memory spaces both use global virtual addresses, which
1392are the same virtual address space used by the CPU. However, some virtual
1393addresses may only be accessible to the CPU, some only accessible by the GPU,
1394and some by both.
1395
1396Using the constant memory space indicates that the data will not change during
1397the execution of the kernel. This allows scalar read instructions to be
1398used. The vector and scalar L1 caches are invalidated of volatile data before
1399each kernel dispatch execution to allow constant memory to change values between
1400kernel dispatches.
1401
1402The local memory space uses the hardware Local Data Store (LDS) which is
1403automatically allocated when the hardware creates work-groups of wavefronts, and
1404freed when all the wavefronts of a work-group have terminated. The data store
1405(DS) instructions can be used to access it.
1406
1407The private memory space uses the hardware scratch memory support. If the kernel
1408uses scratch, then the hardware allocates memory that is accessed using
1409wavefront lane dword (4 byte) interleaving. The mapping used from private
1410address to physical address is:
1411
1412 ``wavefront-scratch-base +
1413 (private-address * wavefront-size * 4) +
1414 (wavefront-lane-id * 4)``
1415
1416There are different ways that the wavefront scratch base address is determined
1417by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1418memory can be accessed in an interleaved manner using buffer instruction with
1419the scratch buffer descriptor and per wave scratch offset, by the scratch
1420instructions, or by flat instructions. If each lane of a wavefront accesses the
1421same private address, the interleaving results in adjacent dwords being accessed
1422and hence requires fewer cache lines to be fetched. Multi-dword access is not
1423supported except by flat and scratch instructions in GFX9.
1424
1425The generic address space uses the hardware flat address support available in
1426GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1427local appertures), that are outside the range of addressible global memory, to
1428map from a flat address to a private or local address.
1429
1430FLAT instructions can take a flat address and access global, private (scratch)
1431and group (LDS) memory depending in if the address is within one of the
1432apperture ranges. Flat access to scratch requires hardware aperture setup and
1433setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1434access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1435(see :ref:`amdgpu-amdhsa-m0`).
1436
1437To convert between a segment address and a flat address the base address of the
1438appertures address can be used. For GFX7-GFX8 these are available in the
1439:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1440Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1441GFX9 the appature base addresses are directly available as inline constant
1442registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1443address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1444which makes it easier to convert from flat to segment or segment to flat.
1445
Tony Tye46d35762017-08-15 20:47:41 +00001446Image and Samplers
1447~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001448
1449Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1450hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1451HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1452enumeration values for the queries that are not trivially deducible from the S#
1453representation.
1454
1455HSA Signals
1456~~~~~~~~~~~
1457
Tony Tye46d35762017-08-15 20:47:41 +00001458HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1459structure allocated in memory accessible from both the CPU and GPU. The
1460structure is defined by the ROCm runtime and subject to change between releases
1461(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001462
1463.. _amdgpu-amdhsa-hsa-aql-queue:
1464
1465HSA AQL Queue
1466~~~~~~~~~~~~~
1467
Tony Tye46d35762017-08-15 20:47:41 +00001468The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001469between releases (see [AMD-ROCm-github]_). For some processors it contains
1470fields needed to implement certain language features such as the flat address
1471aperture bases. It also contains fields used by CP such as managing the
1472allocation of scratch memory.
1473
1474.. _amdgpu-amdhsa-kernel-descriptor:
1475
1476Kernel Descriptor
1477~~~~~~~~~~~~~~~~~
1478
1479A kernel descriptor consists of the information needed by CP to initiate the
1480execution of a kernel, including the entry point address of the machine code
1481that implements the kernel.
1482
1483Kernel Descriptor for GFX6-GFX9
1484+++++++++++++++++++++++++++++++
1485
1486CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1487
1488 .. table:: Kernel Descriptor for GFX6-GFX9
1489 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1490
Tony Tye6baa6d22017-10-18 22:16:55 +00001491 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001492 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001493 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001494 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001495 address space memory
1496 required for a work-group
1497 in bytes. This does not
1498 include any dynamically
1499 allocated local address
1500 space memory that may be
1501 added when the kernel is
1502 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001503 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001504 private address space
1505 memory required for a
1506 work-item in bytes. If
1507 is_dynamic_callstack is 1
1508 then additional space must
1509 be added to this value for
1510 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001511 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001512 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001513 negative) from base
1514 address of kernel
1515 descriptor to kernel's
1516 entry point instruction
1517 which must be 256 byte
1518 aligned.
Tony Tye07d9f102017-11-10 01:00:54 +00001519 223:192 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
1520 size supported by the
1521 kernel in work-items. If
1522 an exact work-group size
1523 is required then must be
1524 omitted or 0 and
1525 ReqdWorkGroupSize* must
1526 be set to non-0.
1527 239:224 2 bytes ReqdWorkGroupSizeX If present and non-0 then
1528 the kernel
1529 must be executed with the
1530 specified work-group size
1531 for X.
1532 255:240 2 bytes ReqdWorkGroupSizeY If present and non-0 then
1533 the kernel
1534 must be executed with the
1535 specified work-group size
1536 for Y.
1537 271:256 2 bytes ReqdWorkGroupSizeZ If present and non-0 then
1538 the kernel
1539 must be executed with the
1540 specified work-group size
1541 for Z.
1542 383:271 14 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001543 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001544 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001545 program settings used by
1546 CP to set up
1547 ``COMPUTE_PGM_RSRC1``
1548 configuration
1549 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001550 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001551 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001552 program settings used by
1553 CP to set up
1554 ``COMPUTE_PGM_RSRC2``
1555 configuration
1556 register. See
1557 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001558 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1559 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001560 (see
1561 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1562
1563 The total number of SGPR
1564 user data registers
1565 requested must not exceed
1566 16 and match value in
1567 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1568 Any requests beyond 16
1569 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001570 449 1 bit EnableSGPRDispatchPtr *see above*
1571 450 1 bit EnableSGPRQueuePtr *see above*
1572 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1573 452 1 bit EnableSGPRDispatchID *see above*
1574 453 1 bit EnableSGPRFlatScratchInit *see above*
1575 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1576 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1577 should always be 0.
1578 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1579 should always be 0.
1580 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1581 should always be 0.
Tony Tye07d9f102017-11-10 01:00:54 +00001582 462:458 5 bits Reserved, must be 0.
1583 463 1 bit IsXNACKEnabled Indicates if the generated
1584 machine code is capable of
1585 supporting XNACK.
Tony Tye6baa6d22017-10-18 22:16:55 +00001586 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001587 bytes
1588 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001589 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001590
1591..
1592
1593 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001594 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001595
Tony Tye3b340612017-06-07 00:46:08 +00001596 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001597 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001598 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001599 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001600 used by each work-item,
1601 granularity is device
1602 specific:
1603
Tony Tye07d9f102017-11-10 01:00:54 +00001604 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001605 - max_vgpr 1..256
1606 - roundup((max_vgpg + 1)
1607 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001608
1609 Used by CP to set up
1610 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001611 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001612 used by a wavefront,
1613 granularity is device
1614 specific:
1615
Tony Tye07d9f102017-11-10 01:00:54 +00001616 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001617 - max_sgpr 1..112
1618 - roundup((max_sgpg + 1)
1619 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001620 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001621 - max_sgpr 1..112
1622 - roundup((max_sgpg + 1)
1623 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001624
1625 Includes the special SGPRs
1626 for VCC, Flat Scratch (for
1627 GFX7 onwards) and XNACK
1628 (for GFX8 onwards). It does
1629 not include the 16 SGPR
1630 added if a trap handler is
1631 enabled.
1632
1633 Used by CP to set up
1634 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001635 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001636
1637 Start executing wavefront
1638 at the specified priority.
1639
1640 CP is responsible for
1641 filling in
1642 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001643 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001644 with specified rounding
1645 mode for single (32
1646 bit) floating point
1647 precision floating point
1648 operations.
1649
1650 Floating point rounding
1651 mode values are defined in
1652 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1653
1654 Used by CP to set up
1655 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001656 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001657 with specified rounding
1658 denorm mode for half/double (16
1659 and 64 bit) floating point
1660 precision floating point
1661 operations.
1662
1663 Floating point rounding
1664 mode values are defined in
1665 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1666
1667 Used by CP to set up
1668 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001669 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001670 with specified denorm mode
1671 for single (32
1672 bit) floating point
1673 precision floating point
1674 operations.
1675
1676 Floating point denorm mode
1677 values are defined in
1678 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1679
1680 Used by CP to set up
1681 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001682 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001683 with specified denorm mode
1684 for half/double (16
1685 and 64 bit) floating point
1686 precision floating point
1687 operations.
1688
1689 Floating point denorm mode
1690 values are defined in
1691 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1692
1693 Used by CP to set up
1694 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001695 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001696
1697 Start executing wavefront
1698 in privilege trap handler
1699 mode.
1700
1701 CP is responsible for
1702 filling in
1703 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001704 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001705 with DX10 clamp mode
1706 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001707 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001708 treatment of NaN's (when
1709 set, clamp NaN to zero,
1710 otherwise pass NaN
1711 through).
1712
1713 Used by CP to set up
1714 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001715 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001716
1717 Start executing wavefront
1718 in single step mode.
1719
1720 CP is responsible for
1721 filling in
1722 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001723 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001724 with IEEE mode
1725 enabled. Floating point
1726 opcodes that support
1727 exception flag gathering
1728 will quiet and propagate
1729 signaling-NaN inputs per
1730 IEEE 754-2008. Min_dx10 and
1731 max_dx10 become IEEE
1732 754-2008 compliant due to
1733 signaling-NaN propagation
1734 and quieting.
1735
1736 Used by CP to set up
1737 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001738 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001739
1740 Only one work-group allowed
1741 to execute on a compute
1742 unit.
1743
1744 CP is responsible for
1745 filling in
1746 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001747 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001748
1749 Flag that can be used to
1750 control debugging code.
1751
1752 CP is responsible for
1753 filling in
1754 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001755 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001756 Reserved, must be 0.
1757 GFX9
1758 Wavefront starts execution
1759 with specified fp16 overflow
1760 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001761
Tony Tye6baa6d22017-10-18 22:16:55 +00001762 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001763 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001764 - If 1, fp16 overflow that is the
1765 result of an +/-INF input value
1766 or divide by 0 produces a +/-INF,
1767 otherwise clamps computed
1768 overflow to +/-MAX_FP16 as
1769 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001770
1771 Used by CP to set up
1772 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001773 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001774 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001775 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001776
1777..
1778
1779 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1780 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1781
Tony Tye3b340612017-06-07 00:46:08 +00001782 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001783 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001784 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001785 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1786 _WAVE_OFFSET SGPR wave scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001787 system register (see
1788 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1789
1790 Used by CP to set up
1791 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001792 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001793 user data registers
1794 requested. This number must
1795 match the number of user
1796 data registers enabled.
1797
1798 Used by CP to set up
1799 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001800 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001801 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001802 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001803 be enabled.
1804
1805 CP sets
1806 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1807 if the runtime has
1808 installed a trap handler
1809 regardless of the setting
1810 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001811 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001812 system SGPR register for
1813 the work-group id in the X
1814 dimension (see
1815 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1816
1817 Used by CP to set up
1818 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001819 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001820 system SGPR register for
1821 the work-group id in the Y
1822 dimension (see
1823 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1824
1825 Used by CP to set up
1826 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001827 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001828 system SGPR register for
1829 the work-group id in the Z
1830 dimension (see
1831 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1832
1833 Used by CP to set up
1834 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001835 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001836 system SGPR register for
1837 work-group information (see
1838 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1839
1840 Used by CP to set up
1841 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001842 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001843 VGPR system registers used
1844 for the work-item ID.
1845 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1846 defines the values.
1847
1848 Used by CP to set up
1849 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001850 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001851
1852 Wavefront starts execution
1853 with address watch
1854 exceptions enabled which
1855 are generated when L1 has
1856 witnessed a thread access
1857 an *address of
1858 interest*.
1859
1860 CP is responsible for
1861 filling in the address
1862 watch bit in
1863 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1864 according to what the
1865 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001866 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001867
1868 Wavefront starts execution
1869 with memory violation
1870 exceptions exceptions
1871 enabled which are generated
1872 when a memory violation has
1873 occurred for this wave from
1874 L1 or LDS
1875 (write-to-read-only-memory,
1876 mis-aligned atomic, LDS
1877 address out of range,
1878 illegal address, etc.).
1879
1880 CP sets the memory
1881 violation bit in
1882 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1883 according to what the
1884 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001885 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001886
1887 CP uses the rounded value
1888 from the dispatch packet,
1889 not this value, as the
1890 dispatch may contain
1891 dynamically allocated group
1892 segment memory. CP writes
1893 directly to
1894 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1895
1896 Amount of group segment
1897 (LDS) to allocate for each
1898 work-group. Granularity is
1899 device specific:
1900
1901 GFX6:
1902 roundup(lds-size / (64 * 4))
1903 GFX7-GFX9:
1904 roundup(lds-size / (128 * 4))
1905
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001906 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1907 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001908 enabled.
1909
1910 Used by CP to set up
1911 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1912 (set from bits 0..6).
1913
1914 IEEE 754 FP Invalid
1915 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001916 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1917 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001918 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001919 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1920 _DIVISION_BY_ZERO Zero
1921 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1922 _OVERFLOW
1923 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1924 _UNDERFLOW
1925 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1926 _INEXACT
1927 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1928 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001929 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001930 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001931 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001932 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001933
1934..
1935
1936 .. table:: Floating Point Rounding Mode Enumeration Values
1937 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1938
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001939 ====================================== ===== ==============================
1940 Enumeration Name Value Description
1941 ====================================== ===== ==============================
1942 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1943 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1944 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1945 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1946 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001947
1948..
1949
1950 .. table:: Floating Point Denorm Mode Enumeration Values
1951 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1952
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001953 ====================================== ===== ==============================
1954 Enumeration Name Value Description
1955 ====================================== ===== ==============================
1956 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1957 Denorms
1958 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1959 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1960 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1961 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001962
1963..
1964
1965 .. table:: System VGPR Work-Item ID Enumeration Values
1966 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1967
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001968 ======================================== ===== ============================
1969 Enumeration Name Value Description
1970 ======================================== ===== ============================
1971 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1972 ID.
1973 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1974 dimensions ID.
1975 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1976 dimensions ID.
1977 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1978 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001979
1980.. _amdgpu-amdhsa-initial-kernel-execution-state:
1981
1982Initial Kernel Execution State
1983~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1984
1985This section defines the register state that will be set up by the packet
1986processor prior to the start of execution of every wavefront. This is limited by
1987the constraints of the hardware controllers of CP/ADC/SPI.
1988
1989The order of the SGPR registers is defined, but the compiler can specify which
1990ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1991fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1992for enabled registers are dense starting at SGPR0: the first enabled register is
1993SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1994an SGPR number.
1995
1996The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1997all waves of the grid. It is possible to specify more than 16 User SGPRs using
1998the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1999initialized. These are then immediately followed by the System SGPRs that are
2000set up by ADC/SPI and can have different values for each wave of the grid
2001dispatch.
2002
2003SGPR register initial state is defined in
2004:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2005
2006 .. table:: SGPR Register Set Up Order
2007 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2008
2009 ========== ========================== ====== ==============================
2010 SGPR Order Name Number Description
2011 (kernel descriptor enable of
2012 field) SGPRs
2013 ========== ========================== ====== ==============================
2014 First Private Segment Buffer 4 V# that can be used, together
2015 (enable_sgpr_private with Scratch Wave Offset as an
2016 _segment_buffer) offset, to access the private
2017 memory space using a segment
2018 address.
2019
2020 CP uses the value provided by
2021 the runtime.
2022 then Dispatch Ptr 2 64 bit address of AQL dispatch
2023 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2024 actually executing.
2025 then Queue Ptr 2 64 bit address of amd_queue_t
2026 (enable_sgpr_queue_ptr) object for AQL queue on which
2027 the dispatch packet was
2028 queued.
2029 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2030 (enable_sgpr_kernarg segment. This is directly
2031 _segment_ptr) copied from the
2032 kernarg_address in the kernel
2033 dispatch packet.
2034
2035 Having CP load it once avoids
2036 loading it at the beginning of
2037 every wavefront.
2038 then Dispatch Id 2 64 bit Dispatch ID of the
2039 (enable_sgpr_dispatch_id) dispatch packet being
2040 executed.
2041 then Flat Scratch Init 2 This is 2 SGPRs:
2042 (enable_sgpr_flat_scratch
2043 _init) GFX6
2044 Not supported.
2045 GFX7-GFX8
2046 The first SGPR is a 32 bit
2047 byte offset from
2048 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2049 to per SPI base of memory
2050 for scratch for the queue
2051 executing the kernel
2052 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002053 from the runtime. (The
2054 Scratch Segment Buffer base
2055 address is
2056 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2057 plus this offset.) The value
2058 of Scratch Wave Offset must
2059 be added to this offset by
2060 the kernel machine code,
2061 right shifted by 8, and
2062 moved to the FLAT_SCRATCH_HI
2063 SGPR register.
2064 FLAT_SCRATCH_HI corresponds
2065 to SGPRn-4 on GFX7, and
2066 SGPRn-6 on GFX8 (where SGPRn
2067 is the highest numbered SGPR
2068 allocated to the wave).
2069 FLAT_SCRATCH_HI is
2070 multiplied by 256 (as it is
2071 in units of 256 bytes) and
2072 added to
2073 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2074 to calculate the per wave
2075 FLAT SCRATCH BASE in flat
2076 memory instructions that
2077 access the scratch
2078 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002079
2080 The second SGPR is 32 bit
2081 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002082 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002083 usage. CP obtains this from
2084 the runtime, and it is
2085 always a multiple of DWORD.
2086 CP checks that the value in
2087 the kernel dispatch packet
2088 Private Segment Byte Size is
2089 not larger, and requests the
2090 runtime to increase the
2091 queue's scratch size if
2092 necessary. The kernel code
2093 must move it to
2094 FLAT_SCRATCH_LO which is
2095 SGPRn-3 on GFX7 and SGPRn-5
2096 on GFX8. FLAT_SCRATCH_LO is
2097 used as the FLAT SCRATCH
2098 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002099 instructions. Having CP load
2100 it once avoids loading it at
2101 the beginning of every
Tony Tyef59d0712017-11-10 20:51:43 +00002102 wavefront.
2103 GFX9
2104 This is the
Tony Tye46d35762017-08-15 20:47:41 +00002105 64 bit base address of the
2106 per SPI scratch backing
2107 memory managed by SPI for
2108 the queue executing the
2109 kernel dispatch. CP obtains
2110 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002111 divides it if there are
2112 multiple Shader Arrays each
2113 with its own SPI). The value
2114 of Scratch Wave Offset must
2115 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002116 machine code and the result
2117 moved to the FLAT_SCRATCH
2118 SGPR which is SGPRn-6 and
2119 SGPRn-5. It is used as the
2120 FLAT SCRATCH BASE in flat
Tony Tyef59d0712017-11-10 20:51:43 +00002121 memory instructions.
2122 then Private Segment Size 1 The 32 bit byte size of a
2123 (enable_sgpr_private single
2124 work-item's
2125 scratch_segment_size) memory
2126 allocation. This is the
2127 value from the kernel
2128 dispatch packet Private
2129 Segment Byte Size rounded up
2130 by CP to a multiple of
2131 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002132
2133 Having CP load it once avoids
2134 loading it at the beginning of
2135 every wavefront.
2136
2137 This is not used for
2138 GFX7-GFX8 since it is the same
2139 value as the second SGPR of
2140 Flat Scratch Init. However, it
2141 may be needed for GFX9 which
2142 changes the meaning of the
2143 Flat Scratch Init value.
2144 then Grid Work-Group Count X 1 32 bit count of the number of
2145 (enable_sgpr_grid work-groups in the X dimension
2146 _workgroup_count_X) for the grid being
2147 executed. Computed from the
2148 fields in the kernel dispatch
2149 packet as ((grid_size.x +
2150 workgroup_size.x - 1) /
2151 workgroup_size.x).
2152 then Grid Work-Group Count Y 1 32 bit count of the number of
2153 (enable_sgpr_grid work-groups in the Y dimension
2154 _workgroup_count_Y && for the grid being
2155 less than 16 previous executed. Computed from the
2156 SGPRs) fields in the kernel dispatch
2157 packet as ((grid_size.y +
2158 workgroup_size.y - 1) /
2159 workgroupSize.y).
2160
2161 Only initialized if <16
2162 previous SGPRs initialized.
2163 then Grid Work-Group Count Z 1 32 bit count of the number of
2164 (enable_sgpr_grid work-groups in the Z dimension
2165 _workgroup_count_Z && for the grid being
2166 less than 16 previous executed. Computed from the
2167 SGPRs) fields in the kernel dispatch
2168 packet as ((grid_size.z +
2169 workgroup_size.z - 1) /
2170 workgroupSize.z).
2171
2172 Only initialized if <16
2173 previous SGPRs initialized.
2174 then Work-Group Id X 1 32 bit work-group id in X
2175 (enable_sgpr_workgroup_id dimension of grid for
2176 _X) wavefront.
2177 then Work-Group Id Y 1 32 bit work-group id in Y
2178 (enable_sgpr_workgroup_id dimension of grid for
2179 _Y) wavefront.
2180 then Work-Group Id Z 1 32 bit work-group id in Z
2181 (enable_sgpr_workgroup_id dimension of grid for
2182 _Z) wavefront.
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002183 then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002184 (enable_sgpr_workgroup ordered_append_term[10:0],
2185 _info) threadgroup_size_in_waves[5:0]}
2186 then Scratch Wave Offset 1 32 bit byte offset from base
2187 (enable_sgpr_private of scratch base of queue
2188 _segment_wave_offset) executing the kernel
2189 dispatch. Must be used as an
2190 offset with Private
2191 segment address when using
2192 Scratch Segment Buffer. It
2193 must be used to set up FLAT
2194 SCRATCH for flat addressing
2195 (see
2196 :ref:`amdgpu-amdhsa-flat-scratch`).
2197 ========== ========================== ====== ==============================
2198
2199The order of the VGPR registers is defined, but the compiler can specify which
2200ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2201fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2202for enabled registers are dense starting at VGPR0: the first enabled register is
2203VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2204VGPR number.
2205
2206VGPR register initial state is defined in
2207:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2208
2209 .. table:: VGPR Register Set Up Order
2210 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2211
2212 ========== ========================== ====== ==============================
2213 VGPR Order Name Number Description
2214 (kernel descriptor enable of
2215 field) VGPRs
2216 ========== ========================== ====== ==============================
2217 First Work-Item Id X 1 32 bit work item id in X
2218 (Always initialized) dimension of work-group for
2219 wavefront lane.
2220 then Work-Item Id Y 1 32 bit work item id in Y
2221 (enable_vgpr_workitem_id dimension of work-group for
2222 > 0) wavefront lane.
2223 then Work-Item Id Z 1 32 bit work item id in Z
2224 (enable_vgpr_workitem_id dimension of work-group for
2225 > 1) wavefront lane.
2226 ========== ========================== ====== ==============================
2227
2228The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2229
22301. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2231 registers.
22322. Work-group Id registers X, Y, Z are set by ADC which supports any
2233 combination including none.
22343. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2235 cannot included with the flat scratch init value which is per queue.
22364. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2237 or (X, Y, Z).
2238
2239Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2240value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2241
2242The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002243has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002244instructions (GFX9).
2245
2246If buffer operations are used then the compiler can generate a V# with the
2247following properties:
2248
2249* base address of 0
2250* no swizzle
2251* ATC: 1 if IOMMU present (such as APU)
2252* ptr64: 1
2253* MTYPE set to support memory coherence that matches the runtime (such as CC for
2254 APU and NC for dGPU).
2255
2256.. _amdgpu-amdhsa-kernel-prolog:
2257
2258Kernel Prolog
2259~~~~~~~~~~~~~
2260
2261.. _amdgpu-amdhsa-m0:
2262
2263M0
2264++
2265
2266GFX6-GFX8
2267 The M0 register must be initialized with a value at least the total LDS size
2268 if the kernel may access LDS via DS or flat operations. Total LDS size is
2269 available in dispatch packet. For M0, it is also possible to use maximum
2270 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2271 GFX7-GFX8).
2272GFX9
2273 The M0 register is not used for range checking LDS accesses and so does not
2274 need to be initialized in the prolog.
2275
2276.. _amdgpu-amdhsa-flat-scratch:
2277
2278Flat Scratch
2279++++++++++++
2280
2281If the kernel may use flat operations to access scratch memory, the prolog code
2282must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2283are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2284Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2285
2286GFX6
2287 Flat scratch is not supported.
2288
Tony Tye07d9f102017-11-10 01:00:54 +00002289GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002290 1. The low word of Flat Scratch Init is 32 bit byte offset from
2291 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2292 being managed by SPI for the queue executing the kernel dispatch. This is
2293 the same value used in the Scratch Segment Buffer V# base address. The
2294 prolog must add the value of Scratch Wave Offset to get the wave's byte
2295 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2296 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2297 by 8 before moving into FLAT_SCRATCH_LO.
2298 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2299 work-items scratch memory usage. This is directly loaded from the kernel
2300 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2301 DWORD. Having CP load it once avoids loading it at the beginning of every
2302 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2303 SIZE.
Tony Tyef59d0712017-11-10 20:51:43 +00002304
Tony Tyef16a45e2017-06-06 20:31:59 +00002305GFX9
2306 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2307 memory being managed by SPI for the queue executing the kernel dispatch. The
2308 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2309 pair for use as the flat scratch base in flat memory instructions.
2310
2311.. _amdgpu-amdhsa-memory-model:
2312
2313Memory Model
2314~~~~~~~~~~~~
2315
2316This section describes the mapping of LLVM memory model onto AMDGPU machine code
2317(see :ref:`memmodel`). *The implementation is WIP.*
2318
2319.. TODO
2320 Update when implementation complete.
2321
Tony Tyef16a45e2017-06-06 20:31:59 +00002322The AMDGPU backend supports the memory synchronization scopes specified in
2323:ref:`amdgpu-memory-scopes`.
2324
2325The code sequences used to implement the memory model are defined in table
2326:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2327
2328The sequences specify the order of instructions that a single thread must
2329execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2330to other memory instructions executed by the same thread. This allows them to be
2331moved earlier or later which can allow them to be combined with other instances
2332of the same instruction, or hoisted/sunk out of loops to improve
2333performance. Only the instructions related to the memory model are given;
2334additional ``s_waitcnt`` instructions are required to ensure registers are
2335defined before being used. These may be able to be combined with the memory
2336model ``s_waitcnt`` instructions as described above.
2337
Tony Tye6baa6d22017-10-18 22:16:55 +00002338The AMDGPU backend supports the following memory models:
2339
2340 HSA Memory Model [HSA]_
2341 The HSA memory model uses a single happens-before relation for all address
2342 spaces (see :ref:`amdgpu-address-spaces`).
2343 OpenCL Memory Model [OpenCL]_
2344 The OpenCL memory model which has separate happens-before relations for the
2345 global and local address spaces. Only a fence specifying both global and
2346 local address space, and seq_cst instructions join the relationships. Since
2347 the LLVM ``memfence`` instruction does not allow an address space to be
2348 specified the OpenCL fence has to convervatively assume both local and
2349 global address space was specified. However, optimizations can often be
2350 done to eliminate the additional ``s_waitcnt`` instructions when there are
2351 no intervening memory instructions which access the corresponding address
2352 space. The code sequences in the table indicate what can be omitted for the
2353 OpenCL memory. The target triple environment is used to determine if the
2354 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002355
2356``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2357operations.
2358
2359``buffer/global/flat_load/store/atomic`` instructions to global memory are
2360termed vector memory operations.
2361
2362For GFX6-GFX9:
2363
2364* Each agent has multiple compute units (CU).
2365* Each CU has multiple SIMDs that execute wavefronts.
2366* The wavefronts for a single work-group are executed in the same CU but may be
2367 executed by different SIMDs.
2368* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2369 executing on it.
2370* All LDS operations of a CU are performed as wavefront wide operations in a
2371 global order and involve no caching. Completion is reported to a wavefront in
2372 execution order.
2373* The LDS memory has multiple request queues shared by the SIMDs of a
2374 CU. Therefore, the LDS operations performed by different waves of a work-group
2375 can be reordered relative to each other, which can result in reordering the
2376 visibility of vector memory operations with respect to LDS operations of other
2377 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002378 ensure synchronization between LDS operations and vector memory operations
Tony Tyef16a45e2017-06-06 20:31:59 +00002379 between waves of a work-group, but not between operations performed by the
2380 same wavefront.
2381* The vector memory operations are performed as wavefront wide operations and
2382 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002383 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002384 vector memory order if they access LDS memory, and out of LDS operation order
2385 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002386* The vector memory operations access a single vector L1 cache shared by all
2387 SIMDs a CU. Therefore, no special action is required for coherence between the
2388 lanes of a single wavefront, or for coherence between wavefronts in the same
2389 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2390 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002391* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2392 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2393 scalar operations are used in a restricted way so do not impact the memory
2394 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2395* The vector and scalar memory operations use an L2 cache shared by all CUs on
2396 the same agent.
2397* The L2 cache has independent channels to service disjoint ranges of virtual
2398 addresses.
2399* Each CU has a separate request queue per channel. Therefore, the vector and
2400 scalar memory operations performed by waves executing in different work-groups
2401 (which may be executing on different CUs) of an agent can be reordered
2402 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002403 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002404 previous vector memory operation has completed before executing a subsequent
2405 vector memory or LDS operation and so can be used to meet the requirements of
2406 acquire and release.
2407* The L2 cache can be kept coherent with other agents on some targets, or ranges
2408 of virtual addresses can be set up to bypass it to ensure system coherence.
2409
Tony Tye07d9f102017-11-10 01:00:54 +00002410Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002411or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2412memory, atomic memory orderings are not meaningful and all accesses are treated
2413as non-atomic.
2414
2415Constant address space uses ``buffer/global_load`` instructions (or equivalent
2416scalar memory instructions). Since the constant address space contents do not
2417change during the execution of a kernel dispatch it is not legal to perform
2418stores, and atomic memory orderings are not meaningful and all access are
2419treated as non-atomic.
2420
2421A memory synchronization scope wider than work-group is not meaningful for the
2422group (LDS) address space and is treated as work-group.
2423
2424The memory model does not support the region address space which is treated as
2425non-atomic.
2426
2427Acquire memory ordering is not meaningful on store atomic instructions and is
2428treated as non-atomic.
2429
2430Release memory ordering is not meaningful on load atomic instructions and is
2431treated a non-atomic.
2432
2433Acquire-release memory ordering is not meaningful on load or store atomic
2434instructions and is treated as acquire and release respectively.
2435
2436AMDGPU backend only uses scalar memory operations to access memory that is
2437proven to not change during the execution of the kernel dispatch. This includes
2438constant address space and global address space for program scope const
2439variables. Therefore the kernel machine code does not have to maintain the
2440scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2441and vector L1 caches are invalidated between kernel dispatches by CP since
2442constant address space data may change between kernel dispatch executions. See
2443:ref:`amdgpu-amdhsa-memory-spaces`.
2444
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002445The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002446case the AMDGPU backend ensures the memory location used to spill is never
2447accessed by vector memory operations at the same time. If scalar writes are used
2448then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2449return since the locations may be used for vector memory instructions by a
2450future wave that uses the same scratch area, or a function call that creates a
2451frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2452as all scalar writes are write-before-read in the same thread.
2453
Tony Tye6baa6d22017-10-18 22:16:55 +00002454Scratch backing memory (which is used for the private address space)
2455is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2456address space is only accessed by a single thread, and is always
2457write-before-read, there is never a need to invalidate these entries from the L1
2458cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2459volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002460
2461On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002462to invalidate the L2 cache. This also causes it to be treated as
2463non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2464(cache coherent) and so the L2 cache will coherent with the CPU and other
2465agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002466
2467 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2468 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2469
Tony Tye6baa6d22017-10-18 22:16:55 +00002470 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002471 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2472 Ordering Sync Scope Address
2473 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002474 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002475 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002476 -----------------------------------------------------------------------------------
2477 load *none* *none* - global - !volatile & !nontemporal
2478 - generic
2479 - private 1. buffer/global/flat_load
2480 - constant
2481 - volatile & !nontemporal
2482
Tony Tyef16a45e2017-06-06 20:31:59 +00002483 1. buffer/global/flat_load
2484 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002485
2486 - nontemporal
2487
2488 1. buffer/global/flat_load
2489 glc=1 slc=1
2490
Tony Tyef16a45e2017-06-06 20:31:59 +00002491 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002492 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002493 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002494 - private 1. buffer/global/flat_store
2495 - constant
2496 - nontemporal
2497
2498 1. buffer/global/flat_stote
2499 glc=1 slc=1
2500
Tony Tyef16a45e2017-06-06 20:31:59 +00002501 store *none* *none* - local 1. ds_store
2502 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002503 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002504 load atomic unordered *any* *any* *Same as non-atomic*.
2505 store atomic unordered *any* *any* *Same as non-atomic*.
2506 atomicrmw unordered *any* *any* *Same as monotonic
2507 atomic*.
2508 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002509 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002510 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2511 - wavefront - generic
2512 - workgroup
2513 load atomic monotonic - singlethread - local 1. ds_load
2514 - wavefront
2515 - workgroup
2516 load atomic monotonic - agent - global 1. buffer/global/flat_load
2517 - system - generic glc=1
2518 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2519 - wavefront - generic
2520 - workgroup
2521 - agent
2522 - system
2523 store atomic monotonic - singlethread - local 1. ds_store
2524 - wavefront
2525 - workgroup
2526 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2527 - wavefront - generic
2528 - workgroup
2529 - agent
2530 - system
2531 atomicrmw monotonic - singlethread - local 1. ds_atomic
2532 - wavefront
2533 - workgroup
2534 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002535 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002536 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2537 - wavefront - local
2538 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002539 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2540 load atomic acquire - workgroup - local 1. ds_load
2541 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002542
Tony Tye6baa6d22017-10-18 22:16:55 +00002543 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002544 - Must happen before
2545 any following
2546 global/generic
2547 load/load
2548 atomic/store/store
2549 atomic/atomicrmw.
2550 - Ensures any
2551 following global
2552 data read is no
2553 older than the load
2554 atomic value being
2555 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002556 load atomic acquire - workgroup - generic 1. flat_load
2557 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002558
Tony Tye6baa6d22017-10-18 22:16:55 +00002559 - If OpenCL, omit.
2560 - Must happen before
2561 any following
2562 global/generic
2563 load/load
2564 atomic/store/store
2565 atomic/atomicrmw.
2566 - Ensures any
2567 following global
2568 data read is no
2569 older than the load
2570 atomic value being
2571 acquired.
2572 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002573 - system glc=1
2574 2. s_waitcnt vmcnt(0)
2575
2576 - Must happen before
2577 following
2578 buffer_wbinvl1_vol.
2579 - Ensures the load
2580 has completed
2581 before invalidating
2582 the cache.
2583
2584 3. buffer_wbinvl1_vol
2585
2586 - Must happen before
2587 any following
2588 global/generic
2589 load/load
2590 atomic/atomicrmw.
2591 - Ensures that
2592 following
2593 loads will not see
2594 stale global data.
2595
2596 load atomic acquire - agent - generic 1. flat_load glc=1
2597 - system 2. s_waitcnt vmcnt(0) &
2598 lgkmcnt(0)
2599
2600 - If OpenCL omit
2601 lgkmcnt(0).
2602 - Must happen before
2603 following
2604 buffer_wbinvl1_vol.
2605 - Ensures the flat_load
2606 has completed
2607 before invalidating
2608 the cache.
2609
2610 3. buffer_wbinvl1_vol
2611
2612 - Must happen before
2613 any following
2614 global/generic
2615 load/load
2616 atomic/atomicrmw.
2617 - Ensures that
2618 following loads
2619 will not see stale
2620 global data.
2621
2622 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2623 - wavefront - local
2624 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002625 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2626 atomicrmw acquire - workgroup - local 1. ds_atomic
2627 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002628
Tony Tye6baa6d22017-10-18 22:16:55 +00002629 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002630 - Must happen before
2631 any following
2632 global/generic
2633 load/load
2634 atomic/store/store
2635 atomic/atomicrmw.
2636 - Ensures any
2637 following global
2638 data read is no
2639 older than the
2640 atomicrmw value
2641 being acquired.
2642
Tony Tye6baa6d22017-10-18 22:16:55 +00002643 atomicrmw acquire - workgroup - generic 1. flat_atomic
2644 2. waitcnt lgkmcnt(0)
2645
2646 - If OpenCL, omit.
2647 - Must happen before
2648 any following
2649 global/generic
2650 load/load
2651 atomic/store/store
2652 atomic/atomicrmw.
2653 - Ensures any
2654 following global
2655 data read is no
2656 older than the
2657 atomicrmw value
2658 being acquired.
2659
2660 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002661 - system 2. s_waitcnt vmcnt(0)
2662
2663 - Must happen before
2664 following
2665 buffer_wbinvl1_vol.
2666 - Ensures the
2667 atomicrmw has
2668 completed before
2669 invalidating the
2670 cache.
2671
2672 3. buffer_wbinvl1_vol
2673
2674 - Must happen before
2675 any following
2676 global/generic
2677 load/load
2678 atomic/atomicrmw.
2679 - Ensures that
2680 following loads
2681 will not see stale
2682 global data.
2683
2684 atomicrmw acquire - agent - generic 1. flat_atomic
2685 - system 2. s_waitcnt vmcnt(0) &
2686 lgkmcnt(0)
2687
2688 - If OpenCL, omit
2689 lgkmcnt(0).
2690 - Must happen before
2691 following
2692 buffer_wbinvl1_vol.
2693 - Ensures the
2694 atomicrmw has
2695 completed before
2696 invalidating the
2697 cache.
2698
2699 3. buffer_wbinvl1_vol
2700
2701 - Must happen before
2702 any following
2703 global/generic
2704 load/load
2705 atomic/atomicrmw.
2706 - Ensures that
2707 following loads
2708 will not see stale
2709 global data.
2710
2711 fence acquire - singlethread *none* *none*
2712 - wavefront
2713 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2714
2715 - If OpenCL and
2716 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002717 not generic, omit.
2718 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002719 currently has no
2720 address space on
2721 the fence need to
2722 conservatively
2723 always generate. If
2724 fence had an
2725 address space then
2726 set to address
2727 space of OpenCL
2728 fence flag, or to
2729 generic if both
2730 local and global
2731 flags are
2732 specified.
2733 - Must happen after
2734 any preceding
2735 local/generic load
2736 atomic/atomicrmw
2737 with an equal or
2738 wider sync scope
2739 and memory ordering
2740 stronger than
2741 unordered (this is
2742 termed the
2743 fence-paired-atomic).
2744 - Must happen before
2745 any following
2746 global/generic
2747 load/load
2748 atomic/store/store
2749 atomic/atomicrmw.
2750 - Ensures any
2751 following global
2752 data read is no
2753 older than the
2754 value read by the
2755 fence-paired-atomic.
2756
Tony Tye6baa6d22017-10-18 22:16:55 +00002757 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2758 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002759
2760 - If OpenCL and
2761 address space is
2762 not generic, omit
2763 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002764 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002765 currently has no
2766 address space on
2767 the fence need to
2768 conservatively
2769 always generate
2770 (see comment for
2771 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002772 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002773 separate s_waitcnt
2774 vmcnt(0) and
2775 s_waitcnt
2776 lgkmcnt(0) to allow
2777 them to be
2778 independently moved
2779 according to the
2780 following rules.
2781 - s_waitcnt vmcnt(0)
2782 must happen after
2783 any preceding
2784 global/generic load
2785 atomic/atomicrmw
2786 with an equal or
2787 wider sync scope
2788 and memory ordering
2789 stronger than
2790 unordered (this is
2791 termed the
2792 fence-paired-atomic).
2793 - s_waitcnt lgkmcnt(0)
2794 must happen after
2795 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002796 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002797 atomic/atomicrmw
2798 with an equal or
2799 wider sync scope
2800 and memory ordering
2801 stronger than
2802 unordered (this is
2803 termed the
2804 fence-paired-atomic).
2805 - Must happen before
2806 the following
2807 buffer_wbinvl1_vol.
2808 - Ensures that the
2809 fence-paired atomic
2810 has completed
2811 before invalidating
2812 the
2813 cache. Therefore
2814 any following
2815 locations read must
2816 be no older than
2817 the value read by
2818 the
2819 fence-paired-atomic.
2820
2821 2. buffer_wbinvl1_vol
2822
Tony Tye6baa6d22017-10-18 22:16:55 +00002823 - Must happen before any
2824 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002825 load/load
2826 atomic/store/store
2827 atomic/atomicrmw.
2828 - Ensures that
2829 following loads
2830 will not see stale
2831 global data.
2832
2833 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002834 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002835 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2836 - wavefront - local
2837 - generic
2838 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002839
2840 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002841 - Must happen after
2842 any preceding
2843 local/generic
2844 load/store/load
2845 atomic/store
2846 atomic/atomicrmw.
2847 - Must happen before
2848 the following
2849 store.
2850 - Ensures that all
2851 memory operations
2852 to local have
2853 completed before
2854 performing the
2855 store that is being
2856 released.
2857
2858 2. buffer/global/flat_store
2859 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002860 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2861
2862 - If OpenCL, omit.
2863 - Must happen after
2864 any preceding
2865 local/generic
2866 load/store/load
2867 atomic/store
2868 atomic/atomicrmw.
2869 - Must happen before
2870 the following
2871 store.
2872 - Ensures that all
2873 memory operations
2874 to local have
2875 completed before
2876 performing the
2877 store that is being
2878 released.
2879
2880 2. flat_store
2881 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2882 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002883
2884 - If OpenCL, omit
2885 lgkmcnt(0).
2886 - Could be split into
2887 separate s_waitcnt
2888 vmcnt(0) and
2889 s_waitcnt
2890 lgkmcnt(0) to allow
2891 them to be
2892 independently moved
2893 according to the
2894 following rules.
2895 - s_waitcnt vmcnt(0)
2896 must happen after
2897 any preceding
2898 global/generic
2899 load/store/load
2900 atomic/store
2901 atomic/atomicrmw.
2902 - s_waitcnt lgkmcnt(0)
2903 must happen after
2904 any preceding
2905 local/generic
2906 load/store/load
2907 atomic/store
2908 atomic/atomicrmw.
2909 - Must happen before
2910 the following
2911 store.
2912 - Ensures that all
2913 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002914 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002915 completed before
2916 performing the
2917 store that is being
2918 released.
2919
2920 2. buffer/global/ds/flat_store
2921 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2922 - wavefront - local
2923 - generic
2924 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002925
2926 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002927 - Must happen after
2928 any preceding
2929 local/generic
2930 load/store/load
2931 atomic/store
2932 atomic/atomicrmw.
2933 - Must happen before
2934 the following
2935 atomicrmw.
2936 - Ensures that all
2937 memory operations
2938 to local have
2939 completed before
2940 performing the
2941 atomicrmw that is
2942 being released.
2943
2944 2. buffer/global/flat_atomic
2945 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00002946 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2947
2948 - If OpenCL, omit.
2949 - Must happen after
2950 any preceding
2951 local/generic
2952 load/store/load
2953 atomic/store
2954 atomic/atomicrmw.
2955 - Must happen before
2956 the following
2957 atomicrmw.
2958 - Ensures that all
2959 memory operations
2960 to local have
2961 completed before
2962 performing the
2963 atomicrmw that is
2964 being released.
2965
2966 2. flat_atomic
2967 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2968 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002969
2970 - If OpenCL, omit
2971 lgkmcnt(0).
2972 - Could be split into
2973 separate s_waitcnt
2974 vmcnt(0) and
2975 s_waitcnt
2976 lgkmcnt(0) to allow
2977 them to be
2978 independently moved
2979 according to the
2980 following rules.
2981 - s_waitcnt vmcnt(0)
2982 must happen after
2983 any preceding
2984 global/generic
2985 load/store/load
2986 atomic/store
2987 atomic/atomicrmw.
2988 - s_waitcnt lgkmcnt(0)
2989 must happen after
2990 any preceding
2991 local/generic
2992 load/store/load
2993 atomic/store
2994 atomic/atomicrmw.
2995 - Must happen before
2996 the following
2997 atomicrmw.
2998 - Ensures that all
2999 memory operations
3000 to global and local
3001 have completed
3002 before performing
3003 the atomicrmw that
3004 is being released.
3005
Tony Tye6baa6d22017-10-18 22:16:55 +00003006 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003007 fence release - singlethread *none* *none*
3008 - wavefront
3009 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3010
3011 - If OpenCL and
3012 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003013 not generic, omit.
3014 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003015 currently has no
3016 address space on
3017 the fence need to
3018 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003019 always generate. If
3020 fence had an
3021 address space then
3022 set to address
3023 space of OpenCL
3024 fence flag, or to
3025 generic if both
3026 local and global
3027 flags are
3028 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003029 - Must happen after
3030 any preceding
3031 local/generic
3032 load/load
3033 atomic/store/store
3034 atomic/atomicrmw.
3035 - Must happen before
3036 any following store
3037 atomic/atomicrmw
3038 with an equal or
3039 wider sync scope
3040 and memory ordering
3041 stronger than
3042 unordered (this is
3043 termed the
3044 fence-paired-atomic).
3045 - Ensures that all
3046 memory operations
3047 to local have
3048 completed before
3049 performing the
3050 following
3051 fence-paired-atomic.
3052
Tony Tye6baa6d22017-10-18 22:16:55 +00003053 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3054 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003055
3056 - If OpenCL and
3057 address space is
3058 not generic, omit
3059 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003060 - If OpenCL and
3061 address space is
3062 local, omit
3063 vmcnt(0).
3064 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003065 currently has no
3066 address space on
3067 the fence need to
3068 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003069 always generate. If
3070 fence had an
3071 address space then
3072 set to address
3073 space of OpenCL
3074 fence flag, or to
3075 generic if both
3076 local and global
3077 flags are
3078 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003079 - Could be split into
3080 separate s_waitcnt
3081 vmcnt(0) and
3082 s_waitcnt
3083 lgkmcnt(0) to allow
3084 them to be
3085 independently moved
3086 according to the
3087 following rules.
3088 - s_waitcnt vmcnt(0)
3089 must happen after
3090 any preceding
3091 global/generic
3092 load/store/load
3093 atomic/store
3094 atomic/atomicrmw.
3095 - s_waitcnt lgkmcnt(0)
3096 must happen after
3097 any preceding
3098 local/generic
3099 load/store/load
3100 atomic/store
3101 atomic/atomicrmw.
3102 - Must happen before
3103 any following store
3104 atomic/atomicrmw
3105 with an equal or
3106 wider sync scope
3107 and memory ordering
3108 stronger than
3109 unordered (this is
3110 termed the
3111 fence-paired-atomic).
3112 - Ensures that all
3113 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003114 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003115 completed before
3116 performing the
3117 following
3118 fence-paired-atomic.
3119
3120 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003121 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003122 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3123 - wavefront - local
3124 - generic
3125 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3126
Tony Tye6baa6d22017-10-18 22:16:55 +00003127 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003128 - Must happen after
3129 any preceding
3130 local/generic
3131 load/store/load
3132 atomic/store
3133 atomic/atomicrmw.
3134 - Must happen before
3135 the following
3136 atomicrmw.
3137 - Ensures that all
3138 memory operations
3139 to local have
3140 completed before
3141 performing the
3142 atomicrmw that is
3143 being released.
3144
Tony Tye6baa6d22017-10-18 22:16:55 +00003145 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003146 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3147 2. s_waitcnt lgkmcnt(0)
3148
Tony Tye6baa6d22017-10-18 22:16:55 +00003149 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003150 - Must happen before
3151 any following
3152 global/generic
3153 load/load
3154 atomic/store/store
3155 atomic/atomicrmw.
3156 - Ensures any
3157 following global
3158 data read is no
3159 older than the load
3160 atomic value being
3161 acquired.
3162
3163 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3164
Tony Tye6baa6d22017-10-18 22:16:55 +00003165 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003166 - Must happen after
3167 any preceding
3168 local/generic
3169 load/store/load
3170 atomic/store
3171 atomic/atomicrmw.
3172 - Must happen before
3173 the following
3174 atomicrmw.
3175 - Ensures that all
3176 memory operations
3177 to local have
3178 completed before
3179 performing the
3180 atomicrmw that is
3181 being released.
3182
3183 2. flat_atomic
3184 3. s_waitcnt lgkmcnt(0)
3185
Tony Tye6baa6d22017-10-18 22:16:55 +00003186 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003187 - Must happen before
3188 any following
3189 global/generic
3190 load/load
3191 atomic/store/store
3192 atomic/atomicrmw.
3193 - Ensures any
3194 following global
3195 data read is no
3196 older than the load
3197 atomic value being
3198 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003199
3200 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3201 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003202
3203 - If OpenCL, omit
3204 lgkmcnt(0).
3205 - Could be split into
3206 separate s_waitcnt
3207 vmcnt(0) and
3208 s_waitcnt
3209 lgkmcnt(0) to allow
3210 them to be
3211 independently moved
3212 according to the
3213 following rules.
3214 - s_waitcnt vmcnt(0)
3215 must happen after
3216 any preceding
3217 global/generic
3218 load/store/load
3219 atomic/store
3220 atomic/atomicrmw.
3221 - s_waitcnt lgkmcnt(0)
3222 must happen after
3223 any preceding
3224 local/generic
3225 load/store/load
3226 atomic/store
3227 atomic/atomicrmw.
3228 - Must happen before
3229 the following
3230 atomicrmw.
3231 - Ensures that all
3232 memory operations
3233 to global have
3234 completed before
3235 performing the
3236 atomicrmw that is
3237 being released.
3238
Tony Tye6baa6d22017-10-18 22:16:55 +00003239 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003240 3. s_waitcnt vmcnt(0)
3241
3242 - Must happen before
3243 following
3244 buffer_wbinvl1_vol.
3245 - Ensures the
3246 atomicrmw has
3247 completed before
3248 invalidating the
3249 cache.
3250
3251 4. buffer_wbinvl1_vol
3252
3253 - Must happen before
3254 any following
3255 global/generic
3256 load/load
3257 atomic/atomicrmw.
3258 - Ensures that
3259 following loads
3260 will not see stale
3261 global data.
3262
Tony Tye6baa6d22017-10-18 22:16:55 +00003263 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3264 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003265
3266 - If OpenCL, omit
3267 lgkmcnt(0).
3268 - Could be split into
3269 separate s_waitcnt
3270 vmcnt(0) and
3271 s_waitcnt
3272 lgkmcnt(0) to allow
3273 them to be
3274 independently moved
3275 according to the
3276 following rules.
3277 - s_waitcnt vmcnt(0)
3278 must happen after
3279 any preceding
3280 global/generic
3281 load/store/load
3282 atomic/store
3283 atomic/atomicrmw.
3284 - s_waitcnt lgkmcnt(0)
3285 must happen after
3286 any preceding
3287 local/generic
3288 load/store/load
3289 atomic/store
3290 atomic/atomicrmw.
3291 - Must happen before
3292 the following
3293 atomicrmw.
3294 - Ensures that all
3295 memory operations
3296 to global have
3297 completed before
3298 performing the
3299 atomicrmw that is
3300 being released.
3301
3302 2. flat_atomic
3303 3. s_waitcnt vmcnt(0) &
3304 lgkmcnt(0)
3305
3306 - If OpenCL, omit
3307 lgkmcnt(0).
3308 - Must happen before
3309 following
3310 buffer_wbinvl1_vol.
3311 - Ensures the
3312 atomicrmw has
3313 completed before
3314 invalidating the
3315 cache.
3316
3317 4. buffer_wbinvl1_vol
3318
3319 - Must happen before
3320 any following
3321 global/generic
3322 load/load
3323 atomic/atomicrmw.
3324 - Ensures that
3325 following loads
3326 will not see stale
3327 global data.
3328
3329 fence acq_rel - singlethread *none* *none*
3330 - wavefront
3331 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3332
3333 - If OpenCL and
3334 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003335 not generic, omit.
3336 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003337 since LLVM
3338 currently has no
3339 address space on
3340 the fence need to
3341 conservatively
3342 always generate
3343 (see comment for
3344 previous fence).
3345 - Must happen after
3346 any preceding
3347 local/generic
3348 load/load
3349 atomic/store/store
3350 atomic/atomicrmw.
3351 - Must happen before
3352 any following
3353 global/generic
3354 load/load
3355 atomic/store/store
3356 atomic/atomicrmw.
3357 - Ensures that all
3358 memory operations
3359 to local have
3360 completed before
3361 performing any
3362 following global
3363 memory operations.
3364 - Ensures that the
3365 preceding
3366 local/generic load
3367 atomic/atomicrmw
3368 with an equal or
3369 wider sync scope
3370 and memory ordering
3371 stronger than
3372 unordered (this is
3373 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003374 acquire-fence-paired-atomic
3375 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003376 before following
3377 global memory
3378 operations. This
3379 satisfies the
3380 requirements of
3381 acquire.
3382 - Ensures that all
3383 previous memory
3384 operations have
3385 completed before a
3386 following
3387 local/generic store
3388 atomic/atomicrmw
3389 with an equal or
3390 wider sync scope
3391 and memory ordering
3392 stronger than
3393 unordered (this is
3394 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003395 release-fence-paired-atomic
3396 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003397 requirements of
3398 release.
3399
Tony Tye6baa6d22017-10-18 22:16:55 +00003400 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3401 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003402
3403 - If OpenCL and
3404 address space is
3405 not generic, omit
3406 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003407 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003408 currently has no
3409 address space on
3410 the fence need to
3411 conservatively
3412 always generate
3413 (see comment for
3414 previous fence).
3415 - Could be split into
3416 separate s_waitcnt
3417 vmcnt(0) and
3418 s_waitcnt
3419 lgkmcnt(0) to allow
3420 them to be
3421 independently moved
3422 according to the
3423 following rules.
3424 - s_waitcnt vmcnt(0)
3425 must happen after
3426 any preceding
3427 global/generic
3428 load/store/load
3429 atomic/store
3430 atomic/atomicrmw.
3431 - s_waitcnt lgkmcnt(0)
3432 must happen after
3433 any preceding
3434 local/generic
3435 load/store/load
3436 atomic/store
3437 atomic/atomicrmw.
3438 - Must happen before
3439 the following
3440 buffer_wbinvl1_vol.
3441 - Ensures that the
3442 preceding
3443 global/local/generic
3444 load
3445 atomic/atomicrmw
3446 with an equal or
3447 wider sync scope
3448 and memory ordering
3449 stronger than
3450 unordered (this is
3451 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003452 acquire-fence-paired-atomic
3453 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003454 before invalidating
3455 the cache. This
3456 satisfies the
3457 requirements of
3458 acquire.
3459 - Ensures that all
3460 previous memory
3461 operations have
3462 completed before a
3463 following
3464 global/local/generic
3465 store
3466 atomic/atomicrmw
3467 with an equal or
3468 wider sync scope
3469 and memory ordering
3470 stronger than
3471 unordered (this is
3472 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003473 release-fence-paired-atomic
3474 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003475 requirements of
3476 release.
3477
3478 2. buffer_wbinvl1_vol
3479
3480 - Must happen before
3481 any following
3482 global/generic
3483 load/load
3484 atomic/store/store
3485 atomic/atomicrmw.
3486 - Ensures that
3487 following loads
3488 will not see stale
3489 global data. This
3490 satisfies the
3491 requirements of
3492 acquire.
3493
3494 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003495 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003496 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003497 - wavefront - local load atomic acquire,
3498 - generic except must generated
3499 all instructions even
3500 for OpenCL.*
3501 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3502 - generic
3503 - Must
3504 happen after
3505 preceding
3506 global/generic load
3507 atomic/store
3508 atomic/atomicrmw
3509 with memory
3510 ordering of seq_cst
3511 and with equal or
3512 wider sync scope.
3513 (Note that seq_cst
3514 fences have their
3515 own s_waitcnt
3516 lgkmcnt(0) and so do
3517 not need to be
3518 considered.)
3519 - Ensures any
3520 preceding
3521 sequential
3522 consistent local
3523 memory instructions
3524 have completed
3525 before executing
3526 this sequentially
3527 consistent
3528 instruction. This
3529 prevents reordering
3530 a seq_cst store
3531 followed by a
3532 seq_cst load. (Note
3533 that seq_cst is
3534 stronger than
3535 acquire/release as
3536 the reordering of
3537 load acquire
3538 followed by a store
3539 release is
3540 prevented by the
3541 waitcnt of
3542 the release, but
3543 there is nothing
3544 preventing a store
3545 release followed by
3546 load acquire from
3547 competing out of
3548 order.)
3549
3550 2. *Following
3551 instructions same as
3552 corresponding load
3553 atomic acquire,
3554 except must generated
3555 all instructions even
3556 for OpenCL.*
3557 load atomic seq_cst - workgroup - local *Same as corresponding
3558 load atomic acquire,
3559 except must generated
3560 all instructions even
3561 for OpenCL.*
3562 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3563 - system - generic vmcnt(0)
3564
3565 - Could be split into
3566 separate s_waitcnt
3567 vmcnt(0)
3568 and s_waitcnt
3569 lgkmcnt(0) to allow
3570 them to be
3571 independently moved
3572 according to the
3573 following rules.
3574 - waitcnt lgkmcnt(0)
3575 must happen after
3576 preceding
3577 global/generic load
3578 atomic/store
3579 atomic/atomicrmw
3580 with memory
3581 ordering of seq_cst
3582 and with equal or
3583 wider sync scope.
3584 (Note that seq_cst
3585 fences have their
3586 own s_waitcnt
3587 lgkmcnt(0) and so do
3588 not need to be
3589 considered.)
3590 - waitcnt vmcnt(0)
3591 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003592 preceding
3593 global/generic load
3594 atomic/store
3595 atomic/atomicrmw
3596 with memory
3597 ordering of seq_cst
3598 and with equal or
3599 wider sync scope.
3600 (Note that seq_cst
3601 fences have their
3602 own s_waitcnt
3603 vmcnt(0) and so do
3604 not need to be
3605 considered.)
3606 - Ensures any
3607 preceding
3608 sequential
3609 consistent global
3610 memory instructions
3611 have completed
3612 before executing
3613 this sequentially
3614 consistent
3615 instruction. This
3616 prevents reordering
3617 a seq_cst store
3618 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003619 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003620 that seq_cst is
3621 stronger than
3622 acquire/release as
3623 the reordering of
3624 load acquire
3625 followed by a store
3626 release is
3627 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003628 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003629 the release, but
3630 there is nothing
3631 preventing a store
3632 release followed by
3633 load acquire from
3634 competing out of
3635 order.)
3636
3637 2. *Following
3638 instructions same as
3639 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003640 atomic acquire,
3641 except must generated
3642 all instructions even
3643 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003644 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003645 - wavefront - local store atomic release,
3646 - workgroup - generic except must generated
3647 all instructions even
3648 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003649 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003650 - system - generic store atomic release,
3651 except must generated
3652 all instructions even
3653 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003654 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003655 - wavefront - local atomicrmw acq_rel,
3656 - workgroup - generic except must generated
3657 all instructions even
3658 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003659 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003660 - system - generic atomicrmw acq_rel,
3661 except must generated
3662 all instructions even
3663 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003664 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003665 - wavefront fence acq_rel,
3666 - workgroup except must generated
3667 - agent all instructions even
3668 - system for OpenCL.*
3669 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003670
3671The memory order also adds the single thread optimization constrains defined in
3672table
3673:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3674
3675 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3676 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3677
3678 ============ ==============================================================
3679 LLVM Memory Optimization Constraints
3680 Ordering
3681 ============ ==============================================================
3682 unordered *none*
3683 monotonic *none*
3684 acquire - If a load atomic/atomicrmw then no following load/load
3685 atomic/store/ store atomic/atomicrmw/fence instruction can
3686 be moved before the acquire.
3687 - If a fence then same as load atomic, plus no preceding
3688 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003689 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003690 atomic/store/ store atomic/atomicrmw/fence instruction can
3691 be moved after the release.
3692 - If a fence then same as store atomic, plus no following
3693 associated fence-paired-atomic can be moved before the
3694 fence.
3695 acq_rel Same constraints as both acquire and release.
3696 seq_cst - If a load atomic then same constraints as acquire, plus no
3697 preceding sequentially consistent load atomic/store
3698 atomic/atomicrmw/fence instruction can be moved after the
3699 seq_cst.
3700 - If a store atomic then the same constraints as release, plus
3701 no following sequentially consistent load atomic/store
3702 atomic/atomicrmw/fence instruction can be moved before the
3703 seq_cst.
3704 - If an atomicrmw/fence then same constraints as acq_rel.
3705 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003706
Wei Ding16289cf2017-02-21 18:48:01 +00003707Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003708~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003709
Tony Tyef16a45e2017-06-06 20:31:59 +00003710For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3711(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3712the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003713
Tony Tyef16a45e2017-06-06 20:31:59 +00003714 .. table:: AMDGPU Trap Handler for AMDHSA OS
3715 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003716
Tony Tyef16a45e2017-06-06 20:31:59 +00003717 =================== =============== =============== =======================
3718 Usage Code Sequence Trap Handler Description
3719 Inputs
3720 =================== =============== =============== =======================
3721 reserved ``s_trap 0x00`` Reserved by hardware.
3722 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3723 ``queue_ptr`` ``debugtrap``
3724 ``VGPR0``: intrinsic (not
3725 ``arg`` implemented).
3726 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3727 ``queue_ptr`` terminated and its
3728 associated queue put
3729 into the error state.
3730 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3731 ``queue_ptr`` installed handled
3732 same as ``llvm.trap``.
3733 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3734 breakpoints.
3735 debugger ``s_trap 0x08`` Reserved for debugger.
3736 debugger ``s_trap 0xfe`` Reserved for debugger.
3737 debugger ``s_trap 0xff`` Reserved for debugger.
3738 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003739
Tony Tye46d35762017-08-15 20:47:41 +00003740Unspecified OS
3741--------------
3742
3743This section provides code conventions used when the target triple OS is
3744empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003745
3746Trap Handler ABI
3747~~~~~~~~~~~~~~~~
3748
3749For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3750not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3751instructions are handled as follows:
3752
3753 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3754 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3755
3756 =============== =============== ===========================================
3757 Usage Code Sequence Description
3758 =============== =============== ===========================================
3759 llvm.trap s_endpgm Causes wavefront to be terminated.
3760 llvm.debugtrap *none* Compiler warning given that there is no
3761 trap handler installed.
3762 =============== =============== ===========================================
3763
3764Source Languages
3765================
3766
3767.. _amdgpu-opencl:
3768
3769OpenCL
3770------
3771
3772When generating code for the OpenCL language the target triple environment
3773should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3774
3775When the language is OpenCL the following differences occur:
3776
37771. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
37782. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye46d35762017-08-15 20:47:41 +000037793. Additional metadata is generated
3780 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003781
3782.. TODO
3783 Specify what affect this has. Hidden arguments added. Additional metadata
3784 generated.
3785
3786.. _amdgpu-hcc:
3787
3788HCC
3789---
3790
3791When generating code for the OpenCL language the target triple environment
3792should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3793
3794When the language is OpenCL the following differences occur:
3795
37961. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3797
3798.. TODO
3799 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003800
Tom Stellard45bb48e2015-06-13 03:28:10 +00003801Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003802---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003803
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003804AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef59d0712017-11-10 20:51:43 +00003805It supports AMDGCN GFX6-GFX9.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003806
Tony Tyef16a45e2017-06-06 20:31:59 +00003807This section describes general syntax for instructions and operands. For more
3808information about instructions, their semantics and supported combinations of
3809operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00003810[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003811
Tony Tyef16a45e2017-06-06 20:31:59 +00003812An instruction has the following syntax (register operands are normally
3813comma-separated while extra operands are space-separated):
Tom Stellard45bb48e2015-06-13 03:28:10 +00003814
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003815*<opcode> <register_operand0>, ... <extra_operand0> ...*
Tom Stellard45bb48e2015-06-13 03:28:10 +00003816
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003817Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003818~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003819
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003820The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003821
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003822* SGPR registers: s0, ... or s[0], ...
3823* VGPR registers: v0, ... or v[0], ...
3824* TTMP registers: ttmp0, ... or ttmp[0], ...
3825* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3826* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3827* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3828* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3829* Register index expressions: v[2*2], s[1-1:2-1]
3830* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003831
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003832The following extra operands are supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003833
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003834* offset, offset0, offset1
3835* idxen, offen bits
3836* glc, slc, tfe bits
3837* waitcnt: integer or combination of counter values
3838* VOP3 modifiers:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003839
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003840 - abs (\| \|), neg (\-)
Tom Stellard45bb48e2015-06-13 03:28:10 +00003841
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003842* DPP modifiers:
3843
3844 - row_shl, row_shr, row_ror, row_rol
3845 - row_mirror, row_half_mirror, row_bcast
3846 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3847 - row_mask, bank_mask, bound_ctrl
3848
3849* SDWA modifiers:
3850
3851 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3852 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3853 - abs, neg, sext
3854
Tony Tyef16a45e2017-06-06 20:31:59 +00003855Instruction Examples
3856~~~~~~~~~~~~~~~~~~~~
3857
3858DS
3859~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003860
3861.. code-block:: nasm
3862
3863 ds_add_u32 v2, v4 offset:16
3864 ds_write_src2_b64 v2 offset0:4 offset1:8
3865 ds_cmpst_f32 v2, v4, v6
3866 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3867
3868
3869For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3870
Tony Tyef16a45e2017-06-06 20:31:59 +00003871FLAT
3872++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003873
3874.. code-block:: nasm
3875
3876 flat_load_dword v1, v[3:4]
3877 flat_store_dwordx3 v[3:4], v[5:7]
3878 flat_atomic_swap v1, v[3:4], v5 glc
3879 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3880 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3881
3882For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3883
Tony Tyef16a45e2017-06-06 20:31:59 +00003884MUBUF
3885+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003886
3887.. code-block:: nasm
3888
3889 buffer_load_dword v1, off, s[4:7], s1
3890 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3891 buffer_store_format_xy v[1:2], off, s[4:7], s1
3892 buffer_wbinvl1
3893 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3894
3895For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3896
Tony Tyef16a45e2017-06-06 20:31:59 +00003897SMRD/SMEM
3898+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003899
3900.. code-block:: nasm
3901
3902 s_load_dword s1, s[2:3], 0xfc
3903 s_load_dwordx8 s[8:15], s[2:3], s4
3904 s_load_dwordx16 s[88:103], s[2:3], s4
3905 s_dcache_inv_vol
3906 s_memtime s[4:5]
3907
3908For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3909
Tony Tyef16a45e2017-06-06 20:31:59 +00003910SOP1
3911++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003912
3913.. code-block:: nasm
3914
3915 s_mov_b32 s1, s2
3916 s_mov_b64 s[0:1], 0x80000000
3917 s_cmov_b32 s1, 200
3918 s_wqm_b64 s[2:3], s[4:5]
3919 s_bcnt0_i32_b64 s1, s[2:3]
3920 s_swappc_b64 s[2:3], s[4:5]
3921 s_cbranch_join s[4:5]
3922
3923For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3924
Tony Tyef16a45e2017-06-06 20:31:59 +00003925SOP2
3926++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003927
3928.. code-block:: nasm
3929
3930 s_add_u32 s1, s2, s3
3931 s_and_b64 s[2:3], s[4:5], s[6:7]
3932 s_cselect_b32 s1, s2, s3
3933 s_andn2_b32 s2, s4, s6
3934 s_lshr_b64 s[2:3], s[4:5], s6
3935 s_ashr_i32 s2, s4, s6
3936 s_bfm_b64 s[2:3], s4, s6
3937 s_bfe_i64 s[2:3], s[4:5], s6
3938 s_cbranch_g_fork s[4:5], s[6:7]
3939
3940For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3941
Tony Tyef16a45e2017-06-06 20:31:59 +00003942SOPC
3943++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003944
3945.. code-block:: nasm
3946
3947 s_cmp_eq_i32 s1, s2
3948 s_bitcmp1_b32 s1, s2
3949 s_bitcmp0_b64 s[2:3], s4
3950 s_setvskip s3, s5
3951
3952For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3953
Tony Tyef16a45e2017-06-06 20:31:59 +00003954SOPP
3955++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003956
3957.. code-block:: nasm
3958
3959 s_barrier
3960 s_nop 2
3961 s_endpgm
3962 s_waitcnt 0 ; Wait for all counters to be 0
3963 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3964 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3965 s_sethalt 9
3966 s_sleep 10
3967 s_sendmsg 0x1
3968 s_sendmsg sendmsg(MSG_INTERRUPT)
3969 s_trap 1
3970
3971For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3972
3973Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003974of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003975range or acceptable values.
3976
Tony Tyef16a45e2017-06-06 20:31:59 +00003977VALU
3978++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003979
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003980For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3981the assembler will automatically use optimal encoding based on its operands.
3982To force specific encoding, one can add a suffix to the opcode of the instruction:
3983
3984* _e32 for 32-bit VOP1/VOP2/VOPC
3985* _e64 for 64-bit VOP3
3986* _dpp for VOP_DPP
3987* _sdwa for VOP_SDWA
3988
3989VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003990
3991.. code-block:: nasm
3992
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003993 v_mov_b32 v1, v2
3994 v_mov_b32_e32 v1, v2
3995 v_nop
3996 v_cvt_f64_i32_e32 v[1:2], v2
3997 v_floor_f32_e32 v1, v2
3998 v_bfrev_b32_e32 v1, v2
3999 v_add_f32_e32 v1, v2, v3
4000 v_mul_i32_i24_e64 v1, v2, 3
4001 v_mul_i32_i24_e32 v1, -3, v3
4002 v_mul_i32_i24_e32 v1, -100, v3
4003 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4004 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00004005
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004006VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004007
4008.. code-block:: nasm
4009
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004010 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4011 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4012 v_mov_b32 v0, v0 wave_shl:1
4013 v_mov_b32 v0, v0 row_mirror
4014 v_mov_b32 v0, v0 row_bcast:31
4015 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4016 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4017 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00004018
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004019VOP_SDWA examples:
4020
4021.. code-block:: nasm
4022
4023 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4024 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4025 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4026 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4027 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4028
4029For full list of supported instructions, refer to "Vector ALU instructions".
4030
4031HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004032~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004033
4034AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4035one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004036
4037.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004038+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004039
4040*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004041object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004042
4043.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004044+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4045
Tom Stellard347ac792015-06-26 21:15:07 +00004046
4047*major*, *minor*, and *stepping* are all integers that describe the instruction
4048set architecture (ISA) version of the assembly program.
4049
4050*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4051"AMD" and *arch* should always be equal to "AMDGPU".
4052
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004053By default, the assembler will derive the ISA version, *vendor*, and *arch*
4054from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004055
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004056.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004057+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004058
4059This directives specifies that the symbol with given name is a kernel entry point
4060(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004061
4062.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004063++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004064
4065This directive marks the beginning of a list of key / value pairs that are used
4066to specify the amd_kernel_code_t object that will be emitted by the assembler.
4067The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4068any amd_kernel_code_t values that are unspecified a default value will be
4069used. The default value for all keys is 0, with the following exceptions:
4070
4071- *kernel_code_version_major* defaults to 1.
4072- *machine_kind* defaults to 1.
4073- *machine_version_major*, *machine_version_minor*, and
4074 *machine_version_stepping* are derived from the value of the -mcpu option
4075 that is passed to the assembler.
4076- *kernel_code_entry_byte_offset* defaults to 256.
4077- *wavefront_size* defaults to 6.
4078- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004079 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004080 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4081
4082The *.amd_kernel_code_t* directive must be placed immediately after the
4083function label and before any instructions.
4084
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004085For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4086comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004087
4088Here is an example of a minimal amd_kernel_code_t specification:
4089
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004090.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004091
4092 .hsa_code_object_version 1,0
4093 .hsa_code_object_isa
4094
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004095 .hsatext
4096 .globl hello_world
4097 .p2align 8
4098 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004099
4100 hello_world:
4101
4102 .amd_kernel_code_t
4103 enable_sgpr_kernarg_segment_ptr = 1
4104 is_ptr64 = 1
4105 compute_pgm_rsrc1_vgprs = 0
4106 compute_pgm_rsrc1_sgprs = 0
4107 compute_pgm_rsrc2_user_sgpr = 2
4108 kernarg_segment_byte_size = 8
4109 wavefront_sgpr_count = 2
4110 workitem_vgpr_count = 3
4111 .end_amd_kernel_code_t
4112
4113 s_load_dwordx2 s[0:1], s[0:1] 0x0
4114 v_mov_b32 v0, 3.14159
4115 s_waitcnt lgkmcnt(0)
4116 v_mov_b32 v1, s0
4117 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004118 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004119 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004120 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004121 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004122
4123Additional Documentation
4124========================
4125
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004126.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4127.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4128.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4129.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4130.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4131.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4132.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4133.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004134.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
4135.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
4136.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4137.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4138.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4139.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4140.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004141.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004142.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4143.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
4144.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__